[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-03 Thread Tim Triche, Jr.
It would be nice (for a number of reasons) to have chromosome lengths
readily available in a foundational package like GenomeInfoDb, so that,
say,

data(seqinfo.hg19)
seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]

would work without issues.  Is there any particular reason this couldn't
happen for the supported/available BSgenomes?  It would seem like a simple
matter to do

R> library(BSgenome.Hsapiens.UCSC.hg19)
R> seqinfo.hg19 <- seqinfo(Hsapiens)
R> save(seqinfo.hg19,
file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")

and be done with it until (say) the next release or next released
BSgenome.  I considered looping through the following BSgenomes myself...
and if it isn't strongly opposed by (everyone) I may still do exactly
that.  Seems useful, no?

e.g. for the following 42 builds,

grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
value=TRUE)
 [1] "BSgenome.Amellifera.UCSC.apiMel2"   "BSgenome.Btaurus.UCSC.bosTau3"

 [3] "BSgenome.Btaurus.UCSC.bosTau4"  "BSgenome.Btaurus.UCSC.bosTau6"

 [5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"

 [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"

 [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
 "BSgenome.Cfamiliaris.UCSC.canFam3"
[11] "BSgenome.Dmelanogaster.UCSC.dm2""BSgenome.Dmelanogaster.UCSC.dm3"

[13] "BSgenome.Dmelanogaster.UCSC.dm6""BSgenome.Drerio.UCSC.danRer5"

[15] "BSgenome.Drerio.UCSC.danRer6"   "BSgenome.Drerio.UCSC.danRer7"

[17] "BSgenome.Ecoli.NCBI.20080805"
"BSgenome.Gaculeatus.UCSC.gasAcu1"
[19] "BSgenome.Ggallus.UCSC.galGal3"  "BSgenome.Ggallus.UCSC.galGal4"

[21] "BSgenome.Hsapiens.NCBI.GRCh38"  "BSgenome.Hsapiens.UCSC.hg17"

[23] "BSgenome.Hsapiens.UCSC.hg18""BSgenome.Hsapiens.UCSC.hg19"

[25] "BSgenome.Hsapiens.UCSC.hg38""BSgenome.Mfascicularis.NCBI.5.0"

[27] "BSgenome.Mfuro.UCSC.musFur1""BSgenome.Mmulatta.UCSC.rheMac2"

[29] "BSgenome.Mmulatta.UCSC.rheMac3" "BSgenome.Mmusculus.UCSC.mm10"

[31] "BSgenome.Mmusculus.UCSC.mm8""BSgenome.Mmusculus.UCSC.mm9"

[33] "BSgenome.Ptroglodytes.UCSC.panTro2"
"BSgenome.Ptroglodytes.UCSC.panTro3"
[35] "BSgenome.Rnorvegicus.UCSC.rn4"  "BSgenome.Rnorvegicus.UCSC.rn5"

[37] "BSgenome.Rnorvegicus.UCSC.rn6"
 "BSgenome.Scerevisiae.UCSC.sacCer1"
[39] "BSgenome.Scerevisiae.UCSC.sacCer2"
 "BSgenome.Scerevisiae.UCSC.sacCer3"
[41] "BSgenome.Sscrofa.UCSC.susScr3"  "BSgenome.Tguttata.UCSC.taeGut1"




Am I insane for suggesting this?  It would make things a little easier for
rtracklayer, most SummarizedExperiment and SE-derived objects, blah, blah,
blah...


Best,

--t




Statistics is the grammar of science.
Karl Pearson 

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-03 Thread Kasper Daniel Hansen
Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb is
clean up the seqinfo slot.  Currently it does most of the cleaning, but it
does not add seqlengths.

It is clear that seqlengths depends on the version of the genome, but I
will argue so does the seqnames.  Of course, for human, chr22 will not
change.  But what about the names of all the random contigs?  Or for other
organisms, what about going from a draft genome with 10k contigs to a more
completely genome assembled into fewer, larger chromosomes.

I acknowledge that this information is present in the BSgenome packages,
but it seems (to me) to be very appropriate to have them around for
cleaning up the seqinfo slot.  For some situations it is not great to
depend on 1 GB> download for something that is a few bytes.

Best,
Kasper

On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
wrote:

> It would be nice (for a number of reasons) to have chromosome lengths
> readily available in a foundational package like GenomeInfoDb, so that,
> say,
>
> data(seqinfo.hg19)
> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>
> would work without issues.  Is there any particular reason this couldn't
> happen for the supported/available BSgenomes?  It would seem like a simple
> matter to do
>
> R> library(BSgenome.Hsapiens.UCSC.hg19)
> R> seqinfo.hg19 <- seqinfo(Hsapiens)
> R> save(seqinfo.hg19,
> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>
> and be done with it until (say) the next release or next released
> BSgenome.  I considered looping through the following BSgenomes myself...
> and if it isn't strongly opposed by (everyone) I may still do exactly
> that.  Seems useful, no?
>
> e.g. for the following 42 builds,
>
> grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
> value=TRUE)
>  [1] "BSgenome.Amellifera.UCSC.apiMel2"   "BSgenome.Btaurus.UCSC.bosTau3"
>
>  [3] "BSgenome.Btaurus.UCSC.bosTau4"  "BSgenome.Btaurus.UCSC.bosTau6"
>
>  [5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"
>
>  [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
>
>  [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>  "BSgenome.Cfamiliaris.UCSC.canFam3"
> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>  "BSgenome.Dmelanogaster.UCSC.dm3"
> [13] "BSgenome.Dmelanogaster.UCSC.dm6""BSgenome.Drerio.UCSC.danRer5"
>
> [15] "BSgenome.Drerio.UCSC.danRer6"   "BSgenome.Drerio.UCSC.danRer7"
>
> [17] "BSgenome.Ecoli.NCBI.20080805"
> "BSgenome.Gaculeatus.UCSC.gasAcu1"
> [19] "BSgenome.Ggallus.UCSC.galGal3"  "BSgenome.Ggallus.UCSC.galGal4"
>
> [21] "BSgenome.Hsapiens.NCBI.GRCh38"  "BSgenome.Hsapiens.UCSC.hg17"
>
> [23] "BSgenome.Hsapiens.UCSC.hg18""BSgenome.Hsapiens.UCSC.hg19"
>
> [25] "BSgenome.Hsapiens.UCSC.hg38"
>  "BSgenome.Mfascicularis.NCBI.5.0"
> [27] "BSgenome.Mfuro.UCSC.musFur1""BSgenome.Mmulatta.UCSC.rheMac2"
>
> [29] "BSgenome.Mmulatta.UCSC.rheMac3" "BSgenome.Mmusculus.UCSC.mm10"
>
> [31] "BSgenome.Mmusculus.UCSC.mm8""BSgenome.Mmusculus.UCSC.mm9"
>
> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> "BSgenome.Ptroglodytes.UCSC.panTro3"
> [35] "BSgenome.Rnorvegicus.UCSC.rn4"  "BSgenome.Rnorvegicus.UCSC.rn5"
>
> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>  "BSgenome.Scerevisiae.UCSC.sacCer1"
> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>  "BSgenome.Scerevisiae.UCSC.sacCer3"
> [41] "BSgenome.Sscrofa.UCSC.susScr3"  "BSgenome.Tguttata.UCSC.taeGut1"
>
>
>
>
> Am I insane for suggesting this?  It would make things a little easier for
> rtracklayer, most SummarizedExperiment and SE-derived objects, blah, blah,
> blah...
>
>
> Best,
>
> --t
>
>
>
>
> Statistics is the grammar of science.
> Karl Pearson 
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-03 Thread Vincent Carey
I typically get this info from Homo.sapiens.  The result is parasitic on
the TxDb that is in there.  I don't know how easy it is to swap alternate
TxDb in to get a different build.  I think it would make sense to regard
the OrganismDb instances as foundational for this sort of structural data.

On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

> Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb is
> clean up the seqinfo slot.  Currently it does most of the cleaning, but it
> does not add seqlengths.
>
> It is clear that seqlengths depends on the version of the genome, but I
> will argue so does the seqnames.  Of course, for human, chr22 will not
> change.  But what about the names of all the random contigs?  Or for other
> organisms, what about going from a draft genome with 10k contigs to a more
> completely genome assembled into fewer, larger chromosomes.
>
> I acknowledge that this information is present in the BSgenome packages,
> but it seems (to me) to be very appropriate to have them around for
> cleaning up the seqinfo slot.  For some situations it is not great to
> depend on 1 GB> download for something that is a few bytes.
>
> Best,
> Kasper
>
> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
> wrote:
>
> > It would be nice (for a number of reasons) to have chromosome lengths
> > readily available in a foundational package like GenomeInfoDb, so that,
> > say,
> >
> > data(seqinfo.hg19)
> > seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
> >
> > would work without issues.  Is there any particular reason this couldn't
> > happen for the supported/available BSgenomes?  It would seem like a
> simple
> > matter to do
> >
> > R> library(BSgenome.Hsapiens.UCSC.hg19)
> > R> seqinfo.hg19 <- seqinfo(Hsapiens)
> > R> save(seqinfo.hg19,
> > file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
> >
> > and be done with it until (say) the next release or next released
> > BSgenome.  I considered looping through the following BSgenomes myself...
> > and if it isn't strongly opposed by (everyone) I may still do exactly
> > that.  Seems useful, no?
> >
> > e.g. for the following 42 builds,
> >
> > grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
> > value=TRUE)
> >  [1] "BSgenome.Amellifera.UCSC.apiMel2"   "BSgenome.Btaurus.UCSC.bosTau3"
> >
> >  [3] "BSgenome.Btaurus.UCSC.bosTau4"  "BSgenome.Btaurus.UCSC.bosTau6"
> >
> >  [5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"
> >
> >  [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
> >
> >  [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
> >  "BSgenome.Cfamiliaris.UCSC.canFam3"
> > [11] "BSgenome.Dmelanogaster.UCSC.dm2"
> >  "BSgenome.Dmelanogaster.UCSC.dm3"
> > [13] "BSgenome.Dmelanogaster.UCSC.dm6""BSgenome.Drerio.UCSC.danRer5"
> >
> > [15] "BSgenome.Drerio.UCSC.danRer6"   "BSgenome.Drerio.UCSC.danRer7"
> >
> > [17] "BSgenome.Ecoli.NCBI.20080805"
> > "BSgenome.Gaculeatus.UCSC.gasAcu1"
> > [19] "BSgenome.Ggallus.UCSC.galGal3"  "BSgenome.Ggallus.UCSC.galGal4"
> >
> > [21] "BSgenome.Hsapiens.NCBI.GRCh38"  "BSgenome.Hsapiens.UCSC.hg17"
> >
> > [23] "BSgenome.Hsapiens.UCSC.hg18""BSgenome.Hsapiens.UCSC.hg19"
> >
> > [25] "BSgenome.Hsapiens.UCSC.hg38"
> >  "BSgenome.Mfascicularis.NCBI.5.0"
> > [27] "BSgenome.Mfuro.UCSC.musFur1"
> "BSgenome.Mmulatta.UCSC.rheMac2"
> >
> > [29] "BSgenome.Mmulatta.UCSC.rheMac3" "BSgenome.Mmusculus.UCSC.mm10"
> >
> > [31] "BSgenome.Mmusculus.UCSC.mm8""BSgenome.Mmusculus.UCSC.mm9"
> >
> > [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> > "BSgenome.Ptroglodytes.UCSC.panTro3"
> > [35] "BSgenome.Rnorvegicus.UCSC.rn4"  "BSgenome.Rnorvegicus.UCSC.rn5"
> >
> > [37] "BSgenome.Rnorvegicus.UCSC.rn6"
> >  "BSgenome.Scerevisiae.UCSC.sacCer1"
> > [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
> >  "BSgenome.Scerevisiae.UCSC.sacCer3"
> > [41] "BSgenome.Sscrofa.UCSC.susScr3"
> "BSgenome.Tguttata.UCSC.taeGut1"
> >
> >
> >
> >
> > Am I insane for suggesting this?  It would make things a little easier
> for
> > rtracklayer, most SummarizedExperiment and SE-derived objects, blah,
> blah,
> > blah...
> >
> >
> > Best,
> >
> > --t
> >
> >
> >
> >
> > Statistics is the grammar of science.
> > Karl Pearson 
> >
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-03 Thread Tim Triche, Jr.
Right, I typically do that too, and if you're working on human data it
isn't a big deal.  What makes things a lot more of a drag is when you work
on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
is essentially a "build ahead" of Homo.sapiens.

R> seqinfo(Homo.sapiens)
Seqinfo object with 93 sequences (1 circular) from hg19 genome

R> seqinfo(Mus.musculus)
Seqinfo object with 66 sequences (1 circular) from mm10 genome:

It's not as explicit as directly assigning the seqinfo from a genome that
corresponds to that of your annotations/results/whatever.  I know we could
all use crossmap or liftOver or whatever, but that's not really the same,
and it takes time, whereas assigning the proper seqinfo for relationships
is very fast.

That's all I was getting at...


Statistics is the grammar of science.
Karl Pearson 

On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey 
wrote:

>  I typically get this info from Homo.sapiens.  The result is parasitic on
> the TxDb that is in there.  I don't know how easy it is to swap alternate
> TxDb in to get a different build.  I think it would make sense to regard
> the OrganismDb instances as foundational for this sort of structural data.
>
> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> kasperdanielhan...@gmail.com> wrote:
>
>> Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb
>> is
>> clean up the seqinfo slot.  Currently it does most of the cleaning, but it
>> does not add seqlengths.
>>
>> It is clear that seqlengths depends on the version of the genome, but I
>> will argue so does the seqnames.  Of course, for human, chr22 will not
>> change.  But what about the names of all the random contigs?  Or for other
>> organisms, what about going from a draft genome with 10k contigs to a more
>> completely genome assembled into fewer, larger chromosomes.
>>
>> I acknowledge that this information is present in the BSgenome packages,
>> but it seems (to me) to be very appropriate to have them around for
>> cleaning up the seqinfo slot.  For some situations it is not great to
>> depend on 1 GB> download for something that is a few bytes.
>>
>> Best,
>> Kasper
>>
>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
>> wrote:
>>
>> > It would be nice (for a number of reasons) to have chromosome lengths
>> > readily available in a foundational package like GenomeInfoDb, so that,
>> > say,
>> >
>> > data(seqinfo.hg19)
>> > seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>> >
>> > would work without issues.  Is there any particular reason this couldn't
>> > happen for the supported/available BSgenomes?  It would seem like a
>> simple
>> > matter to do
>> >
>> > R> library(BSgenome.Hsapiens.UCSC.hg19)
>> > R> seqinfo.hg19 <- seqinfo(Hsapiens)
>> > R> save(seqinfo.hg19,
>> > file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>> >
>> > and be done with it until (say) the next release or next released
>> > BSgenome.  I considered looping through the following BSgenomes
>> myself...
>> > and if it isn't strongly opposed by (everyone) I may still do exactly
>> > that.  Seems useful, no?
>> >
>> > e.g. for the following 42 builds,
>> >
>> > grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
>> > value=TRUE)
>> >  [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>  "BSgenome.Btaurus.UCSC.bosTau3"
>> >
>> >  [3] "BSgenome.Btaurus.UCSC.bosTau4"
>> "BSgenome.Btaurus.UCSC.bosTau6"
>> >
>> >  [5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"
>> >
>> >  [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
>> >
>> >  [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>> >  "BSgenome.Cfamiliaris.UCSC.canFam3"
>> > [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>> >  "BSgenome.Dmelanogaster.UCSC.dm3"
>> > [13] "BSgenome.Dmelanogaster.UCSC.dm6""BSgenome.Drerio.UCSC.danRer5"
>> >
>> > [15] "BSgenome.Drerio.UCSC.danRer6"   "BSgenome.Drerio.UCSC.danRer7"
>> >
>> > [17] "BSgenome.Ecoli.NCBI.20080805"
>> > "BSgenome.Gaculeatus.UCSC.gasAcu1"
>> > [19] "BSgenome.Ggallus.UCSC.galGal3"
>> "BSgenome.Ggallus.UCSC.galGal4"
>> >
>> > [21] "BSgenome.Hsapiens.NCBI.GRCh38"  "BSgenome.Hsapiens.UCSC.hg17"
>> >
>> > [23] "BSgenome.Hsapiens.UCSC.hg18""BSgenome.Hsapiens.UCSC.hg19"
>> >
>> > [25] "BSgenome.Hsapiens.UCSC.hg38"
>> >  "BSgenome.Mfascicularis.NCBI.5.0"
>> > [27] "BSgenome.Mfuro.UCSC.musFur1"
>> "BSgenome.Mmulatta.UCSC.rheMac2"
>> >
>> > [29] "BSgenome.Mmulatta.UCSC.rheMac3" "BSgenome.Mmusculus.UCSC.mm10"
>> >
>> > [31] "BSgenome.Mmusculus.UCSC.mm8""BSgenome.Mmusculus.UCSC.mm9"
>> >
>> > [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>> > "BSgenome.Ptroglodytes.UCSC.panTro3"
>> > [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>> "BSgenome.Rnorvegicus.UCSC.rn5"
>> >
>> > [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>> >  "BSgenome.Scerevisiae.UCSC.sacCer1"
>> > [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>> >  "BSgenome.Scerevisiae.UCSC.sacCer3"
>> > [41] "BSgeno

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-03 Thread Vincent Carey
It really isn't hard to have multiple OrganismDb packages in place -- the
process of making new ones is documented and was given as an exercise in
the EdX course.  I don't know if we want to institutionalize it and
distribute such -- I think we might, so that there would be Hs19, Hs38,
mm9, etc. packages.  They have very little content, they just coordinate
interactions with packages that you'll already have.

On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. 
wrote:

> Right, I typically do that too, and if you're working on human data it
> isn't a big deal.  What makes things a lot more of a drag is when you work
> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
> is essentially a "build ahead" of Homo.sapiens.
>
> R> seqinfo(Homo.sapiens)
> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>
> R> seqinfo(Mus.musculus)
> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>
> It's not as explicit as directly assigning the seqinfo from a genome that
> corresponds to that of your annotations/results/whatever.  I know we could
> all use crossmap or liftOver or whatever, but that's not really the same,
> and it takes time, whereas assigning the proper seqinfo for relationships
> is very fast.
>
> That's all I was getting at...
>
>
> Statistics is the grammar of science.
> Karl Pearson 
>
> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey  > wrote:
>
>>  I typically get this info from Homo.sapiens.  The result is parasitic
>> on
>> the TxDb that is in there.  I don't know how easy it is to swap alternate
>> TxDb in to get a different build.  I think it would make sense to regard
>> the OrganismDb instances as foundational for this sort of structural data.
>>
>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>> kasperdanielhan...@gmail.com> wrote:
>>
>>> Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb
>>> is
>>> clean up the seqinfo slot.  Currently it does most of the cleaning, but
>>> it
>>> does not add seqlengths.
>>>
>>> It is clear that seqlengths depends on the version of the genome, but I
>>> will argue so does the seqnames.  Of course, for human, chr22 will not
>>> change.  But what about the names of all the random contigs?  Or for
>>> other
>>> organisms, what about going from a draft genome with 10k contigs to a
>>> more
>>> completely genome assembled into fewer, larger chromosomes.
>>>
>>> I acknowledge that this information is present in the BSgenome packages,
>>> but it seems (to me) to be very appropriate to have them around for
>>> cleaning up the seqinfo slot.  For some situations it is not great to
>>> depend on 1 GB> download for something that is a few bytes.
>>>
>>> Best,
>>> Kasper
>>>
>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
>>> wrote:
>>>
>>> > It would be nice (for a number of reasons) to have chromosome lengths
>>> > readily available in a foundational package like GenomeInfoDb, so that,
>>> > say,
>>> >
>>> > data(seqinfo.hg19)
>>> > seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>> >
>>> > would work without issues.  Is there any particular reason this
>>> couldn't
>>> > happen for the supported/available BSgenomes?  It would seem like a
>>> simple
>>> > matter to do
>>> >
>>> > R> library(BSgenome.Hsapiens.UCSC.hg19)
>>> > R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>> > R> save(seqinfo.hg19,
>>> > file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>> >
>>> > and be done with it until (say) the next release or next released
>>> > BSgenome.  I considered looping through the following BSgenomes
>>> myself...
>>> > and if it isn't strongly opposed by (everyone) I may still do exactly
>>> > that.  Seems useful, no?
>>> >
>>> > e.g. for the following 42 builds,
>>> >
>>> > grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
>>> > value=TRUE)
>>> >  [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>>  "BSgenome.Btaurus.UCSC.bosTau3"
>>> >
>>> >  [3] "BSgenome.Btaurus.UCSC.bosTau4"
>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>> >
>>> >  [5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"
>>> >
>>> >  [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
>>> >
>>> >  [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>>> >  "BSgenome.Cfamiliaris.UCSC.canFam3"
>>> > [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>>> >  "BSgenome.Dmelanogaster.UCSC.dm3"
>>> > [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>>> "BSgenome.Drerio.UCSC.danRer5"
>>> >
>>> > [15] "BSgenome.Drerio.UCSC.danRer6"
>>>  "BSgenome.Drerio.UCSC.danRer7"
>>> >
>>> > [17] "BSgenome.Ecoli.NCBI.20080805"
>>> > "BSgenome.Gaculeatus.UCSC.gasAcu1"
>>> > [19] "BSgenome.Ggallus.UCSC.galGal3"
>>> "BSgenome.Ggallus.UCSC.galGal4"
>>> >
>>> > [21] "BSgenome.Hsapiens.NCBI.GRCh38"  "BSgenome.Hsapiens.UCSC.hg17"
>>> >
>>> > [23] "BSgenome.Hsapiens.UCSC.hg18""BSgenome.Hsapiens.UCSC.hg19"
>>> >
>>> > [25] "BSgenome.Hsapiens.UCSC.hg38"
>>> >  "BSgenome.Mfascicula

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-03 Thread Tim Triche, Jr.
That would be perfect actually.  And it would radically reduce & modularize 
maintenance.  Maybe that's the best way to go after all.  Quite sensible. 

--t

> On Jun 3, 2015, at 12:46 PM, Vincent Carey  wrote:
> 
> It really isn't hard to have multiple OrganismDb packages in place -- the
> process of making new ones is documented and was given as an exercise in
> the EdX course.  I don't know if we want to institutionalize it and
> distribute such -- I think we might, so that there would be Hs19, Hs38,
> mm9, etc. packages.  They have very little content, they just coordinate
> interactions with packages that you'll already have.
> 
> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. 
> wrote:
> 
>> Right, I typically do that too, and if you're working on human data it
>> isn't a big deal.  What makes things a lot more of a drag is when you work
>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
>> is essentially a "build ahead" of Homo.sapiens.
>> 
>> R> seqinfo(Homo.sapiens)
>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>> 
>> R> seqinfo(Mus.musculus)
>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>> 
>> It's not as explicit as directly assigning the seqinfo from a genome that
>> corresponds to that of your annotations/results/whatever.  I know we could
>> all use crossmap or liftOver or whatever, but that's not really the same,
>> and it takes time, whereas assigning the proper seqinfo for relationships
>> is very fast.
>> 
>> That's all I was getting at...
>> 
>> 
>> Statistics is the grammar of science.
>> Karl Pearson 
>> 
>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey >> wrote:
>> 
>>> I typically get this info from Homo.sapiens.  The result is parasitic
>>> on
>>> the TxDb that is in there.  I don't know how easy it is to swap alternate
>>> TxDb in to get a different build.  I think it would make sense to regard
>>> the OrganismDb instances as foundational for this sort of structural data.
>>> 
>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>>> kasperdanielhan...@gmail.com> wrote:
>>> 
 Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb
 is
 clean up the seqinfo slot.  Currently it does most of the cleaning, but
 it
 does not add seqlengths.
 
 It is clear that seqlengths depends on the version of the genome, but I
 will argue so does the seqnames.  Of course, for human, chr22 will not
 change.  But what about the names of all the random contigs?  Or for
 other
 organisms, what about going from a draft genome with 10k contigs to a
 more
 completely genome assembled into fewer, larger chromosomes.
 
 I acknowledge that this information is present in the BSgenome packages,
 but it seems (to me) to be very appropriate to have them around for
 cleaning up the seqinfo slot.  For some situations it is not great to
 depend on 1 GB> download for something that is a few bytes.
 
 Best,
 Kasper
 
 On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
 wrote:
 
> It would be nice (for a number of reasons) to have chromosome lengths
> readily available in a foundational package like GenomeInfoDb, so that,
> say,
> 
> data(seqinfo.hg19)
> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
> 
> would work without issues.  Is there any particular reason this
 couldn't
> happen for the supported/available BSgenomes?  It would seem like a
 simple
> matter to do
> 
> R> library(BSgenome.Hsapiens.UCSC.hg19)
> R> seqinfo.hg19 <- seqinfo(Hsapiens)
> R> save(seqinfo.hg19,
> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
> 
> and be done with it until (say) the next release or next released
> BSgenome.  I considered looping through the following BSgenomes
 myself...
> and if it isn't strongly opposed by (everyone) I may still do exactly
> that.  Seems useful, no?
> 
> e.g. for the following 42 builds,
> 
> grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
> value=TRUE)
> [1] "BSgenome.Amellifera.UCSC.apiMel2"
 "BSgenome.Btaurus.UCSC.bosTau3"
> 
> [3] "BSgenome.Btaurus.UCSC.bosTau4"
 "BSgenome.Btaurus.UCSC.bosTau6"
> 
> [5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"
> 
> [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
> 
> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
> "BSgenome.Cfamiliaris.UCSC.canFam3"
> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
> "BSgenome.Dmelanogaster.UCSC.dm3"
> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
 "BSgenome.Drerio.UCSC.danRer5"
> 
> [15] "BSgenome.Drerio.UCSC.danRer6"
 "BSgenome.Drerio.UCSC.danRer7"
> 
> [17] "BSgenome.Ecoli.NCBI.20080805"
> "BSgenome.Gaculeatus.UCSC.gasAcu1"

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-04 Thread Hervé Pagès

Hi,

FWIW I started to work on supporting quick generation of a standalone
Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.

It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
and sacCer2. I'll add more.

See ?Seqinfo for some examples.

Right now it fetches the information from internet every time you
call it but maybe we should just store that information in the
GenomeInfoDb package as Tim suggested?

H.

On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:

That would be perfect actually.  And it would radically reduce & modularize 
maintenance.  Maybe that's the best way to go after all.  Quite sensible.

--t


On Jun 3, 2015, at 12:46 PM, Vincent Carey  wrote:

It really isn't hard to have multiple OrganismDb packages in place -- the
process of making new ones is documented and was given as an exercise in
the EdX course.  I don't know if we want to institutionalize it and
distribute such -- I think we might, so that there would be Hs19, Hs38,
mm9, etc. packages.  They have very little content, they just coordinate
interactions with packages that you'll already have.

On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. 
wrote:


Right, I typically do that too, and if you're working on human data it
isn't a big deal.  What makes things a lot more of a drag is when you work
on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
is essentially a "build ahead" of Homo.sapiens.

R> seqinfo(Homo.sapiens)
Seqinfo object with 93 sequences (1 circular) from hg19 genome

R> seqinfo(Mus.musculus)
Seqinfo object with 66 sequences (1 circular) from mm10 genome:

It's not as explicit as directly assigning the seqinfo from a genome that
corresponds to that of your annotations/results/whatever.  I know we could
all use crossmap or liftOver or whatever, but that's not really the same,
and it takes time, whereas assigning the proper seqinfo for relationships
is very fast.

That's all I was getting at...


Statistics is the grammar of science.
Karl Pearson 

On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey 
wrote:



I typically get this info from Homo.sapiens.  The result is parasitic
on
the TxDb that is in there.  I don't know how easy it is to swap alternate
TxDb in to get a different build.  I think it would make sense to regard
the OrganismDb instances as foundational for this sort of structural data.

On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:


Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb
is
clean up the seqinfo slot.  Currently it does most of the cleaning, but
it
does not add seqlengths.

It is clear that seqlengths depends on the version of the genome, but I
will argue so does the seqnames.  Of course, for human, chr22 will not
change.  But what about the names of all the random contigs?  Or for
other
organisms, what about going from a draft genome with 10k contigs to a
more
completely genome assembled into fewer, larger chromosomes.

I acknowledge that this information is present in the BSgenome packages,
but it seems (to me) to be very appropriate to have them around for
cleaning up the seqinfo slot.  For some situations it is not great to
depend on 1 GB> download for something that is a few bytes.

Best,
Kasper

On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
wrote:


It would be nice (for a number of reasons) to have chromosome lengths
readily available in a foundational package like GenomeInfoDb, so that,
say,

data(seqinfo.hg19)
seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]

would work without issues.  Is there any particular reason this

couldn't

happen for the supported/available BSgenomes?  It would seem like a

simple

matter to do

R> library(BSgenome.Hsapiens.UCSC.hg19)
R> seqinfo.hg19 <- seqinfo(Hsapiens)
R> save(seqinfo.hg19,
file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")

and be done with it until (say) the next release or next released
BSgenome.  I considered looping through the following BSgenomes

myself...

and if it isn't strongly opposed by (everyone) I may still do exactly
that.  Seems useful, no?

e.g. for the following 42 builds,

grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
value=TRUE)
[1] "BSgenome.Amellifera.UCSC.apiMel2"

"BSgenome.Btaurus.UCSC.bosTau3"


[3] "BSgenome.Btaurus.UCSC.bosTau4"

"BSgenome.Btaurus.UCSC.bosTau6"


[5] "BSgenome.Btaurus.UCSC.bosTau8"  "BSgenome.Celegans.UCSC.ce10"

[7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"

[9] "BSgenome.Cfamiliaris.UCSC.canFam2"
"BSgenome.Cfamiliaris.UCSC.canFam3"
[11] "BSgenome.Dmelanogaster.UCSC.dm2"
"BSgenome.Dmelanogaster.UCSC.dm3"
[13] "BSgenome.Dmelanogaster.UCSC.dm6"

"BSgenome.Drerio.UCSC.dan

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-04 Thread Kasper Daniel Hansen
Herve,

This sounds great.  I do think it would be highly advantageous to not use
the internet.  For example, you could imagine using this function everytime
a Granges() is created, and it would be useful not to have internet access
for this.

Best,
Kasper

On Thu, Jun 4, 2015 at 7:28 PM, Hervé Pagès  wrote:

> Hi,
>
> FWIW I started to work on supporting quick generation of a standalone
> Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>
> It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
> mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
> gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
> and sacCer2. I'll add more.
>
> See ?Seqinfo for some examples.
>
> Right now it fetches the information from internet every time you
> call it but maybe we should just store that information in the
> GenomeInfoDb package as Tim suggested?
>
> H.
>
>
> On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>
>> That would be perfect actually.  And it would radically reduce &
>> modularize maintenance.  Maybe that's the best way to go after all.  Quite
>> sensible.
>>
>> --t
>>
>>  On Jun 3, 2015, at 12:46 PM, Vincent Carey 
>>> wrote:
>>>
>>> It really isn't hard to have multiple OrganismDb packages in place -- the
>>> process of making new ones is documented and was given as an exercise in
>>> the EdX course.  I don't know if we want to institutionalize it and
>>> distribute such -- I think we might, so that there would be Hs19, Hs38,
>>> mm9, etc. packages.  They have very little content, they just coordinate
>>> interactions with packages that you'll already have.
>>>
>>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. 
>>> wrote:
>>>
>>>  Right, I typically do that too, and if you're working on human data it
 isn't a big deal.  What makes things a lot more of a drag is when you
 work
 on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
 Mus.musculus
 is essentially a "build ahead" of Homo.sapiens.

 R> seqinfo(Homo.sapiens)
 Seqinfo object with 93 sequences (1 circular) from hg19 genome

 R> seqinfo(Mus.musculus)
 Seqinfo object with 66 sequences (1 circular) from mm10 genome:

 It's not as explicit as directly assigning the seqinfo from a genome
 that
 corresponds to that of your annotations/results/whatever.  I know we
 could
 all use crossmap or liftOver or whatever, but that's not really the
 same,
 and it takes time, whereas assigning the proper seqinfo for
 relationships
 is very fast.

 That's all I was getting at...


 Statistics is the grammar of science.
 Karl Pearson 

 On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey <
 st...@channing.harvard.edu

> wrote:
>

  I typically get this info from Homo.sapiens.  The result is parasitic
> on
> the TxDb that is in there.  I don't know how easy it is to swap
> alternate
> TxDb in to get a different build.  I think it would make sense to
> regard
> the OrganismDb instances as foundational for this sort of structural
> data.
>
> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> kasperdanielhan...@gmail.com> wrote:
>
>  Let me rephrase this slightly.  From one POV the purpose of
>> GenomeInfoDb
>> is
>> clean up the seqinfo slot.  Currently it does most of the cleaning,
>> but
>> it
>> does not add seqlengths.
>>
>> It is clear that seqlengths depends on the version of the genome, but
>> I
>> will argue so does the seqnames.  Of course, for human, chr22 will not
>> change.  But what about the names of all the random contigs?  Or for
>> other
>> organisms, what about going from a draft genome with 10k contigs to a
>> more
>> completely genome assembled into fewer, larger chromosomes.
>>
>> I acknowledge that this information is present in the BSgenome
>> packages,
>> but it seems (to me) to be very appropriate to have them around for
>> cleaning up the seqinfo slot.  For some situations it is not great to
>> depend on 1 GB> download for something that is a few bytes.
>>
>> Best,
>> Kasper
>>
>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. > >
>> wrote:
>>
>>  It would be nice (for a number of reasons) to have chromosome lengths
>>> readily available in a foundational package like GenomeInfoDb, so
>>> that,
>>> say,
>>>
>>> data(seqinfo.hg19)
>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>>
>>> would work without issues.  Is there any particular reason this
>>>
>> couldn't
>>
>>> happen for the supported/available BSgenomes?  It would seem like a
>>>
>> simple
>>
>>> matter to do
>>>
>>>

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-04 Thread Michael Lawrence
Maybe this could eventually support setting the seqinfo with:

genome(gr) <- "hg19"

Or is that being too clever?

On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès  wrote:
> Hi,
>
> FWIW I started to work on supporting quick generation of a standalone
> Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>
> It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
> mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
> gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
> and sacCer2. I'll add more.
>
> See ?Seqinfo for some examples.
>
> Right now it fetches the information from internet every time you
> call it but maybe we should just store that information in the
> GenomeInfoDb package as Tim suggested?
>
> H.
>
>
> On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>>
>> That would be perfect actually.  And it would radically reduce &
>> modularize maintenance.  Maybe that's the best way to go after all.  Quite
>> sensible.
>>
>> --t
>>
>>> On Jun 3, 2015, at 12:46 PM, Vincent Carey 
>>> wrote:
>>>
>>> It really isn't hard to have multiple OrganismDb packages in place -- the
>>> process of making new ones is documented and was given as an exercise in
>>> the EdX course.  I don't know if we want to institutionalize it and
>>> distribute such -- I think we might, so that there would be Hs19, Hs38,
>>> mm9, etc. packages.  They have very little content, they just coordinate
>>> interactions with packages that you'll already have.
>>>
>>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. 
>>> wrote:
>>>
 Right, I typically do that too, and if you're working on human data it
 isn't a big deal.  What makes things a lot more of a drag is when you
 work
 on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
 Mus.musculus
 is essentially a "build ahead" of Homo.sapiens.

 R> seqinfo(Homo.sapiens)
 Seqinfo object with 93 sequences (1 circular) from hg19 genome

 R> seqinfo(Mus.musculus)
 Seqinfo object with 66 sequences (1 circular) from mm10 genome:

 It's not as explicit as directly assigning the seqinfo from a genome
 that
 corresponds to that of your annotations/results/whatever.  I know we
 could
 all use crossmap or liftOver or whatever, but that's not really the
 same,
 and it takes time, whereas assigning the proper seqinfo for
 relationships
 is very fast.

 That's all I was getting at...


 Statistics is the grammar of science.
 Karl Pearson 

 On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
 
> wrote:


> I typically get this info from Homo.sapiens.  The result is parasitic
> on
> the TxDb that is in there.  I don't know how easy it is to swap
> alternate
> TxDb in to get a different build.  I think it would make sense to
> regard
> the OrganismDb instances as foundational for this sort of structural
> data.
>
> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> kasperdanielhan...@gmail.com> wrote:
>
>> Let me rephrase this slightly.  From one POV the purpose of
>> GenomeInfoDb
>> is
>> clean up the seqinfo slot.  Currently it does most of the cleaning,
>> but
>> it
>> does not add seqlengths.
>>
>> It is clear that seqlengths depends on the version of the genome, but
>> I
>> will argue so does the seqnames.  Of course, for human, chr22 will not
>> change.  But what about the names of all the random contigs?  Or for
>> other
>> organisms, what about going from a draft genome with 10k contigs to a
>> more
>> completely genome assembled into fewer, larger chromosomes.
>>
>> I acknowledge that this information is present in the BSgenome
>> packages,
>> but it seems (to me) to be very appropriate to have them around for
>> cleaning up the seqinfo slot.  For some situations it is not great to
>> depend on 1 GB> download for something that is a few bytes.
>>
>> Best,
>> Kasper
>>
>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. 
>> wrote:
>>
>>> It would be nice (for a number of reasons) to have chromosome lengths
>>> readily available in a foundational package like GenomeInfoDb, so
>>> that,
>>> say,
>>>
>>> data(seqinfo.hg19)
>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>>
>>> would work without issues.  Is there any particular reason this
>>
>> couldn't
>>>
>>> happen for the supported/available BSgenomes?  It would seem like a
>>
>> simple
>>>
>>> matter to do
>>>
>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>> R> save(seqinfo.hg19,
>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rd

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-04 Thread Tim Triche, Jr.
that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson 

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence 
wrote:

> Maybe this could eventually support setting the seqinfo with:
>
> genome(gr) <- "hg19"
>
> Or is that being too clever?
>
> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès  wrote:
> > Hi,
> >
> > FWIW I started to work on supporting quick generation of a standalone
> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
> >
> > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
> > and sacCer2. I'll add more.
> >
> > See ?Seqinfo for some examples.
> >
> > Right now it fetches the information from internet every time you
> > call it but maybe we should just store that information in the
> > GenomeInfoDb package as Tim suggested?
> >
> > H.
> >
> >
> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
> >>
> >> That would be perfect actually.  And it would radically reduce &
> >> modularize maintenance.  Maybe that's the best way to go after all.
> Quite
> >> sensible.
> >>
> >> --t
> >>
> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey  >
> >>> wrote:
> >>>
> >>> It really isn't hard to have multiple OrganismDb packages in place --
> the
> >>> process of making new ones is documented and was given as an exercise
> in
> >>> the EdX course.  I don't know if we want to institutionalize it and
> >>> distribute such -- I think we might, so that there would be Hs19, Hs38,
> >>> mm9, etc. packages.  They have very little content, they just
> coordinate
> >>> interactions with packages that you'll already have.
> >>>
> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. 
> >>> wrote:
> >>>
>  Right, I typically do that too, and if you're working on human data it
>  isn't a big deal.  What makes things a lot more of a drag is when you
>  work
>  on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
>  Mus.musculus
>  is essentially a "build ahead" of Homo.sapiens.
> 
>  R> seqinfo(Homo.sapiens)
>  Seqinfo object with 93 sequences (1 circular) from hg19 genome
> 
>  R> seqinfo(Mus.musculus)
>  Seqinfo object with 66 sequences (1 circular) from mm10 genome:
> 
>  It's not as explicit as directly assigning the seqinfo from a genome
>  that
>  corresponds to that of your annotations/results/whatever.  I know we
>  could
>  all use crossmap or liftOver or whatever, but that's not really the
>  same,
>  and it takes time, whereas assigning the proper seqinfo for
>  relationships
>  is very fast.
> 
>  That's all I was getting at...
> 
> 
>  Statistics is the grammar of science.
>  Karl Pearson 
> 
>  On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>   >
> > wrote:
> 
> 
> > I typically get this info from Homo.sapiens.  The result is parasitic
> > on
> > the TxDb that is in there.  I don't know how easy it is to swap
> > alternate
> > TxDb in to get a different build.  I think it would make sense to
> > regard
> > the OrganismDb instances as foundational for this sort of structural
> > data.
> >
> > On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> > kasperdanielhan...@gmail.com> wrote:
> >
> >> Let me rephrase this slightly.  From one POV the purpose of
> >> GenomeInfoDb
> >> is
> >> clean up the seqinfo slot.  Currently it does most of the cleaning,
> >> but
> >> it
> >> does not add seqlengths.
> >>
> >> It is clear that seqlengths depends on the version of the genome,
> but
> >> I
> >> will argue so does the seqnames.  Of course, for human, chr22 will
> not
> >> change.  But what about the names of all the random contigs?  Or for
> >> other
> >> organisms, what about going from a draft genome with 10k contigs to
> a
> >> more
> >> completely genome assembled into fewer, larger chromosomes.
> >>
> >> I acknowledge that this information is present in the BSgenome
> >> packages,
> >> but it seems (to me) to be very appropriate to have them around for
> >> cleaning up the seqinfo slot.  For some situations it is not great
> to
> >> depend on 1 GB> download for something that is a few bytes.
> >>
> >> Best,
> >> Kasper
> >>
> >> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <
> tim.tri...@gmail.com>
> >> wrote:
> >>
> >>> It would be nice (for a number of reasons) to have chromosome
> lengths
> >>> readily available in a foundational package like GenomeInfoDb, so
> >>> that,
> >>> say,
> >>>
> >>> data(seqi

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-04 Thread Hervé Pagès

I also think that we're heading towards something like that.

So genome(gr) <- "hg19" would:

  (a) Add any missing information to the seqinfo.
  (b) Sort the seqlevels in "canonical" order.
  (c) Change the seqlevels style to UCSC style if they are not.

The 3 tasks are orthogonal. I guess most of the times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This assumes that the current
seqlevels are already valid hg19 seqlevels:

  si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
  gr1 <- GRanges(seqinfo=si1)
  hg19_si <- Seqinfo(genome="hg19")

  ## (a):
  seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
  seqinfo(gr1)
  # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
  #   seqnames   seqlengths isCircular genome
  #   chrX155270560  FALSE   hg19
  #   chrUn_gl000249  38502  FALSE   hg19
  #   chr2243199373  FALSE   hg19
  #   chr6_cox_hap2 4795371  FALSE   hg19

  ## (b):
  seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
  seqinfo(gr1)
  # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
  #   seqnames   seqlengths isCircular genome
  #   chr2243199373  FALSE   hg19
  #   chrX155270560  FALSE   hg19
  #   chr6_cox_hap2 4795371  FALSE   hg19
  #   chrUn_gl000249  38502  FALSE   hg19

(c) is harder because seqlevelsStyle() doesn't know how to rename
scaffolds yet:

  si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", 
"HSCHR6_MHC_COX_CTG1"))

  gr2 <- GRanges(seqinfo=si2)

  seqlevelsStyle(gr2)
  # [1] "NCBI"

  seqlevelsStyle(gr2) <- "UCSC"
  seqlevels(gr2)
  # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2" 


  # [4] "HSCHR6_MHC_COX_CTG1"

So we need to work on this.

I'm not sure about using genome(gr) <- "hg19" for this. Right now
it sets the genome column of the seqinfo with the supplied string
and nothing else. Aren't there valid use cases for this? What about
using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
whole seqinfo component actually gets filled.

H.

On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:

that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson 

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

Maybe this could eventually support setting the seqinfo with:

genome(gr) <- "hg19"

Or is that being too clever?

On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès mailto:hpa...@fredhutch.org>> wrote:
 > Hi,
 >
 > FWIW I started to work on supporting quick generation of a standalone
 > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
 >
 > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
 > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
 > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
 > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
 > and sacCer2. I'll add more.
 >
 > See ?Seqinfo for some examples.
 >
 > Right now it fetches the information from internet every time you
 > call it but maybe we should just store that information in the
 > GenomeInfoDb package as Tim suggested?
 >
 > H.
 >
 >
 > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
 >>
 >> That would be perfect actually.  And it would radically reduce &
 >> modularize maintenance.  Maybe that's the best way to go after
all.  Quite
 >> sensible.
 >>
 >> --t
 >>
 >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
mailto:st...@channing.harvard.edu>>
 >>> wrote:
 >>>
 >>> It really isn't hard to have multiple OrganismDb packages in
place -- the
 >>> process of making new ones is documented and was given as an
exercise in
 >>> the EdX course.  I don't know if we want to institutionalize it and
 >>> distribute such -- I think we might, so that there would be
Hs19, Hs38,
 >>> mm9, etc. packages.  They have very little content, they just
coordinate
 >>> interactions with packages that you'll already have.
 >>>
 >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
mailto:tim.tri...@gmail.com>>
 >>> wrote:
 >>>
  Right, I typically do that too, and if you're working on human
data it
  isn't a big deal.  What makes things a lot more of a drag is
when you
  work
  on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
  Mus.musculus
  is essentially a "build ahead" of Homo.sapiens.
 
  R> seqinfo(Homo.sapiens)
  Seqinfo object with 93 sequences (1 circular) from hg19 genome
 
  R> seqinfo(Mus.musculus)
  Seqinfo object with 66 sequences (1 circular) from mm10 genome:
 
 

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Michael Lawrence
On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès  wrote:
> I also think that we're heading towards something like that.
>
> So genome(gr) <- "hg19" would:
>
>   (a) Add any missing information to the seqinfo.
>   (b) Sort the seqlevels in "canonical" order.
>   (c) Change the seqlevels style to UCSC style if they are not.
>
> The 3 tasks are orthogonal. I guess most of the times people want
> an easy way to perform them all at once.
>
> We could easily support (a) and (b). This assumes that the current
> seqlevels are already valid hg19 seqlevels:
>
>   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
>   gr1 <- GRanges(seqinfo=si1)
>   hg19_si <- Seqinfo(genome="hg19")
>
>   ## (a):
>   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>   seqinfo(gr1)
>   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>   #   seqnames   seqlengths isCircular genome
>   #   chrX155270560  FALSE   hg19
>   #   chrUn_gl000249  38502  FALSE   hg19
>   #   chr2243199373  FALSE   hg19
>   #   chr6_cox_hap2 4795371  FALSE   hg19
>
>   ## (b):
>   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>   seqinfo(gr1)
>   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>   #   seqnames   seqlengths isCircular genome
>   #   chr2243199373  FALSE   hg19
>   #   chrX155270560  FALSE   hg19
>   #   chr6_cox_hap2 4795371  FALSE   hg19
>   #   chrUn_gl000249  38502  FALSE   hg19
>
> (c) is harder because seqlevelsStyle() doesn't know how to rename
> scaffolds yet:
>
>   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", "HSCHR6_MHC_COX_CTG1"))
>   gr2 <- GRanges(seqinfo=si2)
>
>   seqlevelsStyle(gr2)
>   # [1] "NCBI"
>
>   seqlevelsStyle(gr2) <- "UCSC"
>   seqlevels(gr2)
>   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
>   # [4] "HSCHR6_MHC_COX_CTG1"
>
> So we need to work on this.
>
> I'm not sure about using genome(gr) <- "hg19" for this. Right now
> it sets the genome column of the seqinfo with the supplied string
> and nothing else. Aren't there valid use cases for this?

Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.

> What about
> using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> whole seqinfo component actually gets filled.
>

Yea, but "genome" is so intuitive compared to "seqinfo".



> H.
>
> On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>>
>> that's kind of always been my goal...
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson 
>>
>> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> mailto:lawrence.mich...@gene.com>> wrote:
>>
>> Maybe this could eventually support setting the seqinfo with:
>>
>> genome(gr) <- "hg19"
>>
>> Or is that being too clever?
>>
>> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès > > wrote:
>>  > Hi,
>>  >
>>  > FWIW I started to work on supporting quick generation of a
>> standalone
>>  > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>>  >
>>  > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
>>  > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
>> mm10,
>>  > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
>> galGal3,
>>  > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
>>  > and sacCer2. I'll add more.
>>  >
>>  > See ?Seqinfo for some examples.
>>  >
>>  > Right now it fetches the information from internet every time you
>>  > call it but maybe we should just store that information in the
>>  > GenomeInfoDb package as Tim suggested?
>>  >
>>  > H.
>>  >
>>  >
>>  > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>>  >>
>>  >> That would be perfect actually.  And it would radically reduce &
>>  >> modularize maintenance.  Maybe that's the best way to go after
>> all.  Quite
>>  >> sensible.
>>  >>
>>  >> --t
>>  >>
>>  >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
>> mailto:st...@channing.harvard.edu>>
>>  >>> wrote:
>>  >>>
>>  >>> It really isn't hard to have multiple OrganismDb packages in
>> place -- the
>>  >>> process of making new ones is documented and was given as an
>> exercise in
>>  >>> the EdX course.  I don't know if we want to institutionalize it
>> and
>>  >>> distribute such -- I think we might, so that there would be
>> Hs19, Hs38,
>>  >>> mm9, etc. packages.  They have very little content, they just
>> coordinate
>>  >>> interactions with packages that you'll already have.
>>  >>>
>>  >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
>> mailto:tim.tri...@gmail.com>>
>>
>>  >>> wrote:
>>  >>>
>>   Right, I typically do that to

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hervé Pagès

On 06/05/2015 11:43 AM, Michael Lawrence wrote:

On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès  wrote:

I also think that we're heading towards something like that.

So genome(gr) <- "hg19" would:

   (a) Add any missing information to the seqinfo.
   (b) Sort the seqlevels in "canonical" order.
   (c) Change the seqlevels style to UCSC style if they are not.

The 3 tasks are orthogonal. I guess most of the times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This assumes that the current
seqlevels are already valid hg19 seqlevels:

   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
   gr1 <- GRanges(seqinfo=si1)
   hg19_si <- Seqinfo(genome="hg19")

   ## (a):
   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames   seqlengths isCircular genome
   #   chrX155270560  FALSE   hg19
   #   chrUn_gl000249  38502  FALSE   hg19
   #   chr2243199373  FALSE   hg19
   #   chr6_cox_hap2 4795371  FALSE   hg19

   ## (b):
   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames   seqlengths isCircular genome
   #   chr2243199373  FALSE   hg19
   #   chrX155270560  FALSE   hg19
   #   chr6_cox_hap2 4795371  FALSE   hg19
   #   chrUn_gl000249  38502  FALSE   hg19

(c) is harder because seqlevelsStyle() doesn't know how to rename
scaffolds yet:

   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", "HSCHR6_MHC_COX_CTG1"))
   gr2 <- GRanges(seqinfo=si2)

   seqlevelsStyle(gr2)
   # [1] "NCBI"

   seqlevelsStyle(gr2) <- "UCSC"
   seqlevels(gr2)
   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
   # [4] "HSCHR6_MHC_COX_CTG1"

So we need to work on this.

I'm not sure about using genome(gr) <- "hg19" for this. Right now
it sets the genome column of the seqinfo with the supplied string
and nothing else. Aren't there valid use cases for this?


Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.


Agreed but my worry is that when they don't, then they would be left
with no way to just set the genome column.

H.




What about
using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
whole seqinfo component actually gets filled.



Yea, but "genome" is so intuitive compared to "seqinfo".




H.

On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:


that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson 

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

 Maybe this could eventually support setting the seqinfo with:

 genome(gr) <- "hg19"

 Or is that being too clever?

 On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès mailto:hpa...@fredhutch.org>> wrote:
  > Hi,
  >
  > FWIW I started to work on supporting quick generation of a
standalone
  > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
  >
  > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
  > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
mm10,
  > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
galGal3,
  > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
  > and sacCer2. I'll add more.
  >
  > See ?Seqinfo for some examples.
  >
  > Right now it fetches the information from internet every time you
  > call it but maybe we should just store that information in the
  > GenomeInfoDb package as Tim suggested?
  >
  > H.
  >
  >
  > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
  >>
  >> That would be perfect actually.  And it would radically reduce &
  >> modularize maintenance.  Maybe that's the best way to go after
 all.  Quite
  >> sensible.
  >>
  >> --t
  >>
  >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
 mailto:st...@channing.harvard.edu>>
  >>> wrote:
  >>>
  >>> It really isn't hard to have multiple OrganismDb packages in
 place -- the
  >>> process of making new ones is documented and was given as an
 exercise in
  >>> the EdX course.  I don't know if we want to institutionalize it
and
  >>> distribute such -- I think we might, so that there would be
 Hs19, Hs38,
  >>> mm9, etc. packages.  They have very little content, they just
 coordinate
  >>> interactions with packages that you'll already have.
  >>>
  >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
 mailto:tim.tri...@gmail.com>>

  >>> wrote:
  >>>
   Right, I typically do that too, and if you're working on human
 data it
   isn't

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Gabe Becker
Herve,

This is probably a naive question, but what usecases are there for creating
an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence  wrote:

> On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès  wrote:
> > I also think that we're heading towards something like that.
> >
> > So genome(gr) <- "hg19" would:
> >
> >   (a) Add any missing information to the seqinfo.
> >   (b) Sort the seqlevels in "canonical" order.
> >   (c) Change the seqlevels style to UCSC style if they are not.
> >
> > The 3 tasks are orthogonal. I guess most of the times people want
> > an easy way to perform them all at once.
> >
> > We could easily support (a) and (b). This assumes that the current
> > seqlevels are already valid hg19 seqlevels:
> >
> >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
> >   gr1 <- GRanges(seqinfo=si1)
> >   hg19_si <- Seqinfo(genome="hg19")
> >
> >   ## (a):
> >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> >   seqinfo(gr1)
> >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> >   #   seqnames   seqlengths isCircular genome
> >   #   chrX155270560  FALSE   hg19
> >   #   chrUn_gl000249  38502  FALSE   hg19
> >   #   chr2243199373  FALSE   hg19
> >   #   chr6_cox_hap2 4795371  FALSE   hg19
> >
> >   ## (b):
> >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> >   seqinfo(gr1)
> >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> >   #   seqnames   seqlengths isCircular genome
> >   #   chr2243199373  FALSE   hg19
> >   #   chrX155270560  FALSE   hg19
> >   #   chr6_cox_hap2 4795371  FALSE   hg19
> >   #   chrUn_gl000249  38502  FALSE   hg19
> >
> > (c) is harder because seqlevelsStyle() doesn't know how to rename
> > scaffolds yet:
> >
> >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
> "HSCHR6_MHC_COX_CTG1"))
> >   gr2 <- GRanges(seqinfo=si2)
> >
> >   seqlevelsStyle(gr2)
> >   # [1] "NCBI"
> >
> >   seqlevelsStyle(gr2) <- "UCSC"
> >   seqlevels(gr2)
> >   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
> >   # [4] "HSCHR6_MHC_COX_CTG1"
> >
> > So we need to work on this.
> >
> > I'm not sure about using genome(gr) <- "hg19" for this. Right now
> > it sets the genome column of the seqinfo with the supplied string
> > and nothing else. Aren't there valid use cases for this?
>
> Not sure. People would almost always want the seqname style and order
> to be consistent with the given genome.
>
> > What about
> > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> > whole seqinfo component actually gets filled.
> >
>
> Yea, but "genome" is so intuitive compared to "seqinfo".
>
>
>
> > H.
> >
> > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
> >>
> >> that's kind of always been my goal...
> >>
> >>
> >> Statistics is the grammar of science.
> >> Karl Pearson 
> >>
> >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
> >> mailto:lawrence.mich...@gene.com>> wrote:
> >>
> >> Maybe this could eventually support setting the seqinfo with:
> >>
> >> genome(gr) <- "hg19"
> >>
> >> Or is that being too clever?
> >>
> >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès  >> > wrote:
> >>  > Hi,
> >>  >
> >>  > FWIW I started to work on supporting quick generation of a
> >> standalone
> >>  > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
> >>  >
> >>  > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> >>  > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
> >> mm10,
> >>  > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
> >> galGal3,
> >>  > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
> sacCer3,
> >>  > and sacCer2. I'll add more.
> >>  >
> >>  > See ?Seqinfo for some examples.
> >>  >
> >>  > Right now it fetches the information from internet every time you
> >>  > call it but maybe we should just store that information in the
> >>  > GenomeInfoDb package as Tim suggested?
> >>  >
> >>  > H.
> >>  >
> >>  >
> >>  > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
> >>  >>
> >>  >> That would be perfect actually.  And it would radically reduce &
> >>  >> modularize maintenance.  Maybe that's the best way to go after
> >> all.  Quite
> >>  >> sensible.
> >>  >>
> >>  >> --t
> >>  >>
> >>  >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
> >> mailto:st...@channing.harvard.edu>>
> >>  >>> wrote:
> >>  >>>
> >>  >>> It really isn't hard to have multiple OrganismDb packages in
> >> place -- the
> >>  >>> process of making new ones is documented and was given as an
> >> exercise in
> >>  >>> the EdX course.  I don't know if we want to

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Kasper Daniel Hansen
In WGBS we frequently sequence a human with spikein from the lambda
genome.  In this case, most of the chromosomes of the Granges are from
human, except one.  This is a usecase where genome(GR) is not constant.  I
suggest, partly for compatibility, to keep genome, but perhaps do something
like
  standardizeGenome()
or something like this.

I would indeed love, love, love a function which just cleans it up.

Kasper

On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker  wrote:

> Herve,
>
> This is probably a naive question, but what usecases are there for creating
> an object with the wrong seqinfo for its genome?
>
> ~G
>
> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <
> lawrence.mich...@gene.com
> > wrote:
>
> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
> wrote:
> > > I also think that we're heading towards something like that.
> > >
> > > So genome(gr) <- "hg19" would:
> > >
> > >   (a) Add any missing information to the seqinfo.
> > >   (b) Sort the seqlevels in "canonical" order.
> > >   (c) Change the seqlevels style to UCSC style if they are not.
> > >
> > > The 3 tasks are orthogonal. I guess most of the times people want
> > > an easy way to perform them all at once.
> > >
> > > We could easily support (a) and (b). This assumes that the current
> > > seqlevels are already valid hg19 seqlevels:
> > >
> > >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
> > >   gr1 <- GRanges(seqinfo=si1)
> > >   hg19_si <- Seqinfo(genome="hg19")
> > >
> > >   ## (a):
> > >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> > >   seqinfo(gr1)
> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> > >   #   seqnames   seqlengths isCircular genome
> > >   #   chrX155270560  FALSE   hg19
> > >   #   chrUn_gl000249  38502  FALSE   hg19
> > >   #   chr2243199373  FALSE   hg19
> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
> > >
> > >   ## (b):
> > >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> > >   seqinfo(gr1)
> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> > >   #   seqnames   seqlengths isCircular genome
> > >   #   chr2243199373  FALSE   hg19
> > >   #   chrX155270560  FALSE   hg19
> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
> > >   #   chrUn_gl000249  38502  FALSE   hg19
> > >
> > > (c) is harder because seqlevelsStyle() doesn't know how to rename
> > > scaffolds yet:
> > >
> > >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
> > "HSCHR6_MHC_COX_CTG1"))
> > >   gr2 <- GRanges(seqinfo=si2)
> > >
> > >   seqlevelsStyle(gr2)
> > >   # [1] "NCBI"
> > >
> > >   seqlevelsStyle(gr2) <- "UCSC"
> > >   seqlevels(gr2)
> > >   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
> > >   # [4] "HSCHR6_MHC_COX_CTG1"
> > >
> > > So we need to work on this.
> > >
> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now
> > > it sets the genome column of the seqinfo with the supplied string
> > > and nothing else. Aren't there valid use cases for this?
> >
> > Not sure. People would almost always want the seqname style and order
> > to be consistent with the given genome.
> >
> > > What about
> > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> > > whole seqinfo component actually gets filled.
> > >
> >
> > Yea, but "genome" is so intuitive compared to "seqinfo".
> >
> >
> >
> > > H.
> > >
> > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
> > >>
> > >> that's kind of always been my goal...
> > >>
> > >>
> > >> Statistics is the grammar of science.
> > >> Karl Pearson 
> > >>
> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
> > >> mailto:lawrence.mich...@gene.com>> wrote:
> > >>
> > >> Maybe this could eventually support setting the seqinfo with:
> > >>
> > >> genome(gr) <- "hg19"
> > >>
> > >> Or is that being too clever?
> > >>
> > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès  > >> > wrote:
> > >>  > Hi,
> > >>  >
> > >>  > FWIW I started to work on supporting quick generation of a
> > >> standalone
> > >>  > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
> > >>  >
> > >>  > It already supports hg38, hg19, hg18, panTro4, panTro3,
> panTro2,
> > >>  > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
> > >> mm10,
> > >>  > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
> > >> galGal3,
> > >>  > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
> > sacCer3,
> > >>  > and sacCer2. I'll add more.
> > >>  >
> > >>  > See ?Seqinfo for some examples.
> > >>  >
> > >>  > Right now it fetches the information from internet every time
> you
> > >>  > call it but maybe we should just store that information in the
> > >>  > GenomeInfoDb package as Tim suggested?

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Gabe Becker
That sounds like it calls for an (class-style) inheritence/genome-union
 model to me. I should probably stop talking now before the people who
would have to implement that start throwing things at me, though.

~G

On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

> In WGBS we frequently sequence a human with spikein from the lambda
> genome.  In this case, most of the chromosomes of the Granges are from
> human, except one.  This is a usecase where genome(GR) is not constant.  I
> suggest, partly for compatibility, to keep genome, but perhaps do something
> like
>   standardizeGenome()
> or something like this.
>
> I would indeed love, love, love a function which just cleans it up.
>
> Kasper
>
> On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker  wrote:
>
>> Herve,
>>
>> This is probably a naive question, but what usecases are there for
>> creating
>> an object with the wrong seqinfo for its genome?
>>
>> ~G
>>
>> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <
>> lawrence.mich...@gene.com
>> > wrote:
>>
>> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
>> wrote:
>> > > I also think that we're heading towards something like that.
>> > >
>> > > So genome(gr) <- "hg19" would:
>> > >
>> > >   (a) Add any missing information to the seqinfo.
>> > >   (b) Sort the seqlevels in "canonical" order.
>> > >   (c) Change the seqlevels style to UCSC style if they are not.
>> > >
>> > > The 3 tasks are orthogonal. I guess most of the times people want
>> > > an easy way to perform them all at once.
>> > >
>> > > We could easily support (a) and (b). This assumes that the current
>> > > seqlevels are already valid hg19 seqlevels:
>> > >
>> > >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
>> > >   gr1 <- GRanges(seqinfo=si1)
>> > >   hg19_si <- Seqinfo(genome="hg19")
>> > >
>> > >   ## (a):
>> > >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>> > >   seqinfo(gr1)
>> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>> > >   #   seqnames   seqlengths isCircular genome
>> > >   #   chrX155270560  FALSE   hg19
>> > >   #   chrUn_gl000249  38502  FALSE   hg19
>> > >   #   chr2243199373  FALSE   hg19
>> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
>> > >
>> > >   ## (b):
>> > >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>> > >   seqinfo(gr1)
>> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>> > >   #   seqnames   seqlengths isCircular genome
>> > >   #   chr2243199373  FALSE   hg19
>> > >   #   chrX155270560  FALSE   hg19
>> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
>> > >   #   chrUn_gl000249  38502  FALSE   hg19
>> > >
>> > > (c) is harder because seqlevelsStyle() doesn't know how to rename
>> > > scaffolds yet:
>> > >
>> > >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
>> > "HSCHR6_MHC_COX_CTG1"))
>> > >   gr2 <- GRanges(seqinfo=si2)
>> > >
>> > >   seqlevelsStyle(gr2)
>> > >   # [1] "NCBI"
>> > >
>> > >   seqlevelsStyle(gr2) <- "UCSC"
>> > >   seqlevels(gr2)
>> > >   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
>> > >   # [4] "HSCHR6_MHC_COX_CTG1"
>> > >
>> > > So we need to work on this.
>> > >
>> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now
>> > > it sets the genome column of the seqinfo with the supplied string
>> > > and nothing else. Aren't there valid use cases for this?
>> >
>> > Not sure. People would almost always want the seqname style and order
>> > to be consistent with the given genome.
>> >
>> > > What about
>> > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
>> > > whole seqinfo component actually gets filled.
>> > >
>> >
>> > Yea, but "genome" is so intuitive compared to "seqinfo".
>> >
>> >
>> >
>> > > H.
>> > >
>> > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>> > >>
>> > >> that's kind of always been my goal...
>> > >>
>> > >>
>> > >> Statistics is the grammar of science.
>> > >> Karl Pearson 
>> > >>
>> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> > >> mailto:lawrence.mich...@gene.com>>
>> wrote:
>> > >>
>> > >> Maybe this could eventually support setting the seqinfo with:
>> > >>
>> > >> genome(gr) <- "hg19"
>> > >>
>> > >> Or is that being too clever?
>> > >>
>> > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <
>> hpa...@fredhutch.org
>> > >> > wrote:
>> > >>  > Hi,
>> > >>  >
>> > >>  > FWIW I started to work on supporting quick generation of a
>> > >> standalone
>> > >>  > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>> > >>  >
>> > >>  > It already supports hg38, hg19, hg18, panTro4, panTro3,
>> panTro2,
>> > >>  > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
>> > >> mm10,
>> > >>  > mm9, mm8

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hervé Pagès

Hi Gabe,

I wouldn't say "wrong" but "not normalized". So maybe for example
everything is ok with their seqinfo but the seqlevels are not in
canonical order, and, for whatever reason, they want to keep them
that way (the actual order of the seqlevels impacts the sorting of
the ranges in the GRanges object). If it turns out that the genome
column is empty and they want to set it, then they would be left
with no way to do so.

Another concern I have is that genome<- has been the low-level setter
for the genome column for a while and changing its behavior now might
break some existing code. It's fixable, but we can avoid that.

H.

On 06/05/2015 11:51 AM, Gabe Becker wrote:

Herve,

This is probably a naive question, but what usecases are there for
creating an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès mailto:hpa...@fredhutch.org>> wrote:
 > I also think that we're heading towards something like that.
 >
 > So genome(gr) <- "hg19" would:
 >
 >   (a) Add any missing information to the seqinfo.
 >   (b) Sort the seqlevels in "canonical" order.
 >   (c) Change the seqlevels style to UCSC style if they are not.
 >
 > The 3 tasks are orthogonal. I guess most of the times people want
 > an easy way to perform them all at once.
 >
 > We could easily support (a) and (b). This assumes that the current
 > seqlevels are already valid hg19 seqlevels:
 >
 >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
"chr6_cox_hap2"))
 >   gr1 <- GRanges(seqinfo=si1)
 >   hg19_si <- Seqinfo(genome="hg19")
 >
 >   ## (a):
 >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
 >   seqinfo(gr1)
 >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
 >   #   seqnames   seqlengths isCircular genome
 >   #   chrX155270560  FALSE   hg19
 >   #   chrUn_gl000249  38502  FALSE   hg19
 >   #   chr2243199373  FALSE   hg19
 >   #   chr6_cox_hap2 4795371  FALSE   hg19
 >
 >   ## (b):
 >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
 >   seqinfo(gr1)
 >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
 >   #   seqnames   seqlengths isCircular genome
 >   #   chr2243199373  FALSE   hg19
 >   #   chrX155270560  FALSE   hg19
 >   #   chr6_cox_hap2 4795371  FALSE   hg19
 >   #   chrUn_gl000249  38502  FALSE   hg19
 >
 > (c) is harder because seqlevelsStyle() doesn't know how to rename
 > scaffolds yet:
 >
 >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
"HSCHR6_MHC_COX_CTG1"))
 >   gr2 <- GRanges(seqinfo=si2)
 >
 >   seqlevelsStyle(gr2)
 >   # [1] "NCBI"
 >
 >   seqlevelsStyle(gr2) <- "UCSC"
 >   seqlevels(gr2)
 >   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
 >   # [4] "HSCHR6_MHC_COX_CTG1"
 >
 > So we need to work on this.
 >
 > I'm not sure about using genome(gr) <- "hg19" for this. Right now
 > it sets the genome column of the seqinfo with the supplied string
 > and nothing else. Aren't there valid use cases for this?

Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.

> What about
> using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> whole seqinfo component actually gets filled.
>

Yea, but "genome" is so intuitive compared to "seqinfo".



 > H.
 >
 > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
 >>
 >> that's kind of always been my goal...
 >>
 >>
 >> Statistics is the grammar of science.
 >> Karl Pearson 
 >>
 >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
 >> mailto:lawrence.mich...@gene.com>
>> wrote:
 >>
 >> Maybe this could eventually support setting the seqinfo with:
 >>
 >> genome(gr) <- "hg19"
 >>
 >> Or is that being too clever?
 >>
 >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès
mailto:hpa...@fredhutch.org>
 >> >>
wrote:
 >>  > Hi,
 >>  >
 >>  > FWIW I started to work on supporting quick generation of a
 >> standalone
 >>  > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
 >>  >
 >>  > It already supports hg38, hg19, hg18, panTro4, panTro3,
panTro2,
 >>  > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1,
musFur1,
 >> mm10,
 >>  > mm9, mm8, susScr3, susScr2, rn6, rheMac

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Michael Lawrence
To support the multi-genome case, one could set the genome as a
vector, one value for each seqname, and it would fix the
style/seqlength per seqname. It could sort by the combination of
seqname and species. Presumably it would do nothing for unknown
genomes.

But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
genome(x)" would make sense.

I don't think people sort too often by seqnames (except to the
"natural" ordering), but I could be wrong. I do sympathize though with
the need for a low-level accessor. At least one would want a parameter
for disabling the standardization.

On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
 wrote:
> In WGBS we frequently sequence a human with spikein from the lambda genome.
> In this case, most of the chromosomes of the Granges are from human, except
> one.  This is a usecase where genome(GR) is not constant.  I suggest, partly
> for compatibility, to keep genome, but perhaps do something like
>   standardizeGenome()
> or something like this.
>
> I would indeed love, love, love a function which just cleans it up.
>
> Kasper
>
> On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker  wrote:
>>
>> Herve,
>>
>> This is probably a naive question, but what usecases are there for
>> creating
>> an object with the wrong seqinfo for its genome?
>>
>> ~G
>>
>> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
>> > > wrote:
>>
>> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
>> > wrote:
>> > > I also think that we're heading towards something like that.
>> > >
>> > > So genome(gr) <- "hg19" would:
>> > >
>> > >   (a) Add any missing information to the seqinfo.
>> > >   (b) Sort the seqlevels in "canonical" order.
>> > >   (c) Change the seqlevels style to UCSC style if they are not.
>> > >
>> > > The 3 tasks are orthogonal. I guess most of the times people want
>> > > an easy way to perform them all at once.
>> > >
>> > > We could easily support (a) and (b). This assumes that the current
>> > > seqlevels are already valid hg19 seqlevels:
>> > >
>> > >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
>> > >   gr1 <- GRanges(seqinfo=si1)
>> > >   hg19_si <- Seqinfo(genome="hg19")
>> > >
>> > >   ## (a):
>> > >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>> > >   seqinfo(gr1)
>> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>> > >   #   seqnames   seqlengths isCircular genome
>> > >   #   chrX155270560  FALSE   hg19
>> > >   #   chrUn_gl000249  38502  FALSE   hg19
>> > >   #   chr2243199373  FALSE   hg19
>> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
>> > >
>> > >   ## (b):
>> > >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>> > >   seqinfo(gr1)
>> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>> > >   #   seqnames   seqlengths isCircular genome
>> > >   #   chr2243199373  FALSE   hg19
>> > >   #   chrX155270560  FALSE   hg19
>> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
>> > >   #   chrUn_gl000249  38502  FALSE   hg19
>> > >
>> > > (c) is harder because seqlevelsStyle() doesn't know how to rename
>> > > scaffolds yet:
>> > >
>> > >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
>> > "HSCHR6_MHC_COX_CTG1"))
>> > >   gr2 <- GRanges(seqinfo=si2)
>> > >
>> > >   seqlevelsStyle(gr2)
>> > >   # [1] "NCBI"
>> > >
>> > >   seqlevelsStyle(gr2) <- "UCSC"
>> > >   seqlevels(gr2)
>> > >   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
>> > >   # [4] "HSCHR6_MHC_COX_CTG1"
>> > >
>> > > So we need to work on this.
>> > >
>> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now
>> > > it sets the genome column of the seqinfo with the supplied string
>> > > and nothing else. Aren't there valid use cases for this?
>> >
>> > Not sure. People would almost always want the seqname style and order
>> > to be consistent with the given genome.
>> >
>> > > What about
>> > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
>> > > whole seqinfo component actually gets filled.
>> > >
>> >
>> > Yea, but "genome" is so intuitive compared to "seqinfo".
>> >
>> >
>> >
>> > > H.
>> > >
>> > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>> > >>
>> > >> that's kind of always been my goal...
>> > >>
>> > >>
>> > >> Statistics is the grammar of science.
>> > >> Karl Pearson 
>> > >>
>> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> > >> mailto:lawrence.mich...@gene.com>> wrote:
>> > >>
>> > >> Maybe this could eventually support setting the seqinfo with:
>> > >>
>> > >> genome(gr) <- "hg19"
>> > >>
>> > >> Or is that being too clever?
>> > >>
>> > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès > > >> > wrote:
>> > >>  > Hi,
>> > >>  >
>> > >>  > FWIW I started to work on supporting quick generation of a
>> > >> st

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Gabe Becker
On Fri, Jun 5, 2015 at 1:19 PM, Michael Lawrence 
wrote:

> To support the multi-genome case, one could set the genome as a
> vector, one value for each seqname, and it would fix the
> style/seqlength per seqname. It could sort by the combination of
> seqname and species. Presumably it would do nothing for unknown
> genomes.
>

That is one way. Another would be it takes a vector of the length of
genomes being unioned, and each genome knows it's seqinfo and fixes things
within it's domain. The seqlevels, for example, would be
c(seqlevels(genome1), ..., seqlevels(genomeK)). There shouldn't be overlap,
because if there is there was already an identifiability problem in the
data which is basically guaranteed to be an error (I think).

If combinations of genomes were more formally modeled, though, you could do
fun things like

genome(x, strict=FALSE) = "GRCh38"

Which would do nothing if the genome on x was already a union containing
GRCh38, and otherwise would fix the "human part" of the seqinfo to be
GRCh37, but would leave anything that had been unioned on alone.

~G


>
> But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
> genome(x)" would make sense.
>
> I don't think people sort too often by seqnames (except to the
> "natural" ordering), but I could be wrong. I do sympathize though with
> the need for a low-level accessor. At least one would want a parameter
> for disabling the standardization.
>
> On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
>  wrote:
> > In WGBS we frequently sequence a human with spikein from the lambda
> genome.
> > In this case, most of the chromosomes of the Granges are from human,
> except
> > one.  This is a usecase where genome(GR) is not constant.  I suggest,
> partly
> > for compatibility, to keep genome, but perhaps do something like
> >   standardizeGenome()
> > or something like this.
> >
> > I would indeed love, love, love a function which just cleans it up.
> >
> > Kasper
> >
> > On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker 
> wrote:
> >>
> >> Herve,
> >>
> >> This is probably a naive question, but what usecases are there for
> >> creating
> >> an object with the wrong seqinfo for its genome?
> >>
> >> ~G
> >>
> >> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
> >>  >> > wrote:
> >>
> >> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
> >> > wrote:
> >> > > I also think that we're heading towards something like that.
> >> > >
> >> > > So genome(gr) <- "hg19" would:
> >> > >
> >> > >   (a) Add any missing information to the seqinfo.
> >> > >   (b) Sort the seqlevels in "canonical" order.
> >> > >   (c) Change the seqlevels style to UCSC style if they are not.
> >> > >
> >> > > The 3 tasks are orthogonal. I guess most of the times people want
> >> > > an easy way to perform them all at once.
> >> > >
> >> > > We could easily support (a) and (b). This assumes that the current
> >> > > seqlevels are already valid hg19 seqlevels:
> >> > >
> >> > >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
> "chr6_cox_hap2"))
> >> > >   gr1 <- GRanges(seqinfo=si1)
> >> > >   hg19_si <- Seqinfo(genome="hg19")
> >> > >
> >> > >   ## (a):
> >> > >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> >> > >   seqinfo(gr1)
> >> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> >> > >   #   seqnames   seqlengths isCircular genome
> >> > >   #   chrX155270560  FALSE   hg19
> >> > >   #   chrUn_gl000249  38502  FALSE   hg19
> >> > >   #   chr2243199373  FALSE   hg19
> >> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
> >> > >
> >> > >   ## (b):
> >> > >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> >> > >   seqinfo(gr1)
> >> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> >> > >   #   seqnames   seqlengths isCircular genome
> >> > >   #   chr2243199373  FALSE   hg19
> >> > >   #   chrX155270560  FALSE   hg19
> >> > >   #   chr6_cox_hap2 4795371  FALSE   hg19
> >> > >   #   chrUn_gl000249  38502  FALSE   hg19
> >> > >
> >> > > (c) is harder because seqlevelsStyle() doesn't know how to rename
> >> > > scaffolds yet:
> >> > >
> >> > >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
> >> > "HSCHR6_MHC_COX_CTG1"))
> >> > >   gr2 <- GRanges(seqinfo=si2)
> >> > >
> >> > >   seqlevelsStyle(gr2)
> >> > >   # [1] "NCBI"
> >> > >
> >> > >   seqlevelsStyle(gr2) <- "UCSC"
> >> > >   seqlevels(gr2)
> >> > >   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
> >> > >   # [4] "HSCHR6_MHC_COX_CTG1"
> >> > >
> >> > > So we need to work on this.
> >> > >
> >> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now
> >> > > it sets the genome column of the seqinfo with the supplied string
> >> > > and nothing else. Aren't there valid use cases for this?
> >> >
> >> > Not sure. People would almost always want the seqname style and order
> >> > to be consistent with the gi

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hervé Pagès

On 06/05/2015 01:19 PM, Michael Lawrence wrote:

To support the multi-genome case, one could set the genome as a
vector, one value for each seqname, and it would fix the
style/seqlength per seqname. It could sort by the combination of
seqname and species. Presumably it would do nothing for unknown
genomes.

But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
genome(x)" would make sense.

I don't think people sort too often by seqnames (except to the
"natural" ordering),


That's what sort(), order(), rank() do by default: they sort by seqnames
first, then by start, then by end, and finally by strand.


but I could be wrong. I do sympathize though with
the need for a low-level accessor. At least one would want a parameter
for disabling the standardization.


Ok. So the candidates are:

 (a) standardizeSeqinfo(gr) <- "hg19"

 (b) gr <- standardizeSeqinfo(gr, "hg19")

 (c) standardizeGenome(gr) <- "hg19"

 (d) gr <- standardizeGenome(gr, "hg19")

 (e) seqinfo(gr) <- "hg19"

Is there a risk of confusion with keepStandardChromosomes where
"standard" means a very different thing? I'll add 2 more:

 (f) normalizeSeqinfo(gr) <- "hg19"

 (g) gr <- normalizeSeqinfo(gr, "hg19")

Anyway, we're not here yet. As pointed in an earlier post, there are
still some missing pieces to complete the puzzle.

Thanks,
H.



On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
 wrote:

In WGBS we frequently sequence a human with spikein from the lambda genome.
In this case, most of the chromosomes of the Granges are from human, except
one.  This is a usecase where genome(GR) is not constant.  I suggest, partly
for compatibility, to keep genome, but perhaps do something like
   standardizeGenome()
or something like this.

I would indeed love, love, love a function which just cleans it up.

Kasper

On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker  wrote:


Herve,

This is probably a naive question, but what usecases are there for
creating
an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence

wrote:



On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
wrote:

I also think that we're heading towards something like that.

So genome(gr) <- "hg19" would:

   (a) Add any missing information to the seqinfo.
   (b) Sort the seqlevels in "canonical" order.
   (c) Change the seqlevels style to UCSC style if they are not.

The 3 tasks are orthogonal. I guess most of the times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This assumes that the current
seqlevels are already valid hg19 seqlevels:

   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
   gr1 <- GRanges(seqinfo=si1)
   hg19_si <- Seqinfo(genome="hg19")

   ## (a):
   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames   seqlengths isCircular genome
   #   chrX155270560  FALSE   hg19
   #   chrUn_gl000249  38502  FALSE   hg19
   #   chr2243199373  FALSE   hg19
   #   chr6_cox_hap2 4795371  FALSE   hg19

   ## (b):
   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
   seqinfo(gr1)
   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
   #   seqnames   seqlengths isCircular genome
   #   chr2243199373  FALSE   hg19
   #   chrX155270560  FALSE   hg19
   #   chr6_cox_hap2 4795371  FALSE   hg19
   #   chrUn_gl000249  38502  FALSE   hg19

(c) is harder because seqlevelsStyle() doesn't know how to rename
scaffolds yet:

   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",

"HSCHR6_MHC_COX_CTG1"))

   gr2 <- GRanges(seqinfo=si2)

   seqlevelsStyle(gr2)
   # [1] "NCBI"

   seqlevelsStyle(gr2) <- "UCSC"
   seqlevels(gr2)
   # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
   # [4] "HSCHR6_MHC_COX_CTG1"

So we need to work on this.

I'm not sure about using genome(gr) <- "hg19" for this. Right now
it sets the genome column of the seqinfo with the supplied string
and nothing else. Aren't there valid use cases for this?


Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.


What about
using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
whole seqinfo component actually gets filled.



Yea, but "genome" is so intuitive compared to "seqinfo".




H.

On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:


that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson 

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

 Maybe this could eventually support setting the seqinfo with:

 genome(gr) <- "hg19"

 Or is that being too clever?

 On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès mailto:hpa...@fredhutch.org>> wrote:
  > Hi,

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Tim Triche, Jr.
how about just

gr <- addSeqinfo(gr, "hg19")

Statistics is the grammar of science.
Karl Pearson 

On Fri, Jun 5, 2015 at 1:36 PM, Hervé Pagès  wrote:

> On 06/05/2015 01:19 PM, Michael Lawrence wrote:
>
>> To support the multi-genome case, one could set the genome as a
>> vector, one value for each seqname, and it would fix the
>> style/seqlength per seqname. It could sort by the combination of
>> seqname and species. Presumably it would do nothing for unknown
>> genomes.
>>
>> But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
>> genome(x)" would make sense.
>>
>> I don't think people sort too often by seqnames (except to the
>> "natural" ordering),
>>
>
> That's what sort(), order(), rank() do by default: they sort by seqnames
> first, then by start, then by end, and finally by strand.
>
>  but I could be wrong. I do sympathize though with
>> the need for a low-level accessor. At least one would want a parameter
>> for disabling the standardization.
>>
>
> Ok. So the candidates are:
>
>  (a) standardizeSeqinfo(gr) <- "hg19"
>
>  (b) gr <- standardizeSeqinfo(gr, "hg19")
>
>  (c) standardizeGenome(gr) <- "hg19"
>
>  (d) gr <- standardizeGenome(gr, "hg19")
>
>  (e) seqinfo(gr) <- "hg19"
>
> Is there a risk of confusion with keepStandardChromosomes where
> "standard" means a very different thing? I'll add 2 more:
>
>  (f) normalizeSeqinfo(gr) <- "hg19"
>
>  (g) gr <- normalizeSeqinfo(gr, "hg19")
>
> Anyway, we're not here yet. As pointed in an earlier post, there are
> still some missing pieces to complete the puzzle.
>
> Thanks,
> H.
>
>
>> On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
>>  wrote:
>>
>>> In WGBS we frequently sequence a human with spikein from the lambda
>>> genome.
>>> In this case, most of the chromosomes of the Granges are from human,
>>> except
>>> one.  This is a usecase where genome(GR) is not constant.  I suggest,
>>> partly
>>> for compatibility, to keep genome, but perhaps do something like
>>>standardizeGenome()
>>> or something like this.
>>>
>>> I would indeed love, love, love a function which just cleans it up.
>>>
>>> Kasper
>>>
>>> On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker 
>>> wrote:
>>>

 Herve,

 This is probably a naive question, but what usecases are there for
 creating
 an object with the wrong seqinfo for its genome?

 ~G

 On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
 >>>
> wrote:
>

  On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
> wrote:
>
>> I also think that we're heading towards something like that.
>>
>> So genome(gr) <- "hg19" would:
>>
>>(a) Add any missing information to the seqinfo.
>>(b) Sort the seqlevels in "canonical" order.
>>(c) Change the seqlevels style to UCSC style if they are not.
>>
>> The 3 tasks are orthogonal. I guess most of the times people want
>> an easy way to perform them all at once.
>>
>> We could easily support (a) and (b). This assumes that the current
>> seqlevels are already valid hg19 seqlevels:
>>
>>si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
>> "chr6_cox_hap2"))
>>gr1 <- GRanges(seqinfo=si1)
>>hg19_si <- Seqinfo(genome="hg19")
>>
>>## (a):
>>seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>>seqinfo(gr1)
>># Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>>#   seqnames   seqlengths isCircular genome
>>#   chrX155270560  FALSE   hg19
>>#   chrUn_gl000249  38502  FALSE   hg19
>>#   chr2243199373  FALSE   hg19
>>#   chr6_cox_hap2 4795371  FALSE   hg19
>>
>>## (b):
>>seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>>seqinfo(gr1)
>># Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>>#   seqnames   seqlengths isCircular genome
>>#   chr2243199373  FALSE   hg19
>>#   chrX155270560  FALSE   hg19
>>#   chr6_cox_hap2 4795371  FALSE   hg19
>>#   chrUn_gl000249  38502  FALSE   hg19
>>
>> (c) is harder because seqlevelsStyle() doesn't know how to rename
>> scaffolds yet:
>>
>>si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
>>
> "HSCHR6_MHC_COX_CTG1"))
>
>>gr2 <- GRanges(seqinfo=si2)
>>
>>seqlevelsStyle(gr2)
>># [1] "NCBI"
>>
>>seqlevelsStyle(gr2) <- "UCSC"
>>seqlevels(gr2)
>># [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
>># [4] "HSCHR6_MHC_COX_CTG1"
>>
>> So we need to work on this.
>>
>> I'm not sure about using genome(gr) <- "hg19" for this. Right now
>> it sets the genome column of the seqinfo with the supplied string
>> and nothing el

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Gabe Becker
On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr. 
wrote:

> how about just
>
> gr <- addSeqinfo(gr, "hg19")
>

Add sounds like it's, well, adding rather than replacing (Which it
sometimes would do.

gr <- fixSeqInfo(gr, "hg19")

instead?

~G


-- 
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hervé Pagès

On 06/05/2015 01:39 PM, Tim Triche, Jr. wrote:

how about just

gr <- addSeqinfo(gr, "hg19")


mmh, we don't really "add" a seqinfo. We transform the existing one.
This transformation can be called "standardization", or "normalization",
or... but I wouldn't call it "addition".

H.



Statistics is the grammar of science.
Karl Pearson 

On Fri, Jun 5, 2015 at 1:36 PM, Hervé Pagès mailto:hpa...@fredhutch.org>> wrote:

On 06/05/2015 01:19 PM, Michael Lawrence wrote:

To support the multi-genome case, one could set the genome as a
vector, one value for each seqname, and it would fix the
style/seqlength per seqname. It could sort by the combination of
seqname and species. Presumably it would do nothing for unknown
genomes.

But I agree that a standardizeSeqinfo() that amounts to
"genome(x) <-
genome(x)" would make sense.

I don't think people sort too often by seqnames (except to the
"natural" ordering),


That's what sort(), order(), rank() do by default: they sort by seqnames
first, then by start, then by end, and finally by strand.

but I could be wrong. I do sympathize though with
the need for a low-level accessor. At least one would want a
parameter
for disabling the standardization.


Ok. So the candidates are:

  (a) standardizeSeqinfo(gr) <- "hg19"

  (b) gr <- standardizeSeqinfo(gr, "hg19")

  (c) standardizeGenome(gr) <- "hg19"

  (d) gr <- standardizeGenome(gr, "hg19")

  (e) seqinfo(gr) <- "hg19"

Is there a risk of confusion with keepStandardChromosomes where
"standard" means a very different thing? I'll add 2 more:

  (f) normalizeSeqinfo(gr) <- "hg19"

  (g) gr <- normalizeSeqinfo(gr, "hg19")

Anyway, we're not here yet. As pointed in an earlier post, there are
still some missing pieces to complete the puzzle.

Thanks,
H.


On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
mailto:kasperdanielhan...@gmail.com>> wrote:

In WGBS we frequently sequence a human with spikein from the
lambda genome.
In this case, most of the chromosomes of the Granges are
from human, except
one.  This is a usecase where genome(GR) is not constant.  I
suggest, partly
for compatibility, to keep genome, but perhaps do something like
standardizeGenome()
or something like this.

I would indeed love, love, love a function which just cleans
it up.

Kasper

On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker
mailto:becker.g...@gene.com>> wrote:


Herve,

This is probably a naive question, but what usecases are
there for
creating
an object with the wrong seqinfo for its genome?

~G

On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
mailto:lawrence.mich...@gene.com>

wrote:


On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès
mailto:hpa...@fredhutch.org>>
wrote:

I also think that we're heading towards
something like that.

So genome(gr) <- "hg19" would:

(a) Add any missing information to the seqinfo.
(b) Sort the seqlevels in "canonical" order.
(c) Change the seqlevels style to UCSC style
if they are not.

The 3 tasks are orthogonal. I guess most of the
times people want
an easy way to perform them all at once.

We could easily support (a) and (b). This
assumes that the current
seqlevels are already valid hg19 seqlevels:

si1 <- Seqinfo(c("chrX", "chrUn_gl000249",
"chr2", "chr6_cox_hap2"))
gr1 <- GRanges(seqinfo=si1)
hg19_si <- Seqinfo(genome="hg19")

## (a):
seqinfo(gr1) <- merge(seqinfo(gr1),
hg19_si)[seqlevels(gr1)]
seqinfo(gr1)
# Seqinfo object with 4 sequences (1
circular) from hg19 genome:
#   seqnames   seqlengths isCircular genome
#   chrX155270560  FALSE   hg19
#   chrUn_gl000249  38502  FALSE   hg19
#   chr2243199373  FALSE   hg19
#   chr6_cox_hap2 

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hervé Pagès

On 06/05/2015 01:41 PM, Gabe Becker wrote:



On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr. mailto:tim.tri...@gmail.com>> wrote:

how about just

gr <- addSeqinfo(gr, "hg19")


Add sounds like it's, well, adding rather than replacing (Which it
sometimes would do.

gr <- fixSeqInfo(gr, "hg19")


As I tried to explain earlier, it's not broken.

Come on, we've already a list of 7 proposals! I was hoping you guys
would help pick up one instead of growing the list :-b

H.



instead?

~G


--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Michael Lawrence
Not a big fan of the gets (<-) syntax for "verb" functions.
standardizeSeqinfo() is probably the most "proper" name according to
our nomenclature. But novice users are still just going to want to do
genome(gr) <- "hg19" and have it just the right thing.

On Fri, Jun 5, 2015 at 1:36 PM, Hervé Pagès  wrote:
> On 06/05/2015 01:19 PM, Michael Lawrence wrote:
>>
>> To support the multi-genome case, one could set the genome as a
>> vector, one value for each seqname, and it would fix the
>> style/seqlength per seqname. It could sort by the combination of
>> seqname and species. Presumably it would do nothing for unknown
>> genomes.
>>
>> But I agree that a standardizeSeqinfo() that amounts to "genome(x) <-
>> genome(x)" would make sense.
>>
>> I don't think people sort too often by seqnames (except to the
>> "natural" ordering),
>
>
> That's what sort(), order(), rank() do by default: they sort by seqnames
> first, then by start, then by end, and finally by strand.
>
>> but I could be wrong. I do sympathize though with
>> the need for a low-level accessor. At least one would want a parameter
>> for disabling the standardization.
>
>
> Ok. So the candidates are:
>
>  (a) standardizeSeqinfo(gr) <- "hg19"
>
>  (b) gr <- standardizeSeqinfo(gr, "hg19")
>
>  (c) standardizeGenome(gr) <- "hg19"
>
>  (d) gr <- standardizeGenome(gr, "hg19")
>
>  (e) seqinfo(gr) <- "hg19"
>
> Is there a risk of confusion with keepStandardChromosomes where
> "standard" means a very different thing? I'll add 2 more:
>
>  (f) normalizeSeqinfo(gr) <- "hg19"
>
>  (g) gr <- normalizeSeqinfo(gr, "hg19")
>
> Anyway, we're not here yet. As pointed in an earlier post, there are
> still some missing pieces to complete the puzzle.
>
> Thanks,
>
> H.
>
>>
>> On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen
>>  wrote:
>>>
>>> In WGBS we frequently sequence a human with spikein from the lambda
>>> genome.
>>> In this case, most of the chromosomes of the Granges are from human,
>>> except
>>> one.  This is a usecase where genome(GR) is not constant.  I suggest,
>>> partly
>>> for compatibility, to keep genome, but perhaps do something like
>>>standardizeGenome()
>>> or something like this.
>>>
>>> I would indeed love, love, love a function which just cleans it up.
>>>
>>> Kasper
>>>
>>> On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker  wrote:


 Herve,

 This is probably a naive question, but what usecases are there for
 creating
 an object with the wrong seqinfo for its genome?

 ~G

 On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence
 
> wrote:


> On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès 
> wrote:
>>
>> I also think that we're heading towards something like that.
>>
>> So genome(gr) <- "hg19" would:
>>
>>(a) Add any missing information to the seqinfo.
>>(b) Sort the seqlevels in "canonical" order.
>>(c) Change the seqlevels style to UCSC style if they are not.
>>
>> The 3 tasks are orthogonal. I guess most of the times people want
>> an easy way to perform them all at once.
>>
>> We could easily support (a) and (b). This assumes that the current
>> seqlevels are already valid hg19 seqlevels:
>>
>>si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2",
>> "chr6_cox_hap2"))
>>gr1 <- GRanges(seqinfo=si1)
>>hg19_si <- Seqinfo(genome="hg19")
>>
>>## (a):
>>seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>>seqinfo(gr1)
>># Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>>#   seqnames   seqlengths isCircular genome
>>#   chrX155270560  FALSE   hg19
>>#   chrUn_gl000249  38502  FALSE   hg19
>>#   chr2243199373  FALSE   hg19
>>#   chr6_cox_hap2 4795371  FALSE   hg19
>>
>>## (b):
>>seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>>seqinfo(gr1)
>># Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>>#   seqnames   seqlengths isCircular genome
>>#   chr2243199373  FALSE   hg19
>>#   chrX155270560  FALSE   hg19
>>#   chr6_cox_hap2 4795371  FALSE   hg19
>>#   chrUn_gl000249  38502  FALSE   hg19
>>
>> (c) is harder because seqlevelsStyle() doesn't know how to rename
>> scaffolds yet:
>>
>>si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
>
> "HSCHR6_MHC_COX_CTG1"))
>>
>>gr2 <- GRanges(seqinfo=si2)
>>
>>seqlevelsStyle(gr2)
>># [1] "NCBI"
>>
>>seqlevelsStyle(gr2) <- "UCSC"
>>seqlevels(gr2)
>># [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
>># [4] "HSCHR6_MHC_COX_CTG1"
>>
>> So we need to work on this.
>>
>> I'm not sure about using genome(gr) <- "hg19" for this. R

Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Tim Triche, Jr.
maybe standardizeSeqinfo or fixSeqinfo is clearer after all

Statistics is the grammar of science.
Karl Pearson 

On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker  wrote:

>
>
> On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr. 
> wrote:
>
>> how about just
>>
>>  gr <- addSeqinfo(gr, "hg19")
>>
>
>  Add sounds like it's, well, adding rather than replacing (Which it
> sometimes would do.
>
>  gr <- fixSeqInfo(gr, "hg19")
>
>  instead?
>
>  ~G
>
>
>  --
>   Gabriel Becker, Ph.D
> Computational Biologist
> Genentech Research
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Gabe Becker
I dunno, standardizeSeqInfo just seems really long for a function name
users are going to have to call.

At the risk of annoying Herve further, what about

gr <- castSeqInfo(gr, "gh19")

?

~G

On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr. 
wrote:

> maybe standardizeSeqinfo or fixSeqinfo is clearer after all
>
> Statistics is the grammar of science.
> Karl Pearson 
>
> On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker  wrote:
>
>>
>>
>> On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr. 
>> wrote:
>>
>>> how about just
>>>
>>>  gr <- addSeqinfo(gr, "hg19")
>>>
>>
>>  Add sounds like it's, well, adding rather than replacing (Which it
>> sometimes would do.
>>
>>  gr <- fixSeqInfo(gr, "hg19")
>>
>>  instead?
>>
>>  ~G
>>
>>
>>  --
>>   Gabriel Becker, Ph.D
>> Computational Biologist
>> Genentech Research
>>
>
>


-- 
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hervé Pagès

On 06/05/2015 01:48 PM, Gabe Becker wrote:

I dunno, standardizeSeqInfo just seems really long for a function name
users are going to have to call.

At the risk of annoying Herve further, what about

gr <- castSeqInfo(gr, "gh19")


gr!



?

~G

On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr. mailto:tim.tri...@gmail.com>> wrote:

maybe standardizeSeqinfo or fixSeqinfo is clearer after all

Statistics is the grammar of science.
Karl Pearson 

On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker mailto:becker.g...@gene.com>> wrote:



On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr.
mailto:tim.tri...@gmail.com>> wrote:

how about just

gr <- addSeqinfo(gr, "hg19")


Add sounds like it's, well, adding rather than replacing (Which
it sometimes would do.

gr <- fixSeqInfo(gr, "hg19")

instead?

~G


--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research





--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Tim Triche, Jr.
how about:
foo <- paintBikeShed("mauve")  ## :-D

more serious-like, though,

genome(gr) <- "hg19" ## and
gr <- standardizeGenome(gr, "hg19")

seem like the apotheosis of clarity and thus the way to go.



Statistics is the grammar of science.
Karl Pearson 

On Fri, Jun 5, 2015 at 1:49 PM, Hervé Pagès  wrote:

> On 06/05/2015 01:48 PM, Gabe Becker wrote:
> > I dunno, standardizeSeqInfo just seems really long for a function name
> > users are going to have to call.
> >
> > At the risk of annoying Herve further, what about
> >
> > gr <- castSeqInfo(gr, "gh19")
>
> gr!
>
> >
> > ?
> >
> > ~G
> >
> > On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr.  > > wrote:
> >
> > maybe standardizeSeqinfo or fixSeqinfo is clearer after all
> >
> > Statistics is the grammar of science.
> > Karl Pearson 
> >
> > On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker  > > wrote:
> >
> >
> >
> > On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr.
> > mailto:tim.tri...@gmail.com>> wrote:
> >
> > how about just
> >
> > gr <- addSeqinfo(gr, "hg19")
> >
> >
> > Add sounds like it's, well, adding rather than replacing (Which
> > it sometimes would do.
> >
> > gr <- fixSeqInfo(gr, "hg19")
> >
> > instead?
> >
> > ~G
> >
> >
> > --
> > Gabriel Becker, Ph.D
> > Computational Biologist
> > Genentech Research
> >
> >
> >
> >
> >
> > --
> > Gabriel Becker, Ph.D
> > Computational Biologist
> > Genentech Research
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Hector Corrada Bravo
Sorry if I'm too late for this...

I think having "hg19" in 'gr <- standardizeSeqinfo(gr, "hg19")' be
responsible for so much, can make this new function hard to maintain.
Might it be better if that argument takes an object corresponding to a
canonical annotation? E.g.,


library(Mus.musculus)
gr <- standardizeSeqinfo(gr, Mus.musculus)


On Fri, Jun 5, 2015 at 4:49 PM, Hervé Pagès  wrote:

> On 06/05/2015 01:48 PM, Gabe Becker wrote:
>
>> I dunno, standardizeSeqInfo just seems really long for a function name
>> users are going to have to call.
>>
>> At the risk of annoying Herve further, what about
>>
>> gr <- castSeqInfo(gr, "gh19")
>>
>
> gr!
>
>
>> ?
>>
>> ~G
>>
>> On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr. > > wrote:
>>
>> maybe standardizeSeqinfo or fixSeqinfo is clearer after all
>>
>> Statistics is the grammar of science.
>> Karl Pearson 
>>
>> On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker > > wrote:
>>
>>
>>
>> On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr.
>> mailto:tim.tri...@gmail.com>> wrote:
>>
>> how about just
>>
>> gr <- addSeqinfo(gr, "hg19")
>>
>>
>> Add sounds like it's, well, adding rather than replacing (Which
>> it sometimes would do.
>>
>> gr <- fixSeqInfo(gr, "hg19")
>>
>> instead?
>>
>> ~G
>>
>>
>> --
>> Gabriel Becker, Ph.D
>> Computational Biologist
>> Genentech Research
>>
>>
>>
>>
>>
>> --
>> Gabriel Becker, Ph.D
>> Computational Biologist
>> Genentech Research
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-05 Thread Michael Lawrence
That's already possible, basically:

seqinfo(gr) <- seqinfo(Mus.musculus)

Anyway, I'm with Tim's last suggestion. Just support both. Have an
argument to genome<- like standardize=FALSE for low-level
manipulation.


On Fri, Jun 5, 2015 at 1:56 PM, Hector Corrada Bravo  wrote:
> Sorry if I'm too late for this...
>
> I think having "hg19" in 'gr <- standardizeSeqinfo(gr, "hg19")' be
> responsible for so much, can make this new function hard to maintain.
> Might it be better if that argument takes an object corresponding to a
> canonical annotation? E.g.,
>
>
> library(Mus.musculus)
> gr <- standardizeSeqinfo(gr, Mus.musculus)
>
>
> On Fri, Jun 5, 2015 at 4:49 PM, Hervé Pagès  wrote:
>>
>> On 06/05/2015 01:48 PM, Gabe Becker wrote:
>>>
>>> I dunno, standardizeSeqInfo just seems really long for a function name
>>> users are going to have to call.
>>>
>>> At the risk of annoying Herve further, what about
>>>
>>> gr <- castSeqInfo(gr, "gh19")
>>
>>
>> gr!
>>
>>>
>>> ?
>>>
>>> ~G
>>>
>>> On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr. >> > wrote:
>>>
>>> maybe standardizeSeqinfo or fixSeqinfo is clearer after all
>>>
>>> Statistics is the grammar of science.
>>> Karl Pearson 
>>>
>>> On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker >> > wrote:
>>>
>>>
>>>
>>> On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr.
>>> mailto:tim.tri...@gmail.com>> wrote:
>>>
>>> how about just
>>>
>>> gr <- addSeqinfo(gr, "hg19")
>>>
>>>
>>> Add sounds like it's, well, adding rather than replacing (Which
>>> it sometimes would do.
>>>
>>> gr <- fixSeqInfo(gr, "hg19")
>>>
>>> instead?
>>>
>>> ~G
>>>
>>>
>>> --
>>> Gabriel Becker, Ph.D
>>> Computational Biologist
>>> Genentech Research
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Gabriel Becker, Ph.D
>>> Computational Biologist
>>> Genentech Research
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpa...@fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:(206) 667-1319
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

2015-06-15 Thread Kasper Daniel Hansen
Hector: we still need genome version in whatever call we make.

I assume we will get an update when this has been implemented; I am itching
to use it.

Kasper

On Fri, Jun 5, 2015 at 5:04 PM, Michael Lawrence 
wrote:

> That's already possible, basically:
>
> seqinfo(gr) <- seqinfo(Mus.musculus)
>
> Anyway, I'm with Tim's last suggestion. Just support both. Have an
> argument to genome<- like standardize=FALSE for low-level
> manipulation.
>
>
> On Fri, Jun 5, 2015 at 1:56 PM, Hector Corrada Bravo 
> wrote:
> > Sorry if I'm too late for this...
> >
> > I think having "hg19" in 'gr <- standardizeSeqinfo(gr, "hg19")' be
> > responsible for so much, can make this new function hard to maintain.
> > Might it be better if that argument takes an object corresponding to a
> > canonical annotation? E.g.,
> >
> >
> > library(Mus.musculus)
> > gr <- standardizeSeqinfo(gr, Mus.musculus)
> >
> >
> > On Fri, Jun 5, 2015 at 4:49 PM, Hervé Pagès 
> wrote:
> >>
> >> On 06/05/2015 01:48 PM, Gabe Becker wrote:
> >>>
> >>> I dunno, standardizeSeqInfo just seems really long for a function name
> >>> users are going to have to call.
> >>>
> >>> At the risk of annoying Herve further, what about
> >>>
> >>> gr <- castSeqInfo(gr, "gh19")
> >>
> >>
> >> gr!
> >>
> >>>
> >>> ?
> >>>
> >>> ~G
> >>>
> >>> On Fri, Jun 5, 2015 at 1:46 PM, Tim Triche, Jr.  >>> > wrote:
> >>>
> >>> maybe standardizeSeqinfo or fixSeqinfo is clearer after all
> >>>
> >>> Statistics is the grammar of science.
> >>> Karl Pearson 
> >>>
> >>> On Fri, Jun 5, 2015 at 1:41 PM, Gabe Becker  >>> > wrote:
> >>>
> >>>
> >>>
> >>> On Fri, Jun 5, 2015 at 1:39 PM, Tim Triche, Jr.
> >>> mailto:tim.tri...@gmail.com>> wrote:
> >>>
> >>> how about just
> >>>
> >>> gr <- addSeqinfo(gr, "hg19")
> >>>
> >>>
> >>> Add sounds like it's, well, adding rather than replacing (Which
> >>> it sometimes would do.
> >>>
> >>> gr <- fixSeqInfo(gr, "hg19")
> >>>
> >>> instead?
> >>>
> >>> ~G
> >>>
> >>>
> >>> --
> >>> Gabriel Becker, Ph.D
> >>> Computational Biologist
> >>> Genentech Research
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Gabriel Becker, Ph.D
> >>> Computational Biologist
> >>> Genentech Research
> >>
> >>
> >> --
> >> Hervé Pagès
> >>
> >> Program in Computational Biology
> >> Division of Public Health Sciences
> >> Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N, M1-B514
> >> P.O. Box 19024
> >> Seattle, WA 98109-1024
> >>
> >> E-mail: hpa...@fredhutch.org
> >> Phone:  (206) 667-5791
> >> Fax:(206) 667-1319
> >>
> >> ___
> >> Bioc-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
> >
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel