Re: [Bioc-devel] BSgenome changes

2020-08-20 Thread Hervé Pagès

Kasper,

The tradition so far has been to package all UCSC human genomes since 
hg17. We could also start producing BSgenome packages for other non-UCSC 
Human assemblies. We just need to draw a line somewhere. If there is a 
need for it we can make BSgenome.Hsapiens.NCBI.GRCh37.p13 available, as 
I said earlier. Is this what you are asking for?


H.

On 8/20/20 03:23, Kasper Daniel Hansen wrote:
Well, the presence of two mitochondrial genomes is to fix a mistake by 
UCSC. I can appreciate the importance of representing this mistake when 
you build off UCSC. But it strikes me as not actually representing the 
h37 version of the genome, and it seems to me that we want such a 
representation in the project - not everything comes through UCSC. But 
perhaps I have not given this sufficient thought, this is just my 
immediate reaction.


On Tue, Aug 18, 2020 at 8:18 PM Leonard Goldstein 
mailto:goldstein.leon...@gene.com>> wrote:


Thanks for the explanation Hervé.

Best wishes

Leonard


On Tue, Aug 18, 2020 at 10:06 AM Hervé Pagès mailto:hpa...@fredhutch.org>> wrote:

On 8/18/20 01:40, Kasper Daniel Hansen wrote:
 > In light of this, could we get a version of GRCh37 with only
a single
 > mitochondrial genome?

You mean a BSgenome.Hsapiens.NCBI.GRCh37.p13 package? So it would
contain the same sequences as BSgenome.Hsapiens.UCSC.hg19 but
without
the hg19:chrM sequence?

Certainly doable but note that by using
BSgenome.Hsapiens.UCSC.hg38 you
stay away from this mess. I'm not sure that adding yet another
BSgenome
package would make the situation less confusing.

 >
 > On Fri, Aug 14, 2020 at 6:17 PM Hervé Pagès
mailto:hpa...@fredhutch.org>
 > >>
wrote:
 >
 >     Hi Felix,
 >
 >     On 8/13/20 21:43, Felix Ernst wrote:
 >      > Hi Leonard, Hi Herve,
 >      >
 >      > I followed your conversation, since I have noticed the
same
 >     problem. Thanks, Herve, for the explanation of the recent
changes on
 >     hg19.
 >      >
 >      > The GRCh37.P13 report states in its last line:
 >      >
 >      > MT    assembled-molecule      MT      Mitochondrion 
  J01415.2

 >          =       NC_012920.1     non-nuclear     16569   chrM
 >      >
 >      > Since the last name is called "UCSC-style-name",
wouldn't that
 >     mean that chrM has to be renamed to MT and not chrMT?
 >
 >     This is a mistake in the sequence report for GRCh37.p13.
GRCh37.p13:MT
 >     is the same as hg19:chrMT, not hg19:chrM.
 >
 >     hg19:chrM and hg19:chrMT are **not** the same sequences.
The former is
 >     NC_001807 and has length 16571 and the latter is
NC_012920.1 and has
 >     length 16569.
 >
 >     Yes, seqlevelsStyle() is sorting out all this mess for
you ;-)
 >
 >     Cheers,
 >     H.
 >
 >      >
 >      > Thanks again for the explanation.
 >      >
 >      > Cheers,
 >      > Felix
 >      >
 >      > -Ursprüngliche Nachricht-
 >      > Von: Bioc-devel mailto:bioc-devel-boun...@r-project.org>
 >     >> Im Auftrag von Hervé
Pagès
 >      > Gesendet: Freitag, 14. August 2020 01:08
 >      > An: Leonard Goldstein mailto:goldstein.leon...@gene.com>
 >     >>; bioc-devel@r-project.org

 >     >
 >      > Cc: charlotte.sone...@fmi.ch

>
 >      > Betreff: Re: [Bioc-devel] BSgenome changes
 >      >
 >      > Hi Leonard,
 >      >
 >      > On 8/12/20 15:22, Leonard Goldstein via Bioc-devel wrote:
 >      >> Dear Bioc team,
 >      >>
 >      >> I'm following up on this recent GitHub issue
 >      >>
 >   
        >>
 >   
  _SGSeq_issues_5&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYvfbojaqTJZVg&s=Tfk-tDM99P63dnsvMydG2phv5WQPVbJzPk0hzi-_1SE&e=

 >      >. Please see the issue for more details and code examples.
 >      >>
 >     

Re: [Bioc-devel] BSgenome changes

2020-08-20 Thread Kasper Daniel Hansen
Well, the presence of two mitochondrial genomes is to fix a mistake by
UCSC. I can appreciate the importance of representing this mistake when you
build off UCSC. But it strikes me as not actually representing the h37
version of the genome, and it seems to me that we want such a
representation in the project - not everything comes through UCSC. But
perhaps I have not given this sufficient thought, this is just my immediate
reaction.

On Tue, Aug 18, 2020 at 8:18 PM Leonard Goldstein <
goldstein.leon...@gene.com> wrote:

> Thanks for the explanation Hervé.
>
> Best wishes
>
> Leonard
>
>
> On Tue, Aug 18, 2020 at 10:06 AM Hervé Pagès  wrote:
>
>> On 8/18/20 01:40, Kasper Daniel Hansen wrote:
>> > In light of this, could we get a version of GRCh37 with only a single
>> > mitochondrial genome?
>>
>> You mean a BSgenome.Hsapiens.NCBI.GRCh37.p13 package? So it would
>> contain the same sequences as BSgenome.Hsapiens.UCSC.hg19 but without
>> the hg19:chrM sequence?
>>
>> Certainly doable but note that by using BSgenome.Hsapiens.UCSC.hg38 you
>> stay away from this mess. I'm not sure that adding yet another BSgenome
>> package would make the situation less confusing.
>>
>> >
>> > On Fri, Aug 14, 2020 at 6:17 PM Hervé Pagès > > > wrote:
>> >
>> > Hi Felix,
>> >
>> > On 8/13/20 21:43, Felix Ernst wrote:
>> >  > Hi Leonard, Hi Herve,
>> >  >
>> >  > I followed your conversation, since I have noticed the same
>> > problem. Thanks, Herve, for the explanation of the recent changes on
>> > hg19.
>> >  >
>> >  > The GRCh37.P13 report states in its last line:
>> >  >
>> >  > MTassembled-molecule  MT  Mitochondrion   J01415.2
>> >  =   NC_012920.1 non-nuclear 16569   chrM
>> >  >
>> >  > Since the last name is called "UCSC-style-name", wouldn't that
>> > mean that chrM has to be renamed to MT and not chrMT?
>> >
>> > This is a mistake in the sequence report for GRCh37.p13.
>> GRCh37.p13:MT
>> > is the same as hg19:chrMT, not hg19:chrM.
>> >
>> > hg19:chrM and hg19:chrMT are **not** the same sequences. The former
>> is
>> > NC_001807 and has length 16571 and the latter is NC_012920.1 and has
>> > length 16569.
>> >
>> > Yes, seqlevelsStyle() is sorting out all this mess for you ;-)
>> >
>> > Cheers,
>> > H.
>> >
>> >  >
>> >  > Thanks again for the explanation.
>> >  >
>> >  > Cheers,
>> >  > Felix
>> >  >
>> >  > -Ursprüngliche Nachricht-
>> >  > Von: Bioc-devel > > > Im Auftrag von Hervé
>> Pagès
>> >  > Gesendet: Freitag, 14. August 2020 01:08
>> >  > An: Leonard Goldstein > > >; bioc-devel@r-project.org
>> > 
>> >  > Cc: charlotte.sone...@fmi.ch 
>> >  > Betreff: Re: [Bioc-devel] BSgenome changes
>> >  >
>> >  > Hi Leonard,
>> >  >
>> >  > On 8/12/20 15:22, Leonard Goldstein via Bioc-devel wrote:
>> >  >> Dear Bioc team,
>> >  >>
>> >  >> I'm following up on this recent GitHub issue
>> >  >>
>> > <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ldg21
>> >  >>
>> >
>>  
>> _SGSeq_issues_5&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYvfbojaqTJZVg&s=Tfk-tDM99P63dnsvMydG2phv5WQPVbJzPk0hzi-_1SE&e=
>> >  >. Please see the issue for more details and code examples.
>> >  >>
>> >  >> It looks like changes in Bioc devel result in two copies of the
>> >  >> mitochondrial chromosome for BSgenome.Hsapiens.UCSC.hg19 -- one
>> > named
>> >  >> chrM like in previous package versions (length 16571) and one
>> named
>> >  >> chrMT (length 16569).
>> >  >>
>> >  >> When using seqlevelsStyle() to change chromosome names from
>> UCSC to
>> >  >> NCBI format, this results in new behavior -- in the past chrM
>> was
>> >  >> simply renamed MT, now the different sequence chrMT is used. Is
>> > this intended?
>> >  >
>> >  > Absolutely intended.
>> >  >
>> >  > There is a long story behind the unfortunate fate of the
>> > mitochondrial chromosome in hg19. I'll try to keep it short.
>> >  >
>> >  > When the UCSC folks released the hg19 browser more than 10 years
>> > ago, they based it on assembly GRCh37:
>> >  >
>> >  >
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F01405.13&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=jWtgKVQGC-SQp6i4prhKBiD5cBh2kEc8R1gL2uPlzy0&e=
>> >  >
>> >  > See sequence report for GRCh37:
>> >  >
>> >  >
>> >  >
>> >
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all