Re: [Bioc-devel] proposal for additional seqlevelsStyle

Hervé Pagès Thu, 02 Jul 2020 22:51:14 -0700

Hi Vince, Robert, Kasper,

I've done some work on this. Starting with GenomeInfoDb_1.25.7 theseqlevelsStyle() setter has 2 major improvements:


1. It knows how to rename contigs and scaffolds, not just the chromosomes:

  library(TxDb.Mmusculus.UCSC.mm10.knownGene)

  seqinfo(txdb)
  # Seqinfo object with 66 sequences (1 circular) from mm10 genome:
  # seqnames       seqlengths isCircular genome
  # chr1            195471971       <NA>   mm10
  # chr2            182113224       <NA>   mm10
  # chr3            160039680       <NA>   mm10
  # chr4            156508116       <NA>   mm10
  # chr5            151834684       <NA>   mm10
  # ...                   ...        ...    ...
  # chrUn_GL456392      23629       <NA>   mm10
  # chrUn_GL456393      55711       <NA>   mm10
  # chrUn_GL456394      24323       <NA>   mm10
  # chrUn_GL456396      21240       <NA>   mm10
  # chrUn_JH584304     114452       <NA>   mm10

  seqlevelsStyle(txdb) <- "NCBI"

  seqinfo(txdb)
  # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
  # seqnames      seqlengths isCircular genome
  # 1              195471971       <NA> GRCm38
  # 2              182113224       <NA> GRCm38
  # 3              160039680       <NA> GRCm38
  # 4              156508116       <NA> GRCm38
  # 5              151834684       <NA> GRCm38
  # ...                  ...        ...    ...
  # MSCHRUN_CTG10      23629       <NA> GRCm38
  # MSCHRUN_CTG11      55711       <NA> GRCm38
  # MSCHRUN_CTG12      24323       <NA> GRCm38
  # MSCHRUN_CTG15      21240       <NA> GRCm38
  # MSCHRUN_CTG23     114452       <NA> GRCm38

2. It supports new style RefSeq for renaming to/from RefSeq accessions:

  seqlevelsStyle(txdb) <- "RefSeq"

  seqinfo(txdb)
  # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
  # seqnames    seqlengths isCircular genome
  # NC_000067.6  195471971       <NA> GRCm38
  # NC_000068.7  182113224       <NA> GRCm38
  # NC_000069.6  160039680       <NA> GRCm38
  # NC_000070.6  156508116       <NA> GRCm38
  # NC_000071.6  151834684       <NA> GRCm38
  # ...                ...        ...    ...
  # NT_166476.1      23629       <NA> GRCm38
  # NT_166477.1      55711       <NA> GRCm38
  # NT_166478.1      24323       <NA> GRCm38
  # NT_166480.1      21240       <NA> GRCm38
  # NT_187064.1     114452       <NA> GRCm38

These new features only work on objects for which the genome is set toan NCBI assembly (e.g. WBcel235) or UCSC genome (e.g. ce11). This is thecase with TxDb, BSgenome, and SNPlocs objects.

The workhorses behind them are new low-level utilitiesgetChromInfoFromNCBI() and getChromInfoFromUCSC(). These support 141NCBI assemblies and 74 UCSC genomes at the moment, respectively. It'seasy to add new organisms. The gotcha is that they require internetaccess and so does the seqlevelsStyle() setter. This could be mitigatedby caching the data via BiocFileCache.

Next thing on the list is to support the GenBank style (Vince's originalrequest) to rename to/from GenBank accessions.


Cheers,
H.


On 12/13/19 10:51, Kasper Daniel Hansen wrote:

If the chromosome name depends on the assembly, that makes GenomeInfoDb
even more useful and necessary.  Provided it is supported of course.

On Fri, Dec 13, 2019 at 11:45 AM Vincent Carey <st...@channing.harvard.edu>
wrote:

I tried an inline png but I think it was rejected by bioc-devel.  Here's
another try.

On Fri, Dec 13, 2019 at 11:40 AM Vincent Carey <st...@channing.harvard.edu

wrote:

Thanks -- It is good to know more about the complications of adding
seqlevelsStyle elements.
I am not sure how pervasive this will be in SNP annotation in the future.
The "new API" for dbSNP
references SPDI annotation conventions.

https://urldefense.proofpoint.com/v2/url?u=https-3A__api.ncbi.nlm.nih.gov_variation_v0_&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=1jolcMxvVlDvdU0u8LZeWnWHJFt-R6ZweevICNvw2oo&e=

at least one dbsnp build 152 resource uses this nomenclature.  The one

referenced below is the "go-to" resource for current rsid-coordinate

correspondence, as far as I know.

library(VariantAnnotation)


*0/0 packages newly attached/loaded, see sessionInfo() for details.*

mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note seqnames

nn = readVcf("

https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.ncbi.nih.gov_snp_redesign_latest-5Frelease_VCF_GCF-5F000001405.38.gz&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=SXqRWrv1IK1WpNOf3cjMC82-fvChoJ6za48WzlRGeAU&e=

",

+   genome="GRCh38", param=mypar)

head(rowRanges(nn), 3)


GRanges object with 3 ranges and 5 metadata columns:

                    seqnames    ranges strand | paramRangeID

REF


                       <Rle> <IRanges>  <Rle> |     <factor>

<DNAStringSet>


   rs1331956057 NC_000001.11    100000      * |         <NA>


   rs1252351580 NC_000001.11    100036      * |         <NA>


   rs1238523913 NC_000001.11    100051      * |         <NA>


                               ALT      QUAL      FILTER

                <DNAStringSetList> <numeric> <character>

   rs1331956057                  T      <NA>           .

   rs1252351580                  G      <NA>           .

   rs1238523913                  C      <NA>           .

   -------

   seqinfo: 1 sequence from GRCh38 genome; no seqlengths


On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo <robert.cast...@upf.edu>
wrote:

hi Hervé,

i didn't know about this new sequence style until Vince posted his
message and we briefly talked about it at the European BioC meeting this
week in Brussels. however, i didn't know that the style was specific to
a particular assembly. i have no use case of this at the mome moment,
i.e., i have not encountered myself any annotation or BAM file with
chromosome names written that way, so i don't know how pressing this
issue is, maybe Vince can tell us how spread such chromosome naming
style may become in the near future.

naively, i'd think that it would be matter of adding a
reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc., but i
can imagine that maybe the "reference style" concept might not be the
appropriate placeholder to map all different chromosome names of all
different individual human genomes uploaded to NCBI. maybe we should
wait until we have a specific use case .. Vince?

robert.

On 12/11/19 10:06 PM, Pages, Herve wrote:

Hi Vince, Robert,

Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for

chrom

17 in the GRCh38.

@Robert: Is this what you're also interested in?

The problem is that the RefSeq accessions are specific to a particular
assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10

for

the same chrom in GRCh37).

Currently seqlevelsStyle() doesn't know how to distinguish between
different assemblies of the same organism. Not saying it couldn't but

it

would require some thinking and some significant refactoring. It
wouldn't be just a matter of adding a column to
genomeStyles()$Homo_sapiens.

H.


On 12/10/19 14:19, Robert Castelo wrote:

I second this, and would suggest to name the style as 'GRC' for

"Genome

Reference Consortium".

thanks Vince for bringing this up, being able to easily switch

between

genome styles is great.

if 'paste0()' in R is one of the most influential contributions to
statistical computing

https://urldefense.proofpoint.com/v2/url?u=https-3A__simplystatistics.org_2013_01_31_paste0-2Dis-2Dstatistical-2Dcomputings-2Dmost-2Dinfluential-2Dcontribution-2Dof-2Dthe-2D21st-2Dcentury&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=b0_SIu8orJ7ZcCS3TIodFvGTPibt9R8vFL5Y40YSx3Q&e=


i think that 'seqlevelsStyle()' from the GenomeInfoDb package is one

of

the most influential contributions to human genetics, if you think

about

the time invested by researchers in parsing and changing between
different styles of chromosome names :)

robert.

On 06/12/2019 15:03, Vincent Carey wrote:

I raised this issue previously with little response.

I'd propose that we add a column or two to

genomeStyles()$Homo_sapiens

head(genomeStyles()$Homo_sapiens, 2)

     circular auto   sex NCBI UCSC dbSNP Ensembl

1    FALSE TRUE FALSE    1 chr1   ch1       1

2    FALSE TRUE FALSE    2 chr2   ch2       2


that includes the values for "NCBI reference sequence names"

See

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_nuccore_568815581&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=3Jy-MH7heIcrc_A4qm_izduLvBoPWHSeq4gdxf5nv24&e=

for one report on chr17,
and

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.39&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=y6ut_Xcc4rSbXanckiJhiwLsL0W8neJfKWQa6wnG3aM&e=


for a table that includes the Genbank labels.

Should I just file a PR at

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_GenomeInfoDb_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=KMzfo3_8kkJ-wdvRCNP5rUjTVMW87brj07yHaKL5Qb0&e=

after
testing?


_______________________________________________
Bioc-devel@r-project.org mailing list

https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=SvtNreKVOHnSGjsRwzWWpttpEF7wBXI5utI37-qgX1A&e=


--
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550


--
The information in this e-mail is intended only for t...{{dropped:21}}


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] proposal for additional seqlevelsStyle

Reply via email to