[Bioc-devel] contradictions in the return value of predictCoding() from VariantAnnotation

2013-03-07 Thread Michael Stadler
Dear Valerie and colleagues,

I have recently started using the VariantAnnotation package, which is a
huge asset for much of my work with sequence variants - thank you!

I have a question regarding the following example (it makes use of two
small text files, "annot.gtf" and "variants.vcf", attached to this message):

library(VariantAnnotation)
library(GenomicFeatures)
library(BSgenome.Celegans.UCSC.ce6)

# build a TranscriptDb from "annot.gtf"
chrominfo <- data.frame(chrom=as.character(seqnames(Celegans)),
length=seqlengths(Celegans),
is_circular=FALSE)
txdb <- makeTranscriptDbFromGFF(file="annot.gtf",
format="gtf",
chrominfo=chrominfo,
dataSource="WormBase v. WS190",
species="Caenorhabditis elegans")
txdb
#TranscriptDb object:
#| Db type: TranscriptDb
#| Supporting package: GenomicFeatures
#| Data source: WormBase v. WS190
#| Organism: Caenorhabditis elegans
#| miRBase build ID: NA
#| transcript_nrow: 7
#| exon_nrow: 42
#| cds_nrow: 42
#| Db created by: GenomicFeatures package from Bioconductor
#| Creation time: 2013-03-07 14:21:08 +0100 (Thu, 07 Mar 2013)
#| GenomicFeatures version at creation time: 1.11.14
#| RSQLite version at creation time: 0.11.2
#| DBSCHEMAVERSION: 1.0

# read in sequence variants
vcf <- readVcf("variants.vcf", genome="ce6")
dim(vcf)
#[1] 3 1

# predict the impact on coding sequences
aa <- predictCoding(query=vcf, subject=txdb, seqSource=Celegans)
length(aa)
#[1] 7


# My question is related to the impact of variant "chrV:15822727":
# This SNV is overlapping two transcripts (same base, plus strand),
# yet the reported VARCODON values are different (GAT -> TAT/AAT):
aa[names(aa)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]

#GRanges with 2 ranges and 5 metadata columns:
#seqnames   ranges strand |
#  |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#   REFALT  varAllele
#  
#  chrV:15822727  G  A  A
#  chrV:15822727  G  A  A
#  REFCODON   VARCODON
# 
#  chrV:15822727GATTAT
#  chrV:15822727GATAAT
#  ---
#  seqlengths:
#   chrI chrV
# NA   NA


# interestingly, when altering the annotation or the set of variants,
# this contradictions disapears (GAT -> AAT):
vcf2 <- vcf[-1] # remove the first SNV
aa2 <- predictCoding(query=vcf2, subject=txdb, seqSource=Celegans)
length(aa2)
#[1] 5
aa2[names(aa2)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]

#GRanges with 2 ranges and 5 metadata columns:
#seqnames   ranges strand |
#  |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#   REFALT  varAllele
#  
#  chrV:15822727  G  A  A
#  chrV:15822727  G  A  A
#  REFCODON   VARCODON
# 
#  chrV:15822727GATAAT
#  chrV:15822727GATAAT
#  ---
#  seqlengths:
#   chrI chrV
# NA   NA

Do you know why this could be the case?
Regards,
Michael


Here is my envirnoment:
sessionInfo()
R Under development (unstable) (2013-02-25 r62061)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets
[7] methods   base

other attached packages:
 [1] BSgenome.Celegans.UCSC.ce6_1.3.19
 [2] BSgenome_1.27.1
 [3] GenomicFeatures_1.11.14
 [4] AnnotationDbi_1.21.12
 [5] Biobase_2.19.3
 [6] VariantAnnotation_1.5.42
 [7] Rsamtools_1.11.21
 [8] Biostrings_2.27.11
 [9] GenomicRanges_1.11.35
[10] IRanges_1.17.35
[11] BiocGenerics_0.5.6
[12] RColorBrewer_1.0-5

loaded via a namespace (and not attached):
 [1] DBI_0.2-5  RCurl_1.95-4.1 RSQLite_0.11.2
 [4] XML_3.95-0.1   biomaRt_2.15.0 bitops_1.0-5
 [7] rtracklayer_1.19.9 stats4_3.0.0   tools_3.0.0
[10] zlibbioc_1.5.0

chrVCoding_transcript   exon6512065 6512138 .   -   2   
gene_id "WBGene1073"; transcript_id "F46E10.9.1";
chrVCoding_transcript   exon6512189 6512279 .   -   0   
gene_id "WBGene1073"; transcript_id "F46E10.9.1";
chrVCoding_transcript   exon6512325 6512472 .   -   1   
gene_id "WBGene1073"; transcript_id "F46E10.9.1";
chrVCoding_transcript   exon6512521 6512649 .   -   1   
gene_id "WBGene1073"; transcript_id "F46E10.9.1";
chrVCoding_t

Re: [Bioc-devel] contradictions in the return value of predictCoding() from VariantAnnotation

2013-03-07 Thread Michael Stadler
Here is the second attachment ("variants.vcf") that was missing from the
first message.

Michael

On 03/07/2013 02:40 PM, Michael Stadler wrote:
> Dear Valerie and colleagues,
> 
> I have recently started using the VariantAnnotation package, which is a
> huge asset for much of my work with sequence variants - thank you!
> 
> I have a question regarding the following example (it makes use of two
> small text files, "annot.gtf" and "variants.vcf", attached to this message):
> 
> library(VariantAnnotation)
> library(GenomicFeatures)
> library(BSgenome.Celegans.UCSC.ce6)
> 
> # build a TranscriptDb from "annot.gtf"
> chrominfo <- data.frame(chrom=as.character(seqnames(Celegans)),
> length=seqlengths(Celegans),
> is_circular=FALSE)
> txdb <- makeTranscriptDbFromGFF(file="annot.gtf",
> format="gtf",
> chrominfo=chrominfo,
> dataSource="WormBase v. WS190",
> species="Caenorhabditis elegans")
> txdb
> #TranscriptDb object:
> #| Db type: TranscriptDb
> #| Supporting package: GenomicFeatures
> #| Data source: WormBase v. WS190
> #| Organism: Caenorhabditis elegans
> #| miRBase build ID: NA
> #| transcript_nrow: 7
> #| exon_nrow: 42
> #| cds_nrow: 42
> #| Db created by: GenomicFeatures package from Bioconductor
> #| Creation time: 2013-03-07 14:21:08 +0100 (Thu, 07 Mar 2013)
> #| GenomicFeatures version at creation time: 1.11.14
> #| RSQLite version at creation time: 0.11.2
> #| DBSCHEMAVERSION: 1.0
> 
> # read in sequence variants
> vcf <- readVcf("variants.vcf", genome="ce6")
> dim(vcf)
> #[1] 3 1
> 
> # predict the impact on coding sequences
> aa <- predictCoding(query=vcf, subject=txdb, seqSource=Celegans)
> length(aa)
> #[1] 7
> 
> 
> # My question is related to the impact of variant "chrV:15822727":
> # This SNV is overlapping two transcripts (same base, plus strand),
> # yet the reported VARCODON values are different (GAT -> TAT/AAT):
> aa[names(aa)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]
> 
> #GRanges with 2 ranges and 5 metadata columns:
> #seqnames   ranges strand |
> #  |
> #  chrV:15822727 chrV [15822727, 15822727]  + |
> #  chrV:15822727 chrV [15822727, 15822727]  + |
> #   REFALT  varAllele
> #  
> #  chrV:15822727  G  A  A
> #  chrV:15822727  G  A  A
> #  REFCODON   VARCODON
> # 
> #  chrV:15822727GATTAT
> #  chrV:15822727GATAAT
> #  ---
> #  seqlengths:
> #   chrI chrV
> # NA   NA
> 
> 
> # interestingly, when altering the annotation or the set of variants,
> # this contradictions disapears (GAT -> AAT):
> vcf2 <- vcf[-1] # remove the first SNV
> aa2 <- predictCoding(query=vcf2, subject=txdb, seqSource=Celegans)
> length(aa2)
> #[1] 5
> aa2[names(aa2)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]
> 
> #GRanges with 2 ranges and 5 metadata columns:
> #seqnames   ranges strand |
> #  |
> #  chrV:15822727 chrV [15822727, 15822727]  + |
> #  chrV:15822727 chrV [15822727, 15822727]  + |
> #   REFALT  varAllele
> #  
> #  chrV:15822727  G  A  A
> #  chrV:15822727  G  A  A
> #  REFCODON   VARCODON
> # 
> #  chrV:15822727GATAAT
> #  chrV:15822727GATAAT
> #  ---
> #  seqlengths:
> #   chrI chrV
> # NA   NA
> 
> Do you know why this could be the case?
> Regards,
> Michael
> 
> 
> Here is my envirnoment:
> sessionInfo()
> R Under development (unstable) (2013-02-25 r62061)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] parallel  stats graphics  grDevices utils datasets
> [7] methods   base
> 
> other attached packages:
>  [1] BSgenome.Celegans.UCSC.ce6_1.3.19
>  [2] BSgenome_1.27.1
>  [3] GenomicFeatures_1.11.14
>  [4] AnnotationDbi_1.21.12
>  [5] Biobase_2.19.3
>  [6] VariantAnnotation_1.5.42
>  [7] Rsamtools_1.11.21
>  [8] Biostrings_2.27.11
>  [9] GenomicRanges_1.11.35
> [10] IRanges_1.17.35
> [11] BiocGenerics_0.5.6
> [12] RColorBrewer_1.0-5
> 
> loaded via a namespace (and not attached):
>  [1] DBI_0.2-5  RCurl_1.95-4.1 RSQLite_0.11.2
>  [4] XML_3.95-0.1   biomaRt_2.15.0 bitops_1.0-5
>  [7] rtracklayer_1.19.9 stats4_3.0.0   tools_3.0.0
> [10] zlibbioc_1.5.0
> 
> 
> 
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-

Re: [Bioc-devel] contradictions in the return value of predictCoding() from VariantAnnotation

2013-03-09 Thread Valerie Obenchain

Hi Michael,

Thanks for reporting the bug and sending a reproducible example. Sorry 
it took a few days to get to this.


A fix has been checked into versions 1.4.12 (release) and 1.5.44 (devel).

Valerie


On 03/07/13 05:43, Michael Stadler wrote:

Here is the second attachment ("variants.vcf") that was missing from the
first message.

Michael

On 03/07/2013 02:40 PM, Michael Stadler wrote:

Dear Valerie and colleagues,

I have recently started using the VariantAnnotation package, which is a
huge asset for much of my work with sequence variants - thank you!

I have a question regarding the following example (it makes use of two
small text files, "annot.gtf" and "variants.vcf", attached to this message):

library(VariantAnnotation)
library(GenomicFeatures)
library(BSgenome.Celegans.UCSC.ce6)

# build a TranscriptDb from "annot.gtf"
chrominfo <- data.frame(chrom=as.character(seqnames(Celegans)),
 length=seqlengths(Celegans),
 is_circular=FALSE)
txdb <- makeTranscriptDbFromGFF(file="annot.gtf",
 format="gtf",
 chrominfo=chrominfo,
 dataSource="WormBase v. WS190",
 species="Caenorhabditis elegans")
txdb
#TranscriptDb object:
#| Db type: TranscriptDb
#| Supporting package: GenomicFeatures
#| Data source: WormBase v. WS190
#| Organism: Caenorhabditis elegans
#| miRBase build ID: NA
#| transcript_nrow: 7
#| exon_nrow: 42
#| cds_nrow: 42
#| Db created by: GenomicFeatures package from Bioconductor
#| Creation time: 2013-03-07 14:21:08 +0100 (Thu, 07 Mar 2013)
#| GenomicFeatures version at creation time: 1.11.14
#| RSQLite version at creation time: 0.11.2
#| DBSCHEMAVERSION: 1.0

# read in sequence variants
vcf <- readVcf("variants.vcf", genome="ce6")
dim(vcf)
#[1] 3 1

# predict the impact on coding sequences
aa <- predictCoding(query=vcf, subject=txdb, seqSource=Celegans)
length(aa)
#[1] 7


# My question is related to the impact of variant "chrV:15822727":
# This SNV is overlapping two transcripts (same base, plus strand),
# yet the reported VARCODON values are different (GAT -> TAT/AAT):
aa[names(aa)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]

#GRanges with 2 ranges and 5 metadata columns:
#seqnames   ranges strand |
#  |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#   REFALT  varAllele
#  
#  chrV:15822727  G  A  A
#  chrV:15822727  G  A  A
#  REFCODON   VARCODON
# 
#  chrV:15822727GATTAT
#  chrV:15822727GATAAT
#  ---
#  seqlengths:
#   chrI chrV
# NA   NA


# interestingly, when altering the annotation or the set of variants,
# this contradictions disapears (GAT -> AAT):
vcf2 <- vcf[-1] # remove the first SNV
aa2 <- predictCoding(query=vcf2, subject=txdb, seqSource=Celegans)
length(aa2)
#[1] 5
aa2[names(aa2)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]

#GRanges with 2 ranges and 5 metadata columns:
#seqnames   ranges strand |
#  |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#  chrV:15822727 chrV [15822727, 15822727]  + |
#   REFALT  varAllele
#  
#  chrV:15822727  G  A  A
#  chrV:15822727  G  A  A
#  REFCODON   VARCODON
# 
#  chrV:15822727GATAAT
#  chrV:15822727GATAAT
#  ---
#  seqlengths:
#   chrI chrV
# NA   NA

Do you know why this could be the case?
Regards,
Michael


Here is my envirnoment:
sessionInfo()
R Under development (unstable) (2013-02-25 r62061)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets
[7] methods   base

other attached packages:
  [1] BSgenome.Celegans.UCSC.ce6_1.3.19
  [2] BSgenome_1.27.1
  [3] GenomicFeatures_1.11.14
  [4] AnnotationDbi_1.21.12
  [5] Biobase_2.19.3
  [6] VariantAnnotation_1.5.42
  [7] Rsamtools_1.11.21
  [8] Biostrings_2.27.11
  [9] GenomicRanges_1.11.35
[10] IRanges_1.17.35
[11] BiocGenerics_0.5.6
[12] RColorBrewer_1.0-5

loaded via a namespace (and not attached):
  [1] DBI_0.2-5  RCurl_1.95-4.1 RSQLite_0.11.2
  [4] XML_3.95-0.1   biomaRt_2.15.0 bitops_1.0-5
  [7] rtracklayer_1.19.9 stats4_3.0.0   tools_3.0.0
[10] zlibbioc_1.5.0



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel





Re: [Bioc-devel] contradictions in the return value of predictCoding() from VariantAnnotation

2013-03-12 Thread Michael Stadler
Hi Valerie,

Thank you for the quick fix - it works like a charm!

Michael

On 03/10/2013 02:45 AM, Valerie Obenchain wrote:
> Hi Michael,
> 
> Thanks for reporting the bug and sending a reproducible example. Sorry
> it took a few days to get to this.
> 
> A fix has been checked into versions 1.4.12 (release) and 1.5.44 (devel).
> 
> Valerie
> 
> 
> On 03/07/13 05:43, Michael Stadler wrote:
>> Here is the second attachment ("variants.vcf") that was missing from the
>> first message.
>>
>> Michael
>>
>> On 03/07/2013 02:40 PM, Michael Stadler wrote:
>>> Dear Valerie and colleagues,
>>>
>>> I have recently started using the VariantAnnotation package, which is a
>>> huge asset for much of my work with sequence variants - thank you!
>>>
>>> I have a question regarding the following example (it makes use of two
>>> small text files, "annot.gtf" and "variants.vcf", attached to this
>>> message):
>>>
>>> library(VariantAnnotation)
>>> library(GenomicFeatures)
>>> library(BSgenome.Celegans.UCSC.ce6)
>>>
>>> # build a TranscriptDb from "annot.gtf"
>>> chrominfo <- data.frame(chrom=as.character(seqnames(Celegans)),
>>>  length=seqlengths(Celegans),
>>>  is_circular=FALSE)
>>> txdb <- makeTranscriptDbFromGFF(file="annot.gtf",
>>>  format="gtf",
>>>  chrominfo=chrominfo,
>>>  dataSource="WormBase v. WS190",
>>>  species="Caenorhabditis elegans")
>>> txdb
>>> #TranscriptDb object:
>>> #| Db type: TranscriptDb
>>> #| Supporting package: GenomicFeatures
>>> #| Data source: WormBase v. WS190
>>> #| Organism: Caenorhabditis elegans
>>> #| miRBase build ID: NA
>>> #| transcript_nrow: 7
>>> #| exon_nrow: 42
>>> #| cds_nrow: 42
>>> #| Db created by: GenomicFeatures package from Bioconductor
>>> #| Creation time: 2013-03-07 14:21:08 +0100 (Thu, 07 Mar 2013)
>>> #| GenomicFeatures version at creation time: 1.11.14
>>> #| RSQLite version at creation time: 0.11.2
>>> #| DBSCHEMAVERSION: 1.0
>>>
>>> # read in sequence variants
>>> vcf <- readVcf("variants.vcf", genome="ce6")
>>> dim(vcf)
>>> #[1] 3 1
>>>
>>> # predict the impact on coding sequences
>>> aa <- predictCoding(query=vcf, subject=txdb, seqSource=Celegans)
>>> length(aa)
>>> #[1] 7
>>>
>>>
>>> # My question is related to the impact of variant "chrV:15822727":
>>> # This SNV is overlapping two transcripts (same base, plus strand),
>>> # yet the reported VARCODON values are different (GAT -> TAT/AAT):
>>> aa[names(aa)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]
>>>
>>>
>>> #GRanges with 2 ranges and 5 metadata columns:
>>> #seqnames   ranges strand |
>>> #  |
>>> #  chrV:15822727 chrV [15822727, 15822727]  + |
>>> #  chrV:15822727 chrV [15822727, 15822727]  + |
>>> #   REFALT  varAllele
>>> #  
>>> #  chrV:15822727  G  A  A
>>> #  chrV:15822727  G  A  A
>>> #  REFCODON   VARCODON
>>> # 
>>> #  chrV:15822727GATTAT
>>> #  chrV:15822727GATAAT
>>> #  ---
>>> #  seqlengths:
>>> #   chrI chrV
>>> # NA   NA
>>>
>>>
>>> # interestingly, when altering the annotation or the set of variants,
>>> # this contradictions disapears (GAT -> AAT):
>>> vcf2 <- vcf[-1] # remove the first SNV
>>> aa2 <- predictCoding(query=vcf2, subject=txdb, seqSource=Celegans)
>>> length(aa2)
>>> #[1] 5
>>> aa2[names(aa2)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]
>>>
>>>
>>> #GRanges with 2 ranges and 5 metadata columns:
>>> #seqnames   ranges strand |
>>> #  |
>>> #  chrV:15822727 chrV [15822727, 15822727]  + |
>>> #  chrV:15822727 chrV [15822727, 15822727]  + |
>>> #   REFALT  varAllele
>>> #  
>>> #  chrV:15822727  G  A  A
>>> #  chrV:15822727  G  A  A
>>> #  REFCODON   VARCODON
>>> # 
>>> #  chrV:15822727GATAAT
>>> #  chrV:15822727GATAAT
>>> #  ---
>>> #  seqlengths:
>>> #   chrI chrV
>>> # NA   NA
>>>
>>> Do you know why this could be the case?
>>> Regards,
>>> Michael
>>>
>>>
>>> Here is my envirnoment:
>>> sessionInfo()
>>> R Under development (unstable) (2013-02-25 r62061)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] parallel  stats graphics  grDevices utils datasets
>>> [7] methods   base
>>>
>>> other attached packages:
>>>   [1] BSgenome.Celegans.UCSC.ce6_1.3.19
>>>   [2] BSgenome_1.27.1
>>>   [3]