Val
On 06/11/2015 10:19 AM, Michael Lawrence wrote:
Val, I wasn't suggesting that LOCSTRAND be added to the locateVariants() output. Rather, it would be added to the VRanges during the merge. Michael On Thu, Jun 11, 2015 at 10:10 AM, Valerie Obenchain <voben...@fredhutch.org> wrote:locateVariants(), predictCoding() and the family of mapToTranscripts() functions all return strand according to the annotation matched. The only time the strand of the output could possibly be different from the strand of the input 'query' is when 'ignore.strand = TRUE' (FALSE by default). I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? By just using the default, the output will have the same strand as the input 'query' (unless 'query' is '*' of course). That said, do you still feel it's necessary to add a LOCSTRAND column to the output? Val On 06/11/2015 09:38 AM, Michael Lawrence wrote:I didn't realize that locateVariants() returned an object with its strand matching that of the subject. I would have expected the subject strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it sounds like you want to merge the locateVariants() output with the input. Merging the output strand as LOCSTRAND on the VRanges sounds like a reasonable approach, for now. I don't know if Val is listening, but it sounds like it would be nice to have convenient functions for merging locateVariants() output with its input. The one for VRanges might do something like the above. Michael On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo <robert.cast...@upf.edu> wrote:Of course, the inclusion of strand would imply an interpretation of the variant and its strand (e.g., "-") with respect to an annotated feature. I can see a practical problem of integrity of the information on a VRanges object, by which a mandatory column, such as strand, depends on a non-mandatory column, such as some feature annotation stored as a metadata column. A solution would be to add the transcript identifier (TXID) as mandatory column on the VRanges object but I suspect this is a big change to do, so adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by locateVariants) in the metadata columns of the VRanges object would allow me to use a VRanges object as a container of variant x allele x sample x annotation. Just to clear up the issue of merging strand and variant: a noisy variant (a variant that is not silent) and has a, e.g., loss-of-function effect such as the gain of a stop codon, is usually interpreted in the strand of the transcript and coding sequence in which the stop codon is gained, saying something like and A changed to a T producting the stop codon TAA. Ref and alt alleles are called in the strand of the reference chromosome, so if the transcript was annotated in the negative strand, we would know that we need to reverse-complement ref and alt to interpret the variant, although I see no need to do anything on the VRanges object to ref and alt because we know they are always in the strand of the reference chromosome. Only if you want to detect this stop-gain event (with predictCoding) then you would have to reverse-complement the ref and alt alleles. Conversely, if the variant falls in an intergenic region, then obviously the strand plays no role in the interpretation of the variant and nothing needs to be done when interpreting the ref and alt alleles. On 6/11/15 5:47 PM, Michael Lawrence wrote:The fact that the position describes the variant, but the strand refers to the transcript is confusing to me. What is the concrete use case for merging the two features like that? VRanges constrains its strand for at least 2 reasons: (1) to be less error prone [of course this runs completely counter to flexibility] and (2) simplicity [we don't have to worry about what "-" means for ref/alt, overlap, etc]. On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.cast...@upf.edu> wrote:one option for me is just to add a metadata column with the strand of the overlapping feature. however, i'm interested to fully understand the rationale behind this aspect of the design of the VRanges object. a VRanges object unrolls variants in a VCF file per alternative allele and sample. variants in VCF files are obtained from tallying reads aligned on a reference genome. so, my understanding is that the reference allele is the allele of the reference genome against which the reads were aligned while the alternate allele(s) are allele calls different from the reference. from this perspective, my interpretation is that ref and alt alleles have already a strand, which is the strand of the reference chromosome against which the reads were aligned to. i'm interested in this interpretation of the strand of the variants because i'm interested in the interpretation of sequence-features containing the reference and the alternate alleles, such as differences in a binding site with the reference and the alternate allele. if we relax the meaning of elements in a VRanges object to, not only variants x allele x sample, but to variants x allele x sample x annotated-feature, then i think it would make sense to have the strand-specific annotation in the strand slot of the VRanges object. while this idea may be good or not for a number of reasons, i'm now mostly interested in knowing whether i'm misinterpreting the design of VRanges objects, and maybe variant calling in general or i'm in a more or less right path in using a VRanges object to hold variant annotations. thanks!!! robert. On 06/11/2015 04:30 AM, Michael Lawrence wrote:I guess it depends on what the strand should mean. Would having a negative strand indicate that the ref/alt should be complemented? I'm not sure it's a good idea to conflate the strand of the variant itself with the strand of an overlapping feature. On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo<robert.cast...@upf.edu> wrote:my understanding was that VRanges is a container for variants and variant annotations and strand is just one annotation more. when we use locateVariants() a variant can be annotated to multiple transcripts where in one overlaps an exon, in another an intron and so on. In all those transcripts annotations there is a strand annotation, the strand of the transcript. if the transcript is annotated in the negative strand of the reference chromosome then the annotation of a transcript region to a variant is going to be also on the negative strand. both locateVariants() and predictCoding() return GRanges objects with strand annotations according to the transcripts being annotated. I thought it made sense in VariantFiltering to use VRanges objects as a container for variants and annotations and, for this reason, I would like to carry on the strand annotation into the VRanges object. Is there a strong reason for a VRanges object, derived from GRanges, not to have strand? On 6/10/15 6:26 PM, Michael Lawrence wrote:VRanges is supposed to enforce strand. The goal is to use "*" always, for simplicity and consistency with the result of readVcf(). Is there a use case for negative strand variants? On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo<robert.cast...@upf.edu> wrote:Michael, regarding our email exchange three weeks ago, I found a couple of places in VariantAnnotation that IMO need to be updated to avoid enforcing strand on VRanges. these places occur when constructing and validating VRanges objects, concretely at: 1. file R/methods-VRanges-class.R at the VRanges class constructor: VRanges<- function(seqnames = Rle(), ranges = IRanges(), ref = character(), alt = NA_character_, totalDepth = NA_integer_, refDepth = NA_integer_, altDepth = NA_integer_, ..., sampleNames = NA_character_, softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), ncol = 0L), FilterRules()), hardFilters = FilterRules()) { gr<- GRanges(seqnames, ranges, strand = .rleRecycleVector("*", length(ranges)), ...) [...] that precludes setting the strand at construction time: library(VariantAnnotation) VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", strand="-") Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*", length(ranges)), : formal argument "strand" matched by multiple actual arguments 2. R/AllClasses.R at the VRanges class validity function .valid.VRanges(): .valid.VRanges.strand<- function(object) { if (any(object@strand == "-")) paste("'strand' must always be '+' or '*'") } [...] .valid.VRanges<- function(object) { c(.valid.VRanges.ref(object), .valid.VRanges.alt(object), .valid.VRanges.strand(object), .valid.VRanges.depth(object)) } that prompts an error when variants annotated on the negative strand are detected: library(VariantAnnotation) example(VRanges) strand(vr)<- "-" c(vr) Error in validObject(.Object) : invalid class “VRanges” object: 'strand' must always be '+' or '*' cheers, robert. On 05/22/2015 09:49 PM, Michael Lawrence wrote:This changed recently. VariantAnnotation in devel no longer enforces a strand on VRanges, or at least it allows the "*" case. On Fri, May 22, 2015 at 11:33 AM, Robert Castelo<robert.cast...@upf.edu <mailto:robert.cast...@upf.edu>> wrote: Hi, I have encountered myself in a strange situation when using the function locateVariants() from VariantAnnotation with an input VRanges object. The problem is that some of the expected coding annotations are not showing up when using locateVariants() with default parameters. After investigating this situation I think I found the reason, which does not look like a bug but I would like that you give me some clarification about the logic behind using locateVariants() with VRanges objects. The documentation of the VRanges-class says that in this class of objects "The strand is always constrained to be positive (+).". I guess there may be a good reason for this but I could not find it in the documentation or googling about it. This means that when you coerce a CollapsedVCF object (obtained, for example, from a VCF file via readVcf()) to a VRanges object, even though variants in the VCF may have no strand, they get a positive strand in the VRanges object. The problem arises then, when you call locateVariants() with this VRanges object, because features on the negative strand are never going to overlap with the variants since, by default, the argument ignore.strand=FALSE. Let me illustrate this with a toy example. Consider the SNP rs1129038 (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at chr15:28356859 with allels A/G. It is located on the 3' UTR of the gene HERC2 coded on the negative strand of the human reference genome. Let's build a toy VRanges object having this variant: library(VariantAnnotation) vr<- VRanges(seqnames="chr15", ranges=IRanges(28356859, 28356859), ref="A", alt="G", refDepth=5, altDepth=7, totalDepth=12, sampleNames="A") strand(vr) factor-Rle of length 1 with 1 run Lengths: 1 Values : + Levels(3): + - * Let's build now its CollapsedVCF counterpart by using the corresponding coercion method and set the strand to "*": vcf<- asVCF(vr) strand(vcf)<- "*" Now run locateVariants() on both objects with UCSC annotations: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene locateVariants(vcf, txdb, region=AllVariants()) GRanges object with 2 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> |<factor> <integer> <integer> <integer> <character> <IntegerList> [1] chr15 [28356859, 28356859] * | threeUTR 50 50 1 55386 [2] chr15 [28356859, 28356859] * | threeUTR 50 50 1 55387 GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] 8924 [2] 8924 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths locateVariants(vr, txdb, region=AllVariants()) GRanges object with 1 range and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> |<factor> <integer> <integer> <integer> <integer> <IntegerList> [1] chr15 [28356859, 28356859] + | intergenic<NA> <NA> 1<NA> GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1]<NA> 100132565,100289656,100616223,... 2567 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths Note that while we get the 3' UTR annotation from the strandless VCF object we do not get it from the VRanges object with the positive strand. To make my point clear: this positive strand shows up when you coerce a strandless VCF object to a VRanges one, because positive strandness seems to be the convention for VRanges objects: as(vcf, VRanges) VRanges object with 1 range and 1 metadata column: seqnames ranges strand ref alt totalDepth refDepth altDepth <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> [1] chr15 [28356859, 28356859] + A G 12 5 7 sampleNames softFilterMatrix | QUAL <factorOrRle> <matrix> |<numeric> [1] A |<NA> ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths hardFilters: NULL Of course, if I run locateVariants() with the argument ignore.strand=TRUE, then I get the expected annotation: locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE) GRanges object with 2 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID <Rle> <IRanges> <Rle> |<factor> <integer> <integer> <integer> <character> <IntegerList> [1] chr15 [28356859, 28356859] + | threeUTR 677 677 1 55386 [2] chr15 [28356859, 28356859] + | threeUTR 677 677 1 55387 GENEID PRECEDEID FOLLOWID <character> <CharacterList> <CharacterList> [1] 8924 [2] 8924 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths So, my question is, given that VRanges objects are enforced to have a positive strand, would not be better to have ignore.strand=TRUE as default in locateVariants? Alternatively, I would recommend that locateVariants() issues a warning, or maybe an error, when the input object is VRanges and ignore.strand=FALSE. Finally, out of curiosity, why a VRanges object enforces the positive strand in all its genomic ranges? Would not be better just taking the strand of the CollapsedVCF object when coercing the CollapsedVCF object to VRanges? thanks!! robert. _______________________________________________ Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel-- Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel-- Robert Castelo, PhD Associate Professor Dept. of Experimental and Health Sciences Universitat Pompeu Fabra (UPF) Barcelona Biomedical Research Park (PRBB) Dr Aiguader 88 E-08003 Barcelona, Spain telf: +34.933.160.514 fax: +34.933.160.550_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: voben...@fredhutch.org Phone: (206) 667-3158
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: voben...@fredhutch.org Phone: (206) 667-3158 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel