Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-10 Thread Ludwig Geistlinger
Dear Johannes,

one follow-up question/comment on the EnsDb packages:

The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named ^TxDb

What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79

TxDb.Hsapiens.Ensembl.hg38.ensGene

(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run:  TxDb.Hsapiens.Ensembl.GRCh38.ensGene)

This would also have the advantage that genome build and idType could be
inferred right from the package name.

Best,
Ludwig


 dear Robert and Ludwig,

 the EnsDb packages provide all the gene/transcript etc annotations for all
 genes defined in the Ensembl database (for a given species and Ensembl
 release). Except the column/attribute entrezid that is stored in the
 internal database there is however no link to NCBI or UCSC annotations.
 So, basically, if you want to use pure Ensembl based annotations: use
 EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

 In case you need EnsDbs of other species or Ensembl versions, the
 ensembldb package provides functionality to generate such packages either
 using the Ensembl Perl API or using GTF files provided by Ensembl. If you
 have problems building the packages, just drop me a line and I'll do
 that.

 cheers, jo

 On 03 Jun 2015, at 15:56, Robert M. Flight rfligh...@gmail.com wrote:

 Ludwig,

 If you do this search on the UCSC genome browser (which this annotation
 package is built from), you will see that the longest variant is what
 is
 shown

 http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammalorg=Humandb=hg38position=brca1hgt.positionInput=brca1hgt.suggestTrack=knownGeneSubmit=submithgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAzpix=1885

 If instead of genes you do transcripts, you will see 20 different
 transcripts for this gene, including the one listed by NCBI.

 I havent tried it yet (haven't upgraded R or bioconductor to latest
 version), but there is now an Ensembl based annotation package as well,
 that may work better??
 http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

 -Robert



 On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger 
 ludwig.geistlin...@bio.ifi.lmu.de wrote:

 Dear Bioc annotation team,

 Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
 for

 BRCA1; ENSG0012048; entrez:672

 via

 genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id=672))

 gives me:

 GRanges object with 1 range and 1 metadata column:
  seqnames   ranges strand | gene_id
 RleIRanges  Rle | character
  672chr17 [43044295, 43170403]  - | 672
  ---
  seqinfo: 455 sequences (1 circular) from hg38 genome


 However, querying Ensembl and NCBI Gene
 http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0012048
 http://www.ncbi.nlm.nih.gov/gene/672

 the gene is located at (note the difference in the end position)

 Chromosome 17: 43,044,295-43,125,483 reverse strand


 How is the inconsistency explained and how to extract an ENSEMBL/NCBI
 conform annotation from the TxDb object?
 (I am aware of biomaRt, but I want to explicitely use the Bioc
 annotation
 functionality).

 Thanks!
 Ludwig


 --
 Dipl.-Bioinf. Ludwig Geistlinger

 Lehr- und Forschungseinheit für Bioinformatik
 Institut für Informatik
 Ludwig-Maximilians-Universität München
 Amalienstrasse 17, 2. Stock, Büro A201
 80333 München

 Tel.: 089-2180-4067
 eMail: ludwig.geistlin...@bio.ifi.lmu.de

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


  [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-10 Thread Robert Castelo

Michael,

regarding our email exchange three weeks ago, I found a couple of places 
in VariantAnnotation that IMO need to be updated to avoid enforcing 
strand on VRanges.


these places occur when constructing and validating VRanges objects, 
concretely at:


1. file R/methods-VRanges-class.R at the VRanges class constructor:

VRanges -
  function(seqnames = Rle(), ranges = IRanges(),
   ref = character(), alt = NA_character_,
   totalDepth = NA_integer_, refDepth = NA_integer_,
   altDepth = NA_integer_, ..., sampleNames = NA_character_,
   softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), 
ncol = 0L),

 FilterRules()),
   hardFilters = FilterRules())
{
  gr - GRanges(seqnames, ranges,
strand = .rleRecycleVector(*, length(ranges)), ...)
[...]

that precludes setting the strand at construction time:

library(VariantAnnotation)
VRanges(seqnames=chr1, ranges=IRanges(1, 5), ref=T, alt=C, strand=-)
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector(*, 
length(ranges)),  :

  formal argument strand matched by multiple actual arguments


2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():

.valid.VRanges.strand - function(object) {
  if (any(object@strand == -))
paste('strand' must always be '+' or '*')
}

[...]

.valid.VRanges - function(object)
{
  c(.valid.VRanges.ref(object),
.valid.VRanges.alt(object),
.valid.VRanges.strand(object),
.valid.VRanges.depth(object))
}

that prompts an error when variants annotated on the negative strand are 
detected:


library(VariantAnnotation)
example(VRanges)
strand(vr) - -
c(vr)
Error in validObject(.Object) :
  invalid class “VRanges” object: 'strand' must always be '+' or '*'


cheers,

robert.

On 05/22/2015 09:49 PM, Michael Lawrence wrote:

This changed recently. VariantAnnotation in devel no longer enforces a
strand on VRanges, or at least it allows the * case.


On Fri, May 22, 2015 at 11:33 AM, Robert Castelo robert.cast...@upf.edu
mailto:robert.cast...@upf.edu wrote:

Hi,

I have encountered myself in a strange situation when using the
function locateVariants() from VariantAnnotation with an input
VRanges object. The problem is that some of the expected coding
annotations are not showing up when using locateVariants() with
default parameters.

After investigating this situation I think I found the reason, which
does not look like a bug but I would like that you give me some
clarification about the logic behind using locateVariants() with
VRanges objects.

The documentation of the VRanges-class says that in this class of
objects The strand is always constrained to be positive (+).. I
guess there may be a good reason for this but I could not find it in
the documentation or googling about it.

This means that when you coerce a CollapsedVCF object (obtained, for
example, from a VCF file via readVcf()) to a VRanges object, even
though variants in the VCF may have no strand, they get a positive
strand in the VRanges object.

The problem arises then, when you call locateVariants() with this
VRanges object, because features on the negative strand are never
going to overlap with the variants since, by default, the argument
ignore.strand=FALSE.

Let me illustrate this with a toy example. Consider the SNP
rs1129038
(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
chr15:28356859 with allels A/G. It is located on the 3' UTR of the
gene HERC2 coded on the negative strand of the human reference
genome. Let's build a toy VRanges object having this variant:

library(VariantAnnotation)
vr - VRanges(seqnames=chr15,
   ranges=IRanges(28356859, 28356859),
   ref=A, alt=G,
   refDepth=5, altDepth=7,
   totalDepth=12, sampleNames=A)
strand(vr)
factor-Rle of length 1 with 1 run
   Lengths: 1
   Values : +
Levels(3): + - *

Let's build now its CollapsedVCF counterpart by using the
corresponding coercion method and set the strand to *:

vcf - asVCF(vr)
strand(vcf) - *

Now run locateVariants() on both objects with UCSC annotations:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb - TxDb.Hsapiens.UCSC.hg19.knownGene

locateVariants(vcf, txdb, region=AllVariants())
GRanges object with 2 ranges and 9 metadata columns:
   seqnames   ranges strand | LOCATION  LOCSTART
LOCEND   QUERYIDTXID CDSID
Rle IRanges Rle | factor integer integer integer
character IntegerList
   [1]chr15 [28356859, 28356859]  * | threeUTR50 50
1   55386
   [2]chr15 [28356859, 28356859]  * | threeUTR50 50
1   55387
GENEID   PRECEDEIDFOLLOWID
character CharacterList CharacterList
   [1] 

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-10 Thread Rainer Johannes
update:

seems something is strange with the index... when I load the ensembl-77 DNA for 
mouse I get two files, the fasta file and the index and I can extract the 
sequences, while in the example below I just got the fasta file

On 10 Jun 2015, at 10:11, Rainer Johannes 
johannes.rai...@eurac.edumailto:johannes.rai...@eurac.edu wrote:

Dear Martin,

the AnnotationHub approach looks awesome! However, somehow it does not work for 
me, I always get an error:

library(AnnotationHub)
library(Rsamtools)
library(GenomicFeatures)
ah - AnnotationHub()

## somehow I don't see DNA sequences for release-80... thus using 75
query(ah, c(Takifugu, release-75))

## load the gtf and the dna.toplevel.fa
gtf - ah[[AH10717]]
dna - ah[[AH20637]]

## create txdb and get exons:
txdb - makeTxDbFromGRanges(gtf)
exons - exons(txdb)

## getting the sequences
getSeq(dna, exons)

and I get the following error:
Error in value[[3L]](cond) : 'open' index failed
  file: /Users/jo/~/.AnnotationHub/24732
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  [fai_load] fail to open FASTA index.

## creating the index
indexFa(dna)

## trying again:
getSeq(dna, exons)
Error in value[[3L]](cond) :  record 1 (MT:1-68) was truncated
  file: /Users/jo/~/.AnnotationHub/24732

This one is from the current R-devel, but I get the same error with the stable 
version... any idea?

cheers, jo


my session info:
 sessionInfo()
R Under development (unstable) (2015-06-06 r68485)
Platform: x86_64-apple-darwin14.4.0/x86_64 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] GenomicFeatures_1.20.1 AnnotationDbi_1.30.1   Biobase_2.28.0
 [4] Rsamtools_1.20.4   Biostrings_2.36.1  XVector_0.8.0
 [7] GenomicRanges_1.20.4   GenomeInfoDb_1.4.0 IRanges_2.2.3
[10] S4Vectors_0.6.0BiocGenerics_0.14.0AnnotationHub_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6  magrittr_1.5
 [3] GenomicAlignments_1.4.1  zlibbioc_1.14.0
 [5] BiocParallel_1.2.2   xtable_1.7-4
 [7] R6_2.0.1 stringr_1.0.0
 [9] httr_0.6.1   tools_3.3.0
[11] DBI_0.3.1lambda.r_1.1.7
[13] futile.logger_1.4.1  htmltools_0.2.6
[15] digest_0.6.8 interactiveDisplayBase_1.6.0
[17] shiny_0.12.0 rtracklayer_1.28.4
[19] futile.options_1.0.0 bitops_1.0-6
[21] biomaRt_2.24.0   RCurl_1.95-4.6
[23] RSQLite_1.0.0mime_0.3
[25] stringi_0.4-1BiocInstaller_1.18.2
[27] XML_3.98-1.2 httpuv_1.3.2


On 09 Jun 2015, at 16:25, Martin Morgan 
mtmor...@fredhutch.orgmailto:mtmor...@fredhutch.org wrote:

On 06/08/2015 11:43 PM, Rainer Johannes wrote:
dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute entrezid that is stored in the
internal database there is however no link to NCBI or UCSC annotations. So,
basically, if you want to use pure Ensembl based annotations: use EnsDb, if
you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb
package provides functionality to generate such packages either using the
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
building the packages, just drop me a line and I'll do that.

Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.

## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()`
 or `GenomicFeatures::makeTxDbFromGRanges()`.

_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c(Takifugu, release-80))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file


```{r takifugi-data}
gtf - ah[[AH47101]]
dna - ah[[AH47477]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance


Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-10 Thread Robert Castelo
my understanding was that VRanges is a container for variants and 
variant annotations and strand is just one annotation more. when we use 
locateVariants() a variant can be annotated to multiple transcripts 
where in one overlaps an exon, in another an intron and so on. In all 
those transcripts annotations there is a strand annotation, the strand 
of the transcript. if the transcript is annotated in the negative strand 
of the reference chromosome then the annotation of a transcript region 
to a variant is going to be also on the negative strand.


both locateVariants() and predictCoding() return GRanges objects with 
strand annotations according to the transcripts being annotated. I 
thought it made sense in VariantFiltering to use VRanges objects as a  
container for variants and annotations and, for this reason, I would 
like to carry on the strand annotation into the VRanges object. Is there 
a strong reason for a VRanges object, derived from GRanges, not to have 
strand?


On 6/10/15 6:26 PM, Michael Lawrence wrote:

VRanges is supposed to enforce strand. The goal is to use * always,
for simplicity and consistency with the result of readVcf(). Is there
a use case for negative strand variants?

On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo robert.cast...@upf.edu wrote:

Michael,

regarding our email exchange three weeks ago, I found a couple of places in
VariantAnnotation that IMO need to be updated to avoid enforcing strand on
VRanges.

these places occur when constructing and validating VRanges objects,
concretely at:

1. file R/methods-VRanges-class.R at the VRanges class constructor:

VRanges -
   function(seqnames = Rle(), ranges = IRanges(),
ref = character(), alt = NA_character_,
totalDepth = NA_integer_, refDepth = NA_integer_,
altDepth = NA_integer_, ..., sampleNames = NA_character_,
softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), ncol =
0L),
  FilterRules()),
hardFilters = FilterRules())
{
   gr - GRanges(seqnames, ranges,
 strand = .rleRecycleVector(*, length(ranges)), ...)
[...]

that precludes setting the strand at construction time:

library(VariantAnnotation)
VRanges(seqnames=chr1, ranges=IRanges(1, 5), ref=T, alt=C, strand=-)
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector(*,
length(ranges)),  :
   formal argument strand matched by multiple actual arguments


2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():

.valid.VRanges.strand - function(object) {
   if (any(object@strand == -))
 paste('strand' must always be '+' or '*')
}

[...]

.valid.VRanges - function(object)
{
   c(.valid.VRanges.ref(object),
 .valid.VRanges.alt(object),
 .valid.VRanges.strand(object),
 .valid.VRanges.depth(object))
}

that prompts an error when variants annotated on the negative strand are
detected:

library(VariantAnnotation)
example(VRanges)
strand(vr) - -
c(vr)
Error in validObject(.Object) :
   invalid class “VRanges” object: 'strand' must always be '+' or '*'


cheers,

robert.

On 05/22/2015 09:49 PM, Michael Lawrence wrote:

This changed recently. VariantAnnotation in devel no longer enforces a
strand on VRanges, or at least it allows the * case.


On Fri, May 22, 2015 at 11:33 AM, Robert Castelo robert.cast...@upf.edu
mailto:robert.cast...@upf.edu wrote:

 Hi,

 I have encountered myself in a strange situation when using the
 function locateVariants() from VariantAnnotation with an input
 VRanges object. The problem is that some of the expected coding
 annotations are not showing up when using locateVariants() with
 default parameters.

 After investigating this situation I think I found the reason, which
 does not look like a bug but I would like that you give me some
 clarification about the logic behind using locateVariants() with
 VRanges objects.

 The documentation of the VRanges-class says that in this class of
 objects The strand is always constrained to be positive (+).. I
 guess there may be a good reason for this but I could not find it in
 the documentation or googling about it.

 This means that when you coerce a CollapsedVCF object (obtained, for
 example, from a VCF file via readVcf()) to a VRanges object, even
 though variants in the VCF may have no strand, they get a positive
 strand in the VRanges object.

 The problem arises then, when you call locateVariants() with this
 VRanges object, because features on the negative strand are never
 going to overlap with the variants since, by default, the argument
 ignore.strand=FALSE.

 Let me illustrate this with a toy example. Consider the SNP
 rs1129038
 (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
 chr15:28356859 with allels A/G. It is located on the 3' UTR of the
 gene HERC2 coded on the negative strand of the human reference
 genome. Let's build a 

[Bioc-devel] Cairo error on Mac OS X Snow Leopard (10.6.8) / x86_64

2015-06-10 Thread Leonardo Collado Torres
Hi,

My package regionReport just reported errors on both release and devel
when running R CMD check on Mac OS X Snow Leopard (10.6.8) / x86_64.

The error is:

Error: .onLoad failed in loadNamespace() for 'Cairo', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object
'/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Cairo/libs/Cairo.so':
  
dlopen(/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Cairo/libs/Cairo.so,
6): Symbol not found: _FcConfigGetCurrent
  Referenced from:
/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Cairo/libs/Cairo.so
  Expected in: flat namespace
 in 
/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Cairo/libs/Cairo.so
Execution halted

I'm guessing that there's not much I can do, beyond sending this
email. But let me know if I can help in some way.

Thanks,
Leo

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-10 Thread Michael Lawrence
VRanges is supposed to enforce strand. The goal is to use * always,
for simplicity and consistency with the result of readVcf(). Is there
a use case for negative strand variants?

On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo robert.cast...@upf.edu wrote:
 Michael,

 regarding our email exchange three weeks ago, I found a couple of places in
 VariantAnnotation that IMO need to be updated to avoid enforcing strand on
 VRanges.

 these places occur when constructing and validating VRanges objects,
 concretely at:

 1. file R/methods-VRanges-class.R at the VRanges class constructor:

 VRanges -
   function(seqnames = Rle(), ranges = IRanges(),
ref = character(), alt = NA_character_,
totalDepth = NA_integer_, refDepth = NA_integer_,
altDepth = NA_integer_, ..., sampleNames = NA_character_,
softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), ncol =
 0L),
  FilterRules()),
hardFilters = FilterRules())
 {
   gr - GRanges(seqnames, ranges,
 strand = .rleRecycleVector(*, length(ranges)), ...)
 [...]

 that precludes setting the strand at construction time:

 library(VariantAnnotation)
 VRanges(seqnames=chr1, ranges=IRanges(1, 5), ref=T, alt=C, strand=-)
 Error in GRanges(seqnames, ranges, strand = .rleRecycleVector(*,
 length(ranges)),  :
   formal argument strand matched by multiple actual arguments


 2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():

 .valid.VRanges.strand - function(object) {
   if (any(object@strand == -))
 paste('strand' must always be '+' or '*')
 }

 [...]

 .valid.VRanges - function(object)
 {
   c(.valid.VRanges.ref(object),
 .valid.VRanges.alt(object),
 .valid.VRanges.strand(object),
 .valid.VRanges.depth(object))
 }

 that prompts an error when variants annotated on the negative strand are
 detected:

 library(VariantAnnotation)
 example(VRanges)
 strand(vr) - -
 c(vr)
 Error in validObject(.Object) :
   invalid class “VRanges” object: 'strand' must always be '+' or '*'


 cheers,

 robert.

 On 05/22/2015 09:49 PM, Michael Lawrence wrote:

 This changed recently. VariantAnnotation in devel no longer enforces a
 strand on VRanges, or at least it allows the * case.


 On Fri, May 22, 2015 at 11:33 AM, Robert Castelo robert.cast...@upf.edu
 mailto:robert.cast...@upf.edu wrote:

 Hi,

 I have encountered myself in a strange situation when using the
 function locateVariants() from VariantAnnotation with an input
 VRanges object. The problem is that some of the expected coding
 annotations are not showing up when using locateVariants() with
 default parameters.

 After investigating this situation I think I found the reason, which
 does not look like a bug but I would like that you give me some
 clarification about the logic behind using locateVariants() with
 VRanges objects.

 The documentation of the VRanges-class says that in this class of
 objects The strand is always constrained to be positive (+).. I
 guess there may be a good reason for this but I could not find it in
 the documentation or googling about it.

 This means that when you coerce a CollapsedVCF object (obtained, for
 example, from a VCF file via readVcf()) to a VRanges object, even
 though variants in the VCF may have no strand, they get a positive
 strand in the VRanges object.

 The problem arises then, when you call locateVariants() with this
 VRanges object, because features on the negative strand are never
 going to overlap with the variants since, by default, the argument
 ignore.strand=FALSE.

 Let me illustrate this with a toy example. Consider the SNP
 rs1129038
 (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
 chr15:28356859 with allels A/G. It is located on the 3' UTR of the
 gene HERC2 coded on the negative strand of the human reference
 genome. Let's build a toy VRanges object having this variant:

 library(VariantAnnotation)
 vr - VRanges(seqnames=chr15,
ranges=IRanges(28356859, 28356859),
ref=A, alt=G,
refDepth=5, altDepth=7,
totalDepth=12, sampleNames=A)
 strand(vr)
 factor-Rle of length 1 with 1 run
Lengths: 1
Values : +
 Levels(3): + - *

 Let's build now its CollapsedVCF counterpart by using the
 corresponding coercion method and set the strand to *:

 vcf - asVCF(vr)
 strand(vcf) - *

 Now run locateVariants() on both objects with UCSC annotations:

 library(TxDb.Hsapiens.UCSC.hg19.knownGene)
 txdb - TxDb.Hsapiens.UCSC.hg19.knownGene

 locateVariants(vcf, txdb, region=AllVariants())
 GRanges object with 2 ranges and 9 metadata columns:
seqnames   ranges strand | LOCATION  LOCSTART
 LOCEND   QUERYIDTXID CDSID
 Rle IRanges Rle | factor 

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-10 Thread Michael Lawrence
I guess it depends on what the strand should mean. Would having a
negative strand indicate that the ref/alt should be complemented? I'm
not sure it's a good idea to conflate the strand of the variant itself
with the strand of an overlapping feature.

On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo robert.cast...@upf.edu wrote:
 my understanding was that VRanges is a container for variants and variant
 annotations and strand is just one annotation more. when we use
 locateVariants() a variant can be annotated to multiple transcripts where in
 one overlaps an exon, in another an intron and so on. In all those
 transcripts annotations there is a strand annotation, the strand of the
 transcript. if the transcript is annotated in the negative strand of the
 reference chromosome then the annotation of a transcript region to a variant
 is going to be also on the negative strand.

 both locateVariants() and predictCoding() return GRanges objects with strand
 annotations according to the transcripts being annotated. I thought it made
 sense in VariantFiltering to use VRanges objects as a  container for
 variants and annotations and, for this reason, I would like to carry on the
 strand annotation into the VRanges object. Is there a strong reason for a
 VRanges object, derived from GRanges, not to have strand?


 On 6/10/15 6:26 PM, Michael Lawrence wrote:

 VRanges is supposed to enforce strand. The goal is to use * always,
 for simplicity and consistency with the result of readVcf(). Is there
 a use case for negative strand variants?

 On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo robert.cast...@upf.edu
 wrote:

 Michael,

 regarding our email exchange three weeks ago, I found a couple of places
 in
 VariantAnnotation that IMO need to be updated to avoid enforcing strand
 on
 VRanges.

 these places occur when constructing and validating VRanges objects,
 concretely at:

 1. file R/methods-VRanges-class.R at the VRanges class constructor:

 VRanges -
function(seqnames = Rle(), ranges = IRanges(),
 ref = character(), alt = NA_character_,
 totalDepth = NA_integer_, refDepth = NA_integer_,
 altDepth = NA_integer_, ..., sampleNames = NA_character_,
 softFilterMatrix = FilterMatrix(matrix(nrow = length(gr),
 ncol =
 0L),
   FilterRules()),
 hardFilters = FilterRules())
 {
gr - GRanges(seqnames, ranges,
  strand = .rleRecycleVector(*, length(ranges)), ...)
 [...]

 that precludes setting the strand at construction time:

 library(VariantAnnotation)
 VRanges(seqnames=chr1, ranges=IRanges(1, 5), ref=T, alt=C,
 strand=-)
 Error in GRanges(seqnames, ranges, strand = .rleRecycleVector(*,
 length(ranges)),  :
formal argument strand matched by multiple actual arguments


 2. R/AllClasses.R at the VRanges class validity function
 .valid.VRanges():

 .valid.VRanges.strand - function(object) {
if (any(object@strand == -))
  paste('strand' must always be '+' or '*')
 }

 [...]

 .valid.VRanges - function(object)
 {
c(.valid.VRanges.ref(object),
  .valid.VRanges.alt(object),
  .valid.VRanges.strand(object),
  .valid.VRanges.depth(object))
 }

 that prompts an error when variants annotated on the negative strand are
 detected:

 library(VariantAnnotation)
 example(VRanges)
 strand(vr) - -
 c(vr)
 Error in validObject(.Object) :
invalid class “VRanges” object: 'strand' must always be '+' or '*'


 cheers,

 robert.

 On 05/22/2015 09:49 PM, Michael Lawrence wrote:

 This changed recently. VariantAnnotation in devel no longer enforces a
 strand on VRanges, or at least it allows the * case.


 On Fri, May 22, 2015 at 11:33 AM, Robert Castelo robert.cast...@upf.edu
 mailto:robert.cast...@upf.edu wrote:

  Hi,

  I have encountered myself in a strange situation when using the
  function locateVariants() from VariantAnnotation with an input
  VRanges object. The problem is that some of the expected coding
  annotations are not showing up when using locateVariants() with
  default parameters.

  After investigating this situation I think I found the reason,
 which
  does not look like a bug but I would like that you give me some
  clarification about the logic behind using locateVariants() with
  VRanges objects.

  The documentation of the VRanges-class says that in this class of
  objects The strand is always constrained to be positive (+).. I
  guess there may be a good reason for this but I could not find it
 in
  the documentation or googling about it.

  This means that when you coerce a CollapsedVCF object (obtained,
 for
  example, from a VCF file via readVcf()) to a VRanges object, even
  though variants in the VCF may have no strand, they get a positive
  strand in the VRanges object.

  The problem arises then, when you call locateVariants() with this
  VRanges object, because features on the negative strand are never