Re: [Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

2015-06-11 Thread Robert Castelo

hi Valerie,

thanks for the fix, it seems that it required a more complex update than 
i initially thought looking at the code. actually, i found out that this 
update is breaking with the sample VCF file from the VariantFiltering 
package. i guess this requires some further fix in VariantAnnotation. 
The problem shows up in both, devel and release, please find at the 
bottom both sessionInfo() outputs.


thanks!!
robert.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")

vcf <- readVcf(fl, genome="hg19")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

## this is necessary to avoid the chrM length glitch between
## b37 (GATK bundle 2.8 reference genome) and hg19 versions
commonlev <- intersect(seqlevels(vcf), seqlevels(txdb))
commonlev <- commonlev[seqlengths(vcf)[commonlev] == 
seqlengths(txdb)[commonlev]]

vcf <- keepSeqlevels(vcf, commonlev)

loc_intron <- locateVariants(vcf, txdb, region=IntronVariants())
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
  subscript contains NAs or out-of-bounds indices
traceback()
27: stop("subscript contains NAs or out-of-bounds indices")
26: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
25: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
24: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
23: extractROWS(x, i)
22: extractROWS(x, i)
21: ss[txHits]
20: ss[txHits]
19: is(strand, "Rle")
18: newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = 
strand,

mcols = DataFrame(...), seqlengths = seqlengths, seqinfo = seqinfo)
17: GRanges(seqnames = seqnames(query)[xHits], ranges = 
IRanges(ranges(query)[xHits]),

strand = ss[txHits], LOCATION = .location(length(xHits),
vtype), LOCSTART = start(map), LOCEND = end(map), QUERYID = 
xHits,
TXID = txid, CDSID = cdsid, GENEID = NA_character_, PRECEDEID = 
CharacterList(character(0)),

FOLLOWID = CharacterList(character(0)))
16: .makeResult(query, subject, "intron", ignore.strand = ignore.strand,
asHits = asHits)
15: .local(query, subject, region, ...)
14: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand 
= ignore.strand,

asHits = asHits)
13: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand 
= ignore.strand,

asHits = asHits)
12: eval(expr, envir, enclos)
11: eval(call, parent.frame())
10: callGeneric(query, cache[["intbytx"]], region, ..., ignore.strand = 
ignore.strand,

asHits = asHits)
9: .local(query, subject, region, ...)
8: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
   ignore.strand = ignore.strand, asHits = asHits)
7: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
   ignore.strand = ignore.strand, asHits = asHits)
6: eval(expr, envir, enclos)
5: eval(call, parent.frame())
4: callGeneric(rowRanges(query), subject, region, ..., cache = cache,
   ignore.strand = ignore.strand, asHits = asHits)
3: .local(query, subject, region, ...)
2: locateVariants(vcf, txdb, region = IntronVariants())
1: locateVariants(vcf, txdb, region = IntronVariants())
R Under development (unstable) (2015-04-28 r68268)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
 [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C 
LC_TIME=en_US.UTF8
 [4] LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8 
LC_MESSAGES=en_US.UTF8
 [7] LC_PAPER=en_US.UTF8   LC_NAME=C LC_ADDRESS=C 

[10] LC_TELEPHONE=CLC_MEASUREMENT=en_US.UTF8 
LC_IDENTIFICATION=C


attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets 
 methods   base


other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3 GenomicFeatures_1.21.12 

 [3] AnnotationDbi_1.31.14   VariantAnnotation_1.15.11 

 [5] Rsamtools_1.21.8Biostrings_2.37.2 

 [7] XVector_0.9.1   SummarizedExperiment_0.1.5 

 [9] Biobase_2.29.1  GenomicRanges_1.21.14 

[11] GenomeInfoDb_1.5.7  IRanges_2.3.10 

[13] S4Vectors_0.7.3 BiocGenerics_0.15.2 

[15] vimcom_1.0-0setwidth_1.0-3 


[17] colorout_1.0-3

loaded via a namespace (and not attached):
 [1] GenomicAlignments_1.5.9 zlibbioc_1.15.0 
BiocParallel_1.3.22 BSgenome_1.37.1
 [5] tools_3.3.0 DBI_0.3.1   lambda.r_1.1.7 
  futile.logger_1.4.1
 [9] rtracklayer_1.29.7  futile.options_1.0.0bitops_1.0-6 
  RCurl_1.95-4.6

[13] biomaRt_2.25.1  RSQLite_1.0.0   XML_3.98-1.2


*** SESSION INFORMATION FOR RELEASE **

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
 [1] LC_CTYPE=en

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Robert Castelo
one option for me is just to add a metadata column with the strand of 
the overlapping feature. however, i'm interested to fully understand the 
rationale behind this aspect of the design of the VRanges object.


a VRanges object unrolls variants in a VCF file per alternative allele 
and sample. variants in VCF files are obtained from tallying reads 
aligned on a reference genome. so, my understanding is that the 
reference allele is the allele of the reference genome against which the 
reads were aligned while the alternate allele(s) are allele calls 
different from the reference. from this perspective, my interpretation 
is that ref and alt alleles have already a strand, which is the strand 
of the reference chromosome against which the reads were aligned to. i'm 
interested in this interpretation of the strand of the variants because 
i'm interested in the interpretation of sequence-features containing the 
reference and the alternate alleles, such as differences in a binding 
site with the reference and the alternate allele.


if we relax the meaning of elements in a VRanges object to, not only 
variants x allele x sample, but to variants x allele x sample x 
annotated-feature, then i think it would make sense to have the 
strand-specific annotation in the strand slot of the VRanges object.


while this idea may be good or not for a number of reasons, i'm now 
mostly interested in knowing whether i'm misinterpreting the design of 
VRanges objects, and maybe variant calling in general or i'm in a more 
or less right path in using a VRanges object to hold variant annotations.



thanks!!!

robert.

On 06/11/2015 04:30 AM, Michael Lawrence wrote:

I guess it depends on what the strand should mean. Would having a
negative strand indicate that the ref/alt should be complemented? I'm
not sure it's a good idea to conflate the strand of the variant itself
with the strand of an overlapping feature.

On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo  wrote:

my understanding was that VRanges is a container for variants and variant
annotations and strand is just one annotation more. when we use
locateVariants() a variant can be annotated to multiple transcripts where in
one overlaps an exon, in another an intron and so on. In all those
transcripts annotations there is a strand annotation, the strand of the
transcript. if the transcript is annotated in the negative strand of the
reference chromosome then the annotation of a transcript region to a variant
is going to be also on the negative strand.

both locateVariants() and predictCoding() return GRanges objects with strand
annotations according to the transcripts being annotated. I thought it made
sense in VariantFiltering to use VRanges objects as a  container for
variants and annotations and, for this reason, I would like to carry on the
strand annotation into the VRanges object. Is there a strong reason for a
VRanges object, derived from GRanges, not to have strand?


On 6/10/15 6:26 PM, Michael Lawrence wrote:


VRanges is supposed to enforce strand. The goal is to use "*" always,
for simplicity and consistency with the result of readVcf(). Is there
a use case for negative strand variants?

On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo
wrote:


Michael,

regarding our email exchange three weeks ago, I found a couple of places
in
VariantAnnotation that IMO need to be updated to avoid enforcing strand
on
VRanges.

these places occur when constructing and validating VRanges objects,
concretely at:

1. file R/methods-VRanges-class.R at the VRanges class constructor:

VRanges<-
function(seqnames = Rle(), ranges = IRanges(),
 ref = character(), alt = NA_character_,
 totalDepth = NA_integer_, refDepth = NA_integer_,
 altDepth = NA_integer_, ..., sampleNames = NA_character_,
 softFilterMatrix = FilterMatrix(matrix(nrow = length(gr),
ncol =
0L),
   FilterRules()),
 hardFilters = FilterRules())
{
gr<- GRanges(seqnames, ranges,
  strand = .rleRecycleVector("*", length(ranges)), ...)
[...]

that precludes setting the strand at construction time:

library(VariantAnnotation)
VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
strand="-")
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
length(ranges)),  :
formal argument "strand" matched by multiple actual arguments


2. R/AllClasses.R at the VRanges class validity function
.valid.VRanges():

.valid.VRanges.strand<- function(object) {
if (any(object@strand == "-"))
  paste("'strand' must always be '+' or '*'")
}

[...]

.valid.VRanges<- function(object)
{
c(.valid.VRanges.ref(object),
  .valid.VRanges.alt(object),
  .valid.VRanges.strand(object),
  .valid.VRanges.depth(object))
}

that prompts an error when variants annotated on the negative strand are
detected:

library(VariantAnnotation)
example(VRanges)
strand(vr)<- "-"
c(vr)
Error in validObject(.Object) :
invali

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Michael Lawrence
The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo  wrote:
> one option for me is just to add a metadata column with the strand of the
> overlapping feature. however, i'm interested to fully understand the
> rationale behind this aspect of the design of the VRanges object.
>
> a VRanges object unrolls variants in a VCF file per alternative allele and
> sample. variants in VCF files are obtained from tallying reads aligned on a
> reference genome. so, my understanding is that the reference allele is the
> allele of the reference genome against which the reads were aligned while
> the alternate allele(s) are allele calls different from the reference. from
> this perspective, my interpretation is that ref and alt alleles have already
> a strand, which is the strand of the reference chromosome against which the
> reads were aligned to. i'm interested in this interpretation of the strand
> of the variants because i'm interested in the interpretation of
> sequence-features containing the reference and the alternate alleles, such
> as differences in a binding site with the reference and the alternate
> allele.
>
> if we relax the meaning of elements in a VRanges object to, not only
> variants x allele x sample, but to variants x allele x sample x
> annotated-feature, then i think it would make sense to have the
> strand-specific annotation in the strand slot of the VRanges object.
>
> while this idea may be good or not for a number of reasons, i'm now mostly
> interested in knowing whether i'm misinterpreting the design of VRanges
> objects, and maybe variant calling in general or i'm in a more or less right
> path in using a VRanges object to hold variant annotations.
>
>
> thanks!!!
>
> robert.
>
>
> On 06/11/2015 04:30 AM, Michael Lawrence wrote:
>>
>> I guess it depends on what the strand should mean. Would having a
>> negative strand indicate that the ref/alt should be complemented? I'm
>> not sure it's a good idea to conflate the strand of the variant itself
>> with the strand of an overlapping feature.
>>
>> On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo
>> wrote:
>>>
>>> my understanding was that VRanges is a container for variants and variant
>>> annotations and strand is just one annotation more. when we use
>>> locateVariants() a variant can be annotated to multiple transcripts where
>>> in
>>> one overlaps an exon, in another an intron and so on. In all those
>>> transcripts annotations there is a strand annotation, the strand of the
>>> transcript. if the transcript is annotated in the negative strand of the
>>> reference chromosome then the annotation of a transcript region to a
>>> variant
>>> is going to be also on the negative strand.
>>>
>>> both locateVariants() and predictCoding() return GRanges objects with
>>> strand
>>> annotations according to the transcripts being annotated. I thought it
>>> made
>>> sense in VariantFiltering to use VRanges objects as a  container for
>>> variants and annotations and, for this reason, I would like to carry on
>>> the
>>> strand annotation into the VRanges object. Is there a strong reason for a
>>> VRanges object, derived from GRanges, not to have strand?
>>>
>>>
>>> On 6/10/15 6:26 PM, Michael Lawrence wrote:


 VRanges is supposed to enforce strand. The goal is to use "*" always,
 for simplicity and consistency with the result of readVcf(). Is there
 a use case for negative strand variants?

 On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo
 wrote:
>
>
> Michael,
>
> regarding our email exchange three weeks ago, I found a couple of
> places
> in
> VariantAnnotation that IMO need to be updated to avoid enforcing strand
> on
> VRanges.
>
> these places occur when constructing and validating VRanges objects,
> concretely at:
>
> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>
> VRanges<-
> function(seqnames = Rle(), ranges = IRanges(),
>  ref = character(), alt = NA_character_,
>  totalDepth = NA_integer_, refDepth = NA_integer_,
>  altDepth = NA_integer_, ..., sampleNames = NA_character_,
>  softFilterMatrix = FilterMatrix(matrix(nrow = length(gr),
> ncol =
> 0L),
>FilterRules()),
>  hardFilters = FilterRules())
> {
> gr<- GRanges(seqnames, ranges,
>   strand = .rleRecycleVector("*", length(ranges)), ...)
> [...]
>
> that precludes setting the strand at construction time:
>
>>

Re: [Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

2015-06-11 Thread Valerie Obenchain
This issue is that 'subject' is a GRangesList which has multiple strands 
per list element (so a simple strand(subject) can't be used to construct 
a GRanges). Additionally, there could be different strands in a single 
list element. I've added a check for this and a warning is thrown and 
strand is set to '*' if different strands are found in the same list 
element.


The fix has been checked into release and devel.

Valerie


On 06/11/2015 03:26 AM, Robert Castelo wrote:

hi Valerie,

thanks for the fix, it seems that it required a more complex update than
i initially thought looking at the code. actually, i found out that this
update is breaking with the sample VCF file from the VariantFiltering
package. i guess this requires some further fix in VariantAnnotation.
The problem shows up in both, devel and release, please find at the
bottom both sessionInfo() outputs.

thanks!!
robert.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")

vcf <- readVcf(fl, genome="hg19")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

## this is necessary to avoid the chrM length glitch between
## b37 (GATK bundle 2.8 reference genome) and hg19 versions
commonlev <- intersect(seqlevels(vcf), seqlevels(txdb))
commonlev <- commonlev[seqlengths(vcf)[commonlev] ==
seqlengths(txdb)[commonlev]]
vcf <- keepSeqlevels(vcf, commonlev)

loc_intron <- locateVariants(vcf, txdb, region=IntronVariants())
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
   subscript contains NAs or out-of-bounds indices
traceback()
27: stop("subscript contains NAs or out-of-bounds indices")
26: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
25: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
24: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
23: extractROWS(x, i)
22: extractROWS(x, i)
21: ss[txHits]
20: ss[txHits]
19: is(strand, "Rle")
18: newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand =
strand,
 mcols = DataFrame(...), seqlengths = seqlengths, seqinfo =
seqinfo)
17: GRanges(seqnames = seqnames(query)[xHits], ranges =
IRanges(ranges(query)[xHits]),
 strand = ss[txHits], LOCATION = .location(length(xHits),
 vtype), LOCSTART = start(map), LOCEND = end(map), QUERYID =
xHits,
 TXID = txid, CDSID = cdsid, GENEID = NA_character_, PRECEDEID =
CharacterList(character(0)),
 FOLLOWID = CharacterList(character(0)))
16: .makeResult(query, subject, "intron", ignore.strand = ignore.strand,
 asHits = asHits)
15: .local(query, subject, region, ...)
14: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand
= ignore.strand,
 asHits = asHits)
13: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand
= ignore.strand,
 asHits = asHits)
12: eval(expr, envir, enclos)
11: eval(call, parent.frame())
10: callGeneric(query, cache[["intbytx"]], region, ..., ignore.strand =
ignore.strand,
 asHits = asHits)
9: .local(query, subject, region, ...)
8: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
ignore.strand = ignore.strand, asHits = asHits)
7: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
ignore.strand = ignore.strand, asHits = asHits)
6: eval(expr, envir, enclos)
5: eval(call, parent.frame())
4: callGeneric(rowRanges(query), subject, region, ..., cache = cache,
ignore.strand = ignore.strand, asHits = asHits)
3: .local(query, subject, region, ...)
2: locateVariants(vcf, txdb, region = IntronVariants())
1: locateVariants(vcf, txdb, region = IntronVariants())
R Under development (unstable) (2015-04-28 r68268)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C LC_TIME=en_US.UTF8
  [4] LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8
LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8   LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=CLC_MEASUREMENT=en_US.UTF8
LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3 GenomicFeatures_1.21.12
  [3] AnnotationDbi_1.31.14   VariantAnnotation_1.15.11
  [5] Rsamtools_1.21.8Biostrings_2.37.2
  [7] XVector_0.9.1   SummarizedExperiment_0.1.5
  [9] Biobase_2.29.1  GenomicRanges_1.21.14
[11] GenomeInfoDb_1.5.7  IRanges_2.3.10
[13] S4Vectors_0.7.3 BiocGenerics_0.15.2
[15] vimcom_1.0-0setwidth_1.0-3
[17] colorout_1.0-3

loaded via a namespace (and not attached):
  [1] GenomicAlignments_1.5.9 zlibbioc_1.15.0 BiocParallel_1.3.22
BSgenome_1.37.1
  [5]

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Robert Castelo
Of course, the inclusion of strand would imply an interpretation of the 
variant and its strand (e.g., "-") with respect to an annotated feature. 
I can see a practical problem of integrity of the information on a 
VRanges object, by which a mandatory column, such as strand, depends on 
a non-mandatory column, such as some feature annotation stored as a 
metadata column.


A solution would be to add the transcript identifier (TXID) as mandatory 
column on the VRanges object but I suspect this is a big change to do, 
so adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by 
locateVariants) in the metadata columns of the VRanges object would 
allow me to use a VRanges object as a container of variant x allele x 
sample x annotation.


Just to clear up the issue of merging strand and variant: a noisy 
variant (a variant that is not silent) and has a, e.g., loss-of-function 
effect such as the gain of a stop codon, is usually interpreted in the 
strand of the transcript and coding sequence in which the stop codon is 
gained, saying something like and A changed to a T producting the stop 
codon TAA. Ref and alt alleles are called in the strand of the reference 
chromosome, so if the transcript was annotated in the negative strand, 
we would know that we need to reverse-complement ref and alt to 
interpret the variant, although I see no need to do anything on the 
VRanges object to ref and alt because we know they are always in the 
strand of the reference chromosome. Only if you want to detect this 
stop-gain event (with predictCoding) then you would have to 
reverse-complement the ref and alt alleles. Conversely, if the variant 
falls in an intergenic region, then obviously the strand plays no role 
in the interpretation of the variant and nothing needs to be done when 
interpreting the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:

The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo  wrote:

one option for me is just to add a metadata column with the strand of the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele and
sample. variants in VCF files are obtained from tallying reads aligned on a
reference genome. so, my understanding is that the reference allele is the
allele of the reference genome against which the reads were aligned while
the alternate allele(s) are allele calls different from the reference. from
this perspective, my interpretation is that ref and alt alleles have already
a strand, which is the strand of the reference chromosome against which the
reads were aligned to. i'm interested in this interpretation of the strand
of the variants because i'm interested in the interpretation of
sequence-features containing the reference and the alternate alleles, such
as differences in a binding site with the reference and the alternate
allele.

if we relax the meaning of elements in a VRanges object to, not only
variants x allele x sample, but to variants x allele x sample x
annotated-feature, then i think it would make sense to have the
strand-specific annotation in the strand slot of the VRanges object.

while this idea may be good or not for a number of reasons, i'm now mostly
interested in knowing whether i'm misinterpreting the design of VRanges
objects, and maybe variant calling in general or i'm in a more or less right
path in using a VRanges object to hold variant annotations.


thanks!!!

robert.


On 06/11/2015 04:30 AM, Michael Lawrence wrote:

I guess it depends on what the strand should mean. Would having a
negative strand indicate that the ref/alt should be complemented? I'm
not sure it's a good idea to conflate the strand of the variant itself
with the strand of an overlapping feature.

On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo
wrote:

my understanding was that VRanges is a container for variants and variant
annotations and strand is just one annotation more. when we use
locateVariants() a variant can be annotated to multiple transcripts where
in
one overlaps an exon, in another an intron and so on. In all those
transcripts annotations there is a strand annotation, the strand of the
transcript. if the transcript is annotated in the negative strand of the
reference chromosome then the annotation of a transcript region to a
variant
is going to be also on the negative strand.

both locateVariants() and predictCoding() return GRanges objects with
strand
annotations according to the transcripts being a

Re: [Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

2015-06-11 Thread Robert Castelo
Valerie, thanks for fixing this quickly, I guess that if 
intronsByTranscript() returns introns with different strand within a 
single transcript is rather an annotation error than a biological 
feature, so it is good to prompt a warning about it.


robert.

On 6/11/15 5:51 PM, Valerie Obenchain wrote:
This issue is that 'subject' is a GRangesList which has multiple 
strands per list element (so a simple strand(subject) can't be used to 
construct a GRanges). Additionally, there could be different strands 
in a single list element. I've added a check for this and a warning is 
thrown and strand is set to '*' if different strands are found in the 
same list element.


The fix has been checked into release and devel.

Valerie


On 06/11/2015 03:26 AM, Robert Castelo wrote:

hi Valerie,

thanks for the fix, it seems that it required a more complex update than
i initially thought looking at the code. actually, i found out that this
update is breaking with the sample VCF file from the VariantFiltering
package. i guess this requires some further fix in VariantAnnotation.
The problem shows up in both, devel and release, please find at the
bottom both sessionInfo() outputs.

thanks!!
robert.

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

fl <- system.file("extdata", "CEUtrio.vcf.bgz", 
package="VariantFiltering")


vcf <- readVcf(fl, genome="hg19")

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

## this is necessary to avoid the chrM length glitch between
## b37 (GATK bundle 2.8 reference genome) and hg19 versions
commonlev <- intersect(seqlevels(vcf), seqlevels(txdb))
commonlev <- commonlev[seqlengths(vcf)[commonlev] ==
seqlengths(txdb)[commonlev]]
vcf <- keepSeqlevels(vcf, commonlev)

loc_intron <- locateVariants(vcf, txdb, region=IntronVariants())
Error in NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append) :
   subscript contains NAs or out-of-bounds indices
traceback()
27: stop("subscript contains NAs or out-of-bounds indices")
26: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
25: NSBS(i, x, exact = exact, upperBoundIsStrict = !allow.append)
24: normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
23: extractROWS(x, i)
22: extractROWS(x, i)
21: ss[txHits]
20: ss[txHits]
19: is(strand, "Rle")
18: newGRanges("GRanges", seqnames = seqnames, ranges = ranges, strand =
strand,
 mcols = DataFrame(...), seqlengths = seqlengths, seqinfo =
seqinfo)
17: GRanges(seqnames = seqnames(query)[xHits], ranges =
IRanges(ranges(query)[xHits]),
 strand = ss[txHits], LOCATION = .location(length(xHits),
 vtype), LOCSTART = start(map), LOCEND = end(map), QUERYID =
xHits,
 TXID = txid, CDSID = cdsid, GENEID = NA_character_, PRECEDEID =
CharacterList(character(0)),
 FOLLOWID = CharacterList(character(0)))
16: .makeResult(query, subject, "intron", ignore.strand = ignore.strand,
 asHits = asHits)
15: .local(query, subject, region, ...)
14: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand
= ignore.strand,
 asHits = asHits)
13: locateVariants(query, cache[["intbytx"]], region, ..., ignore.strand
= ignore.strand,
 asHits = asHits)
12: eval(expr, envir, enclos)
11: eval(call, parent.frame())
10: callGeneric(query, cache[["intbytx"]], region, ..., ignore.strand =
ignore.strand,
 asHits = asHits)
9: .local(query, subject, region, ...)
8: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
ignore.strand = ignore.strand, asHits = asHits)
7: locateVariants(rowRanges(query), subject, region, ..., cache = cache,
ignore.strand = ignore.strand, asHits = asHits)
6: eval(expr, envir, enclos)
5: eval(call, parent.frame())
4: callGeneric(rowRanges(query), subject, region, ..., cache = cache,
ignore.strand = ignore.strand, asHits = asHits)
3: .local(query, subject, region, ...)
2: locateVariants(vcf, txdb, region = IntronVariants())
1: locateVariants(vcf, txdb, region = IntronVariants())
R Under development (unstable) (2015-04-28 r68268)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C LC_TIME=en_US.UTF8
  [4] LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8
LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8   LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=CLC_MEASUREMENT=en_US.UTF8
LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3 GenomicFeatures_1.21.12
  [3] AnnotationDbi_1.31.14 VariantAnnotation_1.15.11
  [5] Rsamtools_1.21.8Biostrings_2.37.2
  [7] XVector_0.9.1 SummarizedExperiment_0.1.5
  [9] Biobase_2.29.1 GenomicRanges_1.21.14
[11] GenomeInfoDb_1.5.7  IRanges_2.3.10
[13] S4Vectors_0.7.3 BiocGenerics_0

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Michael Lawrence
I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo  wrote:
> Of course, the inclusion of strand would imply an interpretation of the
> variant and its strand (e.g., "-") with respect to an annotated feature. I
> can see a practical problem of integrity of the information on a VRanges
> object, by which a mandatory column, such as strand, depends on a
> non-mandatory column, such as some feature annotation stored as a metadata
> column.
>
> A solution would be to add the transcript identifier (TXID) as mandatory
> column on the VRanges object but I suspect this is a big change to do, so
> adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
> locateVariants) in the metadata columns of the VRanges object would allow me
> to use a VRanges object as a container of variant x allele x sample x
> annotation.
>
> Just to clear up the issue of merging strand and variant: a noisy variant (a
> variant that is not silent) and has a, e.g., loss-of-function effect such as
> the gain of a stop codon, is usually interpreted in the strand of the
> transcript and coding sequence in which the stop codon is gained, saying
> something like and A changed to a T producting the stop codon TAA. Ref and
> alt alleles are called in the strand of the reference chromosome, so if the
> transcript was annotated in the negative strand, we would know that we need
> to reverse-complement ref and alt to interpret the variant, although I see
> no need to do anything on the VRanges object to ref and alt because we know
> they are always in the strand of the reference chromosome. Only if you want
> to detect this stop-gain event (with predictCoding) then you would have to
> reverse-complement the ref and alt alleles. Conversely, if the variant falls
> in an intergenic region, then obviously the strand plays no role in the
> interpretation of the variant and nothing needs to be done when interpreting
> the ref and alt alleles.
>
>
> On 6/11/15 5:47 PM, Michael Lawrence wrote:
>>
>> The fact that the position describes the variant, but the strand
>> refers to the transcript is confusing to me. What is the concrete use
>> case for merging the two features like that? VRanges constrains its
>> strand for at least 2 reasons: (1) to be less error prone [of course
>> this runs completely counter to flexibility] and (2) simplicity [we
>> don't have to worry about what "-" means for ref/alt, overlap, etc].
>>
>> On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
>> wrote:
>>>
>>> one option for me is just to add a metadata column with the strand of the
>>> overlapping feature. however, i'm interested to fully understand the
>>> rationale behind this aspect of the design of the VRanges object.
>>>
>>> a VRanges object unrolls variants in a VCF file per alternative allele
>>> and
>>> sample. variants in VCF files are obtained from tallying reads aligned on
>>> a
>>> reference genome. so, my understanding is that the reference allele is
>>> the
>>> allele of the reference genome against which the reads were aligned while
>>> the alternate allele(s) are allele calls different from the reference.
>>> from
>>> this perspective, my interpretation is that ref and alt alleles have
>>> already
>>> a strand, which is the strand of the reference chromosome against which
>>> the
>>> reads were aligned to. i'm interested in this interpretation of the
>>> strand
>>> of the variants because i'm interested in the interpretation of
>>> sequence-features containing the reference and the alternate alleles,
>>> such
>>> as differences in a binding site with the reference and the alternate
>>> allele.
>>>
>>> if we relax the meaning of elements in a VRanges object to, not only
>>> variants x allele x sample, but to variants x allele x sample x
>>> annotated-feature, then i think it would make sense to have the
>>> strand-specific annotation in the strand slot of the VRanges object.
>>>
>>> while this idea may be good or not for a number of reasons, i'm now
>>> mostly
>>> interested in knowing whether i'm misinterpreting the design of VRanges
>>> objects, and maybe variant calling in general or i'm in a more or less
>>> right
>>> path in using a VRanges object to hold variant annotations.
>>>
>>>
>>> thanks!!!
>>>
>>> robert.
>>>
>>>
>>> On 06/11/2015 04:30 AM, Michael Lawrence wrote:

 I guess it depends on what the strand should mean. Would having a
 negative strand i

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Robert Castelo
In fact, a feature like locateVariants() returning an object of the same 
class of the input query argument would be (IMO) a useful feature at 
least in the case of an input CollapsedVCF object, because the user 
could then dump directly the annotations to a VCF file with writeVcf(), 
which is one way to store variant annotations into files.


On 6/11/15 6:38 PM, Michael Lawrence wrote:

I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo  wrote:

Of course, the inclusion of strand would imply an interpretation of the
variant and its strand (e.g., "-") with respect to an annotated feature. I
can see a practical problem of integrity of the information on a VRanges
object, by which a mandatory column, such as strand, depends on a
non-mandatory column, such as some feature annotation stored as a metadata
column.

A solution would be to add the transcript identifier (TXID) as mandatory
column on the VRanges object but I suspect this is a big change to do, so
adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
locateVariants) in the metadata columns of the VRanges object would allow me
to use a VRanges object as a container of variant x allele x sample x
annotation.

Just to clear up the issue of merging strand and variant: a noisy variant (a
variant that is not silent) and has a, e.g., loss-of-function effect such as
the gain of a stop codon, is usually interpreted in the strand of the
transcript and coding sequence in which the stop codon is gained, saying
something like and A changed to a T producting the stop codon TAA. Ref and
alt alleles are called in the strand of the reference chromosome, so if the
transcript was annotated in the negative strand, we would know that we need
to reverse-complement ref and alt to interpret the variant, although I see
no need to do anything on the VRanges object to ref and alt because we know
they are always in the strand of the reference chromosome. Only if you want
to detect this stop-gain event (with predictCoding) then you would have to
reverse-complement the ref and alt alleles. Conversely, if the variant falls
in an intergenic region, then obviously the strand plays no role in the
interpretation of the variant and nothing needs to be done when interpreting
the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:

The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
wrote:

one option for me is just to add a metadata column with the strand of the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele
and
sample. variants in VCF files are obtained from tallying reads aligned on
a
reference genome. so, my understanding is that the reference allele is
the
allele of the reference genome against which the reads were aligned while
the alternate allele(s) are allele calls different from the reference.
from
this perspective, my interpretation is that ref and alt alleles have
already
a strand, which is the strand of the reference chromosome against which
the
reads were aligned to. i'm interested in this interpretation of the
strand
of the variants because i'm interested in the interpretation of
sequence-features containing the reference and the alternate alleles,
such
as differences in a binding site with the reference and the alternate
allele.

if we relax the meaning of elements in a VRanges object to, not only
variants x allele x sample, but to variants x allele x sample x
annotated-feature, then i think it would make sense to have the
strand-specific annotation in the strand slot of the VRanges object.

while this idea may be good or not for a number of reasons, i'm now
mostly
interested in knowing whether i'm misinterpreting the design of VRanges
objects, and maybe variant calling in general or i'm in a more or less
right
path in using a VRanges object to hold variant annotations.


thanks!!!

robert.


On 06/11/2015

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Valerie Obenchain
locateVariants(), predictCoding() and the family of mapToTranscripts() 
functions all return strand according to the annotation matched. The 
only time the strand of the output could possibly be different from the 
strand of the input 'query' is when 'ignore.strand = TRUE' (FALSE by 
default).


I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? 
By just using the default, the output will have the same strand as the 
input 'query' (unless 'query' is '*' of course).


That said, do you still feel it's necessary to add a LOCSTRAND column to 
the output?


Val

On 06/11/2015 09:38 AM, Michael Lawrence wrote:

I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo  wrote:

Of course, the inclusion of strand would imply an interpretation of the
variant and its strand (e.g., "-") with respect to an annotated feature. I
can see a practical problem of integrity of the information on a VRanges
object, by which a mandatory column, such as strand, depends on a
non-mandatory column, such as some feature annotation stored as a metadata
column.

A solution would be to add the transcript identifier (TXID) as mandatory
column on the VRanges object but I suspect this is a big change to do, so
adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
locateVariants) in the metadata columns of the VRanges object would allow me
to use a VRanges object as a container of variant x allele x sample x
annotation.

Just to clear up the issue of merging strand and variant: a noisy variant (a
variant that is not silent) and has a, e.g., loss-of-function effect such as
the gain of a stop codon, is usually interpreted in the strand of the
transcript and coding sequence in which the stop codon is gained, saying
something like and A changed to a T producting the stop codon TAA. Ref and
alt alleles are called in the strand of the reference chromosome, so if the
transcript was annotated in the negative strand, we would know that we need
to reverse-complement ref and alt to interpret the variant, although I see
no need to do anything on the VRanges object to ref and alt because we know
they are always in the strand of the reference chromosome. Only if you want
to detect this stop-gain event (with predictCoding) then you would have to
reverse-complement the ref and alt alleles. Conversely, if the variant falls
in an intergenic region, then obviously the strand plays no role in the
interpretation of the variant and nothing needs to be done when interpreting
the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:


The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
wrote:


one option for me is just to add a metadata column with the strand of the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele
and
sample. variants in VCF files are obtained from tallying reads aligned on
a
reference genome. so, my understanding is that the reference allele is
the
allele of the reference genome against which the reads were aligned while
the alternate allele(s) are allele calls different from the reference.
from
this perspective, my interpretation is that ref and alt alleles have
already
a strand, which is the strand of the reference chromosome against which
the
reads were aligned to. i'm interested in this interpretation of the
strand
of the variants because i'm interested in the interpretation of
sequence-features containing the reference and the alternate alleles,
such
as differences in a binding site with the reference and the alternate
allele.

if we relax the meaning of elements in a VRanges object to, not only
variants x allele x sample, but to variants x allele x sample x
annotated-feature, then i think it would make sense to have the
strand-specific annotation in the strand slot of the VRanges object.

while this idea may be good or not for a number of reasons, i'm now
mostl

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Michael Lawrence
Val,

I wasn't suggesting that LOCSTRAND be added to the locateVariants()
output. Rather, it would be added to the VRanges during the merge.

Michael

On Thu, Jun 11, 2015 at 10:10 AM, Valerie Obenchain
 wrote:
> locateVariants(), predictCoding() and the family of mapToTranscripts()
> functions all return strand according to the annotation matched. The only
> time the strand of the output could possibly be different from the strand of
> the input 'query' is when 'ignore.strand = TRUE' (FALSE by default).
>
> I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? By
> just using the default, the output will have the same strand as the input
> 'query' (unless 'query' is '*' of course).
>
> That said, do you still feel it's necessary to add a LOCSTRAND column to the
> output?
>
> Val
>
>
> On 06/11/2015 09:38 AM, Michael Lawrence wrote:
>>
>> I didn't realize that locateVariants() returned an object with its
>> strand matching that of the subject. I would have expected the subject
>> strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
>> sounds like you want to merge the locateVariants() output with the
>> input. Merging the output strand as LOCSTRAND on the VRanges sounds
>> like a reasonable approach, for now. I don't know if Val is listening,
>> but it sounds like it would be nice to have convenient functions for
>> merging locateVariants() output with its input. The one for VRanges
>> might do something like the above.
>>
>> Michael
>>
>> On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo 
>> wrote:
>>>
>>> Of course, the inclusion of strand would imply an interpretation of the
>>> variant and its strand (e.g., "-") with respect to an annotated feature.
>>> I
>>> can see a practical problem of integrity of the information on a VRanges
>>> object, by which a mandatory column, such as strand, depends on a
>>> non-mandatory column, such as some feature annotation stored as a
>>> metadata
>>> column.
>>>
>>> A solution would be to add the transcript identifier (TXID) as mandatory
>>> column on the VRanges object but I suspect this is a big change to do, so
>>> adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
>>> locateVariants) in the metadata columns of the VRanges object would allow
>>> me
>>> to use a VRanges object as a container of variant x allele x sample x
>>> annotation.
>>>
>>> Just to clear up the issue of merging strand and variant: a noisy variant
>>> (a
>>> variant that is not silent) and has a, e.g., loss-of-function effect such
>>> as
>>> the gain of a stop codon, is usually interpreted in the strand of the
>>> transcript and coding sequence in which the stop codon is gained, saying
>>> something like and A changed to a T producting the stop codon TAA. Ref
>>> and
>>> alt alleles are called in the strand of the reference chromosome, so if
>>> the
>>> transcript was annotated in the negative strand, we would know that we
>>> need
>>> to reverse-complement ref and alt to interpret the variant, although I
>>> see
>>> no need to do anything on the VRanges object to ref and alt because we
>>> know
>>> they are always in the strand of the reference chromosome. Only if you
>>> want
>>> to detect this stop-gain event (with predictCoding) then you would have
>>> to
>>> reverse-complement the ref and alt alleles. Conversely, if the variant
>>> falls
>>> in an intergenic region, then obviously the strand plays no role in the
>>> interpretation of the variant and nothing needs to be done when
>>> interpreting
>>> the ref and alt alleles.
>>>
>>>
>>> On 6/11/15 5:47 PM, Michael Lawrence wrote:


 The fact that the position describes the variant, but the strand
 refers to the transcript is confusing to me. What is the concrete use
 case for merging the two features like that? VRanges constrains its
 strand for at least 2 reasons: (1) to be less error prone [of course
 this runs completely counter to flexibility] and (2) simplicity [we
 don't have to worry about what "-" means for ref/alt, overlap, etc].

 On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
 wrote:
>
>
> one option for me is just to add a metadata column with the strand of
> the
> overlapping feature. however, i'm interested to fully understand the
> rationale behind this aspect of the design of the VRanges object.
>
> a VRanges object unrolls variants in a VCF file per alternative allele
> and
> sample. variants in VCF files are obtained from tallying reads aligned
> on
> a
> reference genome. so, my understanding is that the reference allele is
> the
> allele of the reference genome against which the reads were aligned
> while
> the alternate allele(s) are allele calls different from the reference.
> from
> this perspective, my interpretation is that ref and alt alleles have
> already
> a strand, which is the strand of the reference chromosome against which
> th

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Valerie Obenchain
I see. So the merge functions would just add the output of 
locateVariants() to the GRanges or VCF used as 'query'. Or I guess, as 
Robert suggested, locateVariants() could return the same object as 'query'.


Val


On 06/11/2015 10:19 AM, Michael Lawrence wrote:

Val,

I wasn't suggesting that LOCSTRAND be added to the locateVariants()
output. Rather, it would be added to the VRanges during the merge.

Michael

On Thu, Jun 11, 2015 at 10:10 AM, Valerie Obenchain
 wrote:

locateVariants(), predictCoding() and the family of mapToTranscripts()
functions all return strand according to the annotation matched. The only
time the strand of the output could possibly be different from the strand of
the input 'query' is when 'ignore.strand = TRUE' (FALSE by default).

I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? By
just using the default, the output will have the same strand as the input
'query' (unless 'query' is '*' of course).

That said, do you still feel it's necessary to add a LOCSTRAND column to the
output?

Val


On 06/11/2015 09:38 AM, Michael Lawrence wrote:


I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo 
wrote:


Of course, the inclusion of strand would imply an interpretation of the
variant and its strand (e.g., "-") with respect to an annotated feature.
I
can see a practical problem of integrity of the information on a VRanges
object, by which a mandatory column, such as strand, depends on a
non-mandatory column, such as some feature annotation stored as a
metadata
column.

A solution would be to add the transcript identifier (TXID) as mandatory
column on the VRanges object but I suspect this is a big change to do, so
adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
locateVariants) in the metadata columns of the VRanges object would allow
me
to use a VRanges object as a container of variant x allele x sample x
annotation.

Just to clear up the issue of merging strand and variant: a noisy variant
(a
variant that is not silent) and has a, e.g., loss-of-function effect such
as
the gain of a stop codon, is usually interpreted in the strand of the
transcript and coding sequence in which the stop codon is gained, saying
something like and A changed to a T producting the stop codon TAA. Ref
and
alt alleles are called in the strand of the reference chromosome, so if
the
transcript was annotated in the negative strand, we would know that we
need
to reverse-complement ref and alt to interpret the variant, although I
see
no need to do anything on the VRanges object to ref and alt because we
know
they are always in the strand of the reference chromosome. Only if you
want
to detect this stop-gain event (with predictCoding) then you would have
to
reverse-complement the ref and alt alleles. Conversely, if the variant
falls
in an intergenic region, then obviously the strand plays no role in the
interpretation of the variant and nothing needs to be done when
interpreting
the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:



The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
wrote:



one option for me is just to add a metadata column with the strand of
the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele
and
sample. variants in VCF files are obtained from tallying reads aligned
on
a
reference genome. so, my understanding is that the reference allele is
the
allele of the reference genome against which the reads were aligned
while
the alternate allele(s) are allele calls different from the reference.
from
this perspective, my interpretation is that ref and alt alleles have
already
a strand, which is the strand of the reference chromosome against which
the
reads were aligned to. i'm interested in this interpretation of the
strand
of the variants because i'm interested in the interpretation of
sequence-features cont

[Bioc-devel] version increments for unchanged packages

2015-06-11 Thread Stephanie M. Gogarten
Why is it that packages with no changes still get new version numbers at 
every release? For example, my experiment data package GWASdata has not 
changed since the last release, but the version was bumped from 1.4.0 to 
1.6.0. I imagine most users expect that a change in version number 
indicates some change in content.


Stephanie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] S4 classes extending GRangesList

2015-06-11 Thread Leonard Goldstein
Hi all,

I noticed that when combining instances of a class that inherits from
GRangesList, the result does not preserve the class (it is returned as
a GRangesList instead). The class is preserved in other situations
(e.g. when a class extends GRanges). See below for an example. Is
there a reason why the class cannot be preserved in the first case?
Thanks in advance for your help.

Leonard


> ## define a new class 'A' inheriting from GRanges
> setClass(Class = "A", contains = "GRanges")
>
> ## combining two instances of class 'A' returns an object of class 'A'
> gr <- GRanges("1", IRanges(1, 100))
> a <- new("A", gr)
> a
A object with 1 range and 0 metadata columns:
  seqnamesranges strand

  [1]1  [1, 100]  *
  ---
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> c(a, a)
A object with 2 ranges and 0 metadata columns:
  seqnamesranges strand

  [1]1  [1, 100]  *
  [2]1  [1, 100]  *
  ---
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> ## define a new class 'b' inheriting from GRangesList
> setClass(Class = "B", contains = "GRangesList")
>
> ## combining two instances of class 'B' returns a GRangesList
> grl <- split(gr, 1)
> b <- new("B", grl)
> b
B object of length 1:
$1
GRanges object with 1 range and 0 metadata columns:
  seqnamesranges strand

  [1]1  [1, 100]  *

---
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> c(b, b)
GRangesList object of length 2:
$1
GRanges object with 1 range and 0 metadata columns:
  seqnamesranges strand

  [1]1  [1, 100]  *

$1
GRanges object with 1 range and 0 metadata columns:
  seqnames   ranges strand
  [1]1 [1, 100]  *

---
seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> sessionInfo()
R Under development (unstable) (2014-11-04 r66932)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
[1] GenomicRanges_1.21.15 GenomeInfoDb_1.5.7IRanges_2.3.11
[4] S4Vectors_0.7.4   BiocGenerics_0.15.2

loaded via a namespace (and not attached):
[1] XVector_0.9.1
>

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] S4 classes extending GRangesList

2015-06-11 Thread Hervé Pagès

Hi Leonard,

It's a bug in the "c" method for "CompressedList" object. I'll fix
that. Thanks for the report.

H.

On 06/11/2015 10:48 AM, Leonard Goldstein wrote:

Hi all,

I noticed that when combining instances of a class that inherits from
GRangesList, the result does not preserve the class (it is returned as
a GRangesList instead). The class is preserved in other situations
(e.g. when a class extends GRanges). See below for an example. Is
there a reason why the class cannot be preserved in the first case?
Thanks in advance for your help.

Leonard



## define a new class 'A' inheriting from GRanges
setClass(Class = "A", contains = "GRanges")

## combining two instances of class 'A' returns an object of class 'A'
gr <- GRanges("1", IRanges(1, 100))
a <- new("A", gr)
a

A object with 1 range and 0 metadata columns:
   seqnamesranges strand
 
   [1]1  [1, 100]  *
   ---
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

c(a, a)

A object with 2 ranges and 0 metadata columns:
   seqnamesranges strand
 
   [1]1  [1, 100]  *
   [2]1  [1, 100]  *
   ---
   seqinfo: 1 sequence from an unspecified genome; no seqlengths


## define a new class 'b' inheriting from GRangesList
setClass(Class = "B", contains = "GRangesList")

## combining two instances of class 'B' returns a GRangesList
grl <- split(gr, 1)
b <- new("B", grl)
b

B object of length 1:
$1
GRanges object with 1 range and 0 metadata columns:
   seqnamesranges strand
 
   [1]1  [1, 100]  *

---
seqinfo: 1 sequence from an unspecified genome; no seqlengths

c(b, b)

GRangesList object of length 2:
$1
GRanges object with 1 range and 0 metadata columns:
   seqnamesranges strand
 
   [1]1  [1, 100]  *

$1
GRanges object with 1 range and 0 metadata columns:
   seqnames   ranges strand
   [1]1 [1, 100]  *

---
seqinfo: 1 sequence from an unspecified genome; no seqlengths


sessionInfo()

R Under development (unstable) (2014-11-04 r66932)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
[1] GenomicRanges_1.21.15 GenomeInfoDb_1.5.7IRanges_2.3.11
[4] S4Vectors_0.7.4   BiocGenerics_0.15.2

loaded via a namespace (and not attached):
[1] XVector_0.9.1




___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] version increments for unchanged packages

2015-06-11 Thread Gabe Becker
Stephanie,

As far as I know, it is so that package versions are unique to specific
releases of bioconductor. This has the benefits of

   1. providing assurances that that particular version of a package is
   tested and confirmed to work within that release, and
   2. enforces that the source code/data for a particular version of a
   package appears only and exactly once within the Bioconductor SVN
   structure.


Together, these prevent there being ambiguity when a package needs to be
updated/fixed in the context of a particular release, both in terms of what
version needs to be fixed and where that fix needs to be applied.

Best,
~G

On Thu, Jun 11, 2015 at 10:40 AM, Stephanie M. Gogarten <
sdmor...@u.washington.edu> wrote:

> Why is it that packages with no changes still get new version numbers at
> every release? For example, my experiment data package GWASdata has not
> changed since the last release, but the version was bumped from 1.4.0 to
> 1.6.0. I imagine most users expect that a change in version number
> indicates some change in content.
>
> Stephanie
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>



-- 
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] version increments for unchanged packages

2015-06-11 Thread Hervé Pagès

Hi Stephanie,

On 06/11/2015 11:33 AM, Gabe Becker wrote:

Stephanie,

As far as I know, it is so that package versions are unique to specific
releases of bioconductor. This has the benefits of

1. providing assurances that that particular version of a package is
tested and confirmed to work within that release, and
2. enforces that the source code/data for a particular version of a
package appears only and exactly once within the Bioconductor SVN
structure.


Together, these prevent there being ambiguity when a package needs to be
updated/fixed in the context of a particular release, both in terms of what
version needs to be fixed and where that fix needs to be applied.


Exactly.

One more thing: because the exact version of R that is used to build
binary packages is not necessarily the same between 2 versions of
Bioconductor, we would end up with binary packages that are different
(and possibly not interchangeable) but impossible to distinguish based
on their name if we were not bumping versions (e.g. both would be 
GWASdata_1.4.0.zip). Could make troubleshooting nightmarish.


Hope that makes sense,
H.



Best,
~G

On Thu, Jun 11, 2015 at 10:40 AM, Stephanie M. Gogarten <
sdmor...@u.washington.edu> wrote:


Why is it that packages with no changes still get new version numbers at
every release? For example, my experiment data package GWASdata has not
changed since the last release, but the version was bumped from 1.4.0 to
1.6.0. I imagine most users expect that a change in version number
indicates some change in content.

Stephanie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel







--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] S4 classes extending GRangesList

2015-06-11 Thread Leonard Goldstein
Thanks Hervé.

Best wishes,

Leonard


On Thu, Jun 11, 2015 at 10:58 AM, Hervé Pagès  wrote:
> Hi Leonard,
>
> It's a bug in the "c" method for "CompressedList" object. I'll fix
> that. Thanks for the report.
>
> H.
>
>
> On 06/11/2015 10:48 AM, Leonard Goldstein wrote:
>>
>> Hi all,
>>
>> I noticed that when combining instances of a class that inherits from
>> GRangesList, the result does not preserve the class (it is returned as
>> a GRangesList instead). The class is preserved in other situations
>> (e.g. when a class extends GRanges). See below for an example. Is
>> there a reason why the class cannot be preserved in the first case?
>> Thanks in advance for your help.
>>
>> Leonard
>>
>>
>>> ## define a new class 'A' inheriting from GRanges
>>> setClass(Class = "A", contains = "GRanges")
>>>
>>> ## combining two instances of class 'A' returns an object of class 'A'
>>> gr <- GRanges("1", IRanges(1, 100))
>>> a <- new("A", gr)
>>> a
>>
>> A object with 1 range and 0 metadata columns:
>>seqnamesranges strand
>>  
>>[1]1  [1, 100]  *
>>---
>>seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>> c(a, a)
>>
>> A object with 2 ranges and 0 metadata columns:
>>seqnamesranges strand
>>  
>>[1]1  [1, 100]  *
>>[2]1  [1, 100]  *
>>---
>>seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>>
>>> ## define a new class 'b' inheriting from GRangesList
>>> setClass(Class = "B", contains = "GRangesList")
>>>
>>> ## combining two instances of class 'B' returns a GRangesList
>>> grl <- split(gr, 1)
>>> b <- new("B", grl)
>>> b
>>
>> B object of length 1:
>> $1
>> GRanges object with 1 range and 0 metadata columns:
>>seqnamesranges strand
>>  
>>[1]1  [1, 100]  *
>>
>> ---
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>> c(b, b)
>>
>> GRangesList object of length 2:
>> $1
>> GRanges object with 1 range and 0 metadata columns:
>>seqnamesranges strand
>>  
>>[1]1  [1, 100]  *
>>
>> $1
>> GRanges object with 1 range and 0 metadata columns:
>>seqnames   ranges strand
>>[1]1 [1, 100]  *
>>
>> ---
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>>
>>> sessionInfo()
>>
>> R Under development (unstable) (2014-11-04 r66932)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>   [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4parallel  stats graphics  grDevices utils datasets
>> [8] methods   base
>>
>> other attached packages:
>> [1] GenomicRanges_1.21.15 GenomeInfoDb_1.5.7IRanges_2.3.11
>> [4] S4Vectors_0.7.4   BiocGenerics_0.15.2
>>
>> loaded via a namespace (and not attached):
>> [1] XVector_0.9.1
>>>
>>>
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

2015-06-11 Thread Rainer Johannes
update: this approach bases on Martin's suggestion and I think is ideal for 
EnsDb or any Ensembl based annotation packages. It's not required to install an 
explicit genome package, but load (and eventually cache) dynamically the 
correct genomic sequence from Ensembl via AnnotationHub. That way chromosome 
names, variants etc are perfectly aligned. Also, this guarantees that the 
genome version and gene definitions, both from Ensembl match.

Example code (is now also included in the ensembldb vignette):

  library(AnnotationHub)
  library(EnsDb.Hsapiens.v75)
  library(Rsamtools)
  ah <- AnnotationHub()

  edb <- EnsDb.Hsapiens.v75

  ## get the Ensembl version
  eVersion <- metadata(edb)[metadata(edb)$name=="ensembl_version", "value"]
  ## query all available files for the Ensembl version
  eData <- query(ah, c(organism(edb), paste0("release-", eVersion)))
  eData

  ## retrieve the *dna.toplevel.fa file; this might take some time.
  Dna <- ah[["AH20439"]]
  ## generate an index if none is available
  if(is.na(index(Dna))){
  indexFa(Dna)
  Dna <- FaFile(path(Dna))
  }

  ## get start/end coordinates of all genes
  genes <- genes(edb)
  ## subset to all genes that are encoded on chromosomes for which
  ## we do have DNA sequence available.
  genes <- genes[seqnames(genes) %in% seqnames(seqinfo(Dna))]
  ## get the gene sequences, i.e. the sequence including the sequence of
  ## all of the gene's exons and introns
  geneSeqs <- getSeq(Dna, genes)

cheers, jo

On 09 Jun 2015, at 11:38, Rainer Johannes 
mailto:johannes.rai...@eurac.edu>> wrote:

dear Ludwig,

On 09 Jun 2015, at 10:46, Ludwig Geistlinger 
mailto:ludwig.geistlin...@bio.ifi.lmu.de>> 
wrote:

Dear Johannes,

Thx for providing the great EnsDb packages!

One question:

As of now, I am able to choose between TxDb and EnsDb for genomic
coordinates of genomic features such as genes, transcripts, and exons.
For the sequences themselves I need the corresponding BSgenome package.

While it is easy to automatically map from a specific TxDb package (eg
TxDb.Hsapiens.UCSC..knownGene) to the corresponding BSgenome package
(here: BSgenome.Hsapiens.UCSC.), I wonder how to do that for an
EnsDb package as the package name (eg EnsDb.Hsapiens.v79) contains no
information about the genome build.

A cumbersome option would be to extract the genome_build from the metadata
of the EnsDb package (which would give me for EnsDb.Hsapiens.v79:
'GRCh38') and then ask all existing BSgenome.Hsapiens packages for their
metadata release name (eg 'GRCh38' for BSgenome.Hsapiens.UCSC.hg38).

This however needs all BSgenome.Hsapiens packages installed and takes thus
too much time and space for a programmatic access.

Can you suggest a better way to map from coordinates to sequence (within
the BioC annotation functionality)?


agree, there's no easy mapping (yet). I'll implement a method 
"suggestGenomePackage" in the ensembldb package. In the long run I hope that 
also NCBI BSgenome packages (like the BSgenome.Hsapiens.NCBI.GRCh38) will 
become available for all species... that would make the mapping much easier...

cheers, jo

Thanks & Best,
Ludwig





dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute "entrezid" that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo

On 03 Jun 2015, at 15:56, Robert M. Flight 
mailto:rfligh...@gmail.com>> wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885

If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
ludwig.geistlin...@bio.ifi.lmu.de> 
wrote:

Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG0

Re: [Bioc-devel] version increments for unchanged packages

2015-06-11 Thread Hervé Pagès

On 06/11/2015 11:50 AM, Hervé Pagès wrote:

Hi Stephanie,

On 06/11/2015 11:33 AM, Gabe Becker wrote:

Stephanie,

As far as I know, it is so that package versions are unique to specific
releases of bioconductor. This has the benefits of

1. providing assurances that that particular version of a package is
tested and confirmed to work within that release, and
2. enforces that the source code/data for a particular version of a
package appears only and exactly once within the Bioconductor SVN
structure.


Together, these prevent there being ambiguity when a package needs to be
updated/fixed in the context of a particular release, both in terms of
what
version needs to be fixed and where that fix needs to be applied.


Exactly.

One more thing: because the exact version of R that is used to build
binary packages is not necessarily the same between 2 versions of
Bioconductor, we would end up with binary packages that are different
(and possibly not interchangeable) but impossible to distinguish based
on their name if we were not bumping versions (e.g. both would be
GWASdata_1.4.0.zip). Could make troubleshooting nightmarish.


I forgot that we don't build binaries for data experiment packages
anymore. Because these packages don't need compilation, the source
package can be installed on any platform without the need of any extra
tool (e.g. Rtools or Xcode).

That argument still applies for software packages though...

H.



Hope that makes sense,
H.



Best,
~G

On Thu, Jun 11, 2015 at 10:40 AM, Stephanie M. Gogarten <
sdmor...@u.washington.edu> wrote:


Why is it that packages with no changes still get new version numbers at
every release? For example, my experiment data package GWASdata has not
changed since the last release, but the version was bumped from 1.4.0 to
1.6.0. I imagine most users expect that a change in version number
indicates some change in content.

Stephanie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel









--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Use and Usability metrics / shields

2015-06-11 Thread Leonardo Collado Torres
Is it ok if we use the Bioconductor shields elsewhere? For example, in
a github repo landing page or in our website listing the software
we've contributed to.

In particular, I'm thinking of adding
http://www.bioconductor.org/shields/posts/derfinder.svg (and the other
shields) to https://github.com/lcolladotor/derfinder to go along the
Travis CI shield I use right now.

On Thu, Jun 4, 2015 at 11:22 AM, Jim Hester  wrote:
> Henrik,
>
> While I proposed the idea for the shields/badges Dan gets all the credit
> for the implementation.
>
> As far as your (implied) idea of a coverage badge, the thought had occurred
> to us!
>
> Jim
>
> On Wed, May 20, 2015 at 1:34 PM, Henrik Bengtsson > wrote:
>
>> So, lots of things are happening in a few months: Jim Hester starts
>> working at Bioconductor, we get Bioc shields/badges, Jim's covr
>> package is released on CRAN, snare drum, ...  am I to eager if I
>> already now start wishing for a hi-hat as well?
>>
>> /Henrik
>>
>> On Tue, May 19, 2015 at 12:47 PM, Dan Tenenbaum 
>> wrote:
>> >
>> >
>> > - Original Message -
>> >> From: "Leonardo Collado Torres" 
>> >> To: "Dan Tenenbaum" 
>> >> Cc: "Jim Hester" , bioc-devel@r-project.org
>> >> Sent: Tuesday, May 19, 2015 12:37:18 PM
>> >> Subject: Re: [Bioc-devel] Use and Usability metrics / shields
>> >>
>> >> Regarding the 'posts' tag, I can see that it includes a "closed
>> >> questions" component. For example,
>> >> http://www.bioconductor.org/packages/release/bioc/html/derfinder.html
>> >> is 3/1/9/0 right now meaning that 0 questions are closed. From
>> >> https://support.bioconductor.org/info/faq/, only moderators can close
>> >> questions. That seems like quite a bit of work for the moderators. So
>> >> maybe it would be best to drop the "closed questions" component. Or
>> >> alternatively, can the author of a package moderate the posts that
>> >> have a tag corresponding to their package?
>> >>
>> >
>> > Perhaps the wording is wrong; what 'closed' is supposed to mean is that
>> the original poster has accepted an answer. I'll change 'closed' to
>> 'accepted'.
>> >
>> >
>> >> As for 'build: warnings', it seems like it will show for some devel
>> >> packages all the time. For example,
>> >> http://www.bioconductor.org/packages/devel/bioc/html/regionReport.html
>> >> always has a warning in the Windows build machine due to a mismatch
>> >> in
>> >> the version of Rtools installed.
>> >
>> > This is a bug in devtools and may have already been fixed (but not yet
>> propagated to CRAN).
>> > IMO this should be reflected in the build shield.
>> >
>> > Dan
>> >
>> >
>> >>
>> >> I do like these changes and the addition of shields =)
>> >>
>> >>
>> >> On Thu, May 14, 2015 at 10:56 AM, Dan Tenenbaum
>> >>  wrote:
>> >> >
>> >> > - Original Message -
>> >> >> From: "Jim Hester" 
>> >> >> To: "Martin Morgan" 
>> >> >> Cc: bioc-devel@r-project.org
>> >> >> Sent: Thursday, May 14, 2015 7:53:03 AM
>> >> >> Subject: Re: [Bioc-devel] Use and Usability metrics / shields
>> >> >>
>> >> >> The common shield convention is to use blue or orange when the
>> >> >> information
>> >> >> is not qualitatively good or bad, but the color choice is just
>> >> >> subjective
>> >> >> in the end.
>> >> >
>> >> > It does seem though that we should indicate the non-changing nature
>> >> > of these shields with some kind of color change. Perhaps we can
>> >> > come up with one that works with the other design elements on the
>> >> > page.
>> >> >
>> >> > BTW, the 'posts' tag does change color; if there are 0 posts tagged
>> >> > with a package name, the shield is yellow; otherwise it's green.
>> >> >
>> >> > Dan
>> >> >
>> >> >
>> >> >>
>> >> >> On Thu, May 14, 2015 at 10:47 AM, Martin Morgan
>> >> >> 
>> >> >> wrote:
>> >> >>
>> >> >> > On 05/10/2015 11:39 AM, COMMO Frederic wrote:
>> >> >> >
>> >> >> >> Dear Martin,
>> >> >> >>
>> >> >> >> All of these suggestions sound good.
>> >> >> >>
>> >> >> >> Wolfgang's suggestion regarding possible associated papers
>> >> >> >> might
>> >> >> >> be also
>> >> >> >> great.
>> >> >> >>
>> >> >> >> Another useful information would be to point to other
>> >> >> >> publications
>> >> >> >> where
>> >> >> >> a given package was used, and cited.
>> >> >> >> I don't know if it's technically possible, but it would be
>> >> >> >> greatly
>> >> >> >> informative to know how frequently a package is used, and how
>> >> >> >> it
>> >> >> >> performs,
>> >> >> >> in real contexts.
>> >> >> >>
>> >> >> >> Frederic Commo
>> >> >> >> Bioinformatics, U981
>> >> >> >> Gustave Roussy
>> >> >> >>
>> >> >> >> 
>> >> >> >> De : Bioc-devel [bioc-devel-boun...@r-project.org] de la part
>> >> >> >> de
>> >> >> >> Wolfgang Huber [whu...@embl.de]
>> >> >> >> Date d'envoi : samedi 9 mai 2015 19:57
>> >> >> >> À : Martin Morgan
>> >> >> >> Cc: bioc-devel@r-project.org
>> >> >> >> Objet : Re: [Bioc-devel] Use and Usability metrics / shields
>> >> >> >>
>> >> >> >> Dear Martin
>> >> >> >

Re: [Bioc-devel] Use and Usability metrics / shields

2015-06-11 Thread Dan Tenenbaum


- Original Message -
> From: "Leonardo Collado Torres" 
> To: "Dan Tenenbaum" 
> Cc: bioc-devel@r-project.org
> Sent: Thursday, June 11, 2015 9:26:13 PM
> Subject: Re: [Bioc-devel] Use and Usability metrics / shields
> 
> Is it ok if we use the Bioconductor shields elsewhere? For example,
> in
> a github repo landing page or in our website listing the software
> we've contributed to.
> 
> In particular, I'm thinking of adding
> http://www.bioconductor.org/shields/posts/derfinder.svg (and the
> other
> shields) to https://github.com/lcolladotor/derfinder to go along the
> Travis CI shield I use right now.
> 

Feel free, that was part of the idea.

Dan


> On Thu, Jun 4, 2015 at 11:22 AM, Jim Hester
>  wrote:
> > Henrik,
> >
> > While I proposed the idea for the shields/badges Dan gets all the
> > credit
> > for the implementation.
> >
> > As far as your (implied) idea of a coverage badge, the thought had
> > occurred
> > to us!
> >
> > Jim
> >
> > On Wed, May 20, 2015 at 1:34 PM, Henrik Bengtsson
> >  >> wrote:
> >
> >> So, lots of things are happening in a few months: Jim Hester
> >> starts
> >> working at Bioconductor, we get Bioc shields/badges, Jim's covr
> >> package is released on CRAN, snare drum, ...  am I to eager if I
> >> already now start wishing for a hi-hat as well?
> >>
> >> /Henrik
> >>
> >> On Tue, May 19, 2015 at 12:47 PM, Dan Tenenbaum
> >> 
> >> wrote:
> >> >
> >> >
> >> > - Original Message -
> >> >> From: "Leonardo Collado Torres" 
> >> >> To: "Dan Tenenbaum" 
> >> >> Cc: "Jim Hester" ,
> >> >> bioc-devel@r-project.org
> >> >> Sent: Tuesday, May 19, 2015 12:37:18 PM
> >> >> Subject: Re: [Bioc-devel] Use and Usability metrics / shields
> >> >>
> >> >> Regarding the 'posts' tag, I can see that it includes a "closed
> >> >> questions" component. For example,
> >> >> http://www.bioconductor.org/packages/release/bioc/html/derfinder.html
> >> >> is 3/1/9/0 right now meaning that 0 questions are closed. From
> >> >> https://support.bioconductor.org/info/faq/, only moderators can
> >> >> close
> >> >> questions. That seems like quite a bit of work for the
> >> >> moderators. So
> >> >> maybe it would be best to drop the "closed questions"
> >> >> component. Or
> >> >> alternatively, can the author of a package moderate the posts
> >> >> that
> >> >> have a tag corresponding to their package?
> >> >>
> >> >
> >> > Perhaps the wording is wrong; what 'closed' is supposed to mean
> >> > is that
> >> the original poster has accepted an answer. I'll change 'closed'
> >> to
> >> 'accepted'.
> >> >
> >> >
> >> >> As for 'build: warnings', it seems like it will show for some
> >> >> devel
> >> >> packages all the time. For example,
> >> >> http://www.bioconductor.org/packages/devel/bioc/html/regionReport.html
> >> >> always has a warning in the Windows build machine due to a
> >> >> mismatch
> >> >> in
> >> >> the version of Rtools installed.
> >> >
> >> > This is a bug in devtools and may have already been fixed (but
> >> > not yet
> >> propagated to CRAN).
> >> > IMO this should be reflected in the build shield.
> >> >
> >> > Dan
> >> >
> >> >
> >> >>
> >> >> I do like these changes and the addition of shields =)
> >> >>
> >> >>
> >> >> On Thu, May 14, 2015 at 10:56 AM, Dan Tenenbaum
> >> >>  wrote:
> >> >> >
> >> >> > - Original Message -
> >> >> >> From: "Jim Hester" 
> >> >> >> To: "Martin Morgan" 
> >> >> >> Cc: bioc-devel@r-project.org
> >> >> >> Sent: Thursday, May 14, 2015 7:53:03 AM
> >> >> >> Subject: Re: [Bioc-devel] Use and Usability metrics /
> >> >> >> shields
> >> >> >>
> >> >> >> The common shield convention is to use blue or orange when
> >> >> >> the
> >> >> >> information
> >> >> >> is not qualitatively good or bad, but the color choice is
> >> >> >> just
> >> >> >> subjective
> >> >> >> in the end.
> >> >> >
> >> >> > It does seem though that we should indicate the non-changing
> >> >> > nature
> >> >> > of these shields with some kind of color change. Perhaps we
> >> >> > can
> >> >> > come up with one that works with the other design elements on
> >> >> > the
> >> >> > page.
> >> >> >
> >> >> > BTW, the 'posts' tag does change color; if there are 0 posts
> >> >> > tagged
> >> >> > with a package name, the shield is yellow; otherwise it's
> >> >> > green.
> >> >> >
> >> >> > Dan
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> On Thu, May 14, 2015 at 10:47 AM, Martin Morgan
> >> >> >> 
> >> >> >> wrote:
> >> >> >>
> >> >> >> > On 05/10/2015 11:39 AM, COMMO Frederic wrote:
> >> >> >> >
> >> >> >> >> Dear Martin,
> >> >> >> >>
> >> >> >> >> All of these suggestions sound good.
> >> >> >> >>
> >> >> >> >> Wolfgang's suggestion regarding possible associated
> >> >> >> >> papers
> >> >> >> >> might
> >> >> >> >> be also
> >> >> >> >> great.
> >> >> >> >>
> >> >> >> >> Another useful information would be to point to other
> >> >> >> >> publications
> >> >> >> >> where
> >> >> >> >> a given package was used, and cited.
> >> >> >> >> I don't know if it's technically pos