Re: [Bioc-devel] Windows-specific build error of 'seqCAT' package

2019-03-16 Thread Valerie Obenchain via Bioc-devel
It doesn't look like it was the patch - the error is still there. I'll look
in to this some more.
Valerie

On Fri, Mar 15, 2019 at 11:10 AM Valerie Obenchain 
wrote:

> Hi Erik,
> I think the problem was introduced in a contributed patch applied to
> VariantAnnotation in devel. I've reverted the patch and expect
> VariantAnnotation (and downstream packages) to clear up on Windows with
> tomorrow's builds.
> Valerie
>
> On Thu, Mar 7, 2019 at 4:10 AM Erik Fasterius 
> wrote:
>
>> Hi,
>>
>> My seqCAT package recently broke building on the Windows platform, which
>> I assume has to do with changes to the VariantAnnotation package (upon
>> which seqCAT depends). The error (which comes during the creation of the
>> vignette) looks like this:
>>
>> Quitting from lines 102-112 (seqCAT.Rmd)
>> Error: processing vignette 'seqCAT.Rmd' failed with diagnostics:
>> invalid class "VCFHeader" object: 1: 'info(VCFHeader)' must be a 3 column
>> DataFrame with names Number, Type, Description
>> invalid class "VCFHeader" object: 2: 'geno(VCFHeader)' must be a 3 column
>> DataFrame with names Number, Type, Description
>> --- failed re-building 'seqCAT.Rmd'
>>
>> It builds and completes all checks successfully on all other platforms,
>> though. I have no experience with Windows, so I don’t really know how to
>> even start debugging this issue (I’m working on OS X). Given that other
>> platforms are working fine I’m thinking (hoping) that this is some
>> incompatibility with recent changes to VariantAnnotation (or its
>> dependencies) and the Windows platform, but I have no clue if this is the
>> case.
>>
>> Does anybody have any idea as to what is the issue here, or have any tips
>> regarding to be debug something on a different platform than the one you
>> are using yourself?
>>
>> Erik
>>
>> [[alternative HTML version deleted]]
>>
>> ___
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Windows-specific build error of 'seqCAT' package

2019-03-15 Thread Valerie Obenchain via Bioc-devel
Hi Erik,
I think the problem was introduced in a contributed patch applied to
VariantAnnotation in devel. I've reverted the patch and expect
VariantAnnotation (and downstream packages) to clear up on Windows with
tomorrow's builds.
Valerie

On Thu, Mar 7, 2019 at 4:10 AM Erik Fasterius 
wrote:

> Hi,
>
> My seqCAT package recently broke building on the Windows platform, which I
> assume has to do with changes to the VariantAnnotation package (upon which
> seqCAT depends). The error (which comes during the creation of the
> vignette) looks like this:
>
> Quitting from lines 102-112 (seqCAT.Rmd)
> Error: processing vignette 'seqCAT.Rmd' failed with diagnostics:
> invalid class "VCFHeader" object: 1: 'info(VCFHeader)' must be a 3 column
> DataFrame with names Number, Type, Description
> invalid class "VCFHeader" object: 2: 'geno(VCFHeader)' must be a 3 column
> DataFrame with names Number, Type, Description
> --- failed re-building 'seqCAT.Rmd'
>
> It builds and completes all checks successfully on all other platforms,
> though. I have no experience with Windows, so I don’t really know how to
> even start debugging this issue (I’m working on OS X). Given that other
> platforms are working fine I’m thinking (hoping) that this is some
> incompatibility with recent changes to VariantAnnotation (or its
> dependencies) and the Windows platform, but I have no clue if this is the
> case.
>
> Does anybody have any idea as to what is the issue here, or have any tips
> regarding to be debug something on a different platform than the one you
> are using yourself?
>
> Erik
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Bioconductor 3.7 release candidate

2018-04-25 Thread Valerie Obenchain
Hi developers,

Today is the deadline for updating NEWS files and marks the release 
candidate for Bioconductor 3.7. After today, pushes to master should be 
bug fixes only.

   https://www.bioconductor.org/developers/release-schedule/

Thanks.

Valerie


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] 3.5 branch created in svn

2017-04-24 Thread Valerie Obenchain

The BioC 3.5 branch is now ready.

Remember, you always have access to 2 versions of your package in svn:
the "release" and "devel" versions.

Right now the "release" version of your package (which is not officially 
released yet but will be tomorrow if everything goes well) is in the 3.5 branch and 
accessible at:

 
https://hedgehog.fhcrc.org/bioconductor/branches/RELEASE_3_5/madman/Rpacks/>


Only bug fixes and documentation improvements should go here.

As always the "devel" version of your package is at:

   https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/>


Normal development of your package can resume here.

Similarly for experiment packages, where the "release" version of your package 
is at:

 
https://hedgehog.fhcrc.org/bioc-data/branches/RELEASE_3_5/experiment/pkgs/>


and the "devel" version at:

   https://hedgehog.fhcrc.org/bioc-data/trunk/experiment/pkgs/>


Please let us know if you have any questions.


Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Bioconductor 3.5 release: April 7 deadlines

2017-04-21 Thread Valerie Obenchain
Yes, that's what the API freeze means. The goal is to prevent last 
minute changes in chimeraviz from breaking downstream packages that 
depend on chimeraviz. By the way, the API freeze was Friday, April 7.


Valerie


On 04/21/2017 03:21 AM, Stian Lågstad wrote:
I've added a new function to my package chimeraviz (so far it's in a 
separate branch and not commited to Bioconductor). Does the API freeze 
mean that I have to wait until the Bioconductor release before I 
commit it to the devel branch?


On Fri, Apr 7, 2017 at 2:38 AM, Valerie Obenchain <mailto:voben...@gmail.com>> wrote:


Hi,

Some recent activity has caused some (red) ripples across the
builds and the El Capitan Mac builds are still unsettled. To
accommodate, we're extending the deadline to pass R CMD build and
check with no errors to next Friday
(http://www.bioconductor.org/developers/release-schedule/
<http://www.bioconductor.org/developers/release-schedule/>).
Thanks to everyone for cleaning up their packages and
communicating problems (and solutions!) on the lists.

Key deadlines for April 7 are

- No new packages added to the 3.5 roster

- No more API changes

- Current release 3.4 builds will stop

- Annotation packages (both internal and contributed) posted to
the 3.5 repo

Valerie

___
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing
list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>




--
Stian Lågstad
+47 41 80 80 25


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Bioconductor 3.5 release: April 7 deadlines

2017-04-06 Thread Valerie Obenchain

Hi,

Some recent activity has caused some (red) ripples across the builds and 
the El Capitan Mac builds are still unsettled. To accommodate, we're 
extending the deadline to pass R CMD build and check with no errors to 
next Friday (http://www.bioconductor.org/developers/release-schedule/). 
Thanks to everyone for cleaning up their packages and communicating 
problems (and solutions!) on the lists.


Key deadlines for April 7 are

- No new packages added to the 3.5 roster

- No more API changes

- Current release 3.4 builds will stop

- Annotation packages (both internal and contributed) posted to the 3.5 repo

Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation: filterVcf with chunks duplicates header

2015-08-07 Thread Valerie Obenchain
Thanks for the report and reproducible example. Now fixed in 1.14.8 and 
1.15.22.


Valerie



On 08/07/2015 07:25 AM, Julian Gehring wrote:

Hi,

When I use the 'filterVcf' function to process a VCF file in chunks, the
output file contains as many copies of the header as there are
chunks. Consider the example, adapted from the 'filterVcf' man page:

   library(VariantAnnotation)

   fl <- system.file(package = "VariantAnnotation", "extdata", "chr22.vcf.gz")
   filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))

   ## Control: Processing without chunking
   t0 <- TabixFile(fl) ## yieldSize -> NA
   out0 <- filterVcf(t0, "hg19", tempfile(), filters = filt)

   tx0 = readLines(out0)
   header_lines0 = grep("^##", tx0)
   ## 30 header lines, in line 1,..,30

   ## Case: Processing in 3 chunks
   t1 <- TabixFile(fl, yieldSize = 5e3) ## gives us 3 chunks here
   out1 <- filterVcf(t1, "hg19", tempfile(), filters = filt)

   tx1 = readLines(out1)
   header_lines1 = grep("^##", tx1)
   ## 90 header lines, header 3 times duplicated
   ## 1) 1,..,30, 2) 4827,..,4856, 3) 9673,..,9702

It seems that for each chunk, a complete VCF file including the header
gets written.  See the relevant part of VariantFiltering:::.filter:

while (nrow(vcfChunk <- readVcf(tbxFile, genome, ..., param = param))) {
  vcfChunk <- subsetByFilter(vcfChunk, filters)
  writeVcf(vcfChunk, filtered)
}

For the processing of VCFs with large headers in many chunks
(e.g. 1000genomes callsets), this can result in the paradox situation
that the filtered file ends up being significantly larger than the
original.

Best wishes
Julian



R version 3.2.1 (2015-06-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=de_DE.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=de_DE.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=de_DE.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
[1] VariantAnnotation_1.14.6 Rsamtools_1.20.4 Biostrings_2.36.1
[4] XVector_0.8.0GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[7] IRanges_2.2.5S4Vectors_0.6.3  BiocGenerics_0.14.0

loaded via a namespace (and not attached):
  [1] AnnotationDbi_1.30.1zlibbioc_1.14.0 GenomicAlignments_1.4.1
  [4] BiocParallel_1.2.9  BSgenome_1.36.3 tools_3.2.1
  [7] Biobase_2.28.0  DBI_0.3.1   lambda.r_1.1.7
[10] futile.logger_1.4.1 rtracklayer_1.28.6  futile.options_1.0.0
[13] bitops_1.0-6RCurl_1.95-4.7  biomaRt_2.24.0
[16] RSQLite_1.0.0   GenomicFeatures_1.20.1  XML_3.98-1.3

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] testing class and length of function args

2015-07-22 Thread Valerie Obenchain
There is a collection in S4Vectors that test atomic types and return a 
logical.


isSingleInteger
isSingleNumber
isSingleNumberOrNA
isSingleString
isSingleStringOrNA
isTRUEorFALSE


> isSingleNumber(1:5)
[1] FALSE
> isSingleNumber(NA)
[1] FALSE


Val


On 07/22/2015 04:22 PM, Jim Hester wrote:

Not sure about within Bioconductor but Hadley has a package to do this.

https://github.com/hadley/assertthat

On Wed, Jul 22, 2015 at 4:13 PM, Michael Love 
wrote:


it's slightly annoying to write

foo <- function(x) {
   if ( ! is.numeric(x) ) stop("x should be numeric")
   if ( ! length(x) == 2 ) stop("x should be length 2")
   c(x[2], x[1])
}

i wonder if we could have some core functions that test the class and
the length in one and give the appropriate stop message.

maybe this exists already

-Mike

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation: verbose output of readVcf

2015-07-16 Thread Valerie Obenchain

Hi Julian,

Yes, the behavior is intentional though I hadn't thought of the annoying 
chatter in the case of chunking. Sorry about that.


The current readVcf() reads/parses fields according to header 
multiplicity and type; fields without headers are skipped. It is on the 
TODO to be more liberal in reading (especially FORMAT) fields.


Quite a number of people are (a) unaware of header lines and (b) have 
vcf files with incomplete headers. It's confusing to those with 
incomplete headers why all fields aren't read in. So, printing the 
"found" fields was an attempt to communicate which would be read / 
parsed by readVcf().


I've made the following changes in 1.15.21:
- added a 'Header lines' section to the man page to explain this further
- added a 'verbose' arg to readVcf(); when TRUE fields found in header 
are printed


Valerie




On 07/16/2015 02:35 AM, Julian Gehring wrote:

Hi,

In recent versions of 'VariantAnnotation', the 'readVcf' function prints
information about the header lines:

   fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
   vcf <- readVcf(fl, "hg19")

shows

   found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER
   found header lines for 6 ‘info’ fields: NS, DP, AF, AA, DB, H2
   found header lines for 4 ‘geno’ fields: GT, GQ, DP, HQ

When one reads a VCF in chunks, this gets displayed once per chunk:

   fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
   param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
   tab <- TabixFile(fl, yieldSize=4000)
   open(tab)
   while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
 cat("vcf dim:", dim(vcf), "\n")

   found header lines for 1 ‘fixed’ fields: ALT
   found header lines for 1 ‘info’ fields: LDAF
   found header lines for 2 ‘geno’ fields: GT, GL
   vcf dim: 5 3
   found header lines for 1 ‘fixed’ fields: ALT
   found header lines for 1 ‘info’ fields: LDAF
   found header lines for 2 ‘geno’ fields: GT, GL
   vcf dim: 5 3
   found header lines for 1 ‘fixed’ fields: ALT
   found header lines for 1 ‘info’ fields: LDAF
   found header lines for 2 ‘geno’ fields: GT, GL
   vcf dim: 5 3
   found header lines for 1 ‘fixed’ fields: ALT
   found header lines for 1 ‘info’ fields: LDAF
   found header lines for 2 ‘geno’ fields: GT, GL

For larger files, this get a bit cumbersome. It looks to me like debug
information. Is this behavior intentional?

Best wishes
Julian

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Memory issues with BiocParallel::SnowParam()

2015-07-15 Thread Valerie Obenchain

Correction below.

On 07/15/2015 04:14 PM, Valerie Obenchain wrote:

Hi,

BiocParallel in release and devel are quite similar so I'd like to
narrow the focus to "before the changes to SnowParam" and after. This
means BiocParallel 1.0.3 (R 3.1) vs BiocParallel 1.3.34 (R 3.2.1) which
is the current devel.

(1) master vs worker memory

I'm more concerned about memory use on the master than the workers. This
is where the code has changed the most and where the data are touched
the most, ie, split into tasks and divided among workers. So it's the
numbers in the mem_email files instead of the logs that I'm more focused
on right now.

(2) virtual vs actual memory

The mem_email files show the virtual memory requested (Max vmem) for the
job but not actual memory used. Can you output the actual used? That
would be helpful.

It does look like more memory is requested (not necessary used) in
BiocParallel 1.3.34 vs 1.0.3.

(3) SGE and mem_free

This is more of an fyi -

It was my understanding that once Grid engine offered the mem_free
option, h_vmem was no longer necessary and actively discouraged by many
cluster admins.


A Grid engine update can enable cgroups which allow mem_free to be 
enforced, ie, can kill jobs that exceed specified memory. It's in this 
case where mem_free serves the greatest purpose and h_vmem isn't so 
useful. Not sure if your cluster has cgroups enabled.


Valerie



mem_free is used to track the available physical memory on the node and
when a job is submitted to that node SGE subtracts the requested value
from the available memory. h_vmem sets a threshold on the virtural
memory requested - a job is (usually) killed when the requested memory
exceeds this amount. Because many apps request more memory than they use
scheduling by h_vmem causes the cluster to be very inefficient as you
are reserving chunks of memory that never get used.

You may want to follow up with your cluster admin about this. Maybe
you've already gone down this road and there are relatively small
default memory allocations per slot so it's necessary to specify h_vmem.


I like how you've made information related to this bug report available
in github. Having that space to record thoughts, store log/memory files
and provide code really works well.

Valerie



On 07/14/2015 01:33 PM, Leonardo Collado Torres wrote:

Hi Valerie,

I have re-run my two examples twice using "log = TRUE" and updated the
output at http://lcolladotor.github.io/SnowParam-memory/ As I was
writing this email (all morning...), I made a 4th run where I save the
gc() information to compare against R 3.1.x That fourth run kind of
debunked what I was taking away from replicate runs 2 and 3.



## Between runs


Between runs there's almost no variability with R 3.1.x in the memory
used, a little bit with 3.2.1 and more with R 3.2.0. The variability
shown could be due to using BiocParallel 1.2.9 in runs 2 to 4 vs 1.2.7
in run 1 for R 3.2.0 and 1.3.31 vs 1.3.34 in R 3.2.1.



## gc() output differences

Now, from the gc() output from using "log = TRUE", I don't see nearly
any differences in the first example between R:

* 3.2.0
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.o6463210

* 3.2.1
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.x.o6463213


which makes sense given that the memory was the same

* 3.2.0
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.txt#L24

* 3.2.1
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.x.txt#L24


However, I do notice a large difference versus R 3.1.x

* log
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.1.x.o6463536

* mem info
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.1.x.txt#L50




In the derfinder example, the memory goes from 10.9 GB in R 3.2.0 to
12.87 GB in R 3.2.1 (run 2) with SnowParam(). The 18% increase
reported there is kind of similar to the increase from comparing the
max used mb output from gc():

# R 3.2.1 vs 3.2.0
* gc() max mem used mb ratio: (303.6 + 548.9) / (251.3 + 547.6) =~ 1.07
* max mem used in GB from cluster email:12.871 / 10.904  =~ 1.18

 From this observation, maybe using "log = TRUE" and checking that the
memory used reported by gc() goes down will be useful for testing
changes in BiocParallel() to see if they increase/decrease memory use.
When I run the tests locally (using 2 cores), I get similar ratios;
data at
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/local_run.txt.



However, the same type of comparison but now between R 3.2.1 against R
3.1.x shows that the numbers can be off. Although the 2.8 ratio is
closer to what I saw in my analysis scenario ( > 2.5).

# R 3.2.1 vs 3.1.x
* gc() max mem used mb ratio: (303.6 + 548.9) / (236 + 66) =~ 2.83
* max mem used in GB from clust

Re: [Bioc-devel] Memory issues with BiocParallel::SnowParam()

2015-07-15 Thread Valerie Obenchain
ory/blob/gh-pages/logs/der-snow-3.1.x.o6463545#L10-L11
# cluster mem info
* 3.2.1 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.2.x.txt#L24
* 3.2.0 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.2.txt#L24
* 3.1.x 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/der-snow-3.1.x.txt#L50






## Max mem vs gc()


Comparing the memory used in GB from the cluster email to the max used
mb from gc() multiplied by 10 (number of cores used), I see that the
ratio is somewhat consistent.


# first example, R 3.2.0, SnowParam() run 2
7.285 * 1024 / (10 * (23.9 + 461.8)) =~ 1.54
# first example, R 3.2.1, SnowParam(), run 2
7.286 * 1024 / (10 * (24 + 461.8)) =~ 1.54

# derfinder example, R 3.2.0, SnowParam(), run 2
10.904 * 1024 / (10 * (251.3 + 547.6)) =~ 1.4
# derfinder example, R 3.2.1, SnowParam(), run 2
12.871 * 1024 / (10 * (303.6 + 548.9)) =~ 1.55

# derfinder example, R 3.2.0, MulticoreParam(), run 2
13.789 * 1024 / (10 * (230.8 + 770.2)) =~ 1.41
# derfinder example, R 3.2.1, MulticoreParam(), run 2
13.671 * 1024 / (10 * (245.9 + 757.5)) =~ 1.4

And from this other observation, maybe I can use the gc() output from
"log = TRUE" to get an idea if the memory use reported by my cluster
is in line with previous runs, or if there's a cluster issue. This
ratio could also be used to compare different cluster environments to
see which ones are reporting greater/lower memory use.


However, the above ratios are different with R 3.1.x

# first example, R 3.1.x, SnowParam(), run 4
5.036 * 1024 / (10 * (32.1 + 218.5)) =~ 2.06

# derfinder example, R 3.1.x, SnowParam(), run 4
7.175 * 1024 / (10 * (236 + 66)) =~ 2.43
# derfinder example, R 3.1.x, MulticoreParam(), run 4
8.473 * 1024 / (10 * (240.7 + 189.1)) =~ 2.02


I'm not sure if this is a hint, but the largest difference between R
3.1.x and the other two is in the max used mb from Vcells.




Summarizing, I was thinking that

(A) we could use the output from gc() to compare between versions and
check which changes lowered the memory required,
and (B) estimate the actual memory needed as measured by the cluster
as well as compare cluster environments.

However, the gc() numbers from R 3.1.x (only available on the 4th
replicate run) don't seem to support these thoughts.



Or do you interpret these numbers differently?



Best,
Leo

On Sun, Jul 12, 2015 at 11:00 AM, Valerie Obenchain
 wrote:

Hi Leo,

Thanks for the sample code I'll take a look.

You're right, SnowParam has changed quite at bit - logging, error handling
etc. The memory use you're seeing is a concern - thanks for reporting it.

As an fyi, the log output for SnowParam and MulticoreParam now includes
gc(), system.time() and other stats from the workers.

SnowParam(log = TRUE)


Valerie




On 07/10/2015 01:12 PM, Leonardo Collado Torres wrote:


Hi,

I ran my example code with SerialParam() which had a negligible 4%
memory increase between R 3.2.x and 3.1.x This 4% could very well
fluctuate a little bit and might be non significantly different from 0
if I run the test more times.

I also added a second example using code based on my analysis script.
With SerialParam(), the memory change is 13%, but with SnowParam()
it's 82% between the R versions mentioned already using 10 cores. It's
still far from the > 150% increase (2.5 fold change) I'm seeing with
the real data.

I initially thought that these observations ruled out everything else
except SnowParam(). However, maybe the initial 13% memory increase
multiplied by 10 (well, less then linear) is what I'm seeing with 10
cores (82% increase).

The updated information is available at
http://lcolladotor.github.io/SnowParam-memory/



As for what Vincent suggested of an AMI and EC2, I don't have
experience with them. I'm not sure I'll be able to look into them and
create a reproducible environment.


Cheers,
Leo

On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey
 wrote:


I have had (potentially transient and environment-related) problems with
bplapply
in gQTLstats.   I substituted the foreach abstractions and the code
worked.
I still
have difficulty seeing how to diagnose the trouble I ran into.

I'd suggest that you code so that you can easily substitute parallel- or
foreach- or
BatchJobs-based cluster control.  This can help crudely isolate the
source
of trouble.

It would be very nice to have a way of measuring resource usage in
cluster
settings,
both for diagnosis and strategy selection.  For jobs that succeed,
BatchJobs
records
memory used in its registry database, based on gc().  I would hope that
there are
tools that could be used to help one figure out how to factor a task so
that
it is feasible
given some view of environment constraints.

It might be useful for you to build an AMI and then a cluster that allows
replication of
the condition you are seeing on EC2.  This could hel

Re: [Bioc-devel] Redirect workers output to STDERR, how to do so with current BiocParallel::SnowParam()?

2015-07-14 Thread Valerie Obenchain

Hi,

Thanks for sending the updated information. To make sure I know how the 
files relate, let's take snow-3.2.x, BiocParallel 1.3.34 run on July 14 
as an example.


memory files:

The mem_email files report max virtual memory (Max vmem) on the master, 
for the whole job. I see 3 outputs here - one run from July 9 and two 
from July 14:


https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/mem_emails/snow-3.2.x.txt

This last entry corresponds to BiocParallel 1.3.34 from July 14:


Job 6463239 (snow-3.2.x) Complete
 User = lcollado
 Queue= share...@compute-082.cm.cluster
 Host = compute-082.cm.cluster
 Start Time   = 07/14/2015 12:20:48
 End Time = 07/14/2015 12:22:27
 User Time= 00:00:18
 System Time  = 00:00:04
 Wallclock Time   = 00:01:39
 CPU  = 00:10:14
 Max vmem = 7.286G
 Exit Status  = 0



log files:

Corresponding log files for BiocParallel 1.3.34 on July 14 show memory 
used on workers:


https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.x.o6463213
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/snow-3.2.x.o6463239



On 07/14/2015 10:23 AM, Leonardo Collado Torres wrote:

Hi Valerie,

My other recent thread about SnowParam
(https://stat.ethz.ch/pipermail/bioc-devel/2015-July/007788.html)
allowed me to see a small difference.

In R 3.1.x, using 'outfile' as in
https://github.com/lcolladotor/SnowParam-memory/blob/d9d70086016c7e720714bec48f121aeca05b1416/SnowParam-memory-derfinder.R#L83
showed more info than in R 3.2.0 and 3.2.1 as shown below:

* R 3.1.x 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.1.x.e6416599
* R 3.2.0 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.e6416601
* R 3.2.1 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.x.e6416603

using BiocParallel versions

* 1.0.3 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.1.x.o6416599#L57
* 1.2.8 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.o6416601#L58
* 1.3.33 
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.x.o6416603#L58

Basically, only some of the message() calls get printed in stderr in R
3.2.0 and 3.2.1 which is something I wasn't expecting. However, using
"log = TRUE", all the message() calls get printed in stdout, as
expected given the docs, in which case using 'outfile' is kind of
useless.


The idea was to phase out the snow-centric 'outfile' and replace it with 
'log' and 'logdir'. You're right, previous versions of BiocParallel did 
not capture stdout and stderr on the workers; the current release and 
devel do. As you've now seen, when log = TRUE stderr and stdout messages 
print to the console, if logdir is given they go to a file. One thing I 
have not done yet is capture stdout and stderr from the master to a 
'logdir' file (workers captured only).




See below for the output under R 3.2.0 when log = TRUE
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.o6463201
https://github.com/lcolladotor/SnowParam-memory/blob/gh-pages/logs/der-snow-3.2.e6463201

Assuming that it's all working as intended, I'll switch to using log =
TRUE in some of my code when users specify 'verbose = TRUE'.


Sounds good.



Cheers,
Leo

PS In page 1 of
http://www.bioconductor.org/packages/3.2/bioc/vignettes/BiocParallel/inst/doc/Errors_Logs_And_Debugging.pdf
there's a typo. It reads "SnowParma" instead of "SnowParam".


Thanks, I'll fix that.

Valerie



On Fri, May 22, 2015 at 5:33 PM, Valerie Obenchain
 wrote:

Hi,

Thanks for reporting the bug. Now fixed in 1.13.4 (devel) and 1.2.2
(release).


bplapply(1:2, print, BPPARAM=SnowParam(outfile = NULL))

starting worker for localhost:11031
starting worker for localhost:11031
Type: EXEC
[1] 1
Type: EXEC
[1] 2
Type: DONE
Type: DONE
[[1]]
[1] 1

[[2]]
[1] 2


Some new features have been added to BiocParallel - logging with
futile.logger, writing out log and result files and control over how a job
is divided into tasks.

## send log file to 'logdir':
bplapply(1:2, print, BPPARAM=SnowParam(log=TRUE, logdir=tempdir())

## write results to 'resdir':
bplapply(1:2, print, BPPARAM=SnowParam(resdir=tempdir())

## by default jobs are divided evenly over the workers
bplapply(1:100, print, BPPARAM=SnowParam(workers=4))

## force the job to be run in 2 tasks
## (useful when you know runtime or memory requirements)
bplapply(1:100, print, BPPARAM=SnowParam(tasks=2))


It would be great if you had a chance to try any of these out - I'd be
interested in the feedback. Logging was intended to take the idea of the
'outfile' from snow further with the ability to add 

Re: [Bioc-devel] Memory issues with BiocParallel::SnowParam()

2015-07-12 Thread Valerie Obenchain

Vince,

On 07/10/2015 04:12 AM, Vincent Carey wrote:

I have had (potentially transient and environment-related) problems with
bplapply
in gQTLstats.


Was the problem during build or check where a man page example or unit 
test could be isolated as the problem?



  I substituted the foreach abstractions and the code

worked.  I still
have difficulty seeing how to diagnose the trouble I ran into.

I'd suggest that you code so that you can easily substitute parallel- or
foreach- or
BatchJobs-based cluster control.  This can help crudely isolate the source
of trouble.

It would be very nice to have a way of measuring resource usage in cluster
settings,
both for diagnosis and strategy selection.


SnowParam and MulticoreParam log output includes gc(), system.time() and 
all messages sent to stdout and stderr. Turn logging on with,


SnowParam(log = TRUE)

If files are more convenient, logs are written to files (one per tasks) 
with 'logdir',


SnowParam(log = TRUE, logdir  tempfile())



 For jobs that succeed,

BatchJobs records
memory used in its registry database, based on gc().  I would hope that
there are
tools that could be used to help one figure out how to factor a task so
that it is feasible
given some view of environment constraints.


Once you have an idea of memory use from the log output you can modify 
how 'X' is divided over the workers with the 'tasks' arg.


A job is defined as the 'X' in bplapply(). A task is the element(s) of 
'X' sent to a worker, eg,


bplappy(X = 1:5, sqrt)


SnowParam()## X is divided ~ evenly over max workers
SnowParam(workers = 3) ## X divided ~ evenly over 3 workers
SnowParam(tasks = 5)   ## X divided into 5 tasks
SnowParam(workers = 2, tasks = 3) ## X divided by 3, run on 2 workers


If you have problems with BiocParallel, no matter how transient or 
difficult to reproduce, please let me know.


Thanks.
Valerie




It might be useful for you to build an AMI and then a cluster that allows
replication of
the condition you are seeing on EC2.  This could help with diagnosis and
might be
a basis for defining better instrumentation tools for both diagnosis and
planning.

On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres 
wrote:


Hi,

I have a script that at some point generates a list of DataFrame
objects which are rather large matrices. I then feed this list to
BiocParallel::bplapply() and process them.

Previously, I noticed that in our SGE managed cluster using
MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
in https://support.bioconductor.org/p/62551/#62877. Martin posted in
https://support.bioconductor.org/p/62551/#62880 that "Probably the
tools used to assess memory usage are misleading you." This could be
true, but they are the tools that determine memory usage for all jobs
in the cluster. Meaning that if my memory usage blows up according to
these tools, my jobs get killed.

That was with R 3.1.x and in particular running

https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
with

$ sh step1-fullCoverage.sh brainspan

which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
I recently tried to reproduce this (to check changes in run time given
rtracklayer's improvements with BigWig files) using R 3.2.x and the
memory went up to 450 GB before the job got killed given the maximum
memory I specified for the job. The same is true using R 3.2.0.

Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
bug fix is different, for other code not used in this script). I know
that BiocParallel changed quite a bit between those versions, and in
particular SnowParam(). So that's why my prime suspect is
BiocParallel.

I made a smaller reproducible example which you can view at
http://lcolladotor.github.io/SnowParam-memory/. This example uses a
list of data frames with random data, and also uses 10 cores. You can
see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
does use more memory than SnowParam(), as reported by SGE. Beyond the
actual session info differences due to changes in BiocParalell's
implementation, I noticed that the cluster type changed from PSOCK to
SOCK. I ignore if this could explain the memory increase.

The example doesn't generate the huge fold change between R 3.1.x and
the other two versions (still 1.27x > 1x) that I see with my analysis
script, so in that sense it's not the best example for the problem I'm
observing. My tests with

https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
were between June 23rd and 28th, so maybe some recent changes in
BiocParallel addressed this issue.


I'm not sure how to proceed now. One idea is to make another example
with the same type of objects and operations I use in my analysis
script.

A second one is to run my analysis script with SerialParam() on the
different R versions to check if they use different amounts of memory
which would suggest that the memory issue is not 

Re: [Bioc-devel] Memory issues with BiocParallel::SnowParam()

2015-07-12 Thread Valerie Obenchain

Hi Leo,

Thanks for the sample code I'll take a look.

You're right, SnowParam has changed quite at bit - logging, error 
handling etc. The memory use you're seeing is a concern - thanks for 
reporting it.


As an fyi, the log output for SnowParam and MulticoreParam now includes 
gc(), system.time() and other stats from the workers.


SnowParam(log = TRUE)


Valerie



On 07/10/2015 01:12 PM, Leonardo Collado Torres wrote:

Hi,

I ran my example code with SerialParam() which had a negligible 4%
memory increase between R 3.2.x and 3.1.x This 4% could very well
fluctuate a little bit and might be non significantly different from 0
if I run the test more times.

I also added a second example using code based on my analysis script.
With SerialParam(), the memory change is 13%, but with SnowParam()
it's 82% between the R versions mentioned already using 10 cores. It's
still far from the > 150% increase (2.5 fold change) I'm seeing with
the real data.

I initially thought that these observations ruled out everything else
except SnowParam(). However, maybe the initial 13% memory increase
multiplied by 10 (well, less then linear) is what I'm seeing with 10
cores (82% increase).

The updated information is available at
http://lcolladotor.github.io/SnowParam-memory/



As for what Vincent suggested of an AMI and EC2, I don't have
experience with them. I'm not sure I'll be able to look into them and
create a reproducible environment.


Cheers,
Leo

On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey
 wrote:

I have had (potentially transient and environment-related) problems with
bplapply
in gQTLstats.   I substituted the foreach abstractions and the code worked.
I still
have difficulty seeing how to diagnose the trouble I ran into.

I'd suggest that you code so that you can easily substitute parallel- or
foreach- or
BatchJobs-based cluster control.  This can help crudely isolate the source
of trouble.

It would be very nice to have a way of measuring resource usage in cluster
settings,
both for diagnosis and strategy selection.  For jobs that succeed, BatchJobs
records
memory used in its registry database, based on gc().  I would hope that
there are
tools that could be used to help one figure out how to factor a task so that
it is feasible
given some view of environment constraints.

It might be useful for you to build an AMI and then a cluster that allows
replication of
the condition you are seeing on EC2.  This could help with diagnosis and
might be
a basis for defining better instrumentation tools for both diagnosis and
planning.

On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres 
wrote:


Hi,

I have a script that at some point generates a list of DataFrame
objects which are rather large matrices. I then feed this list to
BiocParallel::bplapply() and process them.

Previously, I noticed that in our SGE managed cluster using
MulticoreParam() lead to 5 to 8 times higher memory usage as I posted
in https://support.bioconductor.org/p/62551/#62877. Martin posted in
https://support.bioconductor.org/p/62551/#62880 that "Probably the
tools used to assess memory usage are misleading you." This could be
true, but they are the tools that determine memory usage for all jobs
in the cluster. Meaning that if my memory usage blows up according to
these tools, my jobs get killed.

That was with R 3.1.x and in particular running

https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
with

$ sh step1-fullCoverage.sh brainspan

which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores.
I recently tried to reproduce this (to check changes in run time given
rtracklayer's improvements with BigWig files) using R 3.2.x and the
memory went up to 450 GB before the job got killed given the maximum
memory I specified for the job. The same is true using R 3.2.0.

Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one
bug fix is different, for other code not used in this script). I know
that BiocParallel changed quite a bit between those versions, and in
particular SnowParam(). So that's why my prime suspect is
BiocParallel.

I made a smaller reproducible example which you can view at
http://lcolladotor.github.io/SnowParam-memory/. This example uses a
list of data frames with random data, and also uses 10 cores. You can
see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam()
does use more memory than SnowParam(), as reported by SGE. Beyond the
actual session info differences due to changes in BiocParalell's
implementation, I noticed that the cluster type changed from PSOCK to
SOCK. I ignore if this could explain the memory increase.

The example doesn't generate the huge fold change between R 3.1.x and
the other two versions (still 1.27x > 1x) that I see with my analysis
script, so in that sense it's not the best example for the problem I'm
observing. My tests with

https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh
were between June 23rd and 28th, so 

[Bioc-devel] Bioconductor Newsletter - July 2015

2015-07-01 Thread Valerie Obenchain

July newsletter available at

http://www.bioconductor.org/help/newsletters/2015_July/

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Valerie Obenchain
I see. So the merge functions would just add the output of 
locateVariants() to the GRanges or VCF used as 'query'. Or I guess, as 
Robert suggested, locateVariants() could return the same object as 'query'.


Val


On 06/11/2015 10:19 AM, Michael Lawrence wrote:

Val,

I wasn't suggesting that LOCSTRAND be added to the locateVariants()
output. Rather, it would be added to the VRanges during the merge.

Michael

On Thu, Jun 11, 2015 at 10:10 AM, Valerie Obenchain
 wrote:

locateVariants(), predictCoding() and the family of mapToTranscripts()
functions all return strand according to the annotation matched. The only
time the strand of the output could possibly be different from the strand of
the input 'query' is when 'ignore.strand = TRUE' (FALSE by default).

I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? By
just using the default, the output will have the same strand as the input
'query' (unless 'query' is '*' of course).

That said, do you still feel it's necessary to add a LOCSTRAND column to the
output?

Val


On 06/11/2015 09:38 AM, Michael Lawrence wrote:


I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo 
wrote:


Of course, the inclusion of strand would imply an interpretation of the
variant and its strand (e.g., "-") with respect to an annotated feature.
I
can see a practical problem of integrity of the information on a VRanges
object, by which a mandatory column, such as strand, depends on a
non-mandatory column, such as some feature annotation stored as a
metadata
column.

A solution would be to add the transcript identifier (TXID) as mandatory
column on the VRanges object but I suspect this is a big change to do, so
adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
locateVariants) in the metadata columns of the VRanges object would allow
me
to use a VRanges object as a container of variant x allele x sample x
annotation.

Just to clear up the issue of merging strand and variant: a noisy variant
(a
variant that is not silent) and has a, e.g., loss-of-function effect such
as
the gain of a stop codon, is usually interpreted in the strand of the
transcript and coding sequence in which the stop codon is gained, saying
something like and A changed to a T producting the stop codon TAA. Ref
and
alt alleles are called in the strand of the reference chromosome, so if
the
transcript was annotated in the negative strand, we would know that we
need
to reverse-complement ref and alt to interpret the variant, although I
see
no need to do anything on the VRanges object to ref and alt because we
know
they are always in the strand of the reference chromosome. Only if you
want
to detect this stop-gain event (with predictCoding) then you would have
to
reverse-complement the ref and alt alleles. Conversely, if the variant
falls
in an intergenic region, then obviously the strand plays no role in the
interpretation of the variant and nothing needs to be done when
interpreting
the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:



The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
wrote:



one option for me is just to add a metadata column with the strand of
the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele
and
sample. variants in VCF files are obtained from tallying reads aligned
on
a
reference genome. so, my understanding is that the reference allele is
the
allele of the reference genome against which the reads were aligned
while
the alternate allele(s) are allele calls different from the reference.
from
this perspective, my interpretation is that ref and alt alleles have
already
a strand, which is the strand of the reference chromosome against which
the
reads were aligned to

Re: [Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

2015-06-11 Thread Valerie Obenchain
locateVariants(), predictCoding() and the family of mapToTranscripts() 
functions all return strand according to the annotation matched. The 
only time the strand of the output could possibly be different from the 
strand of the input 'query' is when 'ignore.strand = TRUE' (FALSE by 
default).


I wouldn't think you (Robert) are using 'ignore.strand = TRUE', are you? 
By just using the default, the output will have the same strand as the 
input 'query' (unless 'query' is '*' of course).


That said, do you still feel it's necessary to add a LOCSTRAND column to 
the output?


Val

On 06/11/2015 09:38 AM, Michael Lawrence wrote:

I didn't realize that locateVariants() returned an object with its
strand matching that of the subject. I would have expected the subject
strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
sounds like you want to merge the locateVariants() output with the
input. Merging the output strand as LOCSTRAND on the VRanges sounds
like a reasonable approach, for now. I don't know if Val is listening,
but it sounds like it would be nice to have convenient functions for
merging locateVariants() output with its input. The one for VRanges
might do something like the above.

Michael

On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo  wrote:

Of course, the inclusion of strand would imply an interpretation of the
variant and its strand (e.g., "-") with respect to an annotated feature. I
can see a practical problem of integrity of the information on a VRanges
object, by which a mandatory column, such as strand, depends on a
non-mandatory column, such as some feature annotation stored as a metadata
column.

A solution would be to add the transcript identifier (TXID) as mandatory
column on the VRanges object but I suspect this is a big change to do, so
adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
locateVariants) in the metadata columns of the VRanges object would allow me
to use a VRanges object as a container of variant x allele x sample x
annotation.

Just to clear up the issue of merging strand and variant: a noisy variant (a
variant that is not silent) and has a, e.g., loss-of-function effect such as
the gain of a stop codon, is usually interpreted in the strand of the
transcript and coding sequence in which the stop codon is gained, saying
something like and A changed to a T producting the stop codon TAA. Ref and
alt alleles are called in the strand of the reference chromosome, so if the
transcript was annotated in the negative strand, we would know that we need
to reverse-complement ref and alt to interpret the variant, although I see
no need to do anything on the VRanges object to ref and alt because we know
they are always in the strand of the reference chromosome. Only if you want
to detect this stop-gain event (with predictCoding) then you would have to
reverse-complement the ref and alt alleles. Conversely, if the variant falls
in an intergenic region, then obviously the strand plays no role in the
interpretation of the variant and nothing needs to be done when interpreting
the ref and alt alleles.


On 6/11/15 5:47 PM, Michael Lawrence wrote:


The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo 
wrote:


one option for me is just to add a metadata column with the strand of the
overlapping feature. however, i'm interested to fully understand the
rationale behind this aspect of the design of the VRanges object.

a VRanges object unrolls variants in a VCF file per alternative allele
and
sample. variants in VCF files are obtained from tallying reads aligned on
a
reference genome. so, my understanding is that the reference allele is
the
allele of the reference genome against which the reads were aligned while
the alternate allele(s) are allele calls different from the reference.
from
this perspective, my interpretation is that ref and alt alleles have
already
a strand, which is the strand of the reference chromosome against which
the
reads were aligned to. i'm interested in this interpretation of the
strand
of the variants because i'm interested in the interpretation of
sequence-features containing the reference and the alternate alleles,
such
as differences in a binding site with the reference and the alternate
allele.

if we relax the meaning of elements in a VRanges object to, not only
variants x allele x sample, but to variants x allele x sample x
annotated-feature, then i think it would make sense to have the
strand-specific annotation in the strand slot of the VRanges object.

while this idea may be good or not for a number of reasons, i'm now
mostl

Re: [Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

2015-06-11 Thread Valerie Obenchain
etwidth_1.0-3
[17] colorout_1.0-3

loaded via a namespace (and not attached):
  [1] GenomicAlignments_1.5.9 zlibbioc_1.15.0 BiocParallel_1.3.22
BSgenome_1.37.1
  [5] tools_3.3.0 DBI_0.3.1   lambda.r_1.1.7
   futile.logger_1.4.1
  [9] rtracklayer_1.29.7  futile.options_1.0.0bitops_1.0-6
RCurl_1.95-4.6
[13] biomaRt_2.25.1  RSQLite_1.0.0   XML_3.98-1.2


*** SESSION INFORMATION FOR RELEASE **

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Fedora release 12 (Constantine)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C LC_TIME=en_US.UTF8
  [4] LC_COLLATE=en_US.UTF8 LC_MONETARY=en_US.UTF8
LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8   LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=CLC_MEASUREMENT=en_US.UTF8
LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1
  [3] AnnotationDbi_1.30.1Biobase_2.28.0
  [5] VariantAnnotation_1.14.2Rsamtools_1.20.4
  [7] Biostrings_2.36.1   XVector_0.8.0
  [9] GenomicRanges_1.20.5GenomeInfoDb_1.4.0
[11] IRanges_2.2.4   S4Vectors_0.6.0
[13] BiocGenerics_0.14.0 vimcom_1.2-3
[15] setwidth_1.0-3  colorout_1.1-0

loaded via a namespace (and not attached):
  [1] zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.2
  BSgenome_1.36.0
  [5] tools_3.2.0 DBI_0.3.1   lambda.r_1.1.7
   futile.logger_1.4.1
  [9] rtracklayer_1.28.4      futile.options_1.0.0bitops_1.0-6
RCurl_1.95-4.6
[13] biomaRt_2.24.0  RSQLite_1.0.0   XML_3.98-1.2


On 06/11/2015 01:55 AM, Valerie Obenchain wrote:

Thanks for the report.

Now fixed in release (1.14.2) and devel (1.3.25).

Valerie



On 06/09/2015 09:44 AM, Robert Castelo wrote:

hi,

currently, the annotation of variants in intronic regions by
VariantAnnotation and the locateVariants() function does not assign
strand to annotations in introns:

library(VariantAnnotation)
example(locateVariants)
loc_all[loc_all$LOCATION == "intron"]
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION LOCSTART LOCEND
QUERYID TXID CDSID GENEID
   |   
   
chr1 [13302, 13302] * | intron 948 948
3 2 100287102
chr1 [13327, 13327] * | intron 973 973
4 2 100287102
PRECEDEID FOLLOWID
 


---
seqinfo: 1 sequence from hg19 genome; no seqlengths


however, introns are stranded, so I would suggest to include the strand
information in variants annotated to intronic regions. After a quick
look to the source code I believe the relevant line is within the
private function .makeResult(), concretely at:

strand=strand(query)[xHits],

where is taking the strand of the query (the variant) while I guess it
should be the subject (the annotation).


cheers,

robert.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel








--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] strandless introns with VariantAnnotation::locateVariants()

2015-06-10 Thread Valerie Obenchain

Thanks for the report.

Now fixed in release (1.14.2) and devel (1.3.25).

Valerie



On 06/09/2015 09:44 AM, Robert Castelo wrote:

hi,

currently, the annotation of variants in intronic regions by
VariantAnnotation and the locateVariants() function does not assign
strand to annotations in introns:

library(VariantAnnotation)
example(locateVariants)
loc_all[loc_all$LOCATION == "intron"]
GRanges object with 2 ranges and 9 metadata columns:
seqnames ranges strand | LOCATION  LOCSTARTLOCEND
QUERYIDTXID CDSID  GENEID
|   
   
chr1 [13302, 13302]  * |   intron   948   948
3   2 100287102
chr1 [13327, 13327]  * |   intron   973   973
4   2 100287102
  PRECEDEIDFOLLOWID
 


   ---
   seqinfo: 1 sequence from hg19 genome; no seqlengths


however, introns are stranded, so I would suggest to include the strand
information in variants annotated to intronic regions. After a quick
look to the source code I believe the relevant line is within the
private function .makeResult(), concretely at:

 strand=strand(query)[xHits],

where is taking the strand of the query (the variant) while I guess it
should be the subject (the annotation).


cheers,

robert.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] reproducible with mclapply?

2015-06-04 Thread Valerie Obenchain

I'll add a section to the BiocParallel docs.

Valerie

On 06/04/2015 07:55 AM, Kasper Daniel Hansen wrote:

Yes, based on the documentation that particular random stream generator
would work with mclapply.

This is absolutely a subject which ought to be covered in the BiocParallel
documentation.

And commenting on another set of recommendations: please NEVER used
set.seed inside a function.  Unfortunately, because of the way R works,
this is a really bad idea.  As is functions with arguments like (set.seed =
FALSE).  Users need to be educated about this.  The main issue with using
set.seed is when your work is wrapped into other peoples code, for example
with an external bootstrap or similar.  I understand the desire for
reproducibility, but the design of the random generator in R is such that
this should really be left to the user.

Kasper

On Thu, Jun 4, 2015 at 10:39 AM, Vincent Carey 
wrote:


It does appear to me that the doRNG vignette sec 1.1 describes a solution
to the problem posed.  It is less clear to me that this method is readily
adopted with BiocParallel unless registerDoPar is in use  Should we
address this topic explicitly in the vignette?

On Thu, Jun 4, 2015 at 9:50 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:


Note you're not guaranteed that two random streams starting with different
seeds will be (approximately) independent, so the suggestion on SO makes
the numbers reproducible but technically wrong.

If you want true independence you either need to use a parallel version of
the random number generator or you do what I suggested.  Because of how
mclapply works (via fork) it is not clear to me that it is possible to use
a parallel version of the random number generator, but I am not sure about
this.  The snippet from the documentation quoted above suggests I am
wrong.

Best,
Kasper

On Wed, Jun 3, 2015 at 11:25 PM, Vladislav Petyuk 
wrote:


There are different ways set.seed can be used.  The way it is suggested

on

the aforementioned stackoverflow post is basically a two stage process.
First seed is provided by a user (set.seed(1)).  That is user can change
the outcome from run to run.  Based on that seed, a vector of randomized
seeds is generated (seeds <- sample.int(length(input), replace=TRUE)).
Those seeds are basically arguments to the function under

mclapply/lapply

that help to control random number generation for each iteration

(set.seed

(seeds[idx])).
There are two different roles of set.seed. First left the user to

control

random number generation and the second (within the function) makes sure
that it is the same for individual iterations regardless how the loop is
executed.
Does that make sense?

On Wed, Jun 3, 2015 at 7:07 PM, Yu, Guangchuang 
wrote:


There is one possible solution posted in



http://stackoverflow.com/questions/30610375/how-to-run-permutations-using-mclapply-in-a-reproducible-way-regardless-of-numbe/30627984#30627984

.

As Kasper suggested, it's not a proper way to use set.seed inside a
package.

I suggest using a parameter for example seed=FALSE to disable the

set.seed

and if user want the result reproducible, e.g. in demonstration, set
seed=TRUE explicitly and set.seed will be run inside the function.

Bests,
Guangchuang

On Wed, Jun 3, 2015 at 8:42 PM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:


For this situation, generate the permutation indexes outside of the
mclapply, and the do mclapply over a list with the indices.

And btw., please don't use set.seed inside a package; that control

should

completely be left to the user.

Best,
Kasper

On Wed, Jun 3, 2015 at 7:08 AM, Vincent Carey <

st...@channing.harvard.edu>

wrote:


This document indicates how to achieve reproducibility independent

of

the

underlying physical environment.

http://cran.r-project.org/web/packages/doRNG/vignettes/doRNG.pdf

Let me know if that satisfies the question.

On Wed, Jun 3, 2015 at 5:32 AM, Yu, Guangchuang <

g...@connect.hku.hk>

wrote:


Der Vincent,

RNGkind("L'Ecuyer-CMRG") works as using mc.set.seed=FALSE.

When mc.cores changes, the output is not reproducible.

I think this issue is also of concern within the Bioconductor

community

as parallel version of permutation test is commonly used now.


Best Regards,

Guangchuang



On Wed, Jun 3, 2015 at 5:17 PM, Vincent Carey <

st...@channing.harvard.edu>

wrote:


Hi, this question belongs on R-help, but perhaps







https://stat.ethz.ch/R-manual/R-devel/library/parallel/html/RngStream.html


will be useful.

Best regards

On Wed, Jun 3, 2015 at 3:11 AM, Yu, Guangchuang <

g...@connect.hku.hk>

wrote:


Dear all,

I have an issue of setting seed value when using parallel

package.



library("parallel")
library("digest")

set.seed(0)
m <- mclapply(1:10, function(x) sample(1:10),

+   mc.cores=2)

digest(m, 'crc32')

[1] "4827c80c"


set.seed(0)
m <- mclapply(1:10, function(x) sample(1:10),

+   mc.cores=2)

digest(m, 'crc32')

[1] "e95

Re: [Bioc-devel] Redirect workers output to STDERR, how to do so with current BiocParallel::SnowParam()?

2015-05-22 Thread Valerie Obenchain

Hi,

Thanks for reporting the bug. Now fixed in 1.13.4 (devel) and 1.2.2 
(release).


> bplapply(1:2, print, BPPARAM=SnowParam(outfile = NULL))
starting worker for localhost:11031
starting worker for localhost:11031
Type: EXEC
[1] 1
Type: EXEC
[1] 2
Type: DONE
Type: DONE
[[1]]
[1] 1

[[2]]
[1] 2


Some new features have been added to BiocParallel - logging with 
futile.logger, writing out log and result files and control over how a 
job is divided into tasks.


## send log file to 'logdir':
bplapply(1:2, print, BPPARAM=SnowParam(log=TRUE, logdir=tempdir())

## write results to 'resdir':
bplapply(1:2, print, BPPARAM=SnowParam(resdir=tempdir())

## by default jobs are divided evenly over the workers
bplapply(1:100, print, BPPARAM=SnowParam(workers=4))

## force the job to be run in 2 tasks
## (useful when you know runtime or memory requirements)
bplapply(1:100, print, BPPARAM=SnowParam(tasks=2))


It would be great if you had a chance to try any of these out - I'd be 
interested in the feedback. Logging was intended to take the idea of the 
'outfile' from snow further with the ability to add messages that can be 
filtered by threshold.


The package also has a new errors/logging vignette:

http://www.bioconductor.org/packages/3.2/bioc/vignettes/BiocParallel/inst/doc/Errors_Logs_And_Debugging.pdf

Valerie



On 05/21/2015 07:58 AM, Leonardo Collado Torres wrote:

Hi,

This might be a BioC support website question, but maybe it's a bug.

In previous versions of BiocParallel, you could specify where to
direct the output from the workers by using the 'outfile' argument.
For example, SnowParam(outfile = Sys.getenv('SGE_STDERR_PATH')). I'm
not finding how to do so with the current version (1.3.12).

I understand that SnowParam() has a ... argument that gets passed to
snow::makeCluster(), according to the SnowParam() docs. I also see
that when log = TRUE, a script is used instead of snow. But log =
FALSE by default.

If I use snow, the 'outfile' argument does work as shown below:

## Nothing gets printed by default

cl <- makeCluster(2, type = 'SOCK')
y <- clusterApply(cl, 1:2, print)



## Use outfile now, print works

cl <- makeCluster(2, type = 'SOCK', outfile = NULL)

starting worker for localhost:11671
starting worker for localhost:11671

y <- clusterApply(cl, 1:2, print)

Type: EXEC
Type: EXEC
[1] 1
[1] 2


packageVersion('snow')

[1] ‘0.3.13’


However, I can't use 'outfile' with SnowParam:

## SerialParam works

x <- bplapply(1:2, print, BPPARAM = SerialParam())

[1] 1
[1] 2

## No printing by default with SnowParam, as expected

x <- bplapply(1:2, print, BPPARAM = SnowParam(workers = 1))


## Can't pass 'outfile' argument

x <- bplapply(1:2, print, BPPARAM = SnowParam(workers = 1, outfile = NULL))

Error in bplapply(1:2, print, BPPARAM = SnowParam(workers = 1, outfile
= NULL)) :
   error in evaluating the argument 'BPPARAM' in selecting a method for
function 'bplapply': Error in envRefSetField(.Object, field, classDef,
selfEnv, elements[[field]]) :
   ‘outfile’ is not a field in class “SnowParam”


Digging at the code, I see that in SnowParam() the ... argument is
saved in .clusterargs

args <- c(list(spec = workers, type = type), list(...))
 .clusterargs <- lapply(args, force)

However, ... is still passed to .SnowParam(), and .SnowParam() fields are:

fields=list(
 cluster="cluster",
 .clusterargs="list",
 .controlled="logical",
 log="logical",
 threshold="ANY",
 logdir="character",
 resultdir="character")


So, I'm wondering if ... should not be passed to .SnowParam().



## Trying to pass outfile to .clusterargs directly doesn't work
## Actually, I'm surprised it didn't crash .SnowParam()

x <- bplapply(1:2, print, BPPARAM = SnowParam(workers = 1, 
.clusterargs=list(outfile = NULL)))



Full log: https://gist.github.com/9ac957c6cad1c07f4ea4


Thanks,
Leo

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] rtracklayer::import( as = 'NumericList' ) doesn't work with more than 1 range

2015-05-20 Thread Valerie Obenchain

Hi Leo,

Thanks for reporting the bug with import. The problem was in how the 
length of the output was computed. This has been fixed in both release 
(1.28.3) and devel (1.29.6).


I'll let Michael answer the summary() question.

Valerie


On 05/19/2015 12:10 PM, Leonardo Collado Torres wrote:

Hi,

While playing around with importing BigWig files I found that
import.bw() fails when you use a `which` or `selection` that has more
than one range and you specify `as = 'NumericList'`. The code and
output are available at
https://gist.github.com/lcolladotor/a0eafc335a2738de42f6. From the
BigWigFile-class documentation, I suspect that this is a bug. The same
thing happens even if I use BigWigFileSelection() instead of supplying
a GRanges of length 2.

Also, what is the summary() function doing when you calculate the
mean? I would expect it to be the same mean if I import the data as an
Rle and calculate the mean there. See (after running the code in the
gist):


x <- import(BigWigFile(bw[1]), as = 'RleList')
mean(x)

  chr21
0.02474045

summary(BigWigFile(bw[1]), type = 'mean')[[1]]$score

[1] 0.9037462

mean(x[x > 0])

chr21
1.202603

It's not the mean of the non-zero positions either.


Cheers,
Leo


Leonardo Collado Torres, PhD Candidate
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://lcolladotor.github.io/

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] BioC 2015 Conference Posters

2015-04-24 Thread Valerie Obenchain

Poster registration is now open. Visit the web site

  http://www.bioconductor.org/help/course-materials/2015/BioC2015/

and registration page

  https://register.bioconductor.org/BioC2015/poster_submit.php

for more information.

Posters can be from any area of computational biology, medicine, 
computer science, mathematics and statistics. Purely experimental work 
or package overviews are welcome. They will be up for viewing during 
Tuesday's (and probably Wednesday's) social hour.


Deadline is July 15 so we can estimate the number of display boards. Max 
size is 48 x 36 inches.


For questions please contact voben...@fredhutch.org.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Bioconductor Newsletter - April 2015

2015-04-03 Thread Valerie Obenchain

Now available at http://bioconductor.org/help/newsletters/2015_April/

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] zero-width ranges representing insertions

2015-03-19 Thread Valerie Obenchain

On 03/16/2015 05:31 PM, Hervé Pagès wrote:

On 03/16/2015 04:06 PM, Michael Lawrence wrote:



On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo mailto:robert.cast...@upf.edu>> wrote:

+1 IMO BioC could adopt the zero-width ranges representation for
insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to
deal with each corresponding beast, be VCF, dbSNP or the like. Who
knows, VCF could also change their representation in the future and
it'll be a headache to update the affected packages if we decide to
keep using its insertion representation internally to store variant
ranges in BioC.


That would break just about every tool, so let's hope not. There's a
bunch of code on top of Bioc that currently depends on the current
representation. For example, zero width ranges do not overlap anything,
so they need special treatment to e.g. detect whether an insertion falls
within a gene. There are real benefits to keeping the representation of
indels consistent with the rest of the field (VCF). There was much
thought put into this.


Note that findOverlaps() now handles zero-width ranges.


I've had a chance to take a closer look at how VA handles zero-width ranges.

Previously, both predictCoding() and locateVariants() treated zero-width 
ranges as width 1 (start decremented to equal end). In VA 1.13.42 this 
has been changed for predictCoding() so now zero width are dropped. The 
function internals expect REF and ALT to conform with the vcf specs and 
zero width ranges aren't used. So, it seemed wise to drop the zero-width 
for now.


locateVariants() remains the same because this is more general. I think 
it's still useful to identify where a zero width range falls with 
respect to gene features.





Straight use of findOverlaps() on the ranges of a VCF object leads to
some subtle problems on insertions. For example predictCoding() (which
I guess uses findOverlaps() internally) reports strange things for
these 2 insertions (1 right before and 1 right after the stop codon):



This output is actually fine. The VARCODON values may be slightly 
misleading but the data are correct. predictCoding() only computes amino 
acid sequences for snps or indels that conform to the 'groups of 3' 
idea. The substitution or deletion must result in the sequence being 
divisible by 3 otherwise there is a partial codon at the end that must 
be inferred (consider all possible combinations) and then one must be 
chosen (consensus). The code does not currently do this and I'm not sure 
there is common agreement on how to do it.


This GRanges has a snv followed by 1, 2, and 3 base pair insertions:


rowRanges(vcf)

GRanges object with 4 ranges and 5 metadata columns:
  seqnames ranges strand | paramRangeIDREF
  | 
  snvchr20 [77055, 77055]  * |   T
  1bp inschr20 [77054, 77055]  * |  AT
  2bp inschr20 [77054, 77055]  * |  AT
  3bp inschr20 [77054, 77055]  * |  AT
 ALT  QUAL  FILTER

  snv  G70PASS
  1bp insATC70PASS
  2bp ins   ATCG70PASS
  3bp ins  ATCGG70PASS
  ---
  seqinfo: 1 sequence from an unspecified genome; no seqlengths


Coding changes are computed for the snv and 3bp insertion but the others 
are marked as 'frameshift'. Previously when an indel couldn't be 
translated the VARCODON was the same as the REFCODON which may have been 
confusing (was intended to mean nothing has changed). I've changed this 
so VARCODON is now missing (like VARAA) when it can't be translated.





predictCoding(vcf, txdb, Hsapiens)

GRanges object with 4 ranges and 17 metadata columns:
  seqnames ranges strand | paramRangeIDREF
  | 
  snvchr20 [77055, 77055]  + |   T
  1bp inschr20 [77054, 77055]  + |  AT
  2bp inschr20 [77054, 77055]  + |  AT
  3bp inschr20 [77054, 77055]  + |  AT
 ALT  QUAL  FILTER  varAllele CDSLOC
   
  snv  G70PASS  G [468, 468]
  1bp insATC70PASSATC [467, 468]
  2bp ins   ATCG70PASS   ATCG [467, 468]
  3bp ins  ATCGG70PASS  ATCGG [467, 468]
 PROTEINLOC   QUERYIDTXID CDSID  GENEID
  
  snv   156 1   70477206101  245938
  1bp ins   156 2   70477206101  245938
  2bp ins   156 3   70477206101  245938
  3bp ins   156 4 

Re: [Bioc-devel] little tiny bug in CDSID annotations from predictCoding()

2015-03-17 Thread Valerie Obenchain
The second mapping in .localCoordinates() in AllUtils.R was using a 
listed 'to' instead of unlisted 'to'. This means the mapping was at the 
level of the outer list elements vs the inner.  We want to map/overlap 
with an unlisted object because each range can have a different CDSID. 
This second mapping provided the index for retrieving the CDSIDs which 
is why you saw a difference between the two.


This was a bug for predictCoding only.

Valerie


On 03/17/2015 11:39 AM, Robert Castelo wrote:

Thanks! Do you have an explanation for the apparent disagreement in
CDSID annotations that i described below the bug, between
predictCoding() and locateVariants()?

robert.


 Original message ----
From: Valerie Obenchain
Date:17/03/2015 19:18 (GMT+01:00)
To: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] little tiny bug in CDSID annotations from
predictCoding()

Hi Robert,

Thanks for reporting the typo and bug. Now fixed in 1.13.41.

Valerie

On 03/17/2015 10:58 AM, Robert Castelo wrote:
 > in my message below, the line that it says:
 >
 > head(loc_all$CDSID)
 >
 > it should say
 >
 > head(coding2$CDSID)
 >
 > cheers,
 >
 > robert.
 >
 > ==
 > hi,
 >
 > there is a little tiny bug in the current devel version of
 > VariantAnnotation::predictCoding(), and more concretely within
 > VariantAnnotation:::.localCoordinates(), that precludes the correct
 > annotation of the CDSID column:
 >
 > library(VariantAnnotation)
 > library(TxDb.Hsapiens.UCSC.hg19.knownGene)
 > library(BSgenome.Hsapiens.UCSC.hg19)
 >
 > vcf <- readVcf(system.file("extdata", "CEUtrio.vcf.bgz",
 > package="VariantFiltering"), genome="hg19")
 > seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
 > vcf <- dropSeqlevels(vcf, "chrM")
 > coding1 <- predictCoding(vcf, txdb, Hsapiens)
 > head(coding1$CDSID)
 > IntegerList of length 6
 > [[1]] integer(0)
 > [[2]] integer(0)
 > [[3]] integer(0)
 > [[4]] integer(0)
 > [[5]] integer(0)
 > [[6]] integer(0)
 > table(elementLengths(coding1$CDSID))
 >
 > 0
 > 6038
 >
 > my sessionInfo() is at the end of the message.
 >
 > here is the patch, just replacing 'map2$trancriptHits' by
 > 'map2$transcriptsHits':
 >
 > --- R/AllUtilities.R(revision 100756)
 > +++ R/AllUtilities.R(working copy)
 > @@ -284,7 +284,7 @@
 >   cdsid <- IntegerList(integer(0))
 >   map2 <- mapToTranscripts(unname(from)[xHits], to,
 >ignore.strand=ignore.strand)
 > -cds <- mcols(unlist(to,
use.names=FALSE))$cds_id[map2$trancriptsHits]
 > +cds <- mcols(unlist(to,
use.names=FALSE))$cds_id[map2$transcriptsHits]
 >   if (length(cds)) {
 >   cdslst <- unique(splitAsList(cds, map2$xHits))
 >   cdsid <- cdslst
 >
 > with this fix then things seem to work again:
 >
 > coding1 <- predictCoding(vcf, txdb, Hsapiens)
 >> head(coding1$CDSID)
 > IntegerList of length 6
 > [["1"]] 21771
 > [["2"]] 21771
 > [["3"]] 21771
 > [["4"]] 21771
 > [["5"]] 21428
 > [["6"]] 21428
 > table(elementLengths(coding1$CDSID))
 >
 > 123456789   10   12   13   14
16   19
 >   873 1229 1024  993  615  524  324  168   82   21   12   15   42
76   40
 >
 > while investigating this bug i used VariantAnnotation::locateVariants()
 > which also annotates the CDSID column, and it seemed to be working.
 > however, i noticed that both, predictCoding() and locateVariants(), do
 > not give an identical annotation for the CDSID column in coding variants:
 >
 > coding2 <- locateVariants(vcf, txdb, CodingVariants())
 > head(loc_all$CDSID)
 > IntegerList of length 6
 > [["1"]] 210777
 > [["2"]] 210777
 > [["3"]] 210777
 > [["4"]] 210778
 > [["5"]] 208140
 > [["6"]] 208141
 > table(elementLengths(coding2$CDSID))
 >
 > 1234
 > 4987  901  138   12
 >
 > in principle, it seems that both are annotating valid CDSID keys:
 >
 > allcdsinfo <- select(txdb, keys=keys(txdb, keytype="CDSID"),
 > columns="CDSID", keytype="CDSID")
 > sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
 > allcdsinfo$CDSID)
 > [1] 0
 > sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
 > allcdsinfo$CDSID)
 > [1] 0
 >
 > but predictCoding() annotates CDSID values that are not present in
 > locateVariants() annotations and viceversa:
 >
 > sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
 > as.char

Re: [Bioc-devel] little tiny bug in CDSID annotations from predictCoding()

2015-03-17 Thread Valerie Obenchain

Hi Robert,

Thanks for reporting the typo and bug. Now fixed in 1.13.41.

Valerie

On 03/17/2015 10:58 AM, Robert Castelo wrote:

in my message below, the line that it says:

head(loc_all$CDSID)

it should say

head(coding2$CDSID)

cheers,

robert.

==
hi,

there is a little tiny bug in the current devel version of
VariantAnnotation::predictCoding(), and more concretely within
VariantAnnotation:::.localCoordinates(), that precludes the correct
annotation of the CDSID column:

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)

vcf <- readVcf(system.file("extdata", "CEUtrio.vcf.bgz",
package="VariantFiltering"), genome="hg19")
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
vcf <- dropSeqlevels(vcf, "chrM")
coding1 <- predictCoding(vcf, txdb, Hsapiens)
head(coding1$CDSID)
IntegerList of length 6
[[1]] integer(0)
[[2]] integer(0)
[[3]] integer(0)
[[4]] integer(0)
[[5]] integer(0)
[[6]] integer(0)
table(elementLengths(coding1$CDSID))

0
6038

my sessionInfo() is at the end of the message.

here is the patch, just replacing 'map2$trancriptHits' by
'map2$transcriptsHits':

--- R/AllUtilities.R(revision 100756)
+++ R/AllUtilities.R(working copy)
@@ -284,7 +284,7 @@
  cdsid <- IntegerList(integer(0))
  map2 <- mapToTranscripts(unname(from)[xHits], to,
   ignore.strand=ignore.strand)
-cds <- mcols(unlist(to, use.names=FALSE))$cds_id[map2$trancriptsHits]
+cds <- mcols(unlist(to, use.names=FALSE))$cds_id[map2$transcriptsHits]
  if (length(cds)) {
  cdslst <- unique(splitAsList(cds, map2$xHits))
  cdsid <- cdslst

with this fix then things seem to work again:

coding1 <- predictCoding(vcf, txdb, Hsapiens)

head(coding1$CDSID)

IntegerList of length 6
[["1"]] 21771
[["2"]] 21771
[["3"]] 21771
[["4"]] 21771
[["5"]] 21428
[["6"]] 21428
table(elementLengths(coding1$CDSID))

123456789   10   12   13   14   16   19
  873 1229 1024  993  615  524  324  168   82   21   12   15   42   76   40

while investigating this bug i used VariantAnnotation::locateVariants()
which also annotates the CDSID column, and it seemed to be working.
however, i noticed that both, predictCoding() and locateVariants(), do
not give an identical annotation for the CDSID column in coding variants:

coding2 <- locateVariants(vcf, txdb, CodingVariants())
head(loc_all$CDSID)
IntegerList of length 6
[["1"]] 210777
[["2"]] 210777
[["3"]] 210777
[["4"]] 210778
[["5"]] 208140
[["6"]] 208141
table(elementLengths(coding2$CDSID))

1234
4987  901  138   12

in principle, it seems that both are annotating valid CDSID keys:

allcdsinfo <- select(txdb, keys=keys(txdb, keytype="CDSID"),
columns="CDSID", keytype="CDSID")
sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
allcdsinfo$CDSID)
[1] 0
sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
allcdsinfo$CDSID)
[1] 0

but predictCoding() annotates CDSID values that are not present in
locateVariants() annotations and viceversa:

sum(!as.character(unlist(coding1$CDSID, use.names=FALSE)) %in%
as.character(unlist(coding2$CDSID, use.names=FALSE)))
[1] 24057
sum(!as.character(unlist(coding2$CDSID, use.names=FALSE)) %in%
as.character(unlist(coding1$CDSID, use.names=FALSE)))
[1] 7251

length(unique(intersect(as.character(unlist(coding2$CDSID,
use.names=FALSE)), as.character(unlist(coding1$CDSID, use.names=FALSE)
[1] 0

should not both annotate the same CDSID values on coding variants?


thanks!
robert.
ps: sessionInfo()
R Under development (unstable) (2014-10-14 r66765)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF8LC_COLLATE=en_US.UTF8
  [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8   LC_NAME=C
  [9] LC_ADDRESS=C  LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices
[6] utils datasets  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.1
  [2] GenomicFeatures_1.19.31
  [3] AnnotationDbi_1.29.17
  [4] Biobase_2.27.2
  [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0
  [6] BSgenome_1.35.17
  [7] rtracklayer_1.27.8
  [8] VariantAnnotation_1.13.40
  [9] Rsamtools_1.19.44
[10] Biostrings_2.35.11
[11] XVector_0.7.4
[12] GenomicRanges_1.19.46
[13] GenomeInfoDb_1.3.13
[14] IRanges_2.1.43
[15] S4Vectors_0.5.22
[16] BiocGenerics_0.13.6
[17] vimcom_1.0-0
[18] setwidth_1.0-3
[19] colorout_1.0-3

loaded via a namespace (and not attached):
  [1] BiocParallel_1.1.15  biomaRt_2.23.5
  [3] bitops_1.0-6 DBI_0.3.1
  [5] GenomicAlignments_1.3.31 RCurl_1.95-4.5
  [7] RSQLite_1.0.0tools_3.2.0
  [9] XML_3.98-1.1 zlibbioc_1.13.2

___
Bioc-devel@r-project.org mailing list

Re: [Bioc-devel] zero-width ranges representing insertions

2015-03-16 Thread Valerie Obenchain

On 03/16/2015 05:31 PM, Hervé Pagès wrote:

On 03/16/2015 04:06 PM, Michael Lawrence wrote:



On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo mailto:robert.cast...@upf.edu>> wrote:

+1 IMO BioC could adopt the zero-width ranges representation for
insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to
deal with each corresponding beast, be VCF, dbSNP or the like. Who
knows, VCF could also change their representation in the future and
it'll be a headache to update the affected packages if we decide to
keep using its insertion representation internally to store variant
ranges in BioC.


That would break just about every tool, so let's hope not. There's a
bunch of code on top of Bioc that currently depends on the current
representation. For example, zero width ranges do not overlap anything,
so they need special treatment to e.g. detect whether an insertion falls
within a gene. There are real benefits to keeping the representation of
indels consistent with the rest of the field (VCF). There was much
thought put into this.


Note that findOverlaps() now handles zero-width ranges.



predictCoding and locateVariants had to work around the fact that 
findOverlaps didn't work on zero-length ranges. Now that we can handle 
zero-length overlaps this code should be updated (yes, my fault for not 
updating).


How predictCoding and locateVariants currently handle insertions can be 
modified. The current behavior (bug) should not have no bearing on how 
we want to represent indels in the VCF class in general.


I agree with Michael that much thought went into making the VCF class 
consistent with the VCF specs. While information has been added to the 
file format over the years I am not aware of any changes related to the 
positional representation of the variant (ie, it has been stable) and 
think it unlikely this would change in the future.


Val




Straight use of findOverlaps() on the ranges of a VCF object leads to
some subtle problems on insertions. For example predictCoding() (which
I guess uses findOverlaps() internally) reports strange things for
these 2 insertions (1 right before and 1 right after the stop codon):

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

 > cdsBy(txdb, use.names=TRUE)$uc002wcw.3
GRanges object with 2 ranges and 3 metadata columns:
   seqnames ranges strand |cds_idcds_name exon_rank
   |   
   [1]chr20 [68351, 68408]  + |206100 1
   [2]chr20 [76646, 77058]  + |206101 2
   ---
   seqinfo: 93 sequences (1 circular) from hg19 genome


library(VariantAnnotation)
 > rowRanges(vcf)  # hand-made VCF
GRanges object with 2 ranges and 5 metadata columns:
 seqnames ranges strand | paramRangeID
 | 
   ins before stop codonchr20 [77055, 77055]  * | 
ins after stop codonchr20 [77058, 77058]  * | 
REFALT  QUAL
FILTER
   

   ins before stop codon  T TG70
PASS
ins after stop codon  A AG70
PASS
   ---
   seqinfo: 1 sequence from hg19 genome

Calling predictCoding():

 > library(BSgenome.Hsapiens.UCSC.hg19)
 > predictCoding(vcf, txdb, Hsapiens)
GRanges object with 2 ranges and 17 metadata columns:
 seqnames ranges strand | paramRangeID
 | 
   ins before stop codonchr20 [77055, 77055]  + | 
ins after stop codonchr20 [77058, 77058]  + | 
REFALT  QUAL
FILTER
   

   ins before stop codon  T TG70
PASS
ins after stop codon  A AG70
PASS
  varAllele CDSLOCPROTEINLOC   QUERYID
 
   ins before stop codon TG [468, 468]   156 1
ins after stop codon AG [471, 471]   157 2
TXID CDSID  GENEID CONSEQUENCE
   
   ins before stop codon   70477245938  frameshift
ins after stop codon   70477245938  frameshift
   REFCODON   VARCODON REFAA
   
   ins before stop codonAATAAT
ins after stop codonTAATAA
 VARAA
 
   ins before stop codon
ins after stop codon
   ---
   seqinfo: 1 sequence from hg19 genome

PROTEINLOC, REFCODON, VARCODON, and CONSEQUENCE don't seem quite right
to me. Could be that my hand-made vcf 

Re: [Bioc-devel] ShortRead srdistance is broken

2015-03-16 Thread Valerie Obenchain

Hi Nico,

Can you try again with BiocParallel 1.1.16? Likely related to this post,

https://stat.ethz.ch/pipermail/bioc-devel/2015-March/007114.html


Valerie


On 03/16/2015 07:05 AM, Nicolas Delhomme wrote:

Hej!

Martin I guess that's for you :-) And I suppose it might only be temporary (due 
to development in BIocParallel/ShortRead), but I stumbled on the following:

library(ShortRead)
srdistance(DNAStringSet("AA"),"AAATAA")

fails with the following error:

Error in bplapply(X, FUN, ..., BPRESUME = BPRESUME, BPPARAM = BPPARAM) :
  error in evaluating the argument 'BPPARAM' in selecting a method for function 
'bplapply': Error in registered()[[bpparamClass]] :
  attempt to select less than one element

All packages freshly updated from Bioc and R devel a month old:


sessionInfo()

R Under development (unstable) (2015-02-11 r67792)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets  
methods   base

other attached packages:
[1] ShortRead_1.25.9 GenomicAlignments_1.3.31 Rsamtools_1.19.44
GenomicRanges_1.19.46
[5] GenomeInfoDb_1.3.13  Biostrings_2.35.11   XVector_0.7.4
IRanges_2.1.43
[9] S4Vectors_0.5.22 BiocParallel_1.1.15  BiocGenerics_0.13.6

loaded via a namespace (and not attached):
[1] lattice_0.20-30 bitops_1.0-6grid_3.2.0  DBI_0.3.1   
RSQLite_1.0.0
[6] zlibbioc_1.13.2 hwriter_1.3.2   latticeExtra_0.6-26 
RColorBrewer_1.1-2  tools_3.2.0
[11] Biobase_2.27.2

Cheers,

Nico

---
Nicolas Delhomme

The Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 5478
Email: nicolas.delho...@umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden
---

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel: changes to SnowParam(), should it import futile.logger?

2015-03-15 Thread Valerie Obenchain

Hi Leo,

Thanks for reporting this. It was missed by the build system (and 
myself!) because the package was already installed. Should be fixed in 
1.1.16.


futile.logger (and others) were moved to Suggests to lighten the load of 
the NAMESPACE. Logging isn't technically turned on until the cluster is 
started with bpstart(), however, futile.logger is used to check the 
threshold when the class is instantiated. This step is important in 
class validation so I've moved futile.logger back to Imports.


Valerie


On 03/12/2015 08:35 PM, Leonardo Collado Torres wrote:

Hi,

I noticed that BiocParallel::SnowParam() changed. It now uses the
futile.logger package, but it's only suggested by BiocParallel as seen here
https://github.com/Bioconductor/BiocParallel/blob/master/DESCRIPTION#L19
This leads to some errors as shown at
https://travis-ci.org/lcolladotor/derfinder/builds/54192145#L3570-L3571

Hopefully futile.logger can be imported. I mean, hopefully selectively
importing futile.logger is a quick task.

Thanks,
Leo

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] requirement for named assays in SummarizedExperiment

2015-03-11 Thread Valerie Obenchain

Hi,

After talking with others the vote was against enforcing names on 
assays() and for positional matching if all names are NULL. A mixture of 
names and NULL throws an error.


example(SummarizedExperiment)

## all named
> se2 = se1
> assays(cbind(se1, se2))
List of length 1
names(1): counts

## mixture of names and NULL -> error
> names(assays(se1)) = NULL
> assays(cbind(se1, se2))
Error in assays(cbind(se1, se2)) :
  error in evaluating the argument 'x' in selecting a method for 
function 'assays': Error in .bind.arrays(args, cbind, "assays") :

  elements in ‘assays’ must have the same names

## all NULL -> positional matching
> names(assays(se2)) = NULL
> assays(cbind(se1, se2))
List of length 1

If we find common use cases where positional matching is needed with a 
mixture of names and NULL we can always relax this constraint.


Changes are in 1.19.46.

Valerie



On 03/06/2015 08:20 AM, Valerie Obenchain wrote:

Hi Aaron,

Thanks for catching this.

I favor enforcing names in 'assays'. Combining by position alone is too
dangerous. I'm thinking of the VCF class where the genome information is
stored in 'assays' and the fields are rarely in the same order.

Looks like we also need a more informative error message when names
don't match.

 > assays(se1)
List of length 1
names(1): counts1

 > assays(se2)
List of length 1
names(1): counts2

 > cbind(se1, se2)
Error in sQuote(accessorName) :
   argument "accessorName" is missing, with no default


Valerie


On 03/05/2015 11:09 PM, Aaron Lun wrote:

Dear all,

I stumbled upon some unexpected behaviour with cbind'ing
SummarizedExperiment objects with unnamed assays:


require(GenomicRanges)
nrows <- 5; ncols <- 4
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
rowData <- GRanges("chr1", IRanges(1:nrows, 1:nrows))
colData <- DataFrame(Treatment=1:ncols, row.names=LETTERS[1:ncols])
sset <- SummarizedExperiment(counts, rowData=rowData, colData=colData)
sset

class: SummarizedExperiment
dim: 5 4
exptData(0):
assays(1): ''
rownames: NULL
rowData metadata column names(0):
colnames(4): A B C D
colData names(1): Treatment


cbind(sset, sset)

dim: 5 8
exptData(0):
assays(0):
rownames: NULL
rowData metadata column names(0):
colnames(8): A B ... C1 D1
colData names(1): Treatment

Upon cbind'ing, the assays in the SE object are lost. I think this is
due to the fact that the cbind code matches up assays by their names.
Thus, if there are no names, the code assumes that there are no assays.

I guess this could be prevented by enforcing naming of assays in the
SummarizedExperiment constructor. Or, the binding code could be modified
to work positionally when there are no assay names, e.g., by cbind'ing
the first assays across all SE objects, then the second assays, etc.

Any thoughts?

Regards,

Aaron


sessionInfo()

R Under development (unstable) (2014-12-14 r67167)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils
datasets
[8] methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
[4] S4Vectors_0.5.21  BiocGenerics_0.13.6

loaded via a namespace (and not attached):
[1] XVector_0.7.4


__
The information in this email is confidential and inte...{{dropped:15}}


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] requirement for named assays in SummarizedExperiment

2015-03-06 Thread Valerie Obenchain

Hi Aaron,

Thanks for catching this.

I favor enforcing names in 'assays'. Combining by position alone is too 
dangerous. I'm thinking of the VCF class where the genome information is 
stored in 'assays' and the fields are rarely in the same order.


Looks like we also need a more informative error message when names 
don't match.


> assays(se1)
List of length 1
names(1): counts1

> assays(se2)
List of length 1
names(1): counts2

> cbind(se1, se2)
Error in sQuote(accessorName) :
  argument "accessorName" is missing, with no default


Valerie


On 03/05/2015 11:09 PM, Aaron Lun wrote:

Dear all,

I stumbled upon some unexpected behaviour with cbind'ing
SummarizedExperiment objects with unnamed assays:


require(GenomicRanges)
nrows <- 5; ncols <- 4
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
rowData <- GRanges("chr1", IRanges(1:nrows, 1:nrows))
colData <- DataFrame(Treatment=1:ncols, row.names=LETTERS[1:ncols])
sset <- SummarizedExperiment(counts, rowData=rowData, colData=colData)
sset

class: SummarizedExperiment
dim: 5 4
exptData(0):
assays(1): ''
rownames: NULL
rowData metadata column names(0):
colnames(4): A B C D
colData names(1): Treatment


cbind(sset, sset)

dim: 5 8
exptData(0):
assays(0):
rownames: NULL
rowData metadata column names(0):
colnames(8): A B ... C1 D1
colData names(1): Treatment

Upon cbind'ing, the assays in the SE object are lost. I think this is
due to the fact that the cbind code matches up assays by their names.
Thus, if there are no names, the code assumes that there are no assays.

I guess this could be prevented by enforcing naming of assays in the
SummarizedExperiment constructor. Or, the binding code could be modified
to work positionally when there are no assay names, e.g., by cbind'ing
the first assays across all SE objects, then the second assays, etc.

Any thoughts?

Regards,

Aaron


sessionInfo()

R Under development (unstable) (2014-12-14 r67167)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils
datasets
[8] methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
[4] S4Vectors_0.5.21  BiocGenerics_0.13.6

loaded via a namespace (and not attached):
[1] XVector_0.7.4


__
The information in this email is confidential and inte...{{dropped:15}}


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-06 Thread Valerie Obenchain

Hi Mike,

Our error - we didn't bump GenomicRanges when rowRanges was added. 
Hopefully 1.19.43 will propagate today and things will be sorted out.


Val


On 03/06/2015 07:40 AM, Michael Love wrote:

hi all,

just a practical issue: I have GenomicRanges version 1.19.42 on my
computer which does not have rowRanges defined, although the 1.19.42
version on the Bioc website does have rowRanges in the man page:

http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html

So I pass check locally but not in the devel branch on Bioc servers.


library(GenomicRanges)
rowRanges

Error: object 'rowRanges' not found

sessionInfo()

R Under development (unstable) (2014-12-08 r67137)
Platform: x86_64-apple-darwin12.5.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
S4Vectors_0.5.21
[5] BiocGenerics_0.13.6   RUnit_0.4.28  devtools_1.7.0knitr_1.9
[9] BiocInstaller_1.17.5



On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan  wrote:


On 03/04/2015 10:03 AM, Peter Haverty wrote:


Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.



The current version, under R-devel, is at

   
devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4";)

   > methods(class="SummarizedExperiment")
[1] [ [[[[<-  [<-
[5] $ $<-   assay assay<-
[9] assayNamesassayNames<-  assaysassays<-
   [13] cbind coercecolData   colData<-
   [17] compare   Compare   countOverlaps coverage
   [21] dim   dimnames  dimnames<-disjointBins
   [25] distance  distanceToNearest duplicatedelementMetadata
   [29] elementMetadata<- end   end<- exptData
   [33] exptData<-extractROWS   findOverlaps  flank
   [37] followgranges   isDisjointmcols
   [41] mcols<-   narrownearest   order
   [45] overlapsAny   precede   rangesranges<-
   [49] rank  rbind replaceROWS   resize
   [53] restrict  rowData   rowData<- seqinfo
   [57] seqinfo<- seqnames  shift show
   [61] sort  split start start<-
   [65] strandstrand<-  subsetsubsetByOverlaps
   [69] updateObject  valuesvalues<-  width
   [73] width<-

   see ?"methods" for accessing help and source code

and


head(attr(methods(class="SummarizedExperiment"), "info"))

  generic visible
[,SummarizedExperiment,ANY-method  [TRUE
[[,SummarizedExperiment,ANY,missing-method[[TRUE
[[<-,SummarizedExperiment,ANY,missing-method[[<-TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [<-TRUE
$,SummarizedExperiment-method  $TRUE
$<-,SummarizedExperiment-method  $<-TRUE
  isS4  from
[,SummarizedExperiment,ANY-methodTRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method   TRUE GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-methodTRUE GenomicRanges
$<-,SummarizedExperiment-method  TRUE GenomicRanges

Martin



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
wrote:


I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s  wrote:


GRangesFrame is

Re: [Bioc-devel] GRanges to VRanges coercion

2015-02-23 Thread Valerie Obenchain

Hi Thomas, Michael,

makeVRangesFromGRanges has been added to 1.13.34. The 
methods-VRanges-class.R file was already large so I added 
makeVRangesFromGRanges.R.


A few notes:

- '...' was replaced with 'keep.extra.columns'

- row names are propagated automatically so I removed the explicit 
row.names(vr)


- Errors are thrown in 2 cases: when arguments are not a single string 
and if 'ref' can't be found. You may want to consider a warning if one 
of the 'field' args can't be found in the GRanges metadata.


- I have not added unit tests. If you have some I'll put them in.

- The function is documented in VRanges-class.Rd. If you want to make a 
stand alone man page similar to makeGRangesFromDataFrame I'll add that too.


Thanks for the contribution Thomas. Nice addition.

Valerie



On 02/20/2015 04:17 PM, Michael Lawrence wrote:



On Thu, Feb 19, 2015 at 12:46 PM, Thomas Sandmann
mailto:sandmann.tho...@gene.com>> wrote:

Hi Valerie, hi Michael,

I find myself frequently moving back and forth between data.frames,
GRanges and VRanges objects.

The makeGRangesFromDataFrame function from the GenomicRanges makes
the coercion between the former straightforward, but I couldn't find
anything similar for the second step, coercsion from GRanges to VRanges.

There is a coercion method defined in the GenomicRanges package:

getMethod(coerce, c("GRanges", "VRanges"))
Method Definition:

function (from, to = "VRanges", strict = TRUE)
{
 obj <- new("VRanges")
 as(obj, "GRanges") <- from
 obj
}


Signatures:
 from  to
target  "GRanges" "VRanges"
defined "GRanges" "VRanges"

but I haven't been able to get it to work (or find where it is
documented). The source code shown above doesn't indicate how the
coercion method would check for the presence of required / optional
VRanges columns, e.g. 'ref', 'alt', 'altDepth', etc.


This is just the default coercion method added by the methods package
for a conversion of a class to its parent class. It obviously will not
do the right thing, in general.


Would it be useful to add an explicit makeVRangesFromGRanges
function to the VariantAnnotation package ( and / or the
corresponding coercion method) ?

Then it would be easy to go from a data.frame to a VRanges object,
e.g. as in this pseudocode:

makeVRangesFromGRanges(
makeGRangesFromDataFrame( data.frame )
)

You can find a first attempt at implementing the
makeVRangesFromGRanges function here
, which you
are welcome to use / modify if you find it useful.

If this functionality should already be available, I'd be happy to
learn about that, too !


Val, do you think you could review and incorporate Thomas's code? It
seems like a good addition to me.

Thanks,
Michael

Thank you,
Thomas


SessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils
datasets  methods   base

other attached packages:
  [1] VariantAnnotation_1.12.9 Rsamtools_1.18.2
Biostrings_2.34.1XVector_0.6.0GenomicRanges_1.18.4
  [6] GenomeInfoDb_1.2.4   IRanges_2.0.1
  S4Vectors_0.4.0  BiocGenerics_0.12.1
  BiocInstaller_1.16.1
[11] roxygen2_4.1.0   devtools_1.7.0

loaded via a namespace (and not attached):
  [1] AnnotationDbi_1.28.1base64enc_0.1-2 BatchJobs_1.5
   BBmisc_1.9  Biobase_2.26.0
  [6] BiocParallel_1.0.3  biomaRt_2.22.0  bitops_1.0-6
  brew_1.0-6  BSgenome_1.34.1
[11] checkmate_1.5.1 codetools_0.2-10DBI_0.3.1
 digest_0.6.8fail_1.2
[16] foreach_1.4.2   GenomicAlignments_1.2.1
GenomicFeatures_1.18.3  iterators_1.0.7 Rcpp_0.11.4
[21] RCurl_1.95-4.5  RSQLite_1.0.0
rtracklayer_1.26.2  sendmailR_1.2-1 stringr_0.6.2
[26] tools_3.1.2 XML_3.98-1.1zlibbioc_1.12.0




___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] cryptic error in VariantAnnotation::locateVariants()

2015-02-16 Thread Valerie Obenchain

Hi,

Thanks for the report. Now fixed in 1.13.31.

This problem, like the last one you reported, was related to CDSID 
becoming an IntegerList. When region=AllVariants, results are combined 
and ordered by QUERYID, TXID, CDSID and GENEID. In your example ties 
were not broken by QUERYID or TXID and so CDSID was tested. order() does 
work on 'List' objects but they must be the only argument. The error was 
complaining that when order() has multiple arguments they can't be 
'List's. This is a bit of a special case and is why you haven't seen 
this error for all runs of locateVariants().


Now that CDSID is an IntegerList (holds multiple values) it doesn't make 
sense to include it in the ordering so I have removed it.


Valerie


On 02/16/2015 07:44 AM, Robert Castelo wrote:

hi Valerie,

i'm afraid i have hit another cryptic error in locateVariants() related
to some recent update of the devel version of the package, could you
take a look at it? (code below..)

thanks!!

robert.

==

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

gr <- GRanges("chrX", IRanges(start=128913961, width=1))

loc <- locateVariants(gr, txdb,
   AllVariants(intergenic=IntergenicVariants(0, 0)))
Error in ans[order(meta$QUERYID, meta$TXID, meta$CDSID, meta$GENEID),  :
   error in evaluating the argument 'i' in selecting a method for
function '[': Error in order(c(1L, 1L), c("76534", "76534"),
list(integer(0), integer(0)),  :
   unimplemented type 'list' in 'listgreater'
4: ans[order(meta$QUERYID, meta$TXID, meta$CDSID, meta$GENEID),
]
3: .local(query, subject, region, ...)
2: locateVariants(gr, txdb, AllVariants(intergenic = IntergenicVariants(0,
0)))
1: locateVariants(gr, txdb, AllVariants(intergenic = IntergenicVariants(0,
0)))
sessionInfo()
R Under development (unstable) (2014-10-14 r66765)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C LC_TIME=en_US.UTF8
LC_COLLATE=en_US.UTF8
  [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8
LC_PAPER=en_US.UTF8   LC_NAME=C
  [9] LC_ADDRESS=C  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF8
LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils datasets
  methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 GenomicFeatures_1.19.18
   AnnotationDbi_1.29.17
  [4] Biobase_2.27.1  VariantAnnotation_1.13.29
   Rsamtools_1.19.27
  [7] Biostrings_2.35.8   XVector_0.7.4
   GenomicRanges_1.19.37
[10] GenomeInfoDb_1.3.13 IRanges_2.1.38
S4Vectors_0.5.19
[13] BiocGenerics_0.13.4 vimcom_1.0-0
setwidth_1.0-3
[16] colorout_1.0-3

loaded via a namespace (and not attached):
  [1] base64enc_0.1-2  BatchJobs_1.5BBmisc_1.9
  BiocParallel_1.1.13
  [5] biomaRt_2.23.5   bitops_1.0-6 brew_1.0-6
  BSgenome_1.35.17
  [9] checkmate_1.5.1  codetools_0.2-10 DBI_0.3.1
  digest_0.6.8
[13] fail_1.2 foreach_1.4.2 GenomicAlignments_1.3.27
iterators_1.0.7
[17] RCurl_1.95-4.5   RSQLite_1.0.0 rtracklayer_1.27.7
sendmailR_1.2-1
[21] stringr_0.6.2tools_3.2.0  XML_3.98-1.1
 zlibbioc_1.13.1



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fredhutch.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation::isDelins() ??

2015-02-11 Thread Valerie Obenchain

Thanks Robert! It's been added to 1.13.29.

Valerie

On 02/11/2015 02:36 AM, Robert Castelo wrote:

sure, i'm attaching the patch created from a fresh checkout of the trunk
this morning. in principle, all required bits are there and it builds
and checks without errors and warnings.

cheers,

robert.

On 02/10/2015 07:37 PM, Valerie Obenchain wrote:

Hi Robert,

This sounds like a good addition. I'll put it on the TODO. If you need
this immediately I'd be happy to accept a patch (with unit tests).

Valerie



On 02/10/2015 06:29 AM, Robert Castelo wrote:

hi,

in the VariantAnnotation package, the help of the functions for
identifying variant types such as SNVs, insertions,
deletions, transitions, and structural rearrangements gives the
following definitions:


• isSNV: Reference and alternate alleles are both a single
nucleotide long.

• isInsertion: Reference allele is a single nucleotide and the
alternate allele is greater (longer) than a single nucleotide
and the first nucleotide of the alternate allele matches the
reference.

• isDeletion: Alternate allele is a single nucleotide and the
reference allele is greater (longer) than a single nucleotide
and the first nucleotide of the reference allele matches the
alternate.

• isIndel: The variant is either a deletion or insertion as
determined by ‘isDeletion’ and ‘isInsertion’.

• isSubstition: Reference and alternate alleles are the same
length (1 or more nucleotides long).

• isTransition: Reference and alternate alleles are both a
single nucleotide long. The reference-alternate pair
interchange is of either two-ring purines (A <-> G) or
one-ring pyrimidines (C <-> T).


however, unless I'm missing something here, these definitions do not
cover the indels that involve the the insertion or deletion involving
more than one, respectively, reference or alternate nucleotide. this
could be an example of what i'm trying to say:

library(VariantAnnotation)

vr <- VRanges(seqnames = rep("chr1", times=5),
ranges = IRanges(seq(1, 10, by=20),
seq(1, 10, by=20)+c(1, 1, 2, 2, 3)),
ref = c("T", "A", "A", "AC", "AC"),
alt = c("C", "T", "AC", "AT", "ACC"),
refDepth = c(5, 10, 5, 10, 5),
altDepth = c(7, 6, 7, 6, 7),
totalDepth = c(12, 17, 12, 17, 12),
sampleNames = letters[1:5])

isSNV(vr)
## [1] TRUE TRUE FALSE FALSE FALSE
isIndel(vr)
## [1] FALSE FALSE TRUE FALSE FALSE
isSubstitution(vr)
## [1] TRUE TRUE FALSE TRUE FALSE

note that the last variant does not evaluate as true for any of the
three possibilities. after looking for variant definitions, i have found
that the Human Genome Variation Society (HGVS) describes this as a
deletion followed by an insertion and calls it "indel" or delins" (it's
unclear to me whether they use that interchangeably), see the link here:

http://www.hgvs.org/mutnomen/recs-DNA.html#indel

the only other site I could quickly find with Google, where some
specific definition is given is the site of the software SnpEff, which
calls it "MIXED", a "Multiple-nucleotide and an InDel":

http://snpeff.sourceforge.net/SnpEff_manual.html

I would suggest that VariantAnnotation should try to identify this type
of variant. following the HGVS recommendations, could we maybe have a
function for it called isDelins() ??



cheers,

robert.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel







___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation::isDelins() ??

2015-02-10 Thread Valerie Obenchain

Hi Robert,

This sounds like a good addition. I'll put it on the TODO. If you need 
this immediately I'd be happy to accept a patch (with unit tests).


Valerie



On 02/10/2015 06:29 AM, Robert Castelo wrote:

hi,

in the VariantAnnotation package, the help of the functions for
identifying variant types such as SNVs, insertions,
deletions, transitions, and structural rearrangements gives the
following definitions:


 • isSNV: Reference and alternate alleles are both a single
   nucleotide long.

 • isInsertion: Reference allele is a single nucleotide and the
   alternate allele is greater (longer) than a single nucleotide
   and the first nucleotide of the alternate allele matches the
   reference.

 • isDeletion: Alternate allele is a single nucleotide and the
   reference allele is greater (longer) than a single nucleotide
   and the first nucleotide of the reference allele matches the
   alternate.

 • isIndel: The variant is either a deletion or insertion as
   determined by ‘isDeletion’ and ‘isInsertion’.

 • isSubstition: Reference and alternate alleles are the same
   length (1 or more nucleotides long).

 • isTransition: Reference and alternate alleles are both a
   single nucleotide long.  The reference-alternate pair
   interchange is of either two-ring purines (A <-> G) or
   one-ring pyrimidines (C <-> T).


however, unless I'm missing something here, these definitions do not
cover the indels that involve the the insertion or deletion involving
more than one, respectively, reference or alternate nucleotide. this
could be an example of what i'm trying to say:

library(VariantAnnotation)

vr <- VRanges(seqnames = rep("chr1", times=5),
   ranges = IRanges(seq(1, 10, by=20),
seq(1, 10, by=20)+c(1, 1, 2, 2, 3)),
   ref = c("T", "A",  "A", "AC",  "AC"),
   alt = c("C", "T", "AC", "AT", "ACC"),
   refDepth = c(5, 10, 5, 10, 5),
   altDepth = c(7, 6, 7, 6, 7),
   totalDepth = c(12, 17, 12, 17, 12),
   sampleNames = letters[1:5])

isSNV(vr)
## [1]  TRUE  TRUE FALSE FALSE FALSE
isIndel(vr)
## [1] FALSE FALSE  TRUE FALSE FALSE
isSubstitution(vr)
## [1]  TRUE  TRUE FALSE  TRUE FALSE

note that the last variant does not evaluate as true for any of the
three possibilities. after looking for variant definitions, i have found
that the Human Genome Variation Society (HGVS) describes this as a
deletion followed by an insertion and calls it "indel" or delins" (it's
unclear to me whether they use that interchangeably), see the link here:

http://www.hgvs.org/mutnomen/recs-DNA.html#indel

the only other site I could quickly find with Google, where some
specific definition is given is the site of the software SnpEff, which
calls it "MIXED", a "Multiple-nucleotide and an InDel":

http://snpeff.sourceforge.net/SnpEff_manual.html

I would suggest that VariantAnnotation should try to identify this type
of variant. following the HGVS recommendations, could we maybe have a
function for it called isDelins() ??



cheers,

robert.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] subtle error with VariantAnnotation::locateVariants()

2015-02-03 Thread Valerie Obenchain

Hi Robert,

Thanks for reporting this. Now fixed in 1.13.27.

I introduced this bug when switching the coordinate mapping over from 
mapCoords() to the new mapToTranscripts() family. The output from the 
new mapper is cleaner and allowed me to tidy some old code. In the 
process I realized there could be cases with >1 CDSID per row and 
changed the data type from 'integer' to 'IntegerList'. I thought I 
changed the default in all places but obviously missed a couple. The 
specific bug you hit was complaining that GRanges with different types 
for CDSID couldn't not be combined.


Valerie

On 02/03/2015 05:57 AM, Robert Castelo wrote:

hi,

VariantAnnotation::locateVariants() is breaking in a very specific way
in its current devel version (1.13.26). Since I'm using it while working
on VariantFiltering, I have reverted my copy of VariantAnnotation to
version 1.13.24 to be able to continue working.

so at the moment this is not much of a problem for me, but I thought you
might be interested in knowing about it, just in case you were not aware
of it. This is the minimal example I've come up reproducing it:

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

gr <- GRanges(rep("chr20", 1), IRanges(start=44501458, width=1))

loc <- locateVariants(gr, txdb,
   AllVariants(intergenic=IntergenicVariants(0, 0)))
Error in .Primitive("c")(,  :
   all arguments in '...' must be CompressedList objects
traceback()
traceback()
23: stop("all arguments in '...' must be CompressedList objects")
22: .Primitive("c")(,
 NA_integer_)
21: .Primitive("c")(,
 NA_integer_)
20: do.call(c, unname(cols))
19: do.call(c, unname(cols))
18: FUN(c("LOCATION", "LOCSTART", "LOCEND", "QUERYID", "TXID", "CDSID",
 "GENEID", "PRECEDEID", "FOLLOWID")[[6L]], ...)
17: lapply(colnames(df), function(cn) {
 cols <- lapply(args, `[[`, cn)
 isRle <- vapply(cols, is, logical(1L), "Rle")
 if (any(isRle) && !all(isRle)) {
 cols[isRle] <- lapply(cols[isRle], S4Vectors:::decodeRle)
 }
 isFactor <- vapply(cols, is.factor, logical(1L))
 if (any(isFactor)) {
 cols <- lapply(cols, as.factor)
 levs <- unique(unlist(lapply(cols, levels), use.names =
FALSE))
 cols <- lapply(cols, factor, levs)
 }
 rectangular <- length(dim(cols[[1]])) == 2L
 if (rectangular) {
 combined <- do.call(rbind, unname(cols))
 }
 else {
 combined <- do.call(c, unname(cols))
 }
 if (any(isFactor))
 combined <- structure(combined, class = "factor", levels =
levs)
 combined
 })
16: lapply(colnames(df), function(cn) {
 cols <- lapply(args, `[[`, cn)
 isRle <- vapply(cols, is, logical(1L), "Rle")
 if (any(isRle) && !all(isRle)) {
 cols[isRle] <- lapply(cols[isRle], S4Vectors:::decodeRle)
 }
 isFactor <- vapply(cols, is.factor, logical(1L))
 if (any(isFactor)) {
 cols <- lapply(cols, as.factor)
 levs <- unique(unlist(lapply(cols, levels), use.names =
FALSE))
 cols <- lapply(cols, factor, levs)
 }
 rectangular <- length(dim(cols[[1]])) == 2L
 if (rectangular) {
 combined <- do.call(rbind, unname(cols))
 }
 else {
 combined <- do.call(c, unname(cols))
 }
 if (any(isFactor))
 combined <- structure(combined, class = "factor", levels =
levs)
 combined
 })
15: .Method(..., deparse.level = deparse.level)
14: eval(expr, envir, enclos)
13: eval(.dotsCall, env)
12: eval(.dotsCall, env)
11: standardGeneric("rbind")
10: (function (..., deparse.level = 1)
 standardGeneric("rbind"))(, ,
 , ,
 , ,
 )
9: do.call(rbind, lapply(x, mcols, FALSE))
8: do.call(rbind, lapply(x, mcols, FALSE))
7: .unlist_list_of_GenomicRanges(args, ignore.mcols = ignore.mcols)
6: .local(x, ..., recursive = recursive)
5: c(coding, intron, fiveUTR, threeUTR, splice, promoter, intergenic)
4: c(coding, intron, fiveUTR, threeUTR, splice, promoter, intergenic)
3: .local(query, subject, region, ...)
2: locateVariants(gr, txdb, AllVariants(intergenic = IntergenicVariants(0,
0)))
1: locateVariants(gr, txdb, AllVariants(intergenic = IntergenicVariants(0,
0)))

interestingly, if i replace the chromosome name in the GRanges object of
'chr20' by 'chr2', then it works:

gr <- GRanges(rep("chr2", 1), IRanges(start=44501458, width=1))
loc <- locateVariants(gr, txdb,
   AllVariants(intergenic=IntergenicVariants(0, 0)))

or if i replace the start of the ranges of 44501458 by 1, then it also
works:

gr <- GRanges(rep("chr20", 1), IRanges(start=1, width=1))
loc <- locateVariants(gr, txdb,
   AllVariants(intergenic=IntergenicVariants(0, 0)))

here is the sessionInfo():

R Under dev

[Bioc-devel] January 2015 newsletter

2015-01-05 Thread Valerie Obenchain

Hi all,

Happy 2015. The January 2015 Newsletter is now available:

http://www.bioconductor.org/help/newsletters/2015_January/

Highlights include the (experimental) work being done with Docker 
containers, coordinate mapping, the algorithmic changes behind the 
overlaps operations and a review of the 'csaw' package.



Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] bamFlagAsBitMatrix error (?bug)

2014-12-14 Thread Valerie Obenchain

This should be fixed now (thanks Martin).

We recently deprecated the flag 'isNotPrimaryRead' in favor of 
'isNotPrimaryAlignment' (Rsamtools devel). This affected 
GenomicAlignments and others ...


~/b/Rpacks$ grep -lr isNotPrimaryRead *
CoverageView/R/cov.matrix.R
CoverageView/R/cov.interval.R
gage/vignettes/RNA-seqWorkflow.Rnw
hiReadsProcessor/R/hiReadsProcessor.R
QDNAseq/man/binReadCounts.Rd
QDNAseq/R/binReadCounts.R
Rsamtools/NEWS
Rsamtools/man/ScanBamParam-class.Rd
Rsamtools/R/methods-ScanBamParam.R
SplicingGraphs/inst/unitTests/test_countReads-methods.R
SplicingGraphs/inst/scripts/TSPC-utils.R
SplicingGraphs/vignettes/SplicingGraphs.Rnw
SplicingGraphs/man/assignReads.Rd
SplicingGraphs/man/txpath-methods.Rd
systemPipeR/R/utilities.R


These packages should be clear (of this error) on tomorrow's builds.

Thanks.
Val

On 12/14/2014 09:15 AM, Valerie Obenchain wrote:

Hi Sean,

Yes, we are aware of the problem, thanks. Hopefully this will be
resolved for tomorrows builds ... we'll post back when it's fixed.

Val

On 12/14/2014 08:33 AM, Sean Davis wrote:

Hi, Martin, Val, and Herve.

This looks like a little problem with the bitnames in
Rsamtools/GenomicAlignments.  Perhaps this is related to some bitnames
being deprecated?

Thanks,
Sean



library(GenomicAlignments)
sbp = ScanBamParam(which=cds[1:100],flag=scanBamFlag(isDuplicate =
FALSE))
x = readGAlignmentPairs(LOCALBAMS[180],param=sbp)

Error in bamFlagAsBitMatrix(flag1, bitnames = "isNotPrimaryRead") :
   invalid bitname(s): isNotPrimaryRead

traceback()

8: stop("invalid bitname(s): ", in1string)
7: bamFlagAsBitMatrix(flag1, bitnames = "isNotPrimaryRead")
6: .make_GAlignmentPairs_from_GAlignments(gal, use.mcols = use.mcols)
5: readGAlignmentPairsFromBam(bam, character(), use.names = use.names,
param = param, with.which_label = with.which_label)
4: readGAlignmentPairsFromBam(bam, character(), use.names = use.names,
param = param, with.which_label = with.which_label)
3: readGAlignmentPairsFromBam(file = file, use.names = use.names,
...)
2: readGAlignmentPairsFromBam(file = file, use.names = use.names,
...)
1: readGAlignmentPairs(LOCALBAMS[180], param = sbp)

sessionInfo()

R Under development (unstable) (2014-11-18 r66997)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 GenomicFeatures_1.19.7
  AnnotationDbi_1.29.11
  [4] GenomicAlignments_1.3.14stringr_0.6.2
   plotrix_3.5-10
  [7] pd.huex.1.0.st.v2_3.10.0RSQLite_1.0.0
   DBI_0.3.1
[10] limma_3.23.2oligo_1.31.0
  Biobase_2.27.0
[13] oligoClasses_1.29.3 knitr_1.8
   VariantAnnotation_1.13.19
[16] Rsamtools_1.19.15   Biostrings_2.35.7
   XVector_0.7.3
[19] GenomicRanges_1.19.21   GenomeInfoDb_1.3.7
  IRanges_2.1.33
[22] S4Vectors_0.5.14BiocGenerics_0.13.3
   BiocInstaller_1.17.1

loaded via a namespace (and not attached):
  [1] BBmisc_1.8BSgenome_1.35.8   BatchJobs_1.5
BiocParallel_1.1.9RCurl_1.95-4.5XML_3.98-1.1
  [7] affxparser_1.39.3-1   affyio_1.35.0 base64enc_0.1-2
biomaRt_2.23.5bit_1.1-12bitops_1.0-6
[13] brew_1.0-6checkmate_1.5.1   codetools_0.2-9
digest_0.6.6  evaluate_0.5.5fail_1.2
[19] ff_2.2-13 foreach_1.4.2 formatR_1.0
htmltools_0.2.6   iterators_1.0.7   knitrBootstrap_1.0.0
[25] markdown_0.7.4preprocessCore_1.29.0 rmarkdown_0.3.3
rtracklayer_1.27.6sendmailR_1.2-1   splines_3.2.0
[31] tools_3.2.0   yaml_2.1.13   zlibbioc_1.13.0

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] bamFlagAsBitMatrix error (?bug)

2014-12-14 Thread Valerie Obenchain

Hi Sean,

Yes, we are aware of the problem, thanks. Hopefully this will be 
resolved for tomorrows builds ... we'll post back when it's fixed.


Val

On 12/14/2014 08:33 AM, Sean Davis wrote:

Hi, Martin, Val, and Herve.

This looks like a little problem with the bitnames in
Rsamtools/GenomicAlignments.  Perhaps this is related to some bitnames
being deprecated?

Thanks,
Sean



library(GenomicAlignments)
sbp = ScanBamParam(which=cds[1:100],flag=scanBamFlag(isDuplicate = FALSE))
x = readGAlignmentPairs(LOCALBAMS[180],param=sbp)

Error in bamFlagAsBitMatrix(flag1, bitnames = "isNotPrimaryRead") :
   invalid bitname(s): isNotPrimaryRead

traceback()

8: stop("invalid bitname(s): ", in1string)
7: bamFlagAsBitMatrix(flag1, bitnames = "isNotPrimaryRead")
6: .make_GAlignmentPairs_from_GAlignments(gal, use.mcols = use.mcols)
5: readGAlignmentPairsFromBam(bam, character(), use.names = use.names,
param = param, with.which_label = with.which_label)
4: readGAlignmentPairsFromBam(bam, character(), use.names = use.names,
param = param, with.which_label = with.which_label)
3: readGAlignmentPairsFromBam(file = file, use.names = use.names,
...)
2: readGAlignmentPairsFromBam(file = file, use.names = use.names,
...)
1: readGAlignmentPairs(LOCALBAMS[180], param = sbp)

sessionInfo()

R Under development (unstable) (2014-11-18 r66997)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats4parallel  stats graphics  grDevices datasets  utils
methods   base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0 GenomicFeatures_1.19.7
  AnnotationDbi_1.29.11
  [4] GenomicAlignments_1.3.14stringr_0.6.2
   plotrix_3.5-10
  [7] pd.huex.1.0.st.v2_3.10.0RSQLite_1.0.0
   DBI_0.3.1
[10] limma_3.23.2oligo_1.31.0
  Biobase_2.27.0
[13] oligoClasses_1.29.3 knitr_1.8
   VariantAnnotation_1.13.19
[16] Rsamtools_1.19.15   Biostrings_2.35.7
   XVector_0.7.3
[19] GenomicRanges_1.19.21   GenomeInfoDb_1.3.7
  IRanges_2.1.33
[22] S4Vectors_0.5.14BiocGenerics_0.13.3
   BiocInstaller_1.17.1

loaded via a namespace (and not attached):
  [1] BBmisc_1.8BSgenome_1.35.8   BatchJobs_1.5
BiocParallel_1.1.9RCurl_1.95-4.5XML_3.98-1.1
  [7] affxparser_1.39.3-1   affyio_1.35.0 base64enc_0.1-2
biomaRt_2.23.5bit_1.1-12bitops_1.0-6
[13] brew_1.0-6checkmate_1.5.1   codetools_0.2-9
digest_0.6.6  evaluate_0.5.5fail_1.2
[19] ff_2.2-13 foreach_1.4.2 formatR_1.0
htmltools_0.2.6   iterators_1.0.7   knitrBootstrap_1.0.0
[25] markdown_0.7.4preprocessCore_1.29.0 rmarkdown_0.3.3
rtracklayer_1.27.6sendmailR_1.2-1   splines_3.2.0
[31] tools_3.2.0   yaml_2.1.13   zlibbioc_1.13.0

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation: VCF to VRanges with multiple INFO values

2014-12-11 Thread Valerie Obenchain

Now in IRanges 2.0.1 and VariantAnnotation 1.12.7.

Valerie


On 12/11/2014 09:32 AM, Julian Gehring wrote:

Can you backport the fixes to bioc-release which is also affected?

Best
Julian


Valerie Obenchain (12/09/14 03:24):


Thanks to Michael and Julian for taking care of this. Fixes are in
devel, >= 1.13.15.

Valerie


On 12/03/14 08:44, Michael Lawrence wrote:

Looks like an issue when expand()ing the VCF. Maybe Val could take a look?

On Wed, Dec 3, 2014 at 7:39 AM, Julian Gehring 
wrote:


Hi,

The conversion from a 'VCF' to 'VRanges' object fails if an INFO field
with multiple values for different ALT alleles is present:

Here an example VCF entry for which this fails (line 71151250 in
'ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz'
, taken from

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
):

10001541  rs12451372  C   G,T 100 PASS

AC=700,298;AF=0.139776,0.0595048;AN=5008;NS=2504;DP=17289;EAS_AF=0.2421,0;AMR_AF=0.1801,0.0115;AFR_AF=0.0749,0.2194;EUR_AF=0.0915,0;SAS_AF=0.1431,0;AA=T|||

The respective code to reproduce this:

library(VariantAnnotation)
roi = GRanges("17", IRanges(1e7+1541, width = 1))
vcf = readVcf(path, "GRCh37", ScanVcfParam(which = roi, info = "AF"))
## 'info = character()' and other versions also cause the error

vrc = as(vcf, "VRanges") ## error

fails with

Error in colSums(ielt) : 'x' must be an array of at least two dimensions

This occurs both with the latest version of VariantAnnotation in
bioc-release and bioc-devel.

Best
Julian

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation: VCF to VRanges with multiple INFO values

2014-12-08 Thread Valerie Obenchain
Thanks to Michael and Julian for taking care of this. Fixes are in 
devel, >= 1.13.15.


Valerie


On 12/03/14 08:44, Michael Lawrence wrote:

Looks like an issue when expand()ing the VCF. Maybe Val could take a look?

On Wed, Dec 3, 2014 at 7:39 AM, Julian Gehring 
wrote:


Hi,

The conversion from a 'VCF' to 'VRanges' object fails if an INFO field
with multiple values for different ALT alleles is present:

Here an example VCF entry for which this fails (line 71151250 in
'ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz'
, taken from

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf.gz
):

   10001541  rs12451372  C   G,T 100 PASS

AC=700,298;AF=0.139776,0.0595048;AN=5008;NS=2504;DP=17289;EAS_AF=0.2421,0;AMR_AF=0.1801,0.0115;AFR_AF=0.0749,0.2194;EUR_AF=0.0915,0;SAS_AF=0.1431,0;AA=T|||

The respective code to reproduce this:

   library(VariantAnnotation)
   roi = GRanges("17", IRanges(1e7+1541, width = 1))
   vcf = readVcf(path, "GRCh37", ScanVcfParam(which = roi, info = "AF"))
   ## 'info = character()' and other versions also cause the error

   vrc = as(vcf, "VRanges") ## error

fails with

   Error in colSums(ielt) : 'x' must be an array of at least two dimensions

This occurs both with the latest version of VariantAnnotation in
bioc-release and bioc-devel.

Best
Julian

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation: Harmonize default readVcf params

2014-12-08 Thread Valerie Obenchain

(Resend - last message didn't post to the list.)

OK, sounds good. In 1.13.18 I've deprecated VRangesScanVcfParam() and 
changed the signature of readVcfAsVRanges() to use ScanVcfParam().


If at some point we want to revise the defaults for ScanVcfParam() to a 
minimal subset I'm fine with that.


Valerie



On 12/08/14 13:32, Michael Lawrence wrote:

The reason why 'y' and 'z' are different is the same reason why
readVcfAsVRanges exists. It was meant to be a convenience so that users
could get the minimal information needed into the VCF. But maybe that
was just being too helpful, and the user ends up confused. So I agree
with Julian that we should just drop VRangesScanVcfParam and have
readVcfAsVRanges just be an alternative syntax to as(readVcf(), "VRanges").

Michael

On Mon, Dec 8, 2014 at 10:48 AM, Valerie Obenchain
mailto:voben...@fredhutch.org>> wrote:

Michael, how would you feel about dropping VRangesScanVcfParam? I'm
open to changing the defaults in ScanVcfParam; the current 'read all
fields' default is probably not the best approach.

Valerie



On 12/04/2014 02:47 AM, Julian Gehring wrote:

Hi,

Can we harmonize the default parameters for =ScanVcfParam= and
=VRangesScanVcfParam=?  It even seems that we could drop
=VRangesScanVcfParam= since it is mainly a wrapper for
=ScanVcfParam=.

Currently, the defaults for importing fields from a VCF are:

ScanVcfParam: fixed = character(), info = character(), geno =
character()

VRangesScanVcfParam: fixed = "ALT", info = NA, geno = "AD"

When using

readVcfAsVRanges(vcf_path, genome_name)

with default parameters, that yields a VRanges object only the 'AD'
metadata column.  If 'AD' is not present in the VCF file (which is
perfectly fine because it is not essential), it throws a warning.

My main motivation behind all of this is that I would expect

x = readVcf(vcf_path, genome_name)
y = as(x, "VRanges")

and

z = readVcfAsVRanges(vcf_path, genome_name)

to give an equal object.  I added some code below to make the
case more
concrete:

library(VariantAnnotation)

vcf_path = system.file("extdata", "ex2.vcf",
package="VariantAnnotation")

## read VRanges (implicit conversion)
z = readVcfAsVRanges(vcf_path, "ncbi37")

## read VCF, convert to VRanges (explicitly)
x = readVcf(vcf_path, "ncbi37")
y = as(x, "VRanges")

## harmonize it
vr_param = VRangesScanVcfParam(fixed = character(), info =
character(), geno = character())

z2 = readVcfAsVRanges(vcf_path, "ncbi37", param = vr_param)

all.equal(unname(y), unname(z2))


Best
Julian

_
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>





___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] VariantAnnotation: Same locus, multiple samples

2014-12-08 Thread Valerie Obenchain

(Resending - the last message didn't post to the list.)

I was thinking the absence of a header in VRanges would make collapsing 
difficult and with your comments it's clear this isn't a good idea.


I like the description you gave of the differences in class content and 
geometry and have added them to the VRanges man page.


Valerie


On 12/08/14 13:25, Michael Lawrence wrote:

I don't see how this can be fixed. The two data structures are
semantically incompatible; they encode different types of information,
so information is lost in both directions. Even if we collapsed the
alts, there is no way (as far as I know) to say that data for one
individual + alt combination is absent. We could put NA (".") for every
value concerning that alt, but it seems too big of an assumption to say
that all(is.na <http://is.na>())) implies omission of the VRanges
element. In other words, VCF is rectangular and VRanges is ragged, and
there is no established way to encode the raggedness in the VCF.



On Mon, Dec 8, 2014 at 11:27 AM, Valerie Obenchain
mailto:voben...@fredhutch.org>> wrote:

This could be fixed in the VRanges -> VCF coercion or in VCF -> VRanges.

Currently VRanges -> VCF creates a VCF with >1 row per position (ie,
does not collapse ALT values). I'm not sure this is technically
valid as per the specs, however, it may have been by design to meet
another need. If we are ok with >1 row per position the change can
be made in VCF -> VRanges.

Opinions?

Valerie



On 12/05/2014 01:18 AM, Julian Gehring wrote:

Hi,

Assume that we have two variants from two samples at the same locus,
stored in a 'VRanges' or 'VCF' object:

library(VariantAnnotation)

vr = VRanges("1", IRanges(c(10, 10), width = 1),
  ref = c("C", "C"), alt = c("A", "G"),
  sampleNames = c("S1", "S2"))
vcf = as(vr, "VCF")

If we convert the VCF to a VRanges, we now get each variant in each
patient:

vr2 = as(vcf, "VRanges")

length(vr) ## 2
length(vr2) ## 4

It seems that the VCF object does not store the information of the
'sampleNames' in the first conversion.

Best wishes
Julian

_
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>





___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] GenomicFiles: chunking

2014-10-27 Thread Valerie Obenchain
Sounds great. I think GenomicFiles is a good place for such a function - 
it's along the lines of what we wanted to accomplish with pack / unpack.


Maybe your new function can be used by pack once finished. There's 
definitely room for expanding that functionality.


Valerie



On 10/27/2014 12:07 PM, Michael Love wrote:

hi Valerie,

this sounds good to me.

I am thinking of working on a function (here or elsewhere) that helps
decide, for reduce by range, how to optimally chunk GRanges into a
GRangesList. Practically, this could involve sampling the size of the
imported data for a subset of cells in the (ranges, files) matrix, and
asking/estimating the amount of memory available for each worker.

best,

Mike


On Mon, Oct 27, 2014 at 2:35 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

Hi Kasper and Mike,

I've added 2 new functions to GenomicFiles and deprecated the old
classes. The vignette now has a graphic which (hopefully) clarifies
the MAP / REDUCE mechanics of the different functions.

Below is some performance testing for the new functions and answers
to leftover questions from previous emails.


Major changes to GenomicFiles (in devel):

- *FileViews classes have been deprecated:

The idea is to use the GenomicFiles class to hold any type of file
be it BAM, BigWig, character vector etc. instead of having
name-specific classes like BigWigFileViews. Currently GenomicFiles
does not inherit from SummarizedExperiment but it may in the future.

- Add reduceFiles() and reduceRanges():

These functions pass all ranges or files to MAP vs the lapply
approach taken in reduceByFiles() and reduceByRanges().


(1) Performance:

When testing with reduceByFile() you noted "GenomicFiles" is 10-20x
slower than the straightforward approach". You also noted this was
probably because of the lapply over all ranges - true. (Most likely
there was overhead in creating the SE object as well.) With the new
reduceFiles(), passing all ranges at once, we see performance very
similar to that of the 'by hand' approach.

In the test code I've used Bam instead of BigWig. Both test
functions output lists, have comparable MAP and REDUCE steps etc.

I used 5 files ('fls') and a granges ('grs') of length 100.
 > length(grs)
[1] 100

 > sum(width(grs))
[1] 100

FUN1 is the 'by hand' version. These results are similar to what you
saw, not quite a 4x difference between 10 and 100 ranges.

 >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls),
times=10)
 > Unit: seconds
 >   expr  min   lq mean   median
  uq max
 >   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331
1.98 1.256741
 >  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486
4.278463 4.533846
 >  neval
 > 10
 > 10

FUN2 is the reduceFiles() approach and the results are very similar
to FUN1.

 >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls),
times=10)
 > Unit: seconds
 >   expr  min   lq mean   median
  uq max
 >   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154
1.267655 1.275698
 >  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007
4.384064 4.676068
 >  neval
 > 10
 > 10


(2) Request for "chunking of the mapping of ranges":

For now we decided not to add a 'yieldSize' argument for chunking.
There are 2 approaches to chunking through ranges *within* the same
file. In both cases the user spits the ranges, either before calling
the function or in the MAP step.

i) reduceByFile() with a GRangesList:

The user provides a GRangesList as the 'ranges' arg. On each worker,
lapply applies MAP to the one files and all elements of the GRangesList.

ii) reduceFiles() with a MAP that handles chunking:

The user split ranges in MAP and uses lapply or another loop to
iterate. For example,

MAP <- function(range, file, ...) {
 lst = split(range, someFactor)
 someFUN = function(rngs, file, ...) do something
 lapply(lst, FUN=someFun, ...)
}

The same ideas apply for chunking though ranges *across* files with
reduceByRange() and reduceRanges().

iii) reduceByRange() with a GRangesList:

Mike has a good example here:
https://gist.github.com/__mikelove/deaff84dc75f125d
<https://gist.github.com/mikelove/deaff84dc75f125d>

iv) reduceRanges():

'ranges' should be a GRangesList. The MAP step will operate on an
element of the GRangesList and all files. Unless you want to operate
on all files at once I'd use re

Re: [Bioc-devel] GenomicFiles: chunking

2014-10-27 Thread Valerie Obenchain

Looks like the test code didn't make it through. Attaching again ...


On 10/27/2014 11:35 AM, Valerie Obenchain wrote:

Hi Kasper and Mike,

I've added 2 new functions to GenomicFiles and deprecated the old
classes. The vignette now has a graphic which (hopefully) clarifies the
MAP / REDUCE mechanics of the different functions.

Below is some performance testing for the new functions and answers to
leftover questions from previous emails.


Major changes to GenomicFiles (in devel):

- *FileViews classes have been deprecated:

The idea is to use the GenomicFiles class to hold any type of file be it
BAM, BigWig, character vector etc. instead of having name-specific
classes like BigWigFileViews. Currently GenomicFiles does not inherit
from SummarizedExperiment but it may in the future.

- Add reduceFiles() and reduceRanges():

These functions pass all ranges or files to MAP vs the lapply approach
taken in reduceByFiles() and reduceByRanges().


(1) Performance:

When testing with reduceByFile() you noted "GenomicFiles" is 10-20x
slower than the straightforward approach". You also noted this was
probably because of the lapply over all ranges - true. (Most likely
there was overhead in creating the SE object as well.) With the new
reduceFiles(), passing all ranges at once, we see performance very
similar to that of the 'by hand' approach.

In the test code I've used Bam instead of BigWig. Both test functions
output lists, have comparable MAP and REDUCE steps etc.

I used 5 files ('fls') and a granges ('grs') of length 100.
 > length(grs)
[1] 100

 > sum(width(grs))
[1] 100

FUN1 is the 'by hand' version. These results are similar to what you
saw, not quite a 4x difference between 10 and 100 ranges.

 >> microbenchmark(FUN1(grs[1:10], fls), FUN1(grs[1:100], fls), times=10)
 > Unit: seconds
 >   expr  min   lq mean   median   uq
 max
 >   FUN1(grs[1:10], fls) 1.177858 1.190239 1.206127 1.201331 1.98
1.256741
 >  FUN1(grs[1:100], fls) 4.145503 4.163404 4.249619 4.208486 4.278463
4.533846
 >  neval
 > 10
 > 10

FUN2 is the reduceFiles() approach and the results are very similar to
FUN1.

 >> microbenchmark(FUN2(grs[1:10], fls), FUN2(grs[1:100], fls), times=10)
 > Unit: seconds
 >   expr  min   lq mean   median   uq
 max
 >   FUN2(grs[1:10], fls) 1.242767 1.251188 1.257531 1.253154 1.267655
1.275698
 >  FUN2(grs[1:100], fls) 4.251010 4.340061 4.390290 4.361007 4.384064
4.676068
 >  neval
 > 10
 > 10


(2) Request for "chunking of the mapping of ranges":

For now we decided not to add a 'yieldSize' argument for chunking. There
are 2 approaches to chunking through ranges *within* the same file. In
both cases the user spits the ranges, either before calling the function
or in the MAP step.

i) reduceByFile() with a GRangesList:

The user provides a GRangesList as the 'ranges' arg. On each worker,
lapply applies MAP to the one files and all elements of the GRangesList.

ii) reduceFiles() with a MAP that handles chunking:

The user split ranges in MAP and uses lapply or another loop to iterate.
For example,

MAP <- function(range, file, ...) {
 lst = split(range, someFactor)
 someFUN = function(rngs, file, ...) do something
 lapply(lst, FUN=someFun, ...)
}

The same ideas apply for chunking though ranges *across* files with
reduceByRange() and reduceRanges().

iii) reduceByRange() with a GRangesList:

Mike has a good example here:
https://gist.github.com/mikelove/deaff84dc75f125d

iv) reduceRanges():

'ranges' should be a GRangesList. The MAP step will operate on an
element of the GRangesList and all files. Unless you want to operate on
all files at once I'd use reduceByRange() instead.


(3) Return objects have different shape:

Previous question:

"...
Why does the return object of reduceByFile vs reduceByRange (with
summarize = FALSE) different?  I understand why internally you have
different nesting schemes (files and ranges) for the two functions, but
it is not clear to me that it is desirable to have the return object
depend on how the computation was done.
..."

reduceByFile() and reduceFiles() output a list the same length as the
number of files while reduceByRange() and reduceRanges() output a list
the same length as the number of ranges.

Reduction is different depending on which function is chosen; data are
collapsed either within a file or across files. When REDUCE does
something substantial the output are not equivalent.

While it's possible to get the same result (REDUCE simply unlists or
isn't used), the two approaches were not intended to be equivalent ways
of arriving at the same end. The idea was that the user had a specific
use case in mind - they either wanted to collapse the data across or
wi

Re: [Bioc-devel] GenomicFiles: chunking

2014-10-27 Thread Valerie Obenchain
coverage(BigWigFileViews) returns a "wrong" assay object in my opinion,
...
Specifically, each (i,j) entry in the object is an RleList with a single
element with a name equal to the seqnames of the i'th entry in the query
GRanges.  To me, this extra nestedness is unnecessary; I would have
expected an Rle instead of an RleList with 1 element.
..."

The return value from coverage(x) is an RleList with one coverage vector 
per seqlevel in 'x'. Even if there is only one seqlevel, the result 
still comes back as an RleList. This is just the default behavior.



(5) separate the 'read' function from the MAP step

Previous comment:

"...
Also, something completely different, it seems like it would be 
convenient for stuff like BigWigFileViews to not have to actually parse 
the file in the MAP step.  Somehow I would envision some kind of reading 
function, stored inside the object, which just returns an Rle when I ask 
for a (range, file).  Perhaps this is better left for later.

..."

The current approach for the reduce* functions is for MAP to both 
extract and manipulate data. The idea of separating the extraction step 
is actually implemented in reduceByYield(). (This function used to be 
yieldReduce() in Rsamtools in past releases.) For reduceByYield() teh 
user must specify YIELD (a reader function), MAP, REDUCE and DONE 
(criteria to stop iteration).


I'm not sure what is best here. I thought the many-argument approach of 
reduceByYield() was possibly confusing or burdensome and so didn't use 
it in the other GenomicFiles functions. Maybe it's not confusing but 
instead makes the individual steps more clear. What do you think,


- Should the reader function be separate from the MAP? What are the 
advantages?


- Should READER, MAP, REDUCE be stored inside the GenomicFiles object or 
supplied as arguments to the functions?



(6) unnamed assay in SummarizedExperiment

Previous comment:

"...
The return object of reduceByRange / reduceByFile with summarize = TRUE 
is a SummarizedExperiment with an unnamed assay.  I was surprised to see 
that this is even possible.

..."

There is no default name for SummarizedExperiment in general. I've named 
the assay 'data' for lack of a better term. We could also go with 
'reducedData' or another suggestion.



Thanks for the feedback.

Valerie



On 10/01/2014 08:30 AM, Michael Love wrote:

hi Kasper,

For a concrete example, I posted a R and Rout file here:

https://gist.github.com/mikelove/deaff84dc75f125d

Things to note: 'ranges' is a GRangesList, I cbind() the numeric
vectors in the REDUCE, and then rbind() the final list to get the
desired matrix.

Other than the weird column name 'init', does this give you what you want?

best,

Mike

On Tue, Sep 30, 2014 at 2:08 PM, Michael Love
 wrote:

hi Kasper and Valerie,

In Kasper's original email:

"I would like to be able to write a MAP function which takes
   ranges, file
instead of just
   range, file
And then chunk over say 1,000s of ranges. I could then have an
argument to reduceByXX called something like rangeSize, which is kind
of yieldSize."

I was thinking, for your BigWig example, we get part of the way just
by splitting the ranges into a GRangesList of your desired chunk size.
Then the parallel iteration is over chunks of GRanges. Within your MAP
function:

import(file, which = ranges, as = "Rle", format = "bw")[ranges]

returns an RleList, and calling mean() on this gives a numeric vector
of the mean of the coverage for each range.

Then the only work needed is how to package the result into something
reasonable. What you would get now is a list (length of GRangesList)
of lists (length of files) of vectors (length of GRanges).

best,

Mike

On Mon, Sep 29, 2014 at 7:09 PM, Valerie Obenchain  wrote:

Hi Kasper,

The reduceBy* functions were intended to combine data across ranges or
files. If we have 10 ranges and 3 files you can think of it as a 10 x 3 grid
where we'll have 30 queries and therefore 30 pieces of information that can
be combined across different dimensions.

The request to 'processes ranges in one operation' is a good suggestion and,
as you say, may even be the more common use case.

Some action items and explanations:

1)  treat ranges / files as a group

I'll add functions to treat all ranges / files as a group; essentially no
REDUCER other than concatenation. Chunking (optional) will occur as defined
by 'yieldSize' in the BamFile.

2)  class clean-up

The GenomicFileViews class was the first iteration. It was overly
complicated and the complication didn't gain us much. In my mind the
GenomicFiles class is a striped down version should replace the *FileViews
classes. I plan to deprecate the *FileViews classes.

Ideally the class(s) in GenomicFiles would inherit from
SummarizedExperiment. A stand-alon

Re: [Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

2014-10-24 Thread Valerie Obenchain
This is a good question. I'm not sure we want seqlevelsStyle() to also 
alter the genome value. I think it's a reasonable request but I'd like 
to open it up to discussion. I've cc'd a few others for input.


Valerie



On 10/24/14 09:05, Robert Castelo wrote:

hi Valerie,

thanks for the quick fix and updating the documentation, i have a
further question about the seqinfo slot and particularly the use of
seqlevelsStyle(). Let me illustrate it with an example again:


==
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

## read again the same VCF file
fl <- file.path(system.file("extdata", package="VariantFiltering"),
"CEUtrio.vcf.bgz")
vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## select the standard chromosomes
vcf <- keepStandardChromosomes(vcf)

## since the input VCF file had NCBI style, let's match
## the style of the TxDb annotations
seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)

## drop the mitochondrial chromosome (b/c of the different lengths
## between b37 and hg19
vcf <- dropSeqlevels(vcf, "chrM")

## try to annotate the location of the variants. it prompts an
## error because the 'genome' slot of the Seqinfo object still
## has b37 after running seqlevelsStyle
vcf_annot <- locateVariants(vcf, txdb, AllVariants())
Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
c("sequence",  :
   sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9,
chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19,
chr20, chr21, chr22, chrX, chrY, chrM have incompatible genomes:
   - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37
   - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
hg19, hg19, hg19

## this can be fixed by setting the 'genome' slot to the values of
## the TxDb object
genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
names(genome(txdb)))]

## now this works
vcf_annot <- locateVariants(vcf, txdb, AllVariants())
=

so my question is, should not seqlevelsStyle() also change the 'genome'
slot of the Seqinfo object in the updated object?

if not, would the solution be updating the 'genome' slot in the way i
did it?

thanks!
robert.


On 10/23/2014 11:14 PM, Valerie Obenchain wrote:

Hi Robert,

Thanks for the bug report and reproducible example. Now fixed in release
1.12.2 and devel 1.13.4.

I've also updated the docs to better explain how the Seqinfo objects are
propagated / merged when supplied as 'genome'.

Valerie


On 10/23/2014 06:45 AM, Robert Castelo wrote:

hi there,

in my package VariantFiltering i have an example VCF file from a Hapmap
CEU trio including three chromosomes only to illustrate its vignette.
i've come across a problem with the function readVcf() in
VariantAnnotation that may be specific of the situation of a VCF file
not having all chromosomes, but which it will be great for me if this
could be addressed.

the problem is reproduced as follows:

===
library(VariantAnnotation)

fl <- file.path(system.file("extdata", package="VariantFiltering"),
"CEUtrio.vcf.bgz")

vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old,
seqlevels(value)) :
when 'new2old' is NULL, the first elements in the
supplied 'seqlevels' must be identical to 'seqlevels(x)'


this is caused because although i'm providing the Seqinfo object derived
from the header of the VCF file itself, at some point the ordering of
the seqlevels between the header and the rest of the VCF file differs
due to the smaller subset of chromosomes in the VCF file.

This can be easily fixed by replacing the line:

if (length(newsi) > length(oldsi)) {

within the .scanVcfToVCF() function in methods-readVcf.R, by

if (length(newsi) >= length(oldsi)) {

this is happening both in release and devel. i'm pasting below my
sessionInfo() for the release.

let me know if you think this fix is feasible or i'm wrongly using the
function readVcf(). i'm basically trying to use readVcf() without having
to figure out the appropriate value for the argument 'genome', i.e.,
without knowing beforehand what version of the genome was used to
produce the VCF file.

thanks!!
robert.


sessionInfo()
R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
[5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
[7] LC_PAPER=en_US.UTF8 LC_NAME=C
[9] LC

Re: [Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

2014-10-23 Thread Valerie Obenchain

Hi Robert,

Thanks for the bug report and reproducible example. Now fixed in release 
1.12.2 and devel 1.13.4.


I've also updated the docs to better explain how the Seqinfo objects are 
propagated / merged when supplied as 'genome'.


Valerie


On 10/23/2014 06:45 AM, Robert Castelo wrote:

hi there,

in my package VariantFiltering i have an example VCF file from a Hapmap
CEU trio including three chromosomes only to illustrate its vignette.
i've come across a problem with the function readVcf() in
VariantAnnotation that may be specific of the situation of a VCF file
not having all chromosomes, but which it will be great for me if this
could be addressed.

the problem is reproduced as follows:

===
library(VariantAnnotation)

fl <- file.path(system.file("extdata", package="VariantFiltering"),
"CEUtrio.vcf.bgz")

vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old,
seqlevels(value)) :
   when 'new2old' is NULL, the first elements in the
   supplied 'seqlevels' must be identical to 'seqlevels(x)'


this is caused because although i'm providing the Seqinfo object derived
from the header of the VCF file itself, at some point the ordering of
the seqlevels between the header and the rest of the VCF file differs
due to the smaller subset of chromosomes in the VCF file.

This can be easily fixed by replacing the line:

 if (length(newsi) > length(oldsi)) {

within the .scanVcfToVCF() function in methods-readVcf.R, by

 if (length(newsi) >= length(oldsi)) {

this is happening both in release and devel. i'm pasting below my
sessionInfo() for the release.

let me know if you think this fix is feasible or i'm wrongly using the
function readVcf(). i'm basically trying to use readVcf() without having
to figure out the appropriate value for the argument 'genome', i.e.,
without knowing beforehand what version of the genome was used to
produce the VCF file.

thanks!!
robert.


sessionInfo()
R version 3.1.1 Patched (2014-10-13 r66751)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF8LC_COLLATE=en_US.UTF8
  [5] LC_MONETARY=en_US.UTF8LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8   LC_NAME=C
  [9] LC_ADDRESS=C  LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices
[6] utils datasets  methods   base

other attached packages:
  [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
  [3] Biostrings_2.34.0XVector_0.6.0
  [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
  [7] IRanges_2.0.0S4Vectors_0.4.0
  [9] BiocGenerics_0.12.0  vimcom_1.0-0
[11] setwidth_1.0-3   colorout_1.0-3

loaded via a namespace (and not attached):
  [1] AnnotationDbi_1.28.0base64enc_0.1-2
  [3] BatchJobs_1.4   BBmisc_1.7
  [5] Biobase_2.26.0  BiocParallel_1.0.0
  [7] biomaRt_2.22.0  bitops_1.0-6
  [9] brew_1.0-6  BSgenome_1.34.0
[11] checkmate_1.4   codetools_0.2-9
[13] DBI_0.3.1   digest_0.6.4
[15] fail_1.2foreach_1.4.2
[17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
[19] iterators_1.0.7 RCurl_1.95-4.3
[21] RSQLite_0.11.4  rtracklayer_1.26.0
[23] sendmailR_1.2-1 stringr_0.6.2
[25] tools_3.1.1 XML_3.98-1.1
[27] zlibbioc_1.12.0

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] How to update annotation packages

2014-10-16 Thread Valerie Obenchain

Hi Robert,

Annotation packages are generally updated before each release. You 
should have received an email from Marc (cc'd) last month asking if you 
had any updates. Updated packages are uploaded to our ftp server and 
Marc adds them to the repository.


If you need to make a change I think Marc needs to give you access to 
the server or maybe you already have it. He'll be able to fill in the 
details.


Valerie


On 10/15/2014 09:08 AM, Robert Castelo wrote:

Hi,


i submitted one release ago the following annotation packages:

MafDb.ALL.wgs.phase1.release.v3.20101123
MafDb.ESP6500SI.V2.SSA137.dbSNP138
phastCons100way.UCSC.hg19

for which i'm also the "maintainer".

while the functionality to access their annotation resides in another
software package i maintain (VariantFiltering), they contain some
functionality to help updating their annotations themselves. i need to
update those R scripts which are part of the annotation packages,
however, i found nowhere whether this is possible via svn.

how should i proceed with this, should i made the changes local and
submit the new version, or is there some 'svn' way to do it just like
with software or experiment data packages?

thanks,

robert.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] October Newsletter

2014-10-01 Thread Valerie Obenchain

The fourth quarter Newsletter is now available at

http://www.bioconductor.org/help/newsletters/2014_October/

Topics include support for GRCh38, htslib and an introduction to the 
biocMultiAssay working group.


Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Cannot interact with a BigWig file on the web with rtracklayer (Windows specific)

2014-09-30 Thread Valerie Obenchain

Hi,

Import and manipulation of BigWig files in rtracklayer make use of the 
Kent C library and are not supported on windows.


This is documented at

?import.bw
?`BigWigFile-class`

This problem isn't specific to an 'over the network' example. Trying to 
import() a local BigWig on Windows will also fail.


Valerie


On 09/30/2014 08:00 AM, Leonardo Collado Torres wrote:

Hello,

I ran into an issue interacting with BigWig files over the network
with `rtracklayer`. For some reason, the issue is Windows-specific and
I don't understand why.

Basically, I run the short code at
https://gist.github.com/lcolladotor/0ab8ab3d904d21110637 It works on a
Mac, but it fails on Windows. The same is true for using
rtracklayer::import().

Any help will be greatly appreciated.

Thanks,
Leonardo


### Windows info ###


suppressPackageStartupMessages(library('rtracklayer'))

## Attempt with 1 file
bw <- 
BigWigFile('http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/bigwig/HSB97.AMY.bw')
seql <- seqlengths(bw)

Warning message:
In seqinfo(x) : Invalid argument
Can't open 
http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/bigwig/HSB97.AMY.bw
to read
Error in seqlengths(seqinfo(x)) :
   error in evaluating the argument 'x' in selecting a method for
function 'seqlengths': Error in seqinfo(x) : UCSC library operation
failed

traceback()

3: seqlengths(seqinfo(x))
2: seqlengths(bw)
1: seqlengths(bw)


## Then another file
bw2 <- 
BigWigFile('http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/bigwig/HSB114.A1C.bw')
seql2 <- seqlengths(bw2)

Warning message:
In seqinfo(x) : Invalid argument
Can't open 
http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/bigwig/HSB114.A1C.bw
to read
Error in seqlengths(seqinfo(x)) :
   error in evaluating the argument 'x' in selecting a method for
function 'seqlengths': Error in seqinfo(x) : UCSC library operation
failed


## Check
identical(seql, seql2)

Error in identical(seql, seql2) : object 'seql' not found


## Session info
devtools::session_info()

Session 
info
  setting  value
  version  R version 3.1.1 (2014-07-10)
  system   x86_64, mingw32
  ui   Rgui
  language (EN)
  collate  English_United States.1252
  tz   America/New_York

Packages
  package   * version  date   source
  base64enc   0.1.22014-06-26 CRAN (R 3.1.1)
  BatchJobs   1.4  2014-09-24 CRAN (R 3.1.1)
  BBmisc  1.7  2014-06-21 CRAN (R 3.1.1)
  BiocGenerics  * 0.11.5   2014-09-13 Bioconductor
  BiocParallel0.99.22  2014-09-23 Bioconductor
  Biostrings  2.33.14  2014-09-07 Bioconductor
  bitops  1.0.62013-08-17 CRAN (R 3.1.0)
  brew1.0.62011-04-13 CRAN (R 3.1.0)
  checkmate   1.4  2014-09-03 CRAN (R 3.1.1)
  codetools   0.2.92014-08-21 CRAN (R 3.1.1)
  DBI 0.3.12014-09-24 CRAN (R 3.1.1)
  devtools1.6  2014-09-23 CRAN (R 3.1.1)
  digest  0.6.42013-12-03 CRAN (R 3.1.0)
  fail1.2  2013-09-19 CRAN (R 3.1.0)
  foreach 1.4.22014-04-11 CRAN (R 3.1.0)
  futile.logger   1.3.72014-01-23 CRAN (R 3.1.1)
  futile.options  1.0.02010-04-06 CRAN (R 3.1.1)
  GenomeInfoDb  * 1.1.23   2014-09-28 Bioconductor
  GenomicAlignments   1.1.30   2014-09-23 Bioconductor
  GenomicRanges * 1.17.42  2014-09-23 Bioconductor
  IRanges   * 1.99.28  2014-09-10 Bioconductor
  iterators   1.0.72014-04-11 CRAN (R 3.1.0)
  lambda.r1.1.62014-01-23 CRAN (R 3.1.1)
  RCurl   1.95.4.3 2014-07-29 CRAN (R 3.1.1)
  Rsamtools   1.17.34  2014-09-20 Bioconductor
  RSQLite 0.11.4   2013-05-26 CRAN (R 3.1.0)
  rstudioapi  0.1  2014-03-27 CRAN (R 3.1.1)
  rtracklayer   * 1.25.16  2014-09-10 Bioconductor
  S4Vectors * 0.2.42014-09-14 Bioconductor
  sendmailR   1.2.12014-09-21 CRAN (R 3.1.1)
  stringr 0.6.22012-12-06 CRAN (R 3.1.0)
  XML 3.98.1.1 2013-06-20 CRAN (R 3.1.0)
  XVector 0.5.82014-09-07 Bioconductor
  zlibbioc1.11.1   2014-05-09 Bioconductor










## Mac info 






suppressPackageStartupMessages(library('rtracklayer'))

## Attempt with 1 file
bw <- 
BigWigFile('http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/bigwig/HSB97.AMY.bw')
seql <- seqlengths(bw)

## Then another file
bw2 <- 
BigWigFile('http://download.alleninstitute.org/brainspan/MRF_BigWig_Gencode_v10/bigwig/HSB114.A1C.bw')
seql2 <- seqlengths(bw2)

## Check
identical(seql, seql2)

[1] TRUE


## Session info
devtools::session_info()

Session 
info-

Re: [Bioc-devel] writeVcf performance

2014-09-30 Thread Valerie Obenchain
 digest_0.6.4
[15] fail_1.2   foreach_1.4.2
[17] futile.options_1.0.0   genefilter_1.47.6
[19] geneplotter_1.43.0 GenomicAlignments_1.1.29
[21] genoset_1.19.32gneDB_0.4.18
[23] grid_3.1.1 iterators_1.0.7
[25] lambda.r_1.1.6 lattice_0.20-29
[27] Matrix_1.1-4   RColorBrewer_1.0-5
[29] RCurl_1.95-4.3 rjson_0.2.14
[31] RSQLite_0.11.4 sendmailR_1.2-1
[33] splines_3.1.1  stringr_0.6.2
[35] survival_2.37-7tools_3.1.1
[37] TxDb.Hsapiens.BioMart.igis_2.3 XML_3.98-1.1
[39] xtable_1.7-4   zlibbioc_1.11.1

On Wed, Sep 17, 2014 at 2:08 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

Hi Gabe,

Have you had a chance to test writeVcf? The changes made over the
past week have shaved off more time. It now takes ~ 9 minutes to
write the NA12877 example.

dim(vcf)

[1] 516127621

gc()

  used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

print(system.time(writeVcf(__vcf, tempfile(

user  system elapsed
555.282   6.700 565.700

gc()

  used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
Vcells 1176960717 8979.5 2183277445  16657.1
2171401955 16566.5



In the most recent version (1.11.35) I've added chunking for files
with > 1e5 records. Right now the choice of # records per chunk is
simple, based on total records only. We are still experimenting with
this. You can override default chunking with 'nchunk'. Examples on
the man page.

Valerie


On 09/08/14 08:43, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain
mailto:voben...@fhcrc.org>
<mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>> wrote:

 The new writeVcf code is in 1.11.28.

 Using the illumina file you suggested, geno fields only,
writing now
 takes about 17 minutes.

  > hdr
 class: VCFHeader
 samples(1): NA12877
 meta(6): fileformat ApplyRecalibration ... reference source
 fixed(1): FILTER
 info(22): AC AF ... culprit set
 geno(8): GT GQX ... PL VF

  > param = ScanVcfParam(info=NA)
  > vcf = readVcf(fl, "", param=param)
  > dim(vcf)
 [1] 516127621

  > system.time(writeVcf(vcf, "out.vcf"))
  user   system  elapsed
   971.0326.568 1004.593

 In 1.11.28, parsing of geno data was moved to C. If this didn't
 speed things up enough we were planning to implement 'chunking'
 through the VCF and/or move the parsing of info to C,
however, it
 looks like geno was the bottleneck.

     I've tested a number of samples/fields combinations in
files with >=
 .5 million rows and the improvement over writeVcf() in
release is ~ 90%.

 Valerie




 On 09/04/14 15:28, Valerie Obenchain wrote:

 Thanks Gabe. I should have something for you on Monday.

 Val


 On 09/04/2014 01:56 PM, Gabe Becker wrote:

 Val and Martin,

 Apologies for the delay.

 We realized that the Illumina platinum genome vcf
files make
 a good test
 case, assuming you strip out all the info (info=NA when
 reading it into
 R) stuff.


ftp://platgene:G3n3s4me@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz

<ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz>


<ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>>
 took about ~4.2 hrs to write out, and is about 1.5x
the size
 of the
 files we are actually dealing with (~50M ranges vs
our ~30M).

 Looking forward a new vastly improved writeVcf :).

 ~G


 On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
 mailto:lawrence.mich...@gene.com>
 <mailto:lawre

Re: [Bioc-devel] GenomicFiles: chunking

2014-09-29 Thread Valerie Obenchain

Hi Kasper,

The reduceBy* functions were intended to combine data across ranges or 
files. If we have 10 ranges and 3 files you can think of it as a 10 x 3 
grid where we'll have 30 queries and therefore 30 pieces of information 
that can be combined across different dimensions.


The request to 'processes ranges in one operation' is a good suggestion 
and, as you say, may even be the more common use case.


Some action items and explanations:

1)  treat ranges / files as a group

I'll add functions to treat all ranges / files as a group; essentially 
no REDUCER other than concatenation. Chunking (optional) will occur as 
defined by 'yieldSize' in the BamFile.


2)  class clean-up

The GenomicFileViews class was the first iteration. It was overly 
complicated and the complication didn't gain us much. In my mind the 
GenomicFiles class is a striped down version should replace the 
*FileViews classes. I plan to deprecate the *FileViews classes.


Ideally the class(s) in GenomicFiles would inherit from 
SummarizedExperiment. A stand-alone package for SummarizedExperiment is 
in the works for the near future. Until those plans are finalized I'd 
rather not change the inheritance structure in GenomicFiles.


3) pack / unpack

I experimented with this an came to the conclusion that packing / 
unpacking should be left to the user vs being done behind the scenes 
with reduceBy*. The primary difficulty is that you don't have 
pre-knowledge of the output format of the user's MAPPER. If MAPPER 
outputs an Rle, unpacking may be straightforward. If MAPPER outputs a 
NumericList or matrix or vector with no genomic coordinates then things 
are more complicated.


I'm open if others have suggestions / prototypes.

4) reduceByYield

This was intended for chunking through ranges in a single file. You can 
imagine using bplapply() over files where each file is chunked through 
with reduceByYield(). All ranges are chunked by 'yieldSize' defined in 
the BamFile unless a 'param' dictates a subset of ranges.


5) additional tidy

I'll add a name to the 'assays' when summarize=TRUE, make sure return 
types are consistent when summarize=FALSE, etc.


Thanks for the testing and feedback.


Valerie

On 09/29/2014 07:18 AM, Michael Love wrote:

On Mon, Sep 29, 2014 at 9:09 AM, Kasper Daniel Hansen
 wrote:

I don't fully understand "the use case for reducing by range is when the
entire dataset won't fit into memory".  The basic assumption of these
functions (as far as I can see) is that the output data fits in memory.
What may not fit in memory is various earlier "iterations" of the data.  For
example, in my use case, if I just read in all the data in all the ranges in
all the samples it is basically Rle's across 450MB times 38 files, which is
not small.  What fits in memory is smaller chunks of this; that is true for
every application.



I was unclear. I meant that, in approach1, you have an object,
all.Rle, which contains Rles for every range over every file. Can you
actually run this approach on the full dataset?


Reducing by range (or file) only makes sense when the final output includes
one entity for several ranges/files ... right?  So I don't see how reduce
would help me.



Yes, I think we agree. This is not a good use case for reduce by range
as now implemented.

This is a use case which would benefit from the user-facing function
internally calling pack()/unpack() to reduce the number of import()
calls, and then in the end giving back the mean coverage over the
input ranges. I want this too.

https://github.com/Bioconductor/GenomicFileViews/issues/2#issuecomment-32625456
(link to the old github repo, the new github repo is named GenomicFiles)


As I see the pack()/unpack() paradigm, it just re-orders the query ranges
(which is super nice and matters a lot for speed for some applications).  As
I understand the code (and my understanding is developing) we need an extra
layer to support processing multiple ranges in one operation.

I am happy to help apart from complaining.

Best,
Kasper

On Mon, Sep 29, 2014 at 8:55 AM, Michael Love 
wrote:


Thanks for checking it out and benchmarking. We should be more clear
in the docs that the use case for reducing by range is when the entire
dataset won't fit into memory. Also, we had some discussion and
Valerie had written up methods for packing up the ranges supplied by
the user into a better form for querying files. In your case it would
have packed many ranges together, to reduce the number of import()
calls like your naive approach. See the pack/unpack functions, which
are not in the vignette but are in the man pages. If I remember,
writing code to unpack() the result was not so simple, and development
of these functions was set aside for the moment.

Mike

On Sun, Sep 28, 2014 at 10:49 PM, Kasper Daniel Hansen
 wrote:


I am testing GenomicFiles.

My use case: I have 231k ranges of average width 1.9kb and total width
442
MB.  I also have 38 BigWig files.  I want to compute the av

Re: [Bioc-devel] BiocParallel-devel error

2014-09-29 Thread Valerie Obenchain

Hi Michel,

In BiocParallel 0.99.24 .convertToSimpleError() now checks for NULL and 
converts to NA_character_.


I'm testing with BatchJobs 1.4, BiocParallel 0.99.24 and SLURM. I'm 
still not getting an informative error message:



xx <- bplapply(1:2, FUN)

SubmitJobs |+| 100% (00:00:00)
Waiting [S:2 D:0 E:0 R:0] |+ |   0% (00:00:00)
Error: 2 errors; first error:
  NA

For more information, use bplasterror(). To resume calculation, re-call
  the function and set the argument 'BPRESUME' to TRUE or wrap the
  previous call in bpresume().



Last error:


bplasterror()

0 / 2 partial results stored. First 2 error messages:
[1]: NA
[2]: NA



Valerie


On 09/26/2014 01:44 AM, Michel Lang wrote:

This was a bug in BatchJobs::waitForJobs(). We now throw an error if
jobs "disappear" due to a faulty template file. I'd appreciate if you
could confirm that this is now correctly catched and handled on your
system. I furthermore suggest to replace NULL with NA_character_ in
.convertToSimpleError().

I will upload a fixed BatchJobs to CRAN ASAP (but last upload was the
day before yesterday, thus we have to wait a few days).

Best,
Michel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel-devel error

2014-09-23 Thread Valerie Obenchain

Hi,

Martin and I looked into this a bit. It looks like a problem with 
handling an 'undefined error' returned from a worker (i.e., job did not 
run). When there is a problem executing the tmpl script no error message 
is sent back. The NULL is coerced to simpleError and becomes a problem 
downstream when the error processing is expecting messages of length > 0.


You can reproduce the error by putting a typo in the script. For example 
replace R with something bogus such as MYR in this line:


MYR CMD --no-save --no-restore "<%= rscript %>" /dev/stdout

You said the script worked with release but not devel. Is it possible 
there's a problem with how R devel is being called on the cluster?


Michel Lang (cc'd) implemented BatchJobs in BiocParallel. I'd like to 
get his opinion on how he wants to handle this type of error.
Michel, let me know if you need more details, I can send another example 
off-line.


Valerie



On 09/22/2014 02:58 PM, Valerie Obenchain wrote:

Hi Thomas,

Just wanted to let you know I saw this and am looking into it.

Valerie

On 09/20/2014 02:54 PM, Thomas Girke wrote:

Hi Martin, Micheal and Vincent,

If I run the following code, with the release version of BiocParallel
then it
works (took me some time to actually realize that), but with the
development
version I am getting an error shown after the test code below. If I
run the
same test with BatchJobs from the devel branch alone then there is no
problem.
Thus, it seems there is some change in the devel version of BiocParallel
causing this error? The torque.tmpl file I am using on our cluster is the
standard one from BatchJobs here:
https://github.com/tudo-r/BatchJobs/blob/master/examples/cfTorque/simple.tmpl


For my application, I could stick with BatchJobs, but it would be
nicer if I
could get things to work with BiocParallel.

Thanks,

Thomas

###
## Test Code ##
###
FUN <- function(i) system("hostname", intern=TRUE)
library(BiocParallel); library(BatchJobs)
funs <- makeClusterFunctionsTorque("torque.tmpl")
param <- BatchJobsParam(4, resources=list(walltime="48:00:00",
nodes="1:ppn=4", memory="4gb"), cluster.functions=funs)
register(param)
xx <- bplapply(1:4, FUN)

Error: 4 errors; first error:

For more information, use bplasterror(). To resume calculation,
re-call the function and
set the argument 'BPRESUME' to TRUE or wrap the previous call in
bpresume()


bplasterror()

Error in vapply(head(which(is.error), n.print), f, character(1L)) :
   values must be length 1,
  but FUN(X[[1]]) result is length 0


sessionInfo()

R Under development (unstable) (2014-05-05 r65530)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats graphics  utils datasets  grDevices methods   base

other attached packages:
[1] BatchJobs_1.3BBmisc_1.7   BiocParallel_0.99.19

loaded via a namespace (and not attached):
  [1] BiocGenerics_0.11.4 DBI_0.3.0   RSQLite_0.11.4
brew_1.0-6  checkmate_1.4   codetools_0.2-9
digest_0.6.4fail_1.2foreach_1.4.2
iterators_1.0.7
[11] parallel_3.2.0  sendmailR_1.1-2 stringr_0.6.2
tools_3.2.0

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] BiocParallel-devel error

2014-09-22 Thread Valerie Obenchain

Hi Thomas,

Just wanted to let you know I saw this and am looking into it.

Valerie

On 09/20/2014 02:54 PM, Thomas Girke wrote:

Hi Martin, Micheal and Vincent,

If I run the following code, with the release version of BiocParallel then it
works (took me some time to actually realize that), but with the development
version I am getting an error shown after the test code below. If I run the
same test with BatchJobs from the devel branch alone then there is no problem.
Thus, it seems there is some change in the devel version of BiocParallel
causing this error? The torque.tmpl file I am using on our cluster is the
standard one from BatchJobs here:
https://github.com/tudo-r/BatchJobs/blob/master/examples/cfTorque/simple.tmpl

For my application, I could stick with BatchJobs, but it would be nicer if I
could get things to work with BiocParallel.

Thanks,

Thomas

###
## Test Code ##
###
FUN <- function(i) system("hostname", intern=TRUE)
library(BiocParallel); library(BatchJobs)
funs <- makeClusterFunctionsTorque("torque.tmpl")
param <- BatchJobsParam(4, resources=list(walltime="48:00:00", nodes="1:ppn=4", 
memory="4gb"), cluster.functions=funs)
register(param)
xx <- bplapply(1:4, FUN)

Error: 4 errors; first error:

For more information, use bplasterror(). To resume calculation, re-call the 
function and
set the argument 'BPRESUME' to TRUE or wrap the previous call in bpresume()


bplasterror()

Error in vapply(head(which(is.error), n.print), f, character(1L)) :
   values must be length 1,
  but FUN(X[[1]]) result is length 0


sessionInfo()

R Under development (unstable) (2014-05-05 r65530)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] stats graphics  utils datasets  grDevices methods   base

other attached packages:
[1] BatchJobs_1.3BBmisc_1.7   BiocParallel_0.99.19

loaded via a namespace (and not attached):
  [1] BiocGenerics_0.11.4 DBI_0.3.0   RSQLite_0.11.4  brew_1.0-6
  checkmate_1.4   codetools_0.2-9 digest_0.6.4fail_1.2  
  foreach_1.4.2   iterators_1.0.7
[11] parallel_3.2.0  sendmailR_1.1-2 stringr_0.6.2   tools_3.2.0

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] thanks

2014-09-19 Thread Valerie Obenchain
Just a note to say thanks to those who worked on (1) the new biocViews 
search capabilities and (2) seqlevelsStyle<-. These are great 
improvements that have made tasks easier / faster time and time again.


Yea!

Val

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] writeVcf performance

2014-09-17 Thread Valerie Obenchain

Hi Gabe,

Have you had a chance to test writeVcf? The changes made over the past 
week have shaved off more time. It now takes ~ 9 minutes to write the 
NA12877 example.



dim(vcf)

[1] 516127621

gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157818565 8428.5  298615851 15947.9  261235336 13951.5
Vcells 1109849222 8467.5 1778386307 13568.1 1693553890 12920.8

print(system.time(writeVcf(vcf, tempfile(

   user  system elapsed
555.282   6.700 565.700

gc()

 used   (Mb) gc trigger(Mb)   max used(Mb)
Ncells  157821990 8428.7  329305975 17586.9  261482807 13964.7
Vcells 1176960717 8979.5 2183277445 16657.1 2171401955 16566.5



In the most recent version (1.11.35) I've added chunking for files with 
> 1e5 records. Right now the choice of # records per chunk is simple, 
based on total records only. We are still experimenting with this. You 
can override default chunking with 'nchunk'. Examples on the man page.


Valerie


On 09/08/14 08:43, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
<mailto:lawrence.michael@gene.__com
<mailto:lawrence.mich...@gene.com>>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel algorithm.
 Unfortunately there is some frustrating overhead with the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 <mailto:mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

  

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain

Hi Herve,

This unit test passes in VA 1.11.30 (the current version in svn). It was 
related to writeVcf(), not the IRanges/S4Vector stuff. My fault, not yours.


Val

On 09/09/2014 02:47 PM, Hervé Pagès wrote:

Hi Val,

On 09/09/2014 02:12 PM, Valerie Obenchain wrote:

Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the
process of moving SimpleList and DataFrame from IRanges to S4Vectors;
finished up today I think.


I fixed VariantAnnotation's NAMESPACE this morning but 'R CMD check'
failed for me because of an unit test error in test_VRanges_vcf().
Here is how to quickly reproduce:

   library(VariantAnnotation)
   library(RUnit)
   source("path/to/VariantAnnotation/inst/unitTests/test_VRanges-class.R")

   dest <- tempfile()
   vr <- make_TARGET_VRanges_vcf()
   writeVcf(vr, dest)
   vcf <- readVcf(dest, genome = "hg19")
   perm <- c(1, 7, 8, 4, 2, 10)
   vcf.vr <- as(vcf, "VRanges")[perm]
   genome(vr) <- "hg19"
   checkIdenticalVCF(vr, vcf.vr)  # Error in checkIdentical(orig, vcf) :
FALSE

Hard for me to tell whether this is related to DataFrame moving from
IRanges to S4Vectors or to a regression in writeVcf(). Do you think
you can have a look? Thanks for the help and sorry for the trouble.

H.


Anyhow, if you get VariantAnnotation from svn
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe
rtracklayer).

I'm working on the 'chunking' approach next. It looks like we can still
gain from adding that.

Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing
now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files
with >=
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files
make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz




<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>

took about ~4.2 hrs to write out, and is about 1.5x the
size
of the
files we are actually dealing with (~50M ranges vs our
~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
<mailto:lawrence.michael@gene.__com
<mailto:lawrence.mich...@gene.com>>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 <mailto:mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
 

Re: [Bioc-devel] writeVcf performance

2014-09-09 Thread Valerie Obenchain
Writing 'list' data has been fixed in 1.11.30. fyi, Herve is in the 
process of moving SimpleList and DataFrame from IRanges to S4Vectors; 
finished up today I think. Anyhow, if you get VariantAnnotation from svn 
you'll need to update S4Vectors, IRanges and GenomicRanges (and maybe 
rtracklayer).


I'm working on the 'chunking' approach next. It looks like we can still 
gain from adding that.


Valerie

On 09/08/2014 12:00 PM, Valerie Obenchain wrote:

fyi Martin found a bug with the treatment of list data (ie, Number =
'.') in the header. Working on a fix ...

Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is
~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz


<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
<mailto:lawrence.michael@gene.__com
<mailto:lawrence.mich...@gene.com>>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel
algorithm.
 Unfortunately there is some frustrating overhead with
the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 <mailto:mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is written,
the SEXPs
 become available for collection and are re-used.
Here's my toy
 example

 time.R
 ==
 splitIndic

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain
fyi Martin found a bug with the treatment of list data (ie, Number = 
'.') in the header. Working on a fix ...


Val


On 09/08/2014 08:43 AM, Gabe Becker wrote:

Val,

That is great. I'll check this out and test it on our end.

~G

On Mon, Sep 8, 2014 at 8:38 AM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now
takes about 17 minutes.

 > hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

 > param = ScanVcfParam(info=NA)
 > vcf = readVcf(fl, "", param=param)
 > dim(vcf)
[1] 516127621

 > system.time(writeVcf(vcf, "out.vcf"))
 user   system  elapsed
  971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't
speed things up enough we were planning to implement 'chunking'
through the VCF and/or move the parsing of info to C, however, it
looks like geno was the bottleneck.

I've tested a number of samples/fields combinations in files with >=
.5 million rows and the improvement over writeVcf() in release is ~ 90%.

Valerie




On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make
a good test
case, assuming you strip out all the info (info=NA when
reading it into
R) stuff.


ftp://platgene:G3n3s4me@ussd-__ftp.illumina.com/NA12877_S1.__genome.vcf.gz

<ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz>
took about ~4.2 hrs to write out, and is about 1.5x the size
of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
<mailto:lawrence.michael@gene.__com
<mailto:lawrence.mich...@gene.com>>> wrote:

 Yes, it's very clear that the scaling is non-linear,
and Gabe has
 been experimenting with a chunk-wise + parallel algorithm.
 Unfortunately there is some frustrating overhead with the
 parallelism. But I'm glad Val is arriving at something
quicker.

 Michael


 On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 <mailto:mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>> wrote:

 On 08/27/2014 11:56 AM, Gabe Becker wrote:

 The profiling I attached in my previous email
is for 24 geno
 fields, as I said,
 but our typical usecase involves only ~4-6
fields, and is
 faster but still on
 the order of dozens of minutes.


 I think Val is arriving at a (much) more efficient
 implementation, but...

 I wanted to share my guess that the poor _scaling_
is because
 the garbage collector runs multiple times as the
different
 strings are pasted together, and has to traverse,
in linear
 time, increasing numbers of allocated SEXPs. So
times scale
 approximately quadratically with the number of rows
in the VCF

 An efficiency is to reduce the number of SEXPs in
play by
 writing out in chunks -- as each chunk is written,
the SEXPs
 become available for collection and are re-used.
Here's my toy
 example

 time.R
 ==
 splitIndices <- function (nx, ncl)
 {
  i <- seq_len(nx)
  if (ncl == 0L)
  list()
  else if (ncl == 1L || nx == 1L)
  list(i)
  else {
  fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
  breaks <- seq(1 - fuzz, nx + fuzz, length
 

Re: [Bioc-devel] writeVcf performance

2014-09-08 Thread Valerie Obenchain

The new writeVcf code is in 1.11.28.

Using the illumina file you suggested, geno fields only, writing now 
takes about 17 minutes.


> hdr
class: VCFHeader
samples(1): NA12877
meta(6): fileformat ApplyRecalibration ... reference source
fixed(1): FILTER
info(22): AC AF ... culprit set
geno(8): GT GQX ... PL VF

> param = ScanVcfParam(info=NA)
> vcf = readVcf(fl, "", param=param)
> dim(vcf)
[1] 516127621

> system.time(writeVcf(vcf, "out.vcf"))
user   system  elapsed
 971.0326.568 1004.593

In 1.11.28, parsing of geno data was moved to C. If this didn't speed 
things up enough we were planning to implement 'chunking' through the 
VCF and/or move the parsing of info to C, however, it looks like geno 
was the bottleneck.


I've tested a number of samples/fields combinations in files with >= .5 
million rows and the improvement over writeVcf() in release is ~ 90%.


Valerie



On 09/04/14 15:28, Valerie Obenchain wrote:

Thanks Gabe. I should have something for you on Monday.

Val


On 09/04/2014 01:56 PM, Gabe Becker wrote:

Val and Martin,

Apologies for the delay.

We realized that the Illumina platinum genome vcf files make a good test
case, assuming you strip out all the info (info=NA when reading it into
R) stuff.

ftp://platgene:g3n3s...@ussd-ftp.illumina.com/NA12877_S1.genome.vcf.gz
took about ~4.2 hrs to write out, and is about 1.5x the size of the
files we are actually dealing with (~50M ranges vs our ~30M).

Looking forward a new vastly improved writeVcf :).

~G


On Tue, Sep 2, 2014 at 1:53 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

Yes, it's very clear that the scaling is non-linear, and Gabe has
been experimenting with a chunk-wise + parallel algorithm.
Unfortunately there is some frustrating overhead with the
parallelism. But I'm glad Val is arriving at something quicker.

Michael


On Tue, Sep 2, 2014 at 1:33 PM, Martin Morgan mailto:mtmor...@fhcrc.org>> wrote:

On 08/27/2014 11:56 AM, Gabe Becker wrote:

The profiling I attached in my previous email is for 24 geno
fields, as I said,
but our typical usecase involves only ~4-6 fields, and is
faster but still on
the order of dozens of minutes.


I think Val is arriving at a (much) more efficient
implementation, but...

I wanted to share my guess that the poor _scaling_ is because
the garbage collector runs multiple times as the different
strings are pasted together, and has to traverse, in linear
time, increasing numbers of allocated SEXPs. So times scale
approximately quadratically with the number of rows in the VCF

An efficiency is to reduce the number of SEXPs in play by
writing out in chunks -- as each chunk is written, the SEXPs
become available for collection and are re-used. Here's my toy
example

time.R
==
splitIndices <- function (nx, ncl)
{
 i <- seq_len(nx)
 if (ncl == 0L)
 list()
 else if (ncl == 1L || nx == 1L)
 list(i)
 else {
 fuzz <- min((nx - 1L)/1000, 0.4 * nx/ncl)
 breaks <- seq(1 - fuzz, nx + fuzz, length = ncl + 1L)
 structure(split(i, cut(i, breaks, labels=FALSE)), names
= NULL)
 }
}

x = as.character(seq_len(1e7)); y = sample(x)
if (!is.na <http://is.na>(Sys.getenv("SPLIT", NA))) {
 idx <- splitIndices(length(x), 20)
 system.time(for (i in idx) paste(x[i], y[i], sep=":"))
} else {
 system.time(paste(x, y, sep=":"))
}


running under R-devel with $ SPLIT=TRUE R --no-save --quiet -f
time.R the relevant time is

user  system elapsed
  15.320   0.064  15.381

versus with $ R --no-save --quiet -f time.R it is

user  system elapsed
  95.360   0.164  95.511

I think this is likely an overall strategy when dealing with
character data -- processing in independent chunks of moderate
(1M?) size (enabling as a consequence parallel evaluation in
modest memory) that are sufficient to benefit from
vectorization, but that do not entail allocation of large
numbers of in-use SEXPs.

Martin


Sorry for the confusion.
~G


On Wed, Aug 27, 2014 at 11:45 AM, Gabe Becker
mailto:becke...@gene.com>
<mailto:becke...@gene.com <mailto:becke...@gene.com>>> wrote:

 Martin and Val.

 I re-ran writeVcf on our (G)VCF data (34790518 ranges,
24 geno fields) with
 profiling enabled. The results of 

Re: [Bioc-devel] writeVcf performance

2014-09-04 Thread Valerie Obenchain
m>>>
wrote:

 Gabe is still testing/profiling, but we'll send
something randomized
 along eventually.


 On Tue, Aug 26, 2014 at 11:15 AM, Martin Morgan
mailto:mtmor...@fhcrc.org>
 <mailto:mtmor...@fhcrc.org
<mailto:mtmor...@fhcrc.org>>> wrote:

 I didn't see in the original thread a
reproducible (simulated, I
 guess) example, to be explicit about what the
problem is??

 Martin


 On 08/26/2014 10:47 AM, Michael Lawrence wrote:

 My understanding is that the heap
optimization provided marginal
 gains, and
 that we need to think harder about how to
optimize the all of
 the string
 manipulation in writeVcf. We either need to
reduce it or reduce its
     overhead (i.e., the CHARSXP allocation).
Gabe is doing more tests.


 On Tue, Aug 26, 2014 at 9:43 AM, Valerie
Obenchain
 mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org>>>

 wrote:

 Hi Gabe,

 Martin responded, and so did Michael,


https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html>



<https://stat.ethz.ch/__pipermail/bioc-devel/2014-__August/006082.html
<https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html>>

 It sounded like Michael was ok with
working with/around heap
 initialization.

 Michael, is that right or should we
still consider this on
 the table?


 Val


 On 08/26/2014 09:34 AM, Gabe Becker wrote:

 Val,

 Has there been any movement on
this? This remains a
 substantial
 bottleneck for us when writing very
large VCF files (e.g.
 variants+genotypes for whole genome
NGS samples).

 I was able to see a ~25% speedup
with 4 cores and  an
 "optimal" speedup
 of ~2x with 10-12 cores for a VCF
with 500k rows  using
 a very naive
 parallelization strategy and no
other changes. I suspect
 this could be
 improved on quite a bit, or
possibly made irrelevant
 with judicious use
 of serial C code.

 Did you and Martin make any plans
regarding optimizing
         writeVcf?

 Best
 ~G


 On Tue, Aug 5, 2014 at 2:33 PM,
Valerie Obenchain
 mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org>>
 <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org> <mailto:voben...@fhcrc.org
<mailto:voben...@fhcrc.org>>>>

 wrote:

   Hi Michael,

   I'm interested in working on
this. I'll discuss
 with Martin next
   week when we're both back in
the office.

   Val





   On 08/05/14 07:46, Michael
Lawrence wrote:

   Hi guys (Val, Martin, Herve):

   Anyone have an itch for
optimization? The
 writeVcf function is
   currently a
   

Re: [Bioc-devel] writeVcf performance

2014-08-26 Thread Valerie Obenchain

Hi Gabe,

Martin responded, and so did Michael,

https://stat.ethz.ch/pipermail/bioc-devel/2014-August/006082.html

It sounded like Michael was ok with working with/around heap 
initialization.


Michael, is that right or should we still consider this on the table?


Val

On 08/26/2014 09:34 AM, Gabe Becker wrote:

Val,

Has there been any movement on this? This remains a substantial
bottleneck for us when writing very large VCF files (e.g.
variants+genotypes for whole genome NGS samples).

I was able to see a ~25% speedup with 4 cores and  an "optimal" speedup
of ~2x with 10-12 cores for a VCF with 500k rows  using a very naive
parallelization strategy and no other changes. I suspect this could be
improved on quite a bit, or possibly made irrelevant with judicious use
of serial C code.

Did you and Martin make any plans regarding optimizing writeVcf?

Best
~G


On Tue, Aug 5, 2014 at 2:33 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

Hi Michael,

I'm interested in working on this. I'll discuss with Martin next
week when we're both back in the office.

Val





On 08/05/14 07:46, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is
currently a
bottleneck in our WGS genotyping pipeline. For a typical 50
million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which
is still
too long by my standards (> 0). Only takes 3 minutes to call the
genotypes
(and associated likelihoods etc) from the variant calls (using
80 cores and
450 GB RAM on one node), so the output is an issue. Profiling
suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's
string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50
million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
 user  system elapsed
6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
 user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
 user  system elapsed
   57.224   0.156  57.366

But running it a second time is faster (about what one would
expect?):

system.time(levels <- as.character(end))
 user  system elapsed
   23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of
the string
hash table is not a significant component of the time. So maybe
something
to do with the R heap/gc? No time right now to go deeper. But I
know Martin
likes this sort of thing ;)

Michael

 [[alternative HTML version deleted]]

_
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>


_
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>




--
Computational Biologist
Genentech Research


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Parallel processing of reads in a single fastq file

2014-08-19 Thread Valerie Obenchain

Hi,

bpiterate() has been added to BiocParallel 0.99.11. The current 
implementation is based on sclapply() from HTSeqGeni and is supported 
for the multi-core environment only. Support for other back-ends are in 
progress.


For the current implementation, iterating over multiple files can be 
done by distributing the files over a snow cluster with bplapply() then 
using each cluster node as a master to call bpiterate(). Example on the 
man page.


Maintaince of HTSeqGeni has been passed from Greg to Jens Reeder (cc'd 
on message). Jens, the one difference between bpiterate() and sclapply() 
is the absence of the trace function. Instead of having this hard coded 
we want to add a BPTRACE arg that allows tracing/debugging for any 
BiocParallel function. This should be added over the next week.



Valerie

On 08/06/2014 12:18 PM, Valerie Obenchain wrote:

Hi Jeff,

Thanks for the prompt. It looks like bpiterate or bpstream was intended
but didn't quite make it into BiocParallel. I'll discuss with Martin to
see if I'm missing other history / past discussions and then add it in.
Ryan had some ideas for parallel streaming we discussed at Bioc2014 so
this is timely. Both concepts can be revisited and implemented in some
form.


Greg,

Just wanted to confirm it's ok with you that we put an iteration of
sclapply in BiocParallel?


Valerie

On 08/06/2014 07:16 AM, Johnston, Jeffrey wrote:

Hi,

I have been using FastqStreamer() and yield() to process a large fastq
file in chunks, modifying both the read and name and then appending
the output to a new fastq file as each chunk is processed. This works
great, but would benefit greatly from being parallelized.

As far as I can tell, this problem isn’t easily solved with the
existing parallel tools because you can’t determine how many jobs
you’ll need in advance (you just call yield() until it stops returning
reads).

After some digging, I found the sclapply() function in the HTSeqGenie
package by Gregoire Pau, which he describes as a “multicore dispatcher”:

https://stat.ethz.ch/pipermail/bioc-devel/2013-October/004754.html

I wasn’t able to get the package to install from source due to some
dependencies (there are no binaries for Mac), but I did extract the
function and adapt it slightly for my use case. Here’s an example:

processChunk <- function(fq_chunk) {
   # manipulate fastq reads here
}

yieldHelper <- function() {
   fq <- yield(fqstream)
   if(length(fq) == 0) return(NULL)
   fq
}

fqstream <- FastqStreamer(“…”, n=1e6)
sclapply(yieldHelper, processChunk, max.parallel.jobs=4)
close(fqstream)

Based on the discussion linked above, it seems like there was some
interest in integrating this idea into BiocParallel. I would find that
very useful as it improves performance quite a bit and can likely be
applied to numerous stream-based processing tasks.

I will point out that in my case above, the processChunk() function
doesn’t return anything. Instead it appends the modified fastq records
to a new file. I have to use the Unix lockfile command to ensure that
only one child process appends to the output file at a time. I am not
certain if there is a more elegant solution to this (perhaps a queue
that is emptied by a dedicated writer process?).

Thanks,
Jeff




[[alternative HTML version deleted]]



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel






___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Parallel processing of reads in a single fastq file

2014-08-06 Thread Valerie Obenchain

Hi Jeff,

Thanks for the prompt. It looks like bpiterate or bpstream was intended 
but didn't quite make it into BiocParallel. I'll discuss with Martin to 
see if I'm missing other history / past discussions and then add it in. 
Ryan had some ideas for parallel streaming we discussed at Bioc2014 so 
this is timely. Both concepts can be revisited and implemented in some form.



Greg,

Just wanted to confirm it's ok with you that we put an iteration of 
sclapply in BiocParallel?



Valerie

On 08/06/2014 07:16 AM, Johnston, Jeffrey wrote:

Hi,

I have been using FastqStreamer() and yield() to process a large fastq file in 
chunks, modifying both the read and name and then appending the output to a new 
fastq file as each chunk is processed. This works great, but would benefit 
greatly from being parallelized.

As far as I can tell, this problem isn’t easily solved with the existing 
parallel tools because you can’t determine how many jobs you’ll need in advance 
(you just call yield() until it stops returning reads).

After some digging, I found the sclapply() function in the HTSeqGenie package 
by Gregoire Pau, which he describes as a “multicore dispatcher”:

https://stat.ethz.ch/pipermail/bioc-devel/2013-October/004754.html

I wasn’t able to get the package to install from source due to some 
dependencies (there are no binaries for Mac), but I did extract the function 
and adapt it slightly for my use case. Here’s an example:

processChunk <- function(fq_chunk) {
   # manipulate fastq reads here
}

yieldHelper <- function() {
   fq <- yield(fqstream)
   if(length(fq) == 0) return(NULL)
   fq
}

fqstream <- FastqStreamer(“…”, n=1e6)
sclapply(yieldHelper, processChunk, max.parallel.jobs=4)
close(fqstream)

Based on the discussion linked above, it seems like there was some interest in 
integrating this idea into BiocParallel. I would find that very useful as it 
improves performance quite a bit and can likely be applied to numerous 
stream-based processing tasks.

I will point out that in my case above, the processChunk() function doesn’t 
return anything. Instead it appends the modified fastq records to a new file. I 
have to use the Unix lockfile command to ensure that only one child process 
appends to the output file at a time. I am not certain if there is a more 
elegant solution to this (perhaps a queue that is emptied by a dedicated writer 
process?).

Thanks,
Jeff




[[alternative HTML version deleted]]



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fhcrc.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-08-06 Thread Valerie Obenchain
The man page '...' section was updated in GenomicAlignments 1.1.24 in 
devel. I've now also updated it in 1.0.5 in release.


The '...' does not refer only to the fixed set of args listed below the 
dots. The '...' encompasses any argument(s) provided to 
summarizeOverlaps() not explicitly stated in the function signature. For 
example, if you passed FOO=3, then FOO would end up in '...'.


Any function/method called inside summarizeOverlaps() with a '...' will 
pass the arguments down; they continue to be passed down until they are 
explicitly stated in a function signature (e.g., 'width' and 'fix' in 
ResizeReads()).


Valerie

On 08/06/2014 11:35 AM, Ryan wrote:

Ok, I had a look at the code, and I think understand now. The help text
for the "..." argument says "Additional arguments for BAM file methods
such assingleEnd,fragmentsorparamthat apply to the reading of records
from a file (see below)." But this is actually referring to the fixed
set of individual arguments listed below the dots. It doesn't apply to
the arguments that actually get matched by "..." in a call to
summarizeOverlaps. These actually get passed straight to
preprocess.reads. Perhaps the documentaion for "..." should be updated
to reflect this?

On Wed Aug  6 11:21:20 2014, Valerie Obenchain wrote:

Hi Ryan,

On 08/05/2014 05:47 PM, Ryan C. Thompson wrote:

Hi again,

I'm looking at the examples in the summarizeOverlaps help page here:
http://www.bioconductor.org/packages/devel/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf




And the examples fod preprocess.reads are a little confusing. One of the
examples passes some additional "..." options to summarizeOverlaps, and
implies that these will be passed along as the "..." arguments to the
proprocess.reads function:

summarizeOverlaps(grl, reads, mode=Union,
preprocess.reads=ResizeReads, width=1, fix="end")

The width and fix arguments are implied to be passed through to
ResizeReads, but I don't see it documented anywhere how this would be
done.


This is standard use of '...'. See the ?'...' and ?dotsMethods man
pages. I've added a sentence in \arguments clarifying that '...' can
encompass arguments called by any subsequent function/method.

The summarizeOverlaps documentation for "..." says "Additional

arguments for BAM file methods such as singleEnd, fragments or param
that apply to the reading of records from a file (see below)." I don't
see anything about passing through to preprocess.reads.

Incidentally, this is why my original example used a function that
constructed a closure with these arguments already bound. I would write
the example like this to ensure no ambiguity in argument passing (pay
attention to parens):


The 'resize.args' variable below captures all variables that exist in
the .dispatchOverlaps() helper, many of which don't need to be passed
to resize(). The 'preprocess.reads' function can be written this way
or it can have default values in the signature as I've done in the man
page example.


Valerie




ResizeReads <- function(mode, ...) {
resize.args <- list(...)
function(reads) {
reads <- as(reads, "GRanges")
## Need strandedness
stopifnot(all(strand(reads) != "*"))
do.call(resize, c(list(x=reads), resize.args))
}
}

## By default ResizeReads() counts reads that overlap on the 5 end:
summarizeOverlaps(grl, reads, mode=Union,
preprocess.reads=ResizeReads())

## Count reads that overlap on the 3 end by passing new values
## for width and fix:
summarizeOverlaps(grl, reads, mode=Union,
preprocess.reads=ResizeReads(width=1, fix="end"))

Anyway, I don't have the devel version of R handy to test this out, so I
don't know if what I've described is a problem in practice. But I think
that either the preprocess.reads function should be required to only
take one argument, or else the method of passing through additional
arguments to it should be documented.

-Ryan

On Tue 05 Aug 2014 05:12:41 PM PDT, Ryan C. Thompson wrote:

Hi Valerie,

I got really busy around May and never got a chance to thank you for
adding this option to summarizeOverlaps! So thank you!

-Ryan

On Thu 01 May 2014 04:25:33 PM PDT, Valerie Obenchain wrote:

GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should
be a function where the first argument is 'reads' and the return value
is compatible with 'reads' in the pre-defined count modes.

I've used your ResizeReads() as an example in the man page. I think
the ability to pre-filter and used a pre-defined mode will be useful.
Thanks for the suggestion.

Valerie


On 05/01/2014 02:29 

Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-08-06 Thread Valerie Obenchain

Hi Ryan,

On 08/05/2014 05:47 PM, Ryan C. Thompson wrote:

Hi again,

I'm looking at the examples in the summarizeOverlaps help page here:
http://www.bioconductor.org/packages/devel/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf


And the examples fod preprocess.reads are a little confusing. One of the
examples passes some additional "..." options to summarizeOverlaps, and
implies that these will be passed along as the "..." arguments to the
proprocess.reads function:

summarizeOverlaps(grl, reads, mode=Union,
preprocess.reads=ResizeReads, width=1, fix="end")

The width and fix arguments are implied to be passed through to
ResizeReads, but I don't see it documented anywhere how this would be
done.


This is standard use of '...'. See the ?'...' and ?dotsMethods man 
pages. I've added a sentence in \arguments clarifying that '...' can 
encompass arguments called by any subsequent function/method.


The summarizeOverlaps documentation for "..." says "Additional

arguments for BAM file methods such as singleEnd, fragments or param
that apply to the reading of records from a file (see below)." I don't
see anything about passing through to preprocess.reads.

Incidentally, this is why my original example used a function that
constructed a closure with these arguments already bound. I would write
the example like this to ensure no ambiguity in argument passing (pay
attention to parens):


The 'resize.args' variable below captures all variables that exist in 
the .dispatchOverlaps() helper, many of which don't need to be passed to 
resize(). The 'preprocess.reads' function can be written this way or it 
can have default values in the signature as I've done in the man page 
example.



Valerie




ResizeReads <- function(mode, ...) {
resize.args <- list(...)
function(reads) {
reads <- as(reads, "GRanges")
## Need strandedness
stopifnot(all(strand(reads) != "*"))
do.call(resize, c(list(x=reads), resize.args))
}
}

## By default ResizeReads() counts reads that overlap on the 5 end:
summarizeOverlaps(grl, reads, mode=Union,
preprocess.reads=ResizeReads())

## Count reads that overlap on the 3 end by passing new values
## for width and fix:
summarizeOverlaps(grl, reads, mode=Union,
preprocess.reads=ResizeReads(width=1, fix="end"))

Anyway, I don't have the devel version of R handy to test this out, so I
don't know if what I've described is a problem in practice. But I think
that either the preprocess.reads function should be required to only
take one argument, or else the method of passing through additional
arguments to it should be documented.

-Ryan

On Tue 05 Aug 2014 05:12:41 PM PDT, Ryan C. Thompson wrote:

Hi Valerie,

I got really busy around May and never got a chance to thank you for
adding this option to summarizeOverlaps! So thank you!

-Ryan

On Thu 01 May 2014 04:25:33 PM PDT, Valerie Obenchain wrote:

GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should
be a function where the first argument is 'reads' and the return value
is compatible with 'reads' in the pre-defined count modes.

I've used your ResizeReads() as an example in the man page. I think
the ability to pre-filter and used a pre-defined mode will be useful.
Thanks for the suggestion.

Valerie


On 05/01/2014 02:29 PM, Valerie Obenchain wrote:

On 05/01/2014 02:05 PM, Ryan wrote:

Hi Valerie,

On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:


I have some concerns about the *ExtraArgs() functions. Passing
flexible args to findOverlaps in the existing mode functions
fundamentally changes the documented behavior. The modes were created
to capture specific overlap situations pertaining to gene features
which are graphically depicted in the vignette. Changing 'maxgap' or
'minoverlap' will produce a variety of results inconsistent with past
behavior and difficult to document (e.g., under what circumstances
will IntersectionNotEmpty now register a hit).

Well, I wasn't so sure about those functions either. Obviously you can
pass arguments that break things. They were mostly designed to be
constructors for specific counting modes involving the
minoverlap/maxgap
arguments, but I decided I didn't need those modes after all. They're
certainly not designed to be exposed to the user. I haven't carefully
considered the interaction between the counting mode and
maxgap/minoverlap, but I believe that it would be roughly
equivalent to
extending/shrinking the features/reads by the specified amount (with
some differences for e.g. a feature/read smaller than 2*minoverlap).
For
example, with a read length of 100 and a minoverlap of 10 in Union
counting mode, this would be the 

Re: [Bioc-devel] writeVcf performance

2014-08-05 Thread Valerie Obenchain

Hi Michael,

I'm interested in working on this. I'll discuss with Martin next week 
when we're both back in the office.


Val




On 08/05/14 07:46, Michael Lawrence wrote:

Hi guys (Val, Martin, Herve):

Anyone have an itch for optimization? The writeVcf function is currently a
bottleneck in our WGS genotyping pipeline. For a typical 50 million row
gVCF, it was taking 2.25 hours prior to yesterday's improvements
(pasteCollapseRows) that brought it down to about 1 hour, which is still
too long by my standards (> 0). Only takes 3 minutes to call the genotypes
(and associated likelihoods etc) from the variant calls (using 80 cores and
450 GB RAM on one node), so the output is an issue. Profiling suggests that
the running time scales non-linearly in the number of rows.

Digging a little deeper, it seems to be something with R's string/memory
allocation. Below, pasting 1 million strings takes 6 seconds, but 10
million strings takes over 2 minutes. It gets way worse with 50 million. I
suspect it has something to do with R's string hash table.

set.seed(1000)
end <- sample(1e8, 1e6)
system.time(paste0("END", "=", end))
user  system elapsed
   6.396   0.028   6.420

end <- sample(1e8, 1e7)
system.time(paste0("END", "=", end))
user  system elapsed
134.714   0.352 134.978

Indeed, even this takes a long time (in a fresh session):

set.seed(1000)
end <- sample(1e8, 1e6)
end <- sample(1e8, 1e7)
system.time(as.character(end))
user  system elapsed
  57.224   0.156  57.366

But running it a second time is faster (about what one would expect?):

system.time(levels <- as.character(end))
user  system elapsed
  23.582   0.021  23.589

I did some simple profiling of R to find that the resizing of the string
hash table is not a significant component of the time. So maybe something
to do with the R heap/gc? No time right now to go deeper. But I know Martin
likes this sort of thing ;)

Michael

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Bioconductor newsletter - July 2014

2014-07-01 Thread Valerie Obenchain

Now available at

http://www.bioconductor.org/help/newsletters/



--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: voben...@fhcrc.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] GenomicFiles reducer and iterate argument

2014-06-18 Thread Valerie Obenchain

We'll try a single arg to REDUCER and see how it goes.

BTW I'm also going to swap out DataFrame for Vector in the rowData. 
DataFrame has been more difficult than anticipated (storing names, 
subsetting to get ranges out) and doesn't give any clear advantage over 
Vector.


Val



On 06/17/2014 02:59 PM, Michael Lawrence wrote:

I think there are two different use cases here. The first, the one that
I think is driving the design, is that the user writes a function for a
particular problem, where the value of iterate is known. The other use
case is that the user gets a summary function from somewhere else (a
package) and applies it using reduceBy*. In that case, the user would
potentially need to write a wrapper, depending on the formals of the
reusable function. The only way I could make the second use case work
with the current design is to have a higher order function that returns
a universal iterator that detects the value of iterate via nargs() and
behaves appropriately. The higher order function would not need to be
known to the user, just the package developer.



On Tue, Jun 17, 2014 at 1:39 PM, Martin Morgan mailto:mtmor...@fhcrc.org>> wrote:

Val's out today and I'm at least part of the problem so...


On 06/17/2014 10:13 AM, Michael Lawrence wrote:

On Tue, Jun 17, 2014 at 7:00 AM, Valerie Obenchain
mailto:voben...@fhcrc.org>>
wrote:

Hi Michael, Ryan,

Yes, it would be ideal to have a single signature for both
cases of
'iterate'. We went over the pros/cons again and at the end
of the day
decided to keep things as they are. No perfect solution here.

These were the primary points:

- Disadvantages of defining REDUCER with only '...' is that
'...' can
represent variables other than just the output from MAPPER.


Do you mean that "..." will capture additional arguments? From
where?


reduceBy* takes an argument ... and this is currently available to
both the MAPPER and REDUCER, see below.




- The unappealing aspect of the variadic approach is
introducing a new
check each time REDUCER is called.


What is this check?


- Going the other direction, considering a single arg for
REDUCER instead
two, requires coercing 'last' and 'current' to a list before
pulling them
apart again.


What is the problem with constructing this list? Isn't that one
extremely
fast line of code?


it's not the list construction but the lost convenience of named
arguments, in addition to consistency with Reduce when the data are
presented iteratively -- REDUCER=`+` instead of
REDUCER=function(lst) sum(unlist(lst, use.names=FALSE)).



It seems to me simpler to settle on one signature, and my
preference would
be for the single list argument, just because the call is
smaller and
simpler. Then have a convenient adaptor to handle the variadic case.


The variadic adapter concept is easy enough to understand in
context, but would send me for a head scratch at some later time.

Martin





Valerie



On 06/15/14 16:36, Michael Lawrence wrote:

I kind of prefer the adaptor solution, just for the sake
of API
cleanliness
(the MAPPER/REDUCER pair has some elegance), but I think
we agree that the
iterate switch introduces undesirable coupling.




On Sun, Jun 15, 2014 at 3:07 PM, Ryan
mailto:r...@thompsonclan.org>> wrote:

   What about having two separate reducer arguments, one
for a reducer that

takes two elements at a time and combines them, and
the other for a
reducer
that takes a list and combines all the elements of
the list? Specifying
both at once would be an error. I think it makes
more sense to say "these
two arguments expect different things" than "this
one argument expects a
different thing depending on the value of another
argument".

-Ryan


On Sun Jun 15 11:17:59 2014, Michael Lawrence wrote:

   I just thought there is some benefit for the
callback to be the same,

regardless of the iterate setting. This would
allow generalization
across

Re: [Bioc-devel] GenomicFiles reducer and iterate argument

2014-06-17 Thread Valerie Obenchain

Hi Michael, Ryan,

Yes, it would be ideal to have a single signature for both cases of 
'iterate'. We went over the pros/cons again and at the end of the day 
decided to keep things as they are. No perfect solution here.


These were the primary points:

- Disadvantages of defining REDUCER with only '...' is that '...' can 
represent variables other than just the output from MAPPER.


- The unappealing aspect of the variadic approach is introducing a new 
check each time REDUCER is called.


- Going the other direction, considering a single arg for REDUCER 
instead two, requires coercing 'last' and 'current' to a list before 
pulling them apart again.



Valerie


On 06/15/14 16:36, Michael Lawrence wrote:

I kind of prefer the adaptor solution, just for the sake of API cleanliness
(the MAPPER/REDUCER pair has some elegance), but I think we agree that the
iterate switch introduces undesirable coupling.




On Sun, Jun 15, 2014 at 3:07 PM, Ryan  wrote:


What about having two separate reducer arguments, one for a reducer that
takes two elements at a time and combines them, and the other for a reducer
that takes a list and combines all the elements of the list? Specifying
both at once would be an error. I think it makes more sense to say "these
two arguments expect different things" than "this one argument expects a
different thing depending on the value of another argument".

-Ryan


On Sun Jun 15 11:17:59 2014, Michael Lawrence wrote:


I just thought there is some benefit for the callback to be the same,
regardless of the iterate setting. This would allow generalization across
different data scales. Perhaps all that is needed is a constructor for an
adapter closure, one for each direction.

For example, the variadic adapter would look like:

Variadic <- function(FUN) {
function(x, y) {
  if (missing(y)) {
do.call(FUN, x)
  } else {
FUN(x, y)
  }
}
}

That would make it easy to e.g. adapt rbind into the framework. I wonder
if
there is precedent and better terminology from the functional programming
domain?

Michael



On Sun, Jun 15, 2014 at 8:38 AM, Martin Morgan 
wrote:

  On 06/15/2014 07:34 AM, Michael Lawrence wrote:


  Hi guys,


Was just checking out GenomicFiles and was a little surprised that the
arguments to the REDUCER are different depending on iterate=TRUE vs.
iterate=FALSE. In my often flawed opinion, iteration should not be a
concern of the REDUCER. It should be oblivious to the iteration mode. In
other words, when iterate=TRUE, it is a special case of having two
objects
to combine, instead of multiple.



My 'rationale' was that one would choose iterate=FALSE when one required
all elements to perform the reduction. I thought of the list (rather than
...) as the general R data structure for representing N elements, with a
special case (consistent with Reduce) made for the pairwise reduction of
iterate=TRUE. Either way, the two cases (x, y vs. list(), x, y vs. ...)
seem to require some explaining to the user. Is there a clear better
choice? You're the second person to trip over this, so I guess there's a
crack in the sidewalk...

Martin


  What would be convenient (but unnecessary) is to detect from the formal

arguments whether REDUCER is variadic or list-based. In other words, if
REDUCER is defined like function(...) { } it is called via do.call(),
otherwise it is passed the list.

Thoughts? Maybe I'm totally confused?

Michael

  [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



 [[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel





[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-05-01 Thread Valerie Obenchain
GenomicAlignments 1.1.8 has a 'preprocess.reads' argument. This should 
be a function where the first argument is 'reads' and the return value 
is compatible with 'reads' in the pre-defined count modes.


I've used your ResizeReads() as an example in the man page. I think the 
ability to pre-filter and used a pre-defined mode will be useful. Thanks 
for the suggestion.


Valerie


On 05/01/2014 02:29 PM, Valerie Obenchain wrote:

On 05/01/2014 02:05 PM, Ryan wrote:

Hi Valerie,

On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:


I have some concerns about the *ExtraArgs() functions. Passing
flexible args to findOverlaps in the existing mode functions
fundamentally changes the documented behavior. The modes were created
to capture specific overlap situations pertaining to gene features
which are graphically depicted in the vignette. Changing 'maxgap' or
'minoverlap' will produce a variety of results inconsistent with past
behavior and difficult to document (e.g., under what circumstances
will IntersectionNotEmpty now register a hit).

Well, I wasn't so sure about those functions either. Obviously you can
pass arguments that break things. They were mostly designed to be
constructors for specific counting modes involving the minoverlap/maxgap
arguments, but I decided I didn't need those modes after all. They're
certainly not designed to be exposed to the user. I haven't carefully
considered the interaction between the counting mode and
maxgap/minoverlap, but I believe that it would be roughly equivalent to
extending/shrinking the features/reads by the specified amount (with
some differences for e.g. a feature/read smaller than 2*minoverlap). For
example, with a read length of 100 and a minoverlap of 10 in Union
counting mode, this would be the same as truncating the first and last
10 (or mabe 9?) bases and operating in normal Union mode. As I said,
though, there may be edge cases that I haven't thought of where
unexpected things happen.


I agree that controlling the overlap args is appealing and I like the
added ability to resize. I've created a 'chipseq' mode that combines
these ideas and gives what ResizeReads() was doing but now in 'mode'
form. If this implementation gives you the flexibility you were
looking for I'll check it into devel.


This sounds nice, but if I use the 'chipseq' mode, how do I specify
whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
looks like it just does Union? IntersectionStrict would be useful for
specifying that the read has to occur entirely within the bounds of a
called peak, for example. This is why I implemented it as a "wrapper"
that takes another mode as an argument, so that the resizing logic and
the counting logic were independent.


'chipseq' didn't implement the standard modes b/c I wanted to avoid the
clash of passing unconventional findOverlaps() args to those modes. The
assumption was that if the user specified a mode they would expect a
certain behavior ...

Maybe summarizeOverlaps could

accept an optional "read modification function", and if this is
provided, it will pass the reads through this before passing them to the
counting function. The read modification function would have to take any
valid reads argument and return another valid reads argument. It could
be used for modifying the reads as well as filtering them. This would
allow resizing without the awkward nesting method that I've used.


Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging
before counting. I'll post back when it's done.

Valerie



A couple of questions:

- Do you want to handle paired-end reads? You coerce to a GRanges to
resize but don't coerce back.

For paired end reads, there is no need to estimate the fragment length,
because the pair gives you both ends of the fragment. So if I had
paired-end ChIP-Seq data, I would use it as is with no resizing. I can't
personally think of a reason to resize a paired-end fragment, but I
don't know if others might need that.

I corece to GRanges because I know how GRanges work, but I'm not as
familiar with GAlignments so I don't know how the resize function works
on GAlignments and other classes. I'm sure you know better than I do how
these work. If the coercion is superfluous, feel free to eliminate it.


- Do you want to require strand info for all reads? Is this because of
how resize() anchors "*" to 'start'?

Yes, I require strand info for all reads because the reads must be
directionally extended, which requires strand info. Ditto for counting
the 5-prime and 3-prime ends.

-Ryan





chipseq <- function(features, reads, ignore.strand=FALSE,
inter.feature=TRUE,
type="any", maxgap=0L, minoverlap=1L,
width=NULL, fix="start", use.names=TRUE)

Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-05-01 Thread Valerie Obenchain

On 05/01/2014 02:05 PM, Ryan wrote:

Hi Valerie,

On Thu May 1 13:27:16 2014, Valerie Obenchain wrote:


I have some concerns about the *ExtraArgs() functions. Passing
flexible args to findOverlaps in the existing mode functions
fundamentally changes the documented behavior. The modes were created
to capture specific overlap situations pertaining to gene features
which are graphically depicted in the vignette. Changing 'maxgap' or
'minoverlap' will produce a variety of results inconsistent with past
behavior and difficult to document (e.g., under what circumstances
will IntersectionNotEmpty now register a hit).

Well, I wasn't so sure about those functions either. Obviously you can
pass arguments that break things. They were mostly designed to be
constructors for specific counting modes involving the minoverlap/maxgap
arguments, but I decided I didn't need those modes after all. They're
certainly not designed to be exposed to the user. I haven't carefully
considered the interaction between the counting mode and
maxgap/minoverlap, but I believe that it would be roughly equivalent to
extending/shrinking the features/reads by the specified amount (with
some differences for e.g. a feature/read smaller than 2*minoverlap). For
example, with a read length of 100 and a minoverlap of 10 in Union
counting mode, this would be the same as truncating the first and last
10 (or mabe 9?) bases and operating in normal Union mode. As I said,
though, there may be edge cases that I haven't thought of where
unexpected things happen.


I agree that controlling the overlap args is appealing and I like the
added ability to resize. I've created a 'chipseq' mode that combines
these ideas and gives what ResizeReads() was doing but now in 'mode'
form. If this implementation gives you the flexibility you were
looking for I'll check it into devel.


This sounds nice, but if I use the 'chipseq' mode, how do I specify
whether I want Union, IntersectionNotEmpty, or IntersectionStrict? It
looks like it just does Union? IntersectionStrict would be useful for
specifying that the read has to occur entirely within the bounds of a
called peak, for example. This is why I implemented it as a "wrapper"
that takes another mode as an argument, so that the resizing logic and
the counting logic were independent.


'chipseq' didn't implement the standard modes b/c I wanted to avoid the 
clash of passing unconventional findOverlaps() args to those modes. The 
assumption was that if the user specified a mode they would expect a 
certain behavior ...


Maybe summarizeOverlaps could

accept an optional "read modification function", and if this is
provided, it will pass the reads through this before passing them to the
counting function. The read modification function would have to take any
valid reads argument and return another valid reads argument. It could
be used for modifying the reads as well as filtering them. This would
allow resizing without the awkward nesting method that I've used.


Good idea. Maybe a 'preprocess' or 'prefilter' arg to allow massaging 
before counting. I'll post back when it's done.


Valerie



A couple of questions:

- Do you want to handle paired-end reads? You coerce to a GRanges to
resize but don't coerce back.

For paired end reads, there is no need to estimate the fragment length,
because the pair gives you both ends of the fragment. So if I had
paired-end ChIP-Seq data, I would use it as is with no resizing. I can't
personally think of a reason to resize a paired-end fragment, but I
don't know if others might need that.

I corece to GRanges because I know how GRanges work, but I'm not as
familiar with GAlignments so I don't know how the resize function works
on GAlignments and other classes. I'm sure you know better than I do how
these work. If the coercion is superfluous, feel free to eliminate it.


- Do you want to require strand info for all reads? Is this because of
how resize() anchors "*" to 'start'?

Yes, I require strand info for all reads because the reads must be
directionally extended, which requires strand info. Ditto for counting
the 5-prime and 3-prime ends.

-Ryan





chipseq <- function(features, reads, ignore.strand=FALSE,
inter.feature=TRUE,
type="any", maxgap=0L, minoverlap=1L,
width=NULL, fix="start", use.names=TRUE)
{
reads <- as(reads, "GRanges")
if (any(strand(reads) == "*"))
stop("all reads must have strand")
if (!is.null(width))
reads <- do.call(resize(reads, width, fix=fix,
use.names=use.names,
ignore.strand=ignore.strand))

ov <- findOverlaps(features, reads, type=type,
ignore.strand=ignore.strand,
maxgap=maxgap, minoverlap=minoverlap)
if (inter.feature) {
## Remove reads that overlap multiple features.
reads_to_keep <- whic

Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-05-01 Thread Valerie Obenchain

Thanks.

I have some concerns about the *ExtraArgs() functions. Passing flexible 
args to findOverlaps in the existing mode functions fundamentally 
changes the documented behavior. The modes were created to capture 
specific overlap situations pertaining to gene features which are 
graphically depicted in the vignette. Changing 'maxgap' or 'minoverlap' 
will produce a variety of results inconsistent with past behavior and 
difficult to document (e.g., under what circumstances will 
IntersectionNotEmpty now register a hit).


I agree that controlling the overlap args is appealing and I like the 
added ability to resize. I've created a 'chipseq' mode that combines 
these ideas and gives what ResizeReads() was doing but now in 'mode' 
form. If this implementation gives you the flexibility you were looking 
for I'll check it into devel.


A couple of questions:

- Do you want to handle paired-end reads? You coerce to a GRanges to 
resize but don't coerce back.


- Do you want to require strand info for all reads? Is this because of 
how resize() anchors "*" to 'start'?




chipseq <- function(features, reads, ignore.strand=FALSE, inter.feature=TRUE,
type="any", maxgap=0L, minoverlap=1L,
width=NULL, fix="start", use.names=TRUE)
{
reads <- as(reads, "GRanges")
if (any(strand(reads) == "*"))
stop("all reads must have strand")
if (!is.null(width))
reads <- do.call(resize(reads, width, fix=fix, use.names=use.names,
ignore.strand=ignore.strand))

ov <- findOverlaps(features, reads, type=type, ignore.strand=ignore.strand,
   maxgap=maxgap, minoverlap=minoverlap)
if (inter.feature) {
## Remove reads that overlap multiple features.
reads_to_keep <- which(countSubjectHits(ov) == 1L)
ov <- ov[subjectHits(ov) %in% reads_to_keep]
}
countQueryHits(ov)
}



To count the overlaps of 5' and 3' ends:

summarizeOverlaps(reads, features, chipseq, fix="start", width=1)
summarizeOverlaps(reads, features, chipseq, fix="end", width=1)


Valerie

On 04/30/2014 02:41 PM, Ryan C. Thompson wrote:

No, I forgot to attach the file. Here is the link:

https://www.dropbox.com/s/7qghtksl3mbvlsl/counting-modes.R

On Wed 30 Apr 2014 02:18:28 PM PDT, Valerie Obenchain wrote:

Hi Ryan,

These sound like great contributions. I didn't get an attachment - did
you send one?

Thanks.
Valerie

On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:

Hi all,

I recently asked about ways to do non-standard read counting in
summarizeOverlaps, and Martin Morgan directed me toward writing a custom
function to pass as the "mode" parameter. I have now written the custom
modes that I require for counting my ChIP-Seq reads, and I figured I
would contribute them back in case there was interest in merging them.

The main three functions are "ResizeReads", "FivePrimeEnd", and
"ThreePrimeEnd". The first allows you to directionally extend or shorten
each read to the effective fragment length for the purpose of
determining overlaps. For example, if each read represents the 5-prime
end of a 150-bp fragment and you want to count these fragments using the
Union mode, you could do:

 summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
fix="start"), ...)

Note that ResizeReads takes a mode argument. It returns a function (with
a closure storing the passed arguments) that performs the resizing (by
coercing reads to GRanges and calling "resize") and then dispatches to
the provided mode. (It probably needs to add a call to "match.fun"
somewhere.)

The other two functions are designed to count overlaps of only the read
ends. They are implemented internally using "ResizeReads" with width=1.

The other three counting modes (the "*ExtraArgs" functions) are meant to
be used to easily construct new counting modes. Each function takes any
number of arguments and returns a counting mode that works like the
standard one of the same name, except that those arguments are passed as
extra args to "findOverlaps". For example, you could do Union mode with
a requirement for a minimum overlap of 10:

 summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)

Note that these can be combined or "nested". For instance, you might
want a fragment length of 150 and a min overlap of 10:

 myCountingMode <- ResizeReads(mode=UnionExtraArgs(minoverlap=10),
width=150, fix="start")
 summarizeOverlaps(mode=myCountingMode, ...)

Anyway, if you think any of these are worthy of inclusion for
BioConductor, feel free to add them in. I'm not so sure about the
"nesting" idea, though. Functions that return functions (with

Re: [Bioc-devel] Additional summarizeOverlaps counting modes for ChIP-Seq

2014-04-30 Thread Valerie Obenchain

Hi Ryan,

These sound like great contributions. I didn't get an attachment - did 
you send one?


Thanks.
Valerie

On 04/30/2014 01:06 PM, Ryan C. Thompson wrote:

Hi all,

I recently asked about ways to do non-standard read counting in
summarizeOverlaps, and Martin Morgan directed me toward writing a custom
function to pass as the "mode" parameter. I have now written the custom
modes that I require for counting my ChIP-Seq reads, and I figured I
would contribute them back in case there was interest in merging them.

The main three functions are "ResizeReads", "FivePrimeEnd", and
"ThreePrimeEnd". The first allows you to directionally extend or shorten
each read to the effective fragment length for the purpose of
determining overlaps. For example, if each read represents the 5-prime
end of a 150-bp fragment and you want to count these fragments using the
Union mode, you could do:

 summarizeOverlaps(mode=ResizeReads(mode=Union, width=150,
fix="start"), ...)

Note that ResizeReads takes a mode argument. It returns a function (with
a closure storing the passed arguments) that performs the resizing (by
coercing reads to GRanges and calling "resize") and then dispatches to
the provided mode. (It probably needs to add a call to "match.fun"
somewhere.)

The other two functions are designed to count overlaps of only the read
ends. They are implemented internally using "ResizeReads" with width=1.

The other three counting modes (the "*ExtraArgs" functions) are meant to
be used to easily construct new counting modes. Each function takes any
number of arguments and returns a counting mode that works like the
standard one of the same name, except that those arguments are passed as
extra args to "findOverlaps". For example, you could do Union mode with
a requirement for a minimum overlap of 10:

 summarizeOverlaps(mode=UnionExtraArgs(minoverlap=10), ...)

Note that these can be combined or "nested". For instance, you might
want a fragment length of 150 and a min overlap of 10:

 myCountingMode <- ResizeReads(mode=UnionExtraArgs(minoverlap=10),
width=150, fix="start")
 summarizeOverlaps(mode=myCountingMode, ...)

Anyway, if you think any of these are worthy of inclusion for
BioConductor, feel free to add them in. I'm not so sure about the
"nesting" idea, though. Functions that return functions (with states
saved in closures, which are then passed into another function) are
confusing for people who are not programmers by trade. Maybe
summarizeOverlaps should just gain an argument to pass args to
findOverlaps.

-Ryan Thompson

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Bug in as.data.frame method for RangedData?

2014-04-28 Thread Valerie Obenchain

Hi Kevin,

Thanks for the report. The old behavior of as.data.frame on 'RangedData' 
objects has been re-instated in IRanges 1.99.7.


Last week we added a new as.data.frame,List method that is now used by 
all 'List' objects. This was done for a variety of reasons that I'll 
include in an announcement on the mailing list. The long and the short 
of it is that 2 new columns have been added, 'group' and 'group_name' 
and you'll have the ability to keep outer metadata set on the 'List' object.


I did not intend to change the behavior of as.data.frame,RangedData. 
What happened was the method calls a couple of as.data.frame() methods 
on 'List' objects. Once I removed the old 'List' methods, 
as.data.frame,RangedData went through the new 'List' methods which 
created the new columns you were seeing.


Eventually RangedData objects will be phased out so we've decided to 
keep the legacy behavior instead of updating them to include the new 
columns.


Sorry for the inconvenience.

Valerie


On 04/28/2014 01:02 PM, Kevin Ushey wrote:

Hi,

With the following code and IRanges 1.99.6:

   ranges <- IRanges(c(1,2,3),c(4,5,6))
   rd <- RangedData(ranges)
   as.data.frame(rd)

I get


as.data.frame(rd)

   group group_name start end width group_name.1
1 1  1 1   4 41
2 1  1 2   5 41
3 1  1 3   6 41

With the current release version (IRanges 1.22.4), I get


as.data.frame(rd)

   space start end width
1 1 1   4 4
2 1 2   5 4
3 1 3   6 4

This seems like a bug.

Thanks,
Kevin


sessionInfo()


R Under development (unstable) (2014-04-05 r65382)
Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] restrictToSNV for VCF

2014-04-09 Thread Valerie Obenchain

Update on these tasks.

1) XStringSetList now has an nchar() method (as of Biostrings 2.31.17)

2) restrictToSNV() was removed from VariantAnnotation

3) The following generics and methods for VCF and VRanges have been 
added to VariantAnnotation 1.9.50:


isSNV()
isInsertion()
isDeletion()
isIndel()
isSubstitution
isTranstion()

I've held off on adding

isSV()
isSVPrecise()

until we have a way to distinguish structural vs non-structual ALT. 
Currently if any of the ALT values are structural, all are coerced to 
character. It would be good to have a way to distinguish a mixture of 
ALT values so we can compute on the nucleotides and do whatever else on 
the structural variants. This may be a project for the next dev cycle.


Valerie


On 03/19/2014 03:29 PM, Michael Lawrence wrote:

Thanks Sean. Probably also need an "isSubstitution" for any
substitution, either SNV or complex.


On Wed, Mar 19, 2014 at 3:20 PM, Sean Davis mailto:sdav...@mail.nih.gov>> wrote:



On Wed, Mar 19, 2014 at 4:26 PM, Valerie Obenchain
mailto:voben...@fhcrc.org>> wrote:

Thanks for the feedback.

I'll look into nchar for XStringSetList.

I'm in favor of supporting isDeletion(), isInsertion(),
isIndel() and isSNV() for the VCF classes and removing
restrictToSNV(). I could add an argument 'all_alt' or
'all_alt_agreement' to be used with CollapsedVCF in the case
where not all alternate alleles meet the criteria.

Here are the current definitions:

isDeletion <- function(x) {
   nchar(alt(x)) == 1L & nchar(ref(x)) > 1L &
substring(ref(x), 1, 1) == alt(x)
}

isInsertion <- function(x) {
   nchar(ref(x)) == 1L & nchar(alt(x)) > 1L &
substring(alt(x), 1, 1) == ref(x)
}

isIndel <- function(x) {
   isDeletion(x) | isInsertion(x)
}

isSNV <- function(x) {
   nchar(alt(x)) == 1L & nchar(ref(x)) == 1L
}



To be thorough:

isTransition()

isSV()

isSVPrecise()

We haven't been using VCF for SVs much yet, but there are probably
some fun things to be done on that front.

Sean



Valerie



On 03/19/2014 01:07 PM, Vincent Carey wrote:




On Wed, Mar 19, 2014 at 4:00 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>
<mailto:lawrence.michael@gene.__com
<mailto:lawrence.mich...@gene.com>>> wrote:

 It would be nice to have functions like isSNV, isIndel,
isDeletion,
 etc that at least provide precise definitions of the
terminology.
 I've added these, but they're designed only for
VRanges. Should work
 for ExpandedVCF.

 Also, it would be nice if restrictToSNV just assumed
that alt(x)
 must be something with nchar() support (with special
handling for
 any List), so that the 'character' vector of
alt,VRanges would work
 immediately. Basically restrictToSNV should just be
x[isSNV(x)]. Is
 there even a use-case for the restrictToSNV abstraction
if we did that?


for VCF instance it would be x[isSNV(x),] and indeed I think
that would
be sufficient.  i like the idea of having this family of
predicates for
    variant classes to allow such selections

 Michael



 On Tue, Mar 18, 2014 at 10:36 AM, Valerie Obenchain
 mailto:voben...@fhcrc.org>
<mailto:voben...@fhcrc.org <mailto:voben...@fhcrc.org>>> wrote:

 Hi,

 I've added a restrictToSNV() function to
VariantAnnotation
 (1.9.46). The return value is a subset VCF object
containing
 SNVs only. The function operates on CollapsedVCF or
ExapandedVCF
 and the alt(VCF) value must be nucleotides (i.e.,
no structural
 variants).

 A variant is considered a SNV if the nucleotide
sequences in
 both ref(vcf) and alt(x) are of length 1. I have a
question
 about how variants with multiple 'ALT' values
should be handled.

 Should we consider row 4 a SNV? One 'ALT' is length
1, the other
 is not.

 ALT <- DNAStringSetList("A", c("TT"), c("G", "A"),
c("TT&q

Re: [Bioc-devel] Bioconductor newsletter: April 2014

2014-04-01 Thread Valerie Obenchain

On 04/01/14 09:21, Cook, Malcolm wrote:

This is great.

one quick Errata - The link to the RefNet.db package is dead.


RefNet / RefNet.db won't be available until the April 14 release. Thanks 
for pointing that out - I'll add a note.


Valerie



!Malcolm

  >-Original Message-
  >From: bioc-devel-boun...@r-project.org 
[mailto:bioc-devel-boun...@r-project.org] On Behalf Of Valerie Obenchain
  >Sent: Tuesday, April 01, 2014 9:36 AM
  >To: bioconduc...@r-project.org; bioc-devel@r-project.org
  >Subject: [Bioc-devel] Bioconductor newsletter: April 2014
  >
  >We are pleased to announce the first issue of the Bioconductor
  >newsletter. The aim is to produce this on a quarterly schedule with
  >updates on core development projects and BioC-related events.
  >
  >Input on current publications, research updates and items of interest
  >are welcome (send to voben...@fhcrc.org). Have a look and let us know
  >what you think.
  >
  >http://www.bioconductor.org/help/newsletters/2014_April/
  >
  >
  >Thanks.
  >Valerie
  >
  >--
  >Program in Computational Biology
  >Fred Hutchinson Cancer Research Center
  >1100 Fairview Ave. N, M1-B155
  >Seattle, WA 98109-1024
  >
  >E-mail: voben...@fhcrc.org
  >Phone: (206) 667-3158
  >
  >___
  >Bioc-devel@r-project.org mailing list
  >https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Bioconductor newsletter: April 2014

2014-04-01 Thread Valerie Obenchain
We are pleased to announce the first issue of the Bioconductor 
newsletter. The aim is to produce this on a quarterly schedule with 
updates on core development projects and BioC-related events.


Input on current publications, research updates and items of interest 
are welcome (send to voben...@fhcrc.org). Have a look and let us know 
what you think.


http://www.bioconductor.org/help/newsletters/2014_April/


Thanks.
Valerie

--
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone: (206) 667-3158

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

2014-03-28 Thread Valerie Obenchain

Hi Herve,

I must retract my previous statement about 'yieldSize' and 'which'. As 
of Rsamtools 1.15.0, scanBam() (and functions that build on it) does 
handle the case where both are supplied. This is true for the non-mate 
and mate-pairing code.



From ?BamFile:



 yieldSize: Number of records to yield each time the file is read
  from using ‘scanBam’ or, when ‘length(bamWhich()) != 0’, a
  threshold which yields records in complete ranges whose sum
  first exceeds ‘yieldSize’. Setting ‘yieldSize’ on a
  ‘BamFileList’ does not alter existing yield sizes set on the
  individual ‘BamFile’ instances.



fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bf <- BamFile(fl, yieldSize=1000)
which <- tileGenome(seqlengths(bf),
 tilewidth=500, cut.last.tile.in.chrom=TRUE)
param <- ScanBamParam(which=which, what="qname")
FUN <- function(elt) length(elt[[1]])

Here we have both 'yieldSize' and a 'which' in the param. We ask for a 
yield of 1000 records. The first range only has 394, the second has 570 
and from the third we get 570. As explained in the man page snippit 
above, records are yielded in complete ranges whose sum first exceeds 
'yieldSize'. In range 3 we exceed the 1000 so we get all of range 3 then 
stop.


sapply(scanBam(bf, param=param), FUN)

sapply(scanBam(bf, param=param), FUN)

seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575 seq2:1-500
   394570570  0  0
 seq2:501-1000 seq2:1001-1500 seq2:1501-1584
 0  0  0


We can open the file and yield through all records:

bf <- open(BamFile(fl, yieldSize=1000))
sapply(scanBam(bf, param=param), FUN)


sapply(scanBam(bf, param=param), FUN)

seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575 seq2:1-500
   394570570  0  0
 seq2:501-1000 seq2:1001-1500 seq2:1501-1584
 0  0  0

sapply(scanBam(bf, param=param), FUN)

seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575 seq2:1-500
 0  0  0 82562
 seq2:501-1000 seq2:1001-1500 seq2:1501-1584
   709  0  0

sapply(scanBam(bf, param=param), FUN)

seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575 seq2:1-500
 0  0  0  0  0
 seq2:501-1000 seq2:1001-1500 seq2:1501-1584
 0597 60

sapply(scanBam(bf, param=param), FUN)

seq1:1-500  seq1:501-1000 seq1:1001-1500 seq1:1501-1575 seq2:1-500
 0  0  0  0  0
 seq2:501-1000 seq2:1001-1500 seq2:1501-1584
 0  0  0



I've removed the misinformation from the man pages I altered. Also added 
a unit test for the mates code with 'yieldSize' and 'which' in Rsamtools.


Val

On 03/27/2014 11:36 AM, Hervé Pagès wrote:

Hi Val,

On 03/27/2014 09:13 AM, Valerie Obenchain wrote:

I should also mention that when both 'yieldSize' in the BamFile and
'which' in ScanBamParam are set the 'which' gets priority. The point of
'yieldSize' is to provide a reasonable chunk for processing the file.
When 'which' is provided it's assumed that range is of reasonable chunk
size so the yield is ignored.


Note that more than 1 range can be specified thru 'which'.

What about emitting a warning when 'yieldSize' is ignored?



I've added this info to the 'summarizeOverlaps' and 'readGAlignments'
man pages in GenomicAlignments.


Is this a property of scanBam()? If so then maybe it should be
documented in the man page for scanBam(). I just had a quick look
and was not able to find it, sorry if I missed it.

summarizeOverlaps(), readGAlignments(), and probably other tools,
just rely on scanBam().

Thanks,
H.



Valerie

On 03/27/14 08:30, Valerie Obenchain wrote:

Hi Mike,

This is fixed in Rsamtools 1.15.35.

The bug was related to when the mate-pairing was performed wrt meeting
the 'yieldSize' requirement. Thanks for sending the file and
reproducible example.

The file has ~115 million records:

fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"

countBam(fl)$records

 [1] 114943975


To process the complete file with a yield size of 1e6 took ~ 18 GIG and
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)

bf <- BamFile(fl, yieldSize=100, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
 summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)


system.time(SO(bf))

 user   system 

Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

2014-03-27 Thread Valerie Obenchain
I should also mention that when both 'yieldSize' in the BamFile and 
'which' in ScanBamParam are set the 'which' gets priority. The point of 
'yieldSize' is to provide a reasonable chunk for processing the file. 
When 'which' is provided it's assumed that range is of reasonable chunk 
size so the yield is ignored.


I've added this info to the 'summarizeOverlaps' and 'readGAlignments' 
man pages in GenomicAlignments.


Valerie

On 03/27/14 08:30, Valerie Obenchain wrote:

Hi Mike,

This is fixed in Rsamtools 1.15.35.

The bug was related to when the mate-pairing was performed wrt meeting
the 'yieldSize' requirement. Thanks for sending the file and
reproducible example.

The file has ~115 million records:

fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"

countBam(fl)$records

 [1] 114943975


To process the complete file with a yield size of 1e6 took ~ 18 GIG and
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)

bf <- BamFile(fl, yieldSize=100, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
 summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)


system.time(SO(bf))

 user   system  elapsed
 1545.684   12.412 1558.498


Thanks for reporting the bug.

Valerie



On 03/21/14 13:55, Michael Love wrote:

hi Valerie,

Thanks. I'm trying now to make use of the new mate pairing algorithm but
keeping running out of memory (i requested 50 Gb) and getting my job
killed. I wonder if you could try this code/example below?

If the new C code is faster for paired-end than the
pre-sorting/obeyQname method that would be great (eliminates the need to
have the extra qname-sorted Bam), but it seems to me that with the old
method, it was easier to specify hard limits on memory. Maybe I am
missing something though. :)

Here's an RNA-Seq sample from Encode, and then I run samtools index
locally.

from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/


I download the hg19 paired end reads:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam


library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grl <- exonsBy(txdb, by="gene")
library("Rsamtools")
bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
bf <- BamFile(bamFile, yieldSize=100, asMates=TRUE)
library("GenomicAlignments")
system.time({so <- summarizeOverlaps(grl, bf,
  ignore.strand=TRUE,
  singleEnd=FALSE)})



 > sessionInfo()
R Under development (unstable) (2014-03-18 r65213)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets  methods
[8] base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
  [2] GenomicFeatures_1.15.11
  [3] AnnotationDbi_1.25.15
  [4] Biobase_2.23.6
  [5] GenomicAlignments_0.99.32
  [6] BSgenome_1.31.12
  [7] Rsamtools_1.15.33
  [8] Biostrings_2.31.14
  [9] XVector_0.3.7
[10] GenomicRanges_1.15.39
[11] GenomeInfoDb_0.99.19
[12] IRanges_1.21.34
[13] BiocGenerics_0.9.3
[14] Defaults_1.1-1
[15] devtools_1.4.1
[16] knitr_1.5
[17] BiocInstaller_1.13.3

loaded via a namespace (and not attached):
  [1] BatchJobs_1.2   BBmisc_1.5  BiocParallel_0.5.17
  [4] biomaRt_2.19.3  bitops_1.0-6brew_1.0-6
  [7] codetools_0.2-8 DBI_0.2-7   digest_0.6.4
[10] evaluate_0.5.1  fail_1.2foreach_1.4.1
[13] formatR_0.10httr_0.2iterators_1.0.6
[16] memoise_0.1 plyr_1.8.1  Rcpp_0.11.1
[19] RCurl_1.95-4.1  RSQLite_0.11.4  rtracklayer_1.23.18
[22] sendmailR_1.1-2 stats4_3.2.0stringr_0.6.2
[25] tools_3.2.0 whisker_0.3-2   XML_3.98-1.1
[28] zlibbioc_1.9.0
 >





On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

On 03/19/14 10:24, Michael Love wrote:

hi Valerie,

If the Bam is not sorted by name, isn't it possible that
readGAlignment*
will load > yieldSize number of reads in order to find the mate?


Sorry, our emails keep criss-crossing.

Because the mate-pairing is now done in C yieldSize is no longer a
constraint.

When yieldSize is given, say 100, then 100 mates are returned. The
alg

Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

2014-03-27 Thread Valerie Obenchain

Hi Mike,

This is fixed in Rsamtools 1.15.35.

The bug was related to when the mate-pairing was performed wrt meeting 
the 'yieldSize' requirement. Thanks for sending the file and 
reproducible example.


The file has ~115 million records:

fl <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"

countBam(fl)$records

 [1] 114943975


To process the complete file with a yield size of 1e6 took ~ 18 GIG and 
25 minutes. (ubuntu server, 16 processors, 387 GIG of ram)


bf <- BamFile(fl, yieldSize=100, asMates=TRUE)
grl <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, by="gene")
SO <- function(x)
summarizeOverlaps(grl, x, ignore.strand=TRUE, singleEnd=FALSE)


system.time(SO(bf))

 user   system  elapsed
 1545.684   12.412 1558.498


Thanks for reporting the bug.

Valerie



On 03/21/14 13:55, Michael Love wrote:

hi Valerie,

Thanks. I'm trying now to make use of the new mate pairing algorithm but
keeping running out of memory (i requested 50 Gb) and getting my job
killed. I wonder if you could try this code/example below?

If the new C code is faster for paired-end than the
pre-sorting/obeyQname method that would be great (eliminates the need to
have the extra qname-sorted Bam), but it seems to me that with the old
method, it was easier to specify hard limits on memory. Maybe I am
missing something though. :)

Here's an RNA-Seq sample from Encode, and then I run samtools index locally.

from:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/

I download the hg19 paired end reads:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeCaltechRnaSeq/wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam

library("GenomicFeatures")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
grl <- exonsBy(txdb, by="gene")
library("Rsamtools")
bamFile <- "wgEncodeCaltechRnaSeqGm12878R2x75Il200AlignsRep1V2.bam"
bf <- BamFile(bamFile, yieldSize=100, asMates=TRUE)
library("GenomicAlignments")
system.time({so <- summarizeOverlaps(grl, bf,
  ignore.strand=TRUE,
  singleEnd=FALSE)})



 > sessionInfo()
R Under development (unstable) (2014-03-18 r65213)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets  methods
[8] base

other attached packages:
  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
  [2] GenomicFeatures_1.15.11
  [3] AnnotationDbi_1.25.15
  [4] Biobase_2.23.6
  [5] GenomicAlignments_0.99.32
  [6] BSgenome_1.31.12
  [7] Rsamtools_1.15.33
  [8] Biostrings_2.31.14
  [9] XVector_0.3.7
[10] GenomicRanges_1.15.39
[11] GenomeInfoDb_0.99.19
[12] IRanges_1.21.34
[13] BiocGenerics_0.9.3
[14] Defaults_1.1-1
[15] devtools_1.4.1
[16] knitr_1.5
[17] BiocInstaller_1.13.3

loaded via a namespace (and not attached):
  [1] BatchJobs_1.2   BBmisc_1.5  BiocParallel_0.5.17
  [4] biomaRt_2.19.3  bitops_1.0-6brew_1.0-6
  [7] codetools_0.2-8 DBI_0.2-7   digest_0.6.4
[10] evaluate_0.5.1  fail_1.2foreach_1.4.1
[13] formatR_0.10httr_0.2iterators_1.0.6
[16] memoise_0.1 plyr_1.8.1  Rcpp_0.11.1
[19] RCurl_1.95-4.1  RSQLite_0.11.4  rtracklayer_1.23.18
[22] sendmailR_1.1-2 stats4_3.2.0stringr_0.6.2
[25] tools_3.2.0 whisker_0.3-2   XML_3.98-1.1
[28] zlibbioc_1.9.0
 >





On Wed, Mar 19, 2014 at 2:00 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

On 03/19/14 10:24, Michael Love wrote:

hi Valerie,

If the Bam is not sorted by name, isn't it possible that
readGAlignment*
will load > yieldSize number of reads in order to find the mate?


Sorry, our emails keep criss-crossing.

Because the mate-pairing is now done in C yieldSize is no longer a
constraint.

When yieldSize is given, say 100, then 100 mates are returned. The
algo moves through the file until 100 records are successfully
paired. These 100 are yielded to the user and the code picks up
where it left off. A related situation is the 'which' in a param. In
this case you want the mates in a particular range. The algo moves
through the range and pairs what it can. If it's left with unmated
records it goes outside the range looking for them.
readGAlignmentsList() will return all records in this range, mated
or not. The metadata column of 'mate_status' indicates the

Re: [Bioc-devel] file registry - feedback

2014-03-25 Thread Valerie Obenchain

Hi,

This discussion went off-line and I wanted to give a summary of what we 
decided to go with.


We'll create a new package, BiocFile, that has a minimal API.

API:
- 'File' class (virtual, reference class) and constructor
- close / open / isOpen
- import / export
- file registry

We won't require existing *File classes to implement yield but would 
'recommend' that new *File classes do. By getting this structure in 
place we can guide future *File developments in a consistent direction 
even if we can't harmonize all current classes. I'll start work on this 
after the release.


Thanks again for the input.

Valerie

On 03/11/2014 10:23 PM, Michael Lawrence wrote:




On Tue, Mar 11, 2014 at 3:33 PM, Hervé Pagès mailto:hpa...@fhcrc.org>> wrote:

On 03/11/2014 02:52 PM, Hervé Pagès wrote:

    On 03/11/2014 09:57 AM, Valerie Obenchain wrote:

Hi Herve,

On 03/10/2014 10:31 PM, Hervé Pagès wrote:

Hi Val,

I think it would help understand the motivations behind
this proposal
if you could give an example of a method where the user
cannot supply
a file name but has to create a 'File' (or 'FileList')
object first.
And how the file registry proposal below would help.
It looks like you have such an example in the
GenomicFileViews package.
Do you think you could give more details?


The most recent motivating use case was in creating
subclasses of
GenomicFileViews objects (BamFileViews, BigWigFileViews,
etc.) We wanted
to have a general constructor, something like
GenomicFileViews(), that
would create the appropriate subclass. However to create the
correct
subclass we needed to know if the files were bam, bw, fasta etc.
Recognition of the file type by extension would allow us to
do this with
no further input from the user.


That helps, thanks!

Having this kind of general constructor sounds like it could
indeed be
useful. Would be an opportunity to put all these *File classes
(the 22
RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
subclasses defined in Rsamtools) under the same umbrella (i.e. a
parent
virtual class) and use the name of this virtual class (e.g.
File) for
the general constructor.

Allowing a registration mechanism to extend the knowledge of
this File()
constructor is an implementation detail. I don't see a lot of
benefit to
it. Only a package that implements a concrete File subclass would
actually need to register the new subclass. Sounds easy enough
to ask
to whoever has commit access to the File() code to modify it. This
kind of update might also require adding the name of the package
where
the new File subclass is implemented to the Depends/Imports/Suggests
of the package where File() lives, which is something that cannot be
done via a registration mechanism.


This clean-up of the *File jungle would also be a good opportunity to:

   - Choose what we want to do with reference classes: use them for all
 the *File classes or for none of them. (Right now, those defined
 in Rsamtools are reference classes, and those defined in
 rtracklayer are not.)

   - Move the I/O functionality currently in rtracklayer to a
 separate package. Based on the number of contributed packages I
 reviewed so far that were trying to reinvent the wheel because
 they had no idea that the I/O function they needed was actually
 in rtracklayer, I'd like to advocate for using a package name
 that makes it very clear that it's all about I/O.



I can see some benefit in renaming/reorganizing, but if they weren't
able to perform a simple google search for functionality, I don't think
the name of the package was the problem. "read gff bioconductor" returns
rtracklayer as the top hit.


H.



H.



        Val


    Thanks,
H.


On 03/10/2014 08:46 PM, Valerie Obenchain wrote:

Hi all,

I'm soliciting feedback on the idea of a general
file 'registry' that
would identify file types by their extensions. This
is similar in
spirit
to FileForformat() in rtracklayer but a more general
abstraction that
could be used across packages. The goal

Re: [Bioc-devel] Cytogenetic bands

2014-03-25 Thread Valerie Obenchain

Hi Ilari,

org.Hs.eg.db is one of the packages included in Homo.sapiens and it's 
the origin of 'MAP'. This variable maps between entrez gene ids and 
cytoband names, not genomic coordinates (as you've discovered). It 
includes bands and sub-bands provided by Entrez Gene downloaded from here:


  ftp://ftp.ncbi.nlm.nih.gov/gene/DATA

To see a full description of 'MAP':

  library(org.Hs.eg.db)
  ?org.Hs.egMAP

We don't have an annotation package with cytoband coordinates but you 
can download them using rtracklayer:


library(rtracklayer)
session <- browserSession()
genome(session) <- "hg19"
query <- ucscTableQuery(session, "cytoBandIdeo")
tbl <- getTable(query)


dim(tbl)

[1] 931   5



head(tbl)

  chrom chromStart chromEnd   name gieStain
1  chr1  0  230 p36.33 gneg
2  chr1230  540 p36.32   gpos25
3  chr1540  720 p36.31 gneg
4  chr1720  920 p36.23   gpos25
5  chr1920 1270 p36.22 gneg
6  chr1   1270 1620 p36.21   gpos50


Valerie


On 03/22/2014 10:12 AM, Ilari Scheinin wrote:

Hi,

I would like to obtain the boundaries of cytogenetic bands for human (hg19) as 
I need to map arbitrary genomic positions to the band containing them. I 
figured these would be available via the Homo.sapiens annotation package, so I 
took a look at the available keytypes. MAP looked promising:


library(Homo.sapiens)
head(keys(Homo.sapiens, keytype="MAP"))

[1] "19q13.4"  "12p13.31" "8p22" "14q32.1"  "3q25.1"   “2q35”

However, upon a closer look, these don’t appear to be the actual bands themselves, 
but are instead the matching bands for some other level of data, as it contains 
entries such as "19q13-qter”. (And there are 2,446 of these entries whereas 
there are 862 bands.)

A bit of searching returned two software packages that do contain this 
information: idiogram (data(Hs.cytoband)) and OmicCircos (data(UCSC.hg19.chr)). 
The first one seems to be from genome build hg17, but the second one has hg18 
and hg19. However, using a software package instead of an annotation one to 
obtain this information seems wrong, and makes me worry if it will be kept 
up-to-date in the future (c.f. idiogram).

So, are the coordinates of the cytogenetic bands contained somewhere in the 
annotation packages?

Thanks,
Ilari

_______
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] restrictToSNV for VCF

2014-03-19 Thread Valerie Obenchain

Thanks for the feedback.

I'll look into nchar for XStringSetList.

I'm in favor of supporting isDeletion(), isInsertion(), isIndel() and 
isSNV() for the VCF classes and removing restrictToSNV(). I could add an 
argument 'all_alt' or 'all_alt_agreement' to be used with CollapsedVCF 
in the case where not all alternate alleles meet the criteria.


Here are the current definitions:


isDeletion <- function(x) {
  nchar(alt(x)) == 1L & nchar(ref(x)) > 1L & substring(ref(x), 1, 1) == alt(x)
}

isInsertion <- function(x) {
  nchar(ref(x)) == 1L & nchar(alt(x)) > 1L & substring(alt(x), 1, 1) == ref(x)
}

isIndel <- function(x) {
  isDeletion(x) | isInsertion(x)
}

isSNV <- function(x) {
  nchar(alt(x)) == 1L & nchar(ref(x)) == 1L
}



Valerie


On 03/19/2014 01:07 PM, Vincent Carey wrote:




On Wed, Mar 19, 2014 at 4:00 PM, Michael Lawrence
mailto:lawrence.mich...@gene.com>> wrote:

It would be nice to have functions like isSNV, isIndel, isDeletion,
etc that at least provide precise definitions of the terminology.
I've added these, but they're designed only for VRanges. Should work
for ExpandedVCF.

Also, it would be nice if restrictToSNV just assumed that alt(x)
must be something with nchar() support (with special handling for
any List), so that the 'character' vector of alt,VRanges would work
immediately. Basically restrictToSNV should just be x[isSNV(x)]. Is
there even a use-case for the restrictToSNV abstraction if we did that?


for VCF instance it would be x[isSNV(x),] and indeed I think that would
be sufficient.  i like the idea of having this family of predicates for
variant classes to allow such selections

Michael



On Tue, Mar 18, 2014 at 10:36 AM, Valerie Obenchain
mailto:voben...@fhcrc.org>> wrote:

Hi,

I've added a restrictToSNV() function to VariantAnnotation
(1.9.46). The return value is a subset VCF object containing
SNVs only. The function operates on CollapsedVCF or ExapandedVCF
and the alt(VCF) value must be nucleotides (i.e., no structural
variants).

A variant is considered a SNV if the nucleotide sequences in
both ref(vcf) and alt(x) are of length 1. I have a question
about how variants with multiple 'ALT' values should be handled.

Should we consider row 4 a SNV? One 'ALT' is length 1, the other
is not.

ALT <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
REF <- DNAStringSet(c("G", c("AA"), "T", "G"))

DataFrame(REF, ALT)

DataFrame with 4 rows and 2 columns
  REFALT

1  G  A
2 AA TT
3  TG,A
4  G   TT,C



Thanks.
Valerie

_________
    Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>






--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

2014-03-19 Thread Valerie Obenchain

On 03/19/14 10:24, Michael Love wrote:

hi Valerie,

If the Bam is not sorted by name, isn't it possible that readGAlignment*
will load > yieldSize number of reads in order to find the mate?


Sorry, our emails keep criss-crossing.

Because the mate-pairing is now done in C yieldSize is no longer a 
constraint.


When yieldSize is given, say 100, then 100 mates are returned. The algo 
moves through the file until 100 records are successfully paired. These 
100 are yielded to the user and the code picks up where it left off. A 
related situation is the 'which' in a param. In this case you want the 
mates in a particular range. The algo moves through the range and pairs 
what it can. If it's left with unmated records it goes outside the range 
looking for them. readGAlignmentsList() will return all records in this 
range, mated or not. The metadata column of 'mate_status' indicates the 
different groupings.


Valerie




Mike


On Wed, Mar 19, 2014 at 1:04 PM, Valerie Obenchain mailto:voben...@fhcrc.org>> wrote:

Hi Mike,

You no longer need to sort Bam files to use the pairing algo or
yieldSize. The readGAlignment* functions now work with both
constraints out of the box.

Create a BamFile with yieldSize and indicate you want mates.
bf <- BamFile(fl, yieldSize=1, asMates=TRUE)

Maybe set some specifications in a param:
param <- ScanBamParam(what = c("qname", "flag"))

Then call either readGAlignment* method that handles pairs:
readGAlignmentsList(bf, param=param)
readGAlignmentPairs(bf, param=param)

For summarizeOverlaps():
summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)

We've considered removing the 'obeyQname' arg and documentation but
thought the concept may be useful in another application. I'll
revisit the summarizeOverlaps() documentation to make sure
'obeyQname' is downplayed and 'asMates' is encouraged.

Valerie




On 03/19/14 07:39, Michael Love wrote:

hi,

 From last year, in order to use yieldSize with paired-end
BAMs, I

should sort the BAMs by qname and then use the following call to
BamFile:

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)

https://stat.ethz.ch/__pipermail/bioconductor/2013-__March/051490.html
<https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html>

If I want to use GenomicAlignments::__readGAlignmentsList with
asMates=TRUE and respecting the yieldSize, what is the proper
construction? (in the end, I want to use summarizeOverlaps on
paired-end BAMs while respecting the yieldSize)

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3,
obeyQname=TRUE, asMates=TRUE)
x <- readGAlignmentsList(bf)
Warning message:
   In scanBam(bamfile, ..., param = param) :
 'obeyQname=TRUE' ignored when 'asMates=TRUE'
   Calls: readGAlignmentsList ... .matesFromBam ->
.load_bamcols_from_bamfile -> scanBam -> scanBam

I see in the man pages for summarizeOverlaps it has:

"In Bioconductor > 2.12 it is not
necessary to sort paired-end BAM files by ‘qname’. When
counting with ‘summarizeOverlaps’, setting ‘singleEnd=FALSE’
will trigger paired-end reading and counting."

but I don't see how this can respect the specified yieldSize,
because
readGAlignmentsList has to read in as many reads as necessary to
find
the mate.

Sorry in advance if I am missing something in the documentation!

Mike

_
Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
mailing list
https://stat.ethz.ch/mailman/__listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>





___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

2014-03-19 Thread Valerie Obenchain

Hi Mike,

You no longer need to sort Bam files to use the pairing algo or 
yieldSize. The readGAlignment* functions now work with both constraints 
out of the box.


Create a BamFile with yieldSize and indicate you want mates.
bf <- BamFile(fl, yieldSize=1, asMates=TRUE)

Maybe set some specifications in a param:
param <- ScanBamParam(what = c("qname", "flag"))

Then call either readGAlignment* method that handles pairs:
readGAlignmentsList(bf, param=param)
readGAlignmentPairs(bf, param=param)

For summarizeOverlaps():
summarizeOverlaps(annotation, bf, param=param, singleEnd=FALSE)

We've considered removing the 'obeyQname' arg and documentation but 
thought the concept may be useful in another application. I'll revisit 
the summarizeOverlaps() documentation to make sure 'obeyQname' is 
downplayed and 'asMates' is encouraged.


Valerie



On 03/19/14 07:39, Michael Love wrote:

hi,


From last year, in order to use yieldSize with paired-end BAMs, I

should sort the BAMs by qname and then use the following call to
BamFile:

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)

https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html

If I want to use GenomicAlignments::readGAlignmentsList with
asMates=TRUE and respecting the yieldSize, what is the proper
construction? (in the end, I want to use summarizeOverlaps on
paired-end BAMs while respecting the yieldSize)

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE, asMates=TRUE)
x <- readGAlignmentsList(bf)
Warning message:
  In scanBam(bamfile, ..., param = param) :
'obeyQname=TRUE' ignored when 'asMates=TRUE'
  Calls: readGAlignmentsList ... .matesFromBam ->
.load_bamcols_from_bamfile -> scanBam -> scanBam

I see in the man pages for summarizeOverlaps it has:

"In Bioconductor > 2.12 it is not
necessary to sort paired-end BAM files by ‘qname’. When
counting with ‘summarizeOverlaps’, setting ‘singleEnd=FALSE’
will trigger paired-end reading and counting."

but I don't see how this can respect the specified yieldSize, because
readGAlignmentsList has to read in as many reads as necessary to find
the mate.

Sorry in advance if I am missing something in the documentation!

Mike

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] GenomicAlignments: using asMates=TRUE and yieldSize with paired-end BAM files

2014-03-19 Thread Valerie Obenchain

On 03/19/14 07:39, Michael Love wrote:

hi,


From last year, in order to use yieldSize with paired-end BAMs, I

should sort the BAMs by qname and then use the following call to
BamFile:

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE)

https://stat.ethz.ch/pipermail/bioconductor/2013-March/051490.html

If I want to use GenomicAlignments::readGAlignmentsList with
asMates=TRUE and respecting the yieldSize, what is the proper
construction? (in the end, I want to use summarizeOverlaps on
paired-end BAMs while respecting the yieldSize)

library(pasillaBamSubset)
fl <- sortBam(untreated3_chr4(), tempfile(), byQname=TRUE)
bf <- BamFile(fl, index=character(0), yieldSize=3, obeyQname=TRUE, asMates=TRUE)
x <- readGAlignmentsList(bf)
Warning message:
  In scanBam(bamfile, ..., param = param) :
'obeyQname=TRUE' ignored when 'asMates=TRUE'
  Calls: readGAlignmentsList ... .matesFromBam ->
.load_bamcols_from_bamfile -> scanBam -> scanBam

I see in the man pages for summarizeOverlaps it has:

"In Bioconductor > 2.12 it is not
necessary to sort paired-end BAM files by ‘qname’. When
counting with ‘summarizeOverlaps’, setting ‘singleEnd=FALSE’
will trigger paired-end reading and counting."

but I don't see how this can respect the specified yieldSize, because
readGAlignmentsList has to read in as many reads as necessary to find
the mate.



I didn't specifically answer this in my last email. The reason yieldSize 
is no longer a problem is that we've rewritten the mate pairing 
algorithm in C so pairing is no longer done in R. Both 
readGAlignmentsList() and readGAlignmentPairs() use the same code. The 
*List container is more flexible in what it can hold (singletons, reads 
with unmapped mates etc.) and the *Pairs container holds mated-pairs in 
a left-right organization.


Valerie



Sorry in advance if I am missing something in the documentation!

Mike

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] restrictToSNV for VCF

2014-03-18 Thread Valerie Obenchain

Hi,

I've added a restrictToSNV() function to VariantAnnotation (1.9.46). The 
return value is a subset VCF object containing SNVs only. The function 
operates on CollapsedVCF or ExapandedVCF and the alt(VCF) value must be 
nucleotides (i.e., no structural variants).


A variant is considered a SNV if the nucleotide sequences in both 
ref(vcf) and alt(x) are of length 1. I have a question about how 
variants with multiple 'ALT' values should be handled.


Should we consider row 4 a SNV? One 'ALT' is length 1, the other is not.

ALT <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
REF <- DNAStringSet(c("G", c("AA"), "T", "G"))

DataFrame(REF, ALT)

DataFrame with 4 rows and 2 columns
 REFALT
   
1  G  A
2 AA TT
3  TG,A
4  G   TT,C



Thanks.
Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] concat problem with CharacterList in mcols of GRanges

2014-03-17 Thread Valerie Obenchain
This was my oversight. I didn't think of using a BStringSet for the 
structural variants.


Taking a quick look at how this might work.

> fl <- system.file("extdata", "structural.vcf", 
package="VariantAnnotation")

> vcf <- readVcf(fl, "hg19")
> alt(vcf)
CharacterList of length 7
[[1]] 
[[2]] C
[[3]] 
[[4]] 
[[5]] 
[[6]] 
[[7]] 

Looks like we need to add a List constructor:

> BStringSetList(alt(vcf))
Error: could not find function "BStringSetList"

But once we did the non-nucleotide characters are handled nicely:

> BStringSet(unlist(alt(vcf)))
  A BStringSet instance of length 7
width seq
[1] 5 
[2] 1 C
[3] 5 
[4]12 
[5]11 
[6] 5 
[7]12 

With a List constructor and the ability to combine BStringSet with other 
XStringSet objects we would be set. Good suggestion. Thanks.


Val


On 03/17/2014 12:47 PM, Hervé Pagès wrote:

Hi Vince,

On 03/16/2014 06:11 PM, Vincent Carey wrote:

It seems that there is diversity in the classes assigned for ALT in
results
of readVcf, and there was some discussion of this in 1/2013.


Was this discussion on the mailing list?. Can't find it.

If using diverse/unpredictable classes for ALT cannot be avoided, have
you considered using a BStringSetList instead of a CharacterList when
the variant are "structural"?

There is a big divide between DNAStringSetList and CharacterList in
terms of internal representation. But not so much between
DNAStringSetList and BStringSetList. So using BStringSetList instead
of CharacterList would help smoothing out the kind of issues you're
facing here. In particular, even though combining DNAStringSetList
and BStringSetList objects doesn't work right now, that's something
we should definitely support (it would be easy to add).

Cheers,
H.


 So it looks
like this is predictable and solvable with some upstream work after the
read.


On Sun, Mar 16, 2014 at 7:43 PM, Vincent Carey
wrote:


c(x[[1]][1:3,1:2], x[[3]][1:3,1:2])  # this works

GRanges with 6 ranges and 2 metadata columns:
   seqnames   ranges strand |paramRangeIDREF
 | 
   [1]1 [ 10583,  10583]  * |  dhs_chr1_10402  G
   [2]1 [ 10611,  10611]  * |  dhs_chr1_10402  C
   [3]1 [ 10583,  10583]  * |  dhs_chr1_10502  G
   [4]1 [832178, 832178]  * | dhs_chr1_833139  A
   [5]1 [832266, 832266]  * | dhs_chr1_833139  G
   [6]1 [832297, 832299]  * | dhs_chr1_833139CTG
   ---
   seqlengths:
 1
NA

x[[1]][1:3,1:3]

GRanges with 3 ranges and 3 metadata columns:
   seqnames ranges strand |   paramRangeIDREF
   |
   [1]1 [10583, 10583]  * | dhs_chr1_10402  G
   [2]1 [10611, 10611]  * | dhs_chr1_10402  C
   [3]1 [10583, 10583]  * | dhs_chr1_10502  G
   ALT
   
   [1]   A
   [2]   G
   [3]   A
   ---
   seqlengths:
 1
NA

c(x[[1]][1:3,1:3], x[[3]][1:3,1:3])  # if i try to concatenate while
ALT

is included
Error in .Primitive("c")(,
  :
   all arguments in '...' must have an element class that extends
that of
the first argument

Enter a frame number, or 0 to exit

  1: c(x[[1]][1:3, 1:3], x[[3]][1:3, 1:3])
  2: c(x[[1]][1:3, 1:3], x[[3]][1:3, 1:3])
  3: .local(x, ..., recursive = recursive)
  4: .unlist_list_of_GenomicRanges(args, ignore.mcols = ignore.mcols)
  5: do.call(rbind, lapply(x, mcols, FALSE))
  6: do.call(rbind, lapply(x, mcols, FALSE))
  7: (function (..., deparse.level = 1)
standardGeneric("rbind"))(, , 
sessionInfo()

R Under development (unstable) (2014-03-15 r65199)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats graphics  grDevices datasets  utils tools
[8] methods   base

other attached packages:
  [1] Biostrings_2.31.14XVector_0.3.7 GenomicRanges_1.15.39
  [4] GenomeInfoDb_0.99.19  IRanges_1.21.34   BiocGenerics_0.9.3
  [7] BatchJobs_1.2 BBmisc_1.5weaver_1.29.1
[10] codetools_0.2-8   digest_0.6.4  BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] DBI_0.2-7   RSQLite_0.11.4  Rcpp_0.11.1 brew_1.0-6
[5] fail_1.2plyr_1.8.1  sendmailR_1.1-2 stats4_3.2.0
[9] stringr_0.6.2






[[alternative HTML version deleted]]

_______
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel






--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Unreproducible build check warning

2014-03-17 Thread Valerie Obenchain

Hi Antti,

It's looking for

\alias{[,scoreList,ANY-method}


The generic '[' can dispatch on arguments 'x', 'i' and 'j'.


getGeneric("[")

standardGeneric for "[" defined from package "base"

function (x, i, j, ..., drop = TRUE)
standardGeneric("[", .Primitive("["))


Methods may be defined for arguments: x, i, j, drop
Use  showMethods("[")  for currently available ones.


The method you wrote for scoreList dispataches on 'x' as a scoreList 
object but doesn't specify 'i' or 'j'. One of these indices must be 
present in order for subsetting to happen. In this case (I believe) the 
default is assuming 'i' as ANY and 'j' as missing.


For example, with the VCF class I've specified the method for ANY, ANY:

setMethod("[", c("VCF", "ANY", "ANY"),
function(x, i, j, ..., drop=TRUE)
{
...

To see more examples,

showMethods('[')


Valerie

On 03/17/2014 08:13 AM, Antti Honkela wrote:

Hi all,

The latest build check report shows one warning for 'tigre':
---
* checking for missing documentation entries ... WARNING
Undocumented S4 methods:
   generic '[' and siglist 'scoreList,ANY'
---

As far as I can tell the method in question should be documented, as one
of the .Rd files contains an alias:
\alias{[,scoreList-method}

Furthermore I cannot reproduce it on my own using the latest R-alpha (or
R-devel from last week): R CMD check on the source tar-ball downloaded
directly from Bioconductor runs cleanly.

Can someone please help in figuring out what is going on?


Antti




--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] concat problem with CharacterList in mcols of GRanges

2014-03-17 Thread Valerie Obenchain

Hi Vince,

ALT is a CharacterList when the variants are 'structural', i.e., when 
the ALT field has non-nucleotide characters. Otherwise, ALT is a 
DNAStringSetList.


It looks like ALT is a CharacterList in x[[1]] and is probably a 
DNAStringSetList in x[[3]]. So you are trying to combine GRanges with 
these two different types in ALT:


gr1 <- GRanges("1", IRanges(1:3, width=1),
   ALT = CharacterList(list("A", "G", "A")))

gr2 <- GRanges("1", IRanges(5:7, width=1),
   ALT = DNAStringSetList(list("T", "A", "C")))


c(gr1, gr2)

Error in .Primitive("c")(,  :
  all arguments in '...' must have an element class that extends that of the 
first argument


The ALT values in the subset of x[[1]] are all nucleotides but not all 
ALT values in x[[1]] are which is why readVcf() made it a CharacterList. 
To combine these GRanges you need to either coerce all of x[[3]] ALT to 
CharacterList or coerce the subset of x[[1]] to DNAStringSetList.


Coerce the subset of x[[1]] to DNAStringSetList:


gr1$ALT <- DNAStringSetList(gr1$ALT)
gr1

GRanges with 3 ranges and 1 metadata column:
  seqnamesranges strand |ALT
 | 
  [1]1[1, 1]  * |  A
  [2]1[2, 2]  * |  G
  [3]1[3, 3]  * |  A
  ---


Combine:


c(gr1, gr2)

GRanges with 6 ranges and 1 metadata column:
  seqnamesranges strand |ALT
 | 
  [1]1[1, 1]  * |  A
  [2]1[2, 2]  * |  G
  [3]1[3, 3]  * |  A
  [4]1[5, 5]  * |  T
  [5]1[6, 6]  * |  A
  [6]1[7, 7]  * |  C



Val


On 03/16/2014 06:11 PM, Vincent Carey wrote:

It seems that there is diversity in the classes assigned for ALT in results
of readVcf, and there was some discussion of this in 1/2013.  So it looks
like this is predictable and solvable with some upstream work after the
read.


On Sun, Mar 16, 2014 at 7:43 PM, Vincent Carey
wrote:


c(x[[1]][1:3,1:2], x[[3]][1:3,1:2])  # this works

GRanges with 6 ranges and 2 metadata columns:
   seqnames   ranges strand |paramRangeIDREF
 | 
   [1]1 [ 10583,  10583]  * |  dhs_chr1_10402  G
   [2]1 [ 10611,  10611]  * |  dhs_chr1_10402  C
   [3]1 [ 10583,  10583]  * |  dhs_chr1_10502  G
   [4]1 [832178, 832178]  * | dhs_chr1_833139  A
   [5]1 [832266, 832266]  * | dhs_chr1_833139  G
   [6]1 [832297, 832299]  * | dhs_chr1_833139CTG
   ---
   seqlengths:
 1
NA

x[[1]][1:3,1:3]

GRanges with 3 ranges and 3 metadata columns:
   seqnames ranges strand |   paramRangeIDREF
   |
   [1]1 [10583, 10583]  * | dhs_chr1_10402  G
   [2]1 [10611, 10611]  * | dhs_chr1_10402  C
   [3]1 [10583, 10583]  * | dhs_chr1_10502  G
   ALT
   
   [1]   A
   [2]   G
   [3]   A
   ---
   seqlengths:
 1
NA

c(x[[1]][1:3,1:3], x[[3]][1:3,1:3])  # if i try to concatenate while ALT

is included
Error in .Primitive("c")(,
  :
   all arguments in '...' must have an element class that extends that of
the first argument

Enter a frame number, or 0 to exit

  1: c(x[[1]][1:3, 1:3], x[[3]][1:3, 1:3])
  2: c(x[[1]][1:3, 1:3], x[[3]][1:3, 1:3])
  3: .local(x, ..., recursive = recursive)
  4: .unlist_list_of_GenomicRanges(args, ignore.mcols = ignore.mcols)
  5: do.call(rbind, lapply(x, mcols, FALSE))
  6: do.call(rbind, lapply(x, mcols, FALSE))
  7: (function (..., deparse.level = 1)
standardGeneric("rbind"))(, , 
sessionInfo()

R Under development (unstable) (2014-03-15 r65199)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] C

attached base packages:
[1] parallel  stats graphics  grDevices datasets  utils tools
[8] methods   base

other attached packages:
  [1] Biostrings_2.31.14XVector_0.3.7 GenomicRanges_1.15.39
  [4] GenomeInfoDb_0.99.19  IRanges_1.21.34   BiocGenerics_0.9.3
  [7] BatchJobs_1.2 BBmisc_1.5weaver_1.29.1
[10] codetools_0.2-8   digest_0.6.4  BiocInstaller_1.13.3

loaded via a namespace (and not attached):
[1] DBI_0.2-7   RSQLite_0.11.4  Rcpp_0.11.1 brew_1.0-6
[5] fail_1.2plyr_1.8.1  sendmailR_1.1-2 stats4_3.2.0
[9] stringr_0.6.2






[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bio

Re: [Bioc-devel] file registry - feedback

2014-03-11 Thread Valerie Obenchain

Hi,

On 03/11/14 15:33, Hervé Pagès wrote:

On 03/11/2014 02:52 PM, Hervé Pagès wrote:

On 03/11/2014 09:57 AM, Valerie Obenchain wrote:

Hi Herve,

On 03/10/2014 10:31 PM, Hervé Pagès wrote:

Hi Val,

I think it would help understand the motivations behind this proposal
if you could give an example of a method where the user cannot supply
a file name but has to create a 'File' (or 'FileList') object first.
And how the file registry proposal below would help.
It looks like you have such an example in the GenomicFileViews package.
Do you think you could give more details?


The most recent motivating use case was in creating subclasses of
GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted
to have a general constructor, something like GenomicFileViews(), that
would create the appropriate subclass. However to create the correct
subclass we needed to know if the files were bam, bw, fasta etc.
Recognition of the file type by extension would allow us to do this with
no further input from the user.


That helps, thanks!

Having this kind of general constructor sounds like it could indeed be
useful. Would be an opportunity to put all these *File classes (the 22
RTLFile subclasses defined in rtracklayer and the 5 RsamtoolsFile
subclasses defined in Rsamtools) under the same umbrella (i.e. a parent
virtual class) and use the name of this virtual class (e.g. File) for
the general constructor.

Allowing a registration mechanism to extend the knowledge of this File()
constructor is an implementation detail. I don't see a lot of benefit to
it. Only a package that implements a concrete File subclass would
actually need to register the new subclass. Sounds easy enough to ask
to whoever has commit access to the File() code to modify it. This
kind of update might also require adding the name of the package where
the new File subclass is implemented to the Depends/Imports/Suggests
of the package where File() lives, which is something that cannot be
done via a registration mechanism.


This clean-up of the *File jungle would also be a good opportunity to:

   - Choose what we want to do with reference classes: use them for all
 the *File classes or for none of them. (Right now, those defined
 in Rsamtools are reference classes, and those defined in
 rtracklayer are not.)

   - Move the I/O functionality currently in rtracklayer to a
 separate package. Based on the number of contributed packages I
 reviewed so far that were trying to reinvent the wheel because
 they had no idea that the I/O function they needed was actually
 in rtracklayer, I'd like to advocate for using a package name
 that makes it very clear that it's all about I/O.


Thanks for the suggestions. This re-org sounds good to me. As you say, 
unifying the *File classes in a single package would make them more 
visible to other developers and enforce consistent behavior.


If you aren't in favor of a registration mechanism for 'discovery' how 
should a function with methods for many *File classes (e.g., import()) 
handle a character file name? import() uses FileForFormat() to discover 
the file type, makes the *File class and dispatches to the appropriate 
*File method. The registry was an attempt at generalizing this concept.


What do you think about the use of a registry for Vince's idea of 
holding a digest/path reference to large data but not installing it 
until it's used? Other ideas of how / where this could be stored?


Val




H.




H.




Val



Thanks,
H.


On 03/10/2014 08:46 PM, Valerie Obenchain wrote:

Hi all,

I'm soliciting feedback on the idea of a general file 'registry' that
would identify file types by their extensions. This is similar in
spirit
to FileForformat() in rtracklayer but a more general abstraction that
could be used across packages. The goal is to allow a user to supply
only file name(s) to a method instead of first creating a 'File' class
such as BamFile, FaFile, BigWigFile etc.

A first attempt at this is in the GenomicFileViews package
(https://github.com/Bioconductor/GenomicFileViews). A registry
(lookup)
is created as an environment at load time:

.fileTypeRegistry <- new.env(parent=emptyenv()

Files are registered with an information triplet consisting of class,
package and regular expression to identify the extension. In
GenomicFileViews we register FaFileList, BamFileList and
BigWigFileList
but any 'File' class can be registered that has a constructor of the
same name.

.onLoad <- function(libname, pkgname)
{
 registerFileType("FaFileList", "Rsamtools", "\\.fa$")
 registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
 registerFileType("BamFileList", "Rsamtools", "\\.bam$")
 registerFileType("BigWigFileList", "rtracklayer", "\\.bw$

Re: [Bioc-devel] file registry - feedback

2014-03-11 Thread Valerie Obenchain

Hi,

On 03/11/2014 09:47 AM, Michael Lawrence wrote:

Except for the checksum, the existing File classes should support this,
where the package provides a dataset via data() that is just the serialized
File object (path). One could create a FileWithChecksum class that
decorates a File object with a checksum. Any attempts to read the file are
intercepted by the decorator, which verifies the checksum, and then
delegates.


Neat. Sounds like this is worth pursuing.



Michael


On Tue, Mar 11, 2014 at 8:53 AM, Vincent Carey
wrote:


I'm going to suggest a use case that may motivate this type of development.

Up to 2010 or so, data packages generally made sense.  You have about
100-500MB of serialized or pre-serialized stuff.  Installing it in an R
package is unpleasant from a resource consumption perspective but it works,
you can use data/extdata and work with data with programmatic access,
documentation and checkability.

More recently, it is easy to come across data resources that we'd like to
have package-like control over/access to, but installing such packages
makes no sense.  The volume is too big, and you want to work with the
resource with non-R tools as well from time to time.  You don't want to
move the data.

We should have a protocol for "packaging" data without installing it.  A
digest of the raw data resource should be computed and kept in the
registry.  A registered file can be part of a package that can be checked
and installed, but the data themselves do not move.  Genomic data in S3
buckets should provide a basic use case.

The digest is recomputed whenever we want to start working with the
registry/package to verify that we are working with the intended artifact.


On Tue, Mar 11, 2014 at 11:11 AM, Gabriel Becker wrote:


Would it be better to let the user (registerer) specify a function, which
could be a simple class constructor or something more complex in cases
where that would be useful?


Yes, good suggestion.



This could allow the concept to generalize to other things, such as
databases that might need some startup machinery called before they are
actually useful to the user.



The intent of the registry was to provide a way to lookup files by their 
extension. I'm not sure how this applies to the database example. Do you 
envision creating multiple databases throughout an R session (vs a 
single set up at load time)? For example if the file has type 'X' 
extension it becomes a type 'X' database etc.?




This would also deal with Michael's point about package/where since
functions have their own "where" information. Unless I'm missing some
other
intent for specifying a specific package?

~G


On Tue, Mar 11, 2014 at 5:59 AM, Michael Lawrence <
lawrence.mich...@gene.com

wrote:



rtracklayer essentially has this, although registration is implicit

through

extension of RTLFile or RsamtoolsFile, and the extension is taken from

the

class name. There is a BigWigFile, corresponding to ".bigwig", and that

is

extended by BWFile to support the ".bw" extension. The expectation is

that

other packages would extend RTLFile to implictly register handlers.  I'm
not sure there is a use case for generalization, but this proposal makes
registration more explicit, which is probably a good thing. rtracklayer

was

just piggy backing on S4 registration.

I'm a little bit confused by the use of Lists rather than individual

File

objects. Are you also proposing that all RTLFiles would need a
corresponding List, and that there would need to be an RTLFileList

method

for the various generics?


No, I don't want to force the 'List' route. I was using them in 
GenomicFileViews so that's what I registered. The 'class' should be any 
class that has a constructor of the same name. Thinking about this more 
the 'class' probably should be the individual File object instead of the 
List object. Coercion to List can be done inside the helper.




It may not be necessary to specify the package name. There should be an
environment (where) argument that defaults to topenv(parent.frame()),

and

that should suffice.


I'll look into this.


Any comments on whether this should be it's own package or in an 
existing one?



Thanks for the input.
Valerie




Michael


On Mon, Mar 10, 2014 at 8:46 PM, Valerie Obenchain 
wrote:



Hi all,

I'm soliciting feedback on the idea of a general file 'registry' that
would identify file types by their extensions. This is similar in

spirit

to

FileForformat() in rtracklayer but a more general abstraction that

could

be

used across packages. The goal is to allow a user to supply only file
name(s) to a method instead of first creating a 'File' class such as
BamFile, FaFile, BigWigFile etc.

A first attempt at this is in the GenomicFileViews package (
https://github.com/Bioconductor/Geno

Re: [Bioc-devel] file registry - feedback

2014-03-11 Thread Valerie Obenchain

Hi Herve,

On 03/10/2014 10:31 PM, Hervé Pagès wrote:

Hi Val,

I think it would help understand the motivations behind this proposal
if you could give an example of a method where the user cannot supply
a file name but has to create a 'File' (or 'FileList') object first.
And how the file registry proposal below would help.
It looks like you have such an example in the GenomicFileViews package.
Do you think you could give more details?


The most recent motivating use case was in creating subclasses of 
GenomicFileViews objects (BamFileViews, BigWigFileViews, etc.) We wanted 
to have a general constructor, something like GenomicFileViews(), that 
would create the appropriate subclass. However to create the correct 
subclass we needed to know if the files were bam, bw, fasta etc. 
Recognition of the file type by extension would allow us to do this with 
no further input from the user.


Val



Thanks,
H.


On 03/10/2014 08:46 PM, Valerie Obenchain wrote:

Hi all,

I'm soliciting feedback on the idea of a general file 'registry' that
would identify file types by their extensions. This is similar in spirit
to FileForformat() in rtracklayer but a more general abstraction that
could be used across packages. The goal is to allow a user to supply
only file name(s) to a method instead of first creating a 'File' class
such as BamFile, FaFile, BigWigFile etc.

A first attempt at this is in the GenomicFileViews package
(https://github.com/Bioconductor/GenomicFileViews). A registry (lookup)
is created as an environment at load time:

.fileTypeRegistry <- new.env(parent=emptyenv()

Files are registered with an information triplet consisting of class,
package and regular expression to identify the extension. In
GenomicFileViews we register FaFileList, BamFileList and BigWigFileList
but any 'File' class can be registered that has a constructor of the
same name.

.onLoad <- function(libname, pkgname)
{
 registerFileType("FaFileList", "Rsamtools", "\\.fa$")
 registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
 registerFileType("BamFileList", "Rsamtools", "\\.bam$")
 registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
}

The makeFileType() helper creates the appropriate class. This function
is used behind the scenes to do the lookup and coerce to the correct
'File' class.

 > makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2
names(2): foo.bam bar.bam

New types can be added at any time with registerFileType():

registerFileType(NewClass, NewPackage, "\\.NewExtension$")


Thoughts:

(1) If this sounds generally useful where should it live? rtracklayer,
GenomicFileViews or other? Alternatively it could be its own lightweight
package (FileRegister) that creates the registry and provides the
helpers. It would be up to the package authors that depend on
FileRegister to register their own files types at load time.

(2) To avoid potential ambiguities maybe searching should be by regex
and package name. Still a work in progress.


Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel





--
Valerie Obenchain
Program in Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] file registry - feedback

2014-03-10 Thread Valerie Obenchain

Hi all,

I'm soliciting feedback on the idea of a general file 'registry' that 
would identify file types by their extensions. This is similar in spirit 
to FileForformat() in rtracklayer but a more general abstraction that 
could be used across packages. The goal is to allow a user to supply 
only file name(s) to a method instead of first creating a 'File' class 
such as BamFile, FaFile, BigWigFile etc.


A first attempt at this is in the GenomicFileViews package 
(https://github.com/Bioconductor/GenomicFileViews). A registry (lookup) 
is created as an environment at load time:


.fileTypeRegistry <- new.env(parent=emptyenv()

Files are registered with an information triplet consisting of class, 
package and regular expression to identify the extension. In 
GenomicFileViews we register FaFileList, BamFileList and BigWigFileList 
but any 'File' class can be registered that has a constructor of the 
same name.


.onLoad <- function(libname, pkgname)
{
registerFileType("FaFileList", "Rsamtools", "\\.fa$")
registerFileType("FaFileList", "Rsamtools", "\\.fasta$")
registerFileType("BamFileList", "Rsamtools", "\\.bam$")
registerFileType("BigWigFileList", "rtracklayer", "\\.bw$")
}

The makeFileType() helper creates the appropriate class. This function 
is used behind the scenes to do the lookup and coerce to the correct 
'File' class.


> makeFileType(c("foo.bam", "bar.bam"))
BamFileList of length 2
names(2): foo.bam bar.bam

New types can be added at any time with registerFileType():

registerFileType(NewClass, NewPackage, "\\.NewExtension$")


Thoughts:

(1) If this sounds generally useful where should it live? rtracklayer, 
GenomicFileViews or other? Alternatively it could be its own lightweight 
package (FileRegister) that creates the registry and provides the 
helpers. It would be up to the package authors that depend on 
FileRegister to register their own files types at load time.


(2) To avoid potential ambiguities maybe searching should be by regex 
and package name. Still a work in progress.



Valerie

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] request: high-level seqlevel utilities

2013-12-27 Thread Valerie Obenchain
d be stored with the Seqinfo. It could be imputed
(along with the isCircular I think) via th


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

_______
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




--
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: voben...@fhcrc.org
Phone:  (206) 667-3158
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


  1   2   >