On 09/30/2011 07:48 AM, Ivan Gregoretti wrote:
Following Janet's example, I would also like to propose an upgrade to
ScanBamParam:

It would be great if we could tell ScanBamPram that we want to load
only the reads that passed the vendor's quality filter.

In other words, the functionality I am suggesting is analogous to the
filter in readAligned() from the ShortRead library.

Hi Ivan --

in principle, flag=scanBamFlag(isValidVendorRead=TRUE) will do this; it requires that the flag is set in the BAM file.



With the new release of Illumina sequencing reagents (version 3) you
get 200 million reads per lane from the HiSeq 2000. In my view, with
samples that big becoming popular, any investment in "read in"
efficiency is a good investment. I would be happy to provide a sample
BAM for those interested in addressing this suggestion.

It is also my humble opinion that we should start considering
parallelisation for reading in. I hope that I am not just wishing too
much.

I'm actually revising the I/O a little at the moment; I'll implement a better strategy for reading in the data. When I look at 'top' on my system, I see the CPU running at say 50% which implies disk input is the bottleneck; probably this is on our system administration end, where the large storage required for BAM files doesn't have completely adequate performance. This I/O is tricky to guage, because the next time through the BAM file input _is_ CPU limited and much faster -- the disk system has done some clever buffering. But in real use cases I wouldn't see the benefit of that buffering since I wouldn't be revisiting the file.

In terms of parallel throughput it might often be appropriate to parallelize at a higher level, e.g., iterating over regions of interest (e.g., GRanges defining chromosomes, with the iteration via lapply) and a function FUN tasked with input of data in a subset of the GRanges followed by processing (e.g., counting overlaps in an RNASeq experiment) that typically leads to a large reduction in data size. To parallelize, replace lapply with mclapply (currently in the multicore package, but in devel in the 'parallel' package distributed with base R). Use BamFile and BamFileList to avoid re-loading the index on each file access. I'm not sure that just inputting large amounts of data in parallel and then pasting together to operate on as one large object is a real win -- the data is big, and the processing is on a single processor (unless it is split again in mclapply...).

I'm open to discussion on this...

Martin



Thank you,

Ivan

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to