Re: [Bioc-sig-seq] Another ScanBamParam suggestion

Martin Morgan Fri, 30 Sep 2011 09:11:41 -0700

On 09/30/2011 07:48 AM, Ivan Gregoretti wrote:

Following Janet's example, I would also like to propose an upgrade to
ScanBamParam:


It would be great if we could tell ScanBamPram that we want to load
only the reads that passed the vendor's quality filter.

In other words, the functionality I am suggesting is analogous to the
filter in readAligned() from the ShortRead library.


Hi Ivan --

in principle, flag=scanBamFlag(isValidVendorRead=TRUE) will do this; itrequires that the flag is set in the BAM file.



With the new release of Illumina sequencing reagents (version 3) you
get 200 million reads per lane from the HiSeq 2000. In my view, with
samples that big becoming popular, any investment in "read in"
efficiency is a good investment. I would be happy to provide a sample
BAM for those interested in addressing this suggestion.

It is also my humble opinion that we should start considering
parallelisation for reading in. I hope that I am not just wishing too
much.

I'm actually revising the I/O a little at the moment; I'll implement abetter strategy for reading in the data. When I look at 'top' on mysystem, I see the CPU running at say 50% which implies disk input is thebottleneck; probably this is on our system administration end, where thelarge storage required for BAM files doesn't have completely adequateperformance. This I/O is tricky to guage, because the next time throughthe BAM file input _is_ CPU limited and much faster -- the disk systemhas done some clever buffering. But in real use cases I wouldn't see thebenefit of that buffering since I wouldn't be revisiting the file.

In terms of parallel throughput it might often be appropriate toparallelize at a higher level, e.g., iterating over regions of interest(e.g., GRanges defining chromosomes, with the iteration via lapply) anda function FUN tasked with input of data in a subset of the GRangesfollowed by processing (e.g., counting overlaps in an RNASeq experiment)that typically leads to a large reduction in data size. To parallelize,replace lapply with mclapply (currently in the multicore package, but indevel in the 'parallel' package distributed with base R). Use BamFileand BamFileList to avoid re-loading the index on each file access. I'mnot sure that just inputting large amounts of data in parallel and thenpasting together to operate on as one large object is a real win -- thedata is big, and the processing is on a single processor (unless it issplit again in mclapply...).


I'm open to discussion on this...

Martin


Thank you,

Ivan

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Another ScanBamParam suggestion

Reply via email to