Hi Martin, Thanks for flag=scanBamFlag(isValidVendorRead=TRUE). I didn't know it existed.
Regarding parallelisation of I/O, I completely understand the challenge. I can only add that in my experience with R and big sequencing files, the bottleneck has been invariably the CPU and not the disk. I work in a 16 core Intel Xeon X5570 2.93GHz with 144RAM. The disk is accessible through a gigabit network. I hope we get to hear some ideas from the community. Thank you, Ivan Ivan Gregoretti, PhD National Institute of Diabetes and Digestive and Kidney Diseases National Institutes of Health 5 Memorial Dr, Building 5, Room 205. Bethesda, MD 20892. USA. Phone: 1-301-496-1016 and 1-301-496-1592 Fax: 1-301-496-9878 On Fri, Sep 30, 2011 at 12:11 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 09/30/2011 07:48 AM, Ivan Gregoretti wrote: > >> Following Janet's example, I would also like to propose an upgrade to >> ScanBamParam: >> >> It would be great if we could tell ScanBamPram that we want to load >> only the reads that passed the vendor's quality filter. >> >> In other words, the functionality I am suggesting is analogous to the >> filter in readAligned() from the ShortRead library. >> > > Hi Ivan -- > > in principle, flag=scanBamFlag(**isValidVendorRead=TRUE) will do this; it > requires that the flag is set in the BAM file. > > > >> >> With the new release of Illumina sequencing reagents (version 3) you >> get 200 million reads per lane from the HiSeq 2000. In my view, with >> samples that big becoming popular, any investment in "read in" >> efficiency is a good investment. I would be happy to provide a sample >> BAM for those interested in addressing this suggestion. >> >> It is also my humble opinion that we should start considering >> parallelisation for reading in. I hope that I am not just wishing too >> much. >> > > I'm actually revising the I/O a little at the moment; I'll implement a > better strategy for reading in the data. When I look at 'top' on my system, > I see the CPU running at say 50% which implies disk input is the bottleneck; > probably this is on our system administration end, where the large storage > required for BAM files doesn't have completely adequate performance. This > I/O is tricky to guage, because the next time through the BAM file input > _is_ CPU limited and much faster -- the disk system has done some clever > buffering. But in real use cases I wouldn't see the benefit of that > buffering since I wouldn't be revisiting the file. > > In terms of parallel throughput it might often be appropriate to > parallelize at a higher level, e.g., iterating over regions of interest > (e.g., GRanges defining chromosomes, with the iteration via lapply) and a > function FUN tasked with input of data in a subset of the GRanges followed > by processing (e.g., counting overlaps in an RNASeq experiment) that > typically leads to a large reduction in data size. To parallelize, replace > lapply with mclapply (currently in the multicore package, but in devel in > the 'parallel' package distributed with base R). Use BamFile and > BamFileList to avoid re-loading the index on each file access. I'm not sure > that just inputting large amounts of data in parallel and then pasting > together to operate on as one large object is a real win -- the data is big, > and the processing is on a single processor (unless it is split again in > mclapply...). > > I'm open to discussion on this... > > Martin > > > >> Thank you, >> >> Ivan >> >> ______________________________**_________________ >> Bioc-sig-sequencing mailing list >> Bioc-sig-sequencing@r-project.**org <Bioc-sig-sequencing@r-project.org> >> https://stat.ethz.ch/mailman/**listinfo/bioc-sig-sequencing<https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing