Hello Martin, I am currently reading in data with scanBam and making use of the ScanBamParam function.
My bam data looks like this: > names(bam1) [1] "chr1:1-197195432" "chr2:1-181748087" "chr3:1-159599783" > class(bam1) [1] "list" > class(bam1$"chr1:1-197195432") [1] "list" > class(bam1$"chr1:1-197195432"$qname) [1] "character" How do we consolidate all three lists (chr1, chr2, chr3) into a single list? I read the Rsamtools documentation but the examples do not quite apply to this case. Thank you, Ivan On Wed, Jul 13, 2011 at 5:25 PM, Ivan Gregoretti <ivang...@gmail.com> wrote: > Hi Martin, > > I will have to play a little bit both with the counter function and > with simplify2array. It's not evident for me what they are intended to > do. > > I can answer your question about BAM size. Our samples are all from > Illumina HiSeq now. So, we are talking about no less than 100 million > good quality tags per lane. > > In different occasions I have run into the problem of the 2^31 -1 > limit and I am always thinking about ways to process large amounts of > information by chunks and in parallel. With time, sequencers will only > output more tags so we will will run into this problems only more > often. > > The suggestion of using ScanBamParam is good, thank you, but at this > early stage of sample analysis I need to load the whole BAM. Some > sample quality assessments are in order. > > Thank you, > > Ivan > > > > > On Wed, Jul 13, 2011 at 5:04 PM, Martin Morgan <mtmor...@fhcrc.org> wrote: >> On 07/13/2011 01:57 PM, Martin Morgan wrote: >>> >>> On 07/13/2011 01:36 PM, Ivan Gregoretti wrote: >>>> >>>> Hi everybody, >>>> >>>> As I wait for my large BAM to be read in by scanBAM, I can't help but >>>> to wonder: >>>> >>>> Has anybody tried combining scanBam with multicore to load the >>>> chromosomes in parallel? >>>> >>>> That would require >>>> >>>> 1) to merge the chunks at the end and >>>> >>>> 2) the original BAM to be indexed. >>>> >>>> Does anybody have any experience to share? >>> >>> Was wondering how large and long we're talking about? >>> >>> Use of ScanBamParam(what=...) can help. >>> >>> For some tasks I'd think of a coarser granularity, e.g., in the context >>> of multiple bam files so that the data reduction (to a vector of >>> 10,000's of counts) occurs on each core. >>> >>> counter = function(fl, genes) { >>> aln = readGappedAlignments(fl) >>> strand(aln) = "*" >>> hits = countOverlaps(aln, genes) >>> countOverlaps(genes, aln[hits==1]) >>> } >>> simplify2array(mclapply(bamFiles, counter, genes)) >>> >>> One issue I understand people have is that mclapply uses 'serialize()' >>> to convert the return value of each function to a raw vector. raw >>> vectors have the same total length limit as any other vector (2^31 -1 >>> elements) and this places a limit on the size of chunk returned by each >>> core. Also I believe that exceeding the limit can silently corrupt the >>> data (i.e., a bug). This is second-hand information. >> >> Should also have mentioned that casting a GRanges object to RangesList >> provides the appropriate list to iterate over chromosomes, and that >> ScanBamParam will accept a which of class RangesList. >> >> Martin >> >>> >>> Martin >>> >>>> >>>> Thank you, >>>> >>>> Ivan >>>> >>>> _______________________________________________ >>>> Bioc-sig-sequencing mailing list >>>> Bioc-sig-sequencing@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> >> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 >> > _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing