Hej Bioc Core! There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){ ## create a stream stopifnot(is(bamFile,"BamFile")) ## set the yieldSize if it is not set already if(is.na(yieldSize(bamFile))){ yieldSize(bamFile) <- yieldSize } ## open it open(bamFile) ## verb if(verbose){ message(paste("Streaming",basename(path(bamFile)))) } ## create the output out <- GappedAlignments() ## process it while(length(chunk <- readBamGappedAlignments(bamFile))){ if(verbose){ message(paste("Processed",length(chunk),"reads")) } out <- c(out,chunk) } ## close close(bamFile) ## return return(out) } In the method above, the first iteration of combining the GappedAlignments: out <- c(out,chunk) takes: system.time(append(out,chunk)) user system elapsed 123.704 0.060 124.011 whereas the second iteration (faked here) takes only (still long): system.time(append(chunk,chunk)) user system elapsed 2.708 0.044 2.758 I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below: out <- GappedAlignments(seqlengths=seqlengths(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 'names(seqlengths)' incompatible with 'levels(seqnames)' out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk)) Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 'strand' must be specified when 'seqnames' is not empty out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+") Error in validObject(.Object) : invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle" invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x' I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem. My sessionInfo: R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12 [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15 [7] BiocGenerics_0.7.2 loaded via a namespace (and not attached): [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0 Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delho...@embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel