Thank you for the information. I guess I'll try to stick to the R-level parallelization whenever possible.
Best, Oleksii On Wed, 26 May 2021 at 13:47, Martin Morgan <mtmorgan.b...@gmail.com> wrote: > The best way to process large files is in chunks using BamFile(…, > yieldSize = …) and by using ScanBamParam() to select just the components of > the bam files of interest. The number of cores is basically irrelevant for > input -- you'll be using just one, so choose yieldSize to use modest > amounts of memory for primary data, e.g, 4 GB per file, and process each > file separately. > > Figure the iteration-by-chunk solution for one file; the simplest example > is in ?Rsamtools::BamFile > > > ## Use 'yieldSize' to iterate through a file in chunks. > bf <- open(BamFile(fl, yieldSize=1000)) > while (nrec <- length(scanBam(bf)[[1]][[1]])) > cat("records:", nrec, "\n") > close(bf) > > but you'd likely want the convenience of > GenomicAlignments::readGAlignments() / readGAlignmentPairs(). > > Once this is working, write this as a proper function, specifying all > packages required for the function to complete, e.g., > > fun = function(fl, yieldSize) { > library(Rsamtools) > nrec <- 0L > bf <- open(BamFile(fl, yieldSize=yieldSize)) > repeat { > len <- length(scanBam(bf)[[1]][[1]]) > if (len == 0L) > break > nrec = nrec + len > } > close(bf) > nrec > } > > try to minimize the size of the inputs (here just the file name) and the > outputs (nrec, a single integer), perhaps using the file system to > temporarily store large results. Use BiocParallel::bplapply to apply this > to all files > > bplapply(fls, fun, yieldSize = 1000000) > > I would actually recommend BiocParallel::SnowParam() (separate processes) > because (a) this enforces the discipline that the function does not rely > implicitly on the state of the parent process and (b) ensures operation > across all OS, and easier transition to, e.g., an HPC cluster. The fixed > cost of starting separate processes for each file are outweighed by the > time spent processing the file in the process. > > GenomicFiles::reduceByYield() or reduceByFile() might be relevant. > > I am not totally current (others on this list probably know more) but I > don't think openMP is supported on MacOS ( > https://mac.r-project.org/openmp/) so would be a poor choice at the C > level if cross-platform utility were important. If it were me, and again I > do not have enough recent experience, I might aim for Intel Threaded > Building Blocks, using RcppParallel for inspiration. > > Martin > > From: Oleksii Nikolaienko <oleksii.nikolaie...@gmail.com> > Date: Tuesday, May 25, 2021 at 6:28 PM > To: Martin Morgan <mtmorgan.b...@gmail.com> > Cc: "bioc-devel@r-project.org" <bioc-devel@r-project.org> > Subject: Re: [Bioc-devel] C++ parallel computing > > Hi Martin, > thanks for your answer. The goal is to speed up my package (epialleleR), > where most of the functions are already written in C++, but the code is > single-threaded. Tasks include: apply analog of > GenomicAlignments::sequenceLayer to SEQ, QUAL and XM strings, calculate > per-read methylation beta values, create methylation cytosine reports with > prefiltering of sequence reads. Probably all of them I could parallelize > at the level of R, but even in this case I'd maybe like to use OpenMP SIMD > directives. > And yes, the plan is to use Rhtslib. Current backend for reading BAM > is Rsamtools, however I believe I could speed things up significantly by > avoiding unnecessary type conversions and cutting other corners. It doesn't > hurt much when the BAM file is smaller than 1GB, but for 20-40GB file > loading takes more than an hour (24 cores, 378GB RAM workstation). > > Best, > Oleksii > > > On Tue, 25 May 2021 at 19:39, Martin Morgan <mailto: > mtmorgan.b...@gmail.com> wrote: > If the BAM files are each processed independently, and each processing > task takes a while, then it is probably 'good enough' to use R-level > parallel evaluation using BiocParallel (currently the recommendation for > Bioconductor packages) or other evaluation framework. Also, presumably you > will use Rhtslib, which provides C-level access to the hts library. This > will requiring writing C / C++ code to interface between R and the hts > library, and will of course be a significant underataking. > > It might be worth outlining in a bit more detail what your task is and how > (not too much detail!) you've tried to implement this in Rsamtools. > > Martin Morgan > > On 5/24/21, 10:01 AM, "Bioc-devel on behalf of Oleksii Nikolaienko" > <mailto:bioc-devel-boun...@r-project.org on behalf of mailto: > oleksii.nikolaie...@gmail.com> wrote: > > Dear Bioc team, > I'd like to ask for your advice on the parallelization within a Bioc > package. Please point me to a better place if this mailing list is not > appropriate. > After a bit of thinking I decided that I'd like to parallelize > processing > at the level of C++ code. Would you strongly recommend not to and use > an R > approach instead (e.g. "future")? > If parallel C++ is ok, what would be the best solution for all major > OSs? > My initial choice was OpenMP, but then it seems that Apple has > something > against it (https://mac.r-project.org/openmp/). My own dev > environment is > mostly Big Sur/ARM64, but I wouldn't want to drop its support anyway. > > (On the actual task: loading and specific processing of very large BAM > files, ideally significantly faster than by means of Rsamtools as a > backend) > > Best, > Oleksii Nikolaienko > > [[alternative HTML version deleted]] > > _______________________________________________ > mailto:Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel