[Bioc-devel] Reducing memory footprint of large object

Christian Arnold Thu, 05 Nov 2015 07:29:32 -0800


Hi all,

I wanted to ask around in this list with full of experts if any of youhave an advice about the following problem:

I got a large SNPhood object from someone (package SNPhood, which Ideveloped) from an analysis of 200.000 SNPs or so that stores lots ofread counts and the positions of overlapping reads in general. In total,the object is 2 GB large. I examined the object and identified the slotsthat need the most memory. In this particular slot, a nested list isstored that saves the read start positions of all overlapping reads foreach SNP region.

For example, for one individual, a list of length 120,049 with integervectors, with 20,853,838 elements within the vectors in total:

>format(object.size(SNPhood.o@internal$readStartPos$ambiguous$GM12878),units = "Mb")

[1] "86 Mb"

Unsurprisingly, when unlisting, this can only be a bit improved:

format(object.size(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878)),units = "Mb")

"79.6 Mb"

> length(SNPhood.o@internal$readStartPos$ambiguous$GM12878)
[1] 120049
> length(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878))
[1] 20853838

The vector of read start positions may look like this:

head(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878),50)

[1] 714086 714087 714088 714089 714099 714100 714106 714108 714110714114 714114 714123 714123 714123 714125 714125 714128 714130 714138714139 714145 714148 714149 714150 714151 714152 714154 714164 714164714172 714173 714184 714186 714187 714188 714189 714192 714194 714198714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224

So there are a few reads with identical start sites, but this does notoccur too often. I indeed need all of this information for furtherprocessing.

Do you have any idea if I can save this information more efficiently sothat the overall object size is reduced? I could try an Rle, but thestructure of the data does not be ideal for this...


Any tips are very much appreciated!

Thanks,
Christian

--
—————————————————————————
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arn...@embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Reducing memory footprint of large object

Reply via email to