Hi all,

I wanted to ask around in this list with full of experts if any of you have an advice about the following problem:

I got a large SNPhood object from someone (package SNPhood, which I developed) from an analysis of 200.000 SNPs or so that stores lots of read counts and the positions of overlapping reads in general. In total, the object is 2 GB large. I examined the object and identified the slots that need the most memory. In this particular slot, a nested list is stored that saves the read start positions of all overlapping reads for each SNP region.

For example, for one individual, a list of length 120,049 with integer vectors, with 20,853,838 elements within the vectors in total:

> format(object.size(SNPhood.o@internal$readStartPos$ambiguous$GM12878), units = "Mb")
[1] "86 Mb"

Unsurprisingly, when unlisting, this can only be a bit improved:
format(object.size(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878)), units = "Mb")
"79.6 Mb"

> length(SNPhood.o@internal$readStartPos$ambiguous$GM12878)
[1] 120049
> length(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878))
[1] 20853838

The vector of read start positions may look like this:

head(unlist(SNPhood.o@internal$readStartPos$ambiguous$GM12878),50)
[1] 714086 714087 714088 714089 714099 714100 714106 714108 714110 714114 714114 714123 714123 714123 714125 714125 714128 714130 714138 714139 714145 714148 714149 714150 714151 714152 714154 714164 714164 714172 714173 714184 714186 714187 714188 714189 714192 714194 714198 714204 714206 714209 714209 714212 714216 714219 714219 714223 714224 714224

So there are a few reads with identical start sites, but this does not occur too often. I indeed need all of this information for further processing.

Do you have any idea if I can save this information more efficiently so that the overall object size is reduced? I could try an Rle, but the structure of the data does not be ideal for this...

Any tips are very much appreciated!

Thanks,
Christian

--
—————————————————————————
Christian Arnold, PhD
Staff Bioinformatician

SCB Unit - Computational Biology
Joint appointment Genome Biology
Joint appointment European Bioinformatics Institute (EMBL-EBI)

European Molecular Biology Laboratory (EMBL)
Meyerhofstrasse 1; 69117, Heidelberg, Germany

Email: christian.arn...@embl.de
Phone: +49(0)6221-387-8472
Web: http://www.zaugg.embl.de/

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to