Hi, (probably hello to you, Martin) I'm looking at some Illumina seq data, and trying to be more rigorous than I have been in the past about memory usage and tidying up unused variables. I'm a little mystified by something - I wonder if you can help me understand?
I'm starting with a big AlignedRead object (one full lane of seq data) and then I've been using [] on AlignedRead objects to take various subsets of the data (and then looking at quality scores, map positions, etc). I'm also taking some very small subsets (e.g. just the first 100 reads) to test and optimize some functions I'm writing. My confusion comes because even though I'm cutting down the number of seq reads by a lot (e.g. from 18 million to just 100 reads), the new AlignedRead object still takes up a lot of memory. Two examples are given below - in both cases the small object takes about half as much memory as the original, even though the number of reads is now very much smaller. Do you have any suggestions as to how I might reduce the memory footprint of the subsetted AlignedRead object? Is this an expected behavior? thanks very much, Janet library(ShortRead) exptPath <- system.file("extdata", package = "ShortRead") sp <- SolexaPath(exptPath) aln <- readAligned(sp, "s_2_export.txt") aln ## aln has 1000 reads aln_small <- aln[1:2] ### aln 2 has 2 reads object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes as.numeric(object.size(aln_small)) / as.numeric(object.size(aln)) #### [1] 0.4978324 read2Dir <- "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2" my_reads <- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport") my_reads_verysmall <- my_reads[1:100] length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100 object.size(my_reads) # 3190125528 bytes object.size(my_reads_verysmall) # 1753653496 bytes as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads)) # [1] 0.549713 sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.10.0 Rsamtools_1.4.1 lattice_0.19-26 Biostrings_2.20.0 [5] GenomicRanges_1.4.3 IRanges_1.10.0 loaded via a namespace (and not attached): [1] Biobase_2.12.1 grid_2.13.0 hwriter_1.3 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing