Hi, (probably hello to you, Martin)

I'm looking at some Illumina seq data, and trying to be more rigorous than I 
have been in the past about memory usage and tidying up unused variables. I'm a 
little mystified by something - I wonder if you can help me understand?  

I'm starting with a big AlignedRead object (one full lane of seq data) and then 
I've been using [] on AlignedRead objects to take various subsets of the data 
(and then looking at quality scores, map positions, etc).   I'm also taking 
some very small subsets (e.g. just the first 100 reads) to test and optimize 
some functions I'm writing.

My confusion comes because even though I'm cutting down the number of seq reads 
by a lot (e.g. from 18 million to just 100 reads), the new AlignedRead object 
still takes up a lot of memory.   

Two examples are given below - in both cases the small object takes about half 
as much memory as the original, even though the number of reads is now very 
much smaller.

Do you have any suggestions as to how I might reduce the memory footprint of 
the subsetted AlignedRead object?  Is this an expected behavior?

thanks very much,

Janet


library(ShortRead)

exptPath <- system.file("extdata", package = "ShortRead")
sp <- SolexaPath(exptPath)
aln <- readAligned(sp, "s_2_export.txt")

aln  ## aln has 1000 reads
aln_small <- aln[1:2]   ### aln 2 has 2 reads

object.size(aln)
# 165156 bytes
object.size(aln_small)
# 82220 bytes

as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
#### [1] 0.4978324

read2Dir <- 
"data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
my_reads <- readAligned(read2Dir, pattern="s_1_export.txt", 
type="SolexaExport")    
my_reads_verysmall <- my_reads[1:100]

length(my_reads)
# [1] 17894091
length(my_reads_verysmall)
# [1] 100

object.size(my_reads)
# 3190125528 bytes
object.size(my_reads_verysmall)
# 1753653496 bytes

as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
# [1] 0.549713



sessionInfo()

R version 2.13.0 (2011-04-13)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.10.0    Rsamtools_1.4.1     lattice_0.19-26     
Biostrings_2.20.0  
[5] GenomicRanges_1.4.3 IRanges_1.10.0     

loaded via a namespace (and not attached):
[1] Biobase_2.12.1 grid_2.13.0    hwriter_1.3   

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to