On 05/10/2011 02:47 PM, Janet Young wrote:
Hi, (probably hello to you, Martin)

Hi (or is that hello?) Janet --

I'm looking at some Illumina seq data, and trying to be more rigorous
than I have been in the past about memory usage and tidying up unused
variables. I'm a little mystified by something - I wonder if you can
help me understand?

I'm starting with a big AlignedRead object (one full lane of seq
data) and then I've been using [] on AlignedRead objects to take
various subsets of the data (and then looking at quality scores, map
positions, etc).   I'm also taking some very small subsets (e.g. just
the first 100 reads) to test and optimize some functions I'm
writing.

My confusion comes because even though I'm cutting down the number of
seq reads by a lot (e.g. from 18 million to just 100 reads), the new
AlignedRead object still takes up a lot of memory.

XStringSet (including DNAStringSet and BStringSet, which are used to hold AlignedRead sequence and quality information) are actually 'views' (start and end coordinates) on an underlying XString; when you subset a DNAStringSet, you subset the view but not the underlying DNAString. This might sound bad, but actually you have just one instance of the DNAString, and two light-weight views (the full view, and the subset view). If you're done with the full DNAString, you can force it to be compacated with compact(), e.g., after your first example

> object.size(s1)
50840 bytes
> object.size(s1[1:10]) # mostly smaller 'view'
38984 bytes
> object.size(s1[1]) # a little smaller 'view', same big DNAString
38864 bytes
> object.size(compact(s1[1]))
3912 bytes

or

> object.size(aln)
175968 bytes
> object.size(aln[1])
91488 bytes
> object.size(compact(aln[1]))
21584 bytes

(I think Herve is working on your weighted coverage question)

Martin


Two examples are given below - in both cases the small object takes
about half as much memory as the original, even though the number of
reads is now very much smaller.

Do you have any suggestions as to how I might reduce the memory
footprint of the subsetted AlignedRead object?  Is this an expected
behavior?

thanks very much,

Janet


library(ShortRead)

exptPath<- system.file("extdata", package = "ShortRead") sp<-
SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")

aln  ## aln has 1000 reads aln_small<- aln[1:2]   ### aln 2 has 2
reads

object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes

as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
#### [1] 0.4978324

read2Dir<-
"data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"


my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
my_reads_verysmall<- my_reads[1:100]

length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100

object.size(my_reads) # 3190125528 bytes
object.size(my_reads_verysmall) # 1753653496 bytes

as.numeric(object.size(my_reads_verysmall)) /
as.numeric(object.size(my_reads)) # [1] 0.549713



sessionInfo()

R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
(32-bit)

locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats     graphics  grDevices utils
datasets  methods   base

other attached packages: [1] ShortRead_1.10.0    Rsamtools_1.4.1
lattice_0.19-26     Biostrings_2.20.0 [5] GenomicRanges_1.4.3
IRanges_1.10.0

loaded via a namespace (and not attached): [1] Biobase_2.12.1
grid_2.13.0    hwriter_1.3

_______________________________________________ Bioc-sig-sequencing
mailing list Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to