On 05/10/2011 02:47 PM, Janet Young wrote:
Hi, (probably hello to you, Martin)
Hi (or is that hello?) Janet --
I'm looking at some Illumina seq data, and trying to be more rigorous
than I have been in the past about memory usage and tidying up unused
variables. I'm a little mystified by something - I wonder if you can
help me understand?
I'm starting with a big AlignedRead object (one full lane of seq
data) and then I've been using [] on AlignedRead objects to take
various subsets of the data (and then looking at quality scores, map
positions, etc). I'm also taking some very small subsets (e.g. just
the first 100 reads) to test and optimize some functions I'm
writing.
My confusion comes because even though I'm cutting down the number of
seq reads by a lot (e.g. from 18 million to just 100 reads), the new
AlignedRead object still takes up a lot of memory.
XStringSet (including DNAStringSet and BStringSet, which are used to
hold AlignedRead sequence and quality information) are actually 'views'
(start and end coordinates) on an underlying XString; when you subset a
DNAStringSet, you subset the view but not the underlying DNAString. This
might sound bad, but actually you have just one instance of the
DNAString, and two light-weight views (the full view, and the subset
view). If you're done with the full DNAString, you can force it to be
compacated with compact(), e.g., after your first example
> object.size(s1)
50840 bytes
> object.size(s1[1:10]) # mostly smaller 'view'
38984 bytes
> object.size(s1[1]) # a little smaller 'view', same big DNAString
38864 bytes
> object.size(compact(s1[1]))
3912 bytes
or
> object.size(aln)
175968 bytes
> object.size(aln[1])
91488 bytes
> object.size(compact(aln[1]))
21584 bytes
(I think Herve is working on your weighted coverage question)
Martin
Two examples are given below - in both cases the small object takes
about half as much memory as the original, even though the number of
reads is now very much smaller.
Do you have any suggestions as to how I might reduce the memory
footprint of the subsetted AlignedRead object? Is this an expected
behavior?
thanks very much,
Janet
library(ShortRead)
exptPath<- system.file("extdata", package = "ShortRead") sp<-
SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")
aln ## aln has 1000 reads aln_small<- aln[1:2] ### aln 2 has 2
reads
object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes
as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
#### [1] 0.4978324
read2Dir<-
"data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
my_reads<- readAligned(read2Dir, pattern="s_1_export.txt",
type="SolexaExport")
my_reads_verysmall<- my_reads[1:100]
length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100
object.size(my_reads) # 3190125528 bytes
object.size(my_reads_verysmall) # 1753653496 bytes
as.numeric(object.size(my_reads_verysmall)) /
as.numeric(object.size(my_reads)) # [1] 0.549713
sessionInfo()
R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
(32-bit)
locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages: [1] stats graphics grDevices utils
datasets methods base
other attached packages: [1] ShortRead_1.10.0 Rsamtools_1.4.1
lattice_0.19-26 Biostrings_2.20.0 [5] GenomicRanges_1.4.3
IRanges_1.10.0
loaded via a namespace (and not attached): [1] Biobase_2.12.1
grid_2.13.0 hwriter_1.3
_______________________________________________ Bioc-sig-sequencing
mailing list Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing