Hi again, I've been looking into this question a bit more, now using compact(). Things seemed to go well with the example dataset provided with ShortRead (as far as I can tell). But when I use our own much larger dataset the compaction factor is not very good (my small object is the first 100 reads, the full set is ~18 million reads, but the small object is 40% the size in memory of the full object). I think it looks like quality and sreads haven't really been compacted nearly as much as I would expect. The code is below.
Martin - if you would like access to this dataset I can email you about that privately, but I'm guessing the same would happen with any big dataset? Janet library(ShortRead) ########## a big dataset (our data) read2Dir <- "solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2" my_reads <- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport") my_reads_verysmall <- my_reads[1:100] my_reads_verysmall_compact <- compact(my_reads[1:100]) length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100 length(my_reads_verysmall_compact) # [1] 100 object.size(my_reads) # 3190125528 bytes object.size(my_reads_verysmall) # 1753653496 bytes object.size(my_reads_verysmall_compact) # 1253663384 bytes as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads)) # [1] 0.549713 as.numeric(object.size(my_reads_verysmall_compact)) / as.numeric(object.size(my_reads)) # [1] 0.3929825 ##### where is most of that memory? object.size(position(my_reads_verysmall_compact)) # 440 bytes object.size(chromosome(my_reads_verysmall_compact)) # 3040 bytes object.size(sread(my_reads_verysmall_compact)) # 626820976 bytes object.size(quality(my_reads_verysmall_compact)) # 626821568 bytes myDNA2 <- DNAStringSet(as.character(sread(my_reads_verysmall_compact))) object.size(myDNA2) # 9944 bytes ###### and ShortRead's example dataset, 1000 reads exptPath <- system.file("extdata", package = "ShortRead") sp <- SolexaPath(exptPath) aln <- readAligned(sp, "s_2_export.txt") aln ## aln has 1000 reads aln_small <- aln[1:2] ### aln 2 has 2 reads aln_small_compact <- compact(aln[1:2]) object.size(aln) # 175968 bytes object.size(aln_small) # 91488 bytes object.size(aln_small_compact) # 21744 bytes as.numeric(object.size(aln_small)) / as.numeric(object.size(aln)) # [1] 0.5199127 as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln)) # [1] 0.1235679 myDNA <- DNAStringSet( as.character(sread(aln_small_compact)) ) identical(myDNA,sread(aln_small_compact)) [1] TRUE aln_medium_compact <- compact(aln[1:300]) myDNA_medium <- DNAStringSet( as.character(sread(aln_medium_compact)) ) identical(myDNA,sread(aln_medium_compact)) # [1] FALSE aln_10_compact <- compact(aln[1:10]) myDNA_10 <- DNAStringSet( as.character(sread(aln_10_compact)) ) identical(myDNA,sread(aln_10_compact)) # [1] FALSE as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln)) # [1] 0.1317512 ######### sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.10.0 Rsamtools_1.4.0 lattice_0.19-23 [4] Biostrings_2.20.0 GenomicRanges_1.4.0 IRanges_1.10.0 loaded via a namespace (and not attached): [1] Biobase_2.12.0 grid_2.13.0 hwriter_1.3 On May 10, 2011, at 4:25 PM, Martin Morgan wrote: > On 05/10/2011 02:47 PM, Janet Young wrote: >> Hi, (probably hello to you, Martin) > > Hi (or is that hello?) Janet -- > >> I'm looking at some Illumina seq data, and trying to be more rigorous >> than I have been in the past about memory usage and tidying up unused >> variables. I'm a little mystified by something - I wonder if you can >> help me understand? >> >> I'm starting with a big AlignedRead object (one full lane of seq >> data) and then I've been using [] on AlignedRead objects to take >> various subsets of the data (and then looking at quality scores, map >> positions, etc). I'm also taking some very small subsets (e.g. just >> the first 100 reads) to test and optimize some functions I'm >> writing. >> >> My confusion comes because even though I'm cutting down the number of >> seq reads by a lot (e.g. from 18 million to just 100 reads), the new >> AlignedRead object still takes up a lot of memory. > > XStringSet (including DNAStringSet and BStringSet, which are used to hold > AlignedRead sequence and quality information) are actually 'views' (start and > end coordinates) on an underlying XString; when you subset a DNAStringSet, > you subset the view but not the underlying DNAString. This might sound bad, > but actually you have just one instance of the DNAString, and two > light-weight views (the full view, and the subset view). If you're done with > the full DNAString, you can force it to be compacated with compact(), e.g., > after your first example > > > object.size(s1) > 50840 bytes > > object.size(s1[1:10]) # mostly smaller 'view' > 38984 bytes > > object.size(s1[1]) # a little smaller 'view', same big DNAString > 38864 bytes > > object.size(compact(s1[1])) > 3912 bytes > > or > > > object.size(aln) > 175968 bytes > > object.size(aln[1]) > 91488 bytes > > object.size(compact(aln[1])) > 21584 bytes > > (I think Herve is working on your weighted coverage question) > > Martin > >> >> Two examples are given below - in both cases the small object takes >> about half as much memory as the original, even though the number of >> reads is now very much smaller. >> >> Do you have any suggestions as to how I might reduce the memory >> footprint of the subsetted AlignedRead object? Is this an expected >> behavior? >> >> thanks very much, >> >> Janet >> >> >> library(ShortRead) >> >> exptPath<- system.file("extdata", package = "ShortRead") sp<- >> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt") >> >> aln ## aln has 1000 reads aln_small<- aln[1:2] ### aln 2 has 2 >> reads >> >> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes >> >> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln)) >> #### [1] 0.4978324 >> >> read2Dir<- >> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2" >> >> > my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", > type="SolexaExport") >> my_reads_verysmall<- my_reads[1:100] >> >> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100 >> >> object.size(my_reads) # 3190125528 bytes >> object.size(my_reads_verysmall) # 1753653496 bytes >> >> as.numeric(object.size(my_reads_verysmall)) / >> as.numeric(object.size(my_reads)) # [1] 0.549713 >> >> >> >> sessionInfo() >> >> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386 >> (32-bit) >> >> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >> >> attached base packages: [1] stats graphics grDevices utils >> datasets methods base >> >> other attached packages: [1] ShortRead_1.10.0 Rsamtools_1.4.1 >> lattice_0.19-26 Biostrings_2.20.0 [5] GenomicRanges_1.4.3 >> IRanges_1.10.0 >> >> loaded via a namespace (and not attached): [1] Biobase_2.12.1 >> grid_2.13.0 hwriter_1.3 >> >> _______________________________________________ Bioc-sig-sequencing >> mailing list Bioc-sig-sequencing@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing