Hi again, Thank you, Herve! That made perfect sense and was very helpful (it explains some issues I had a while ago too). I'm slowly learning, with help from you all.
I'll need to think a little more about how I monitor my memory usage now that I understand what you've told me about object.size and sharing. I started getting into this whole question because I noticed I'd done something to clog up the memory and swap in a large-memory machine, so my first thought was to look at what big objects were around in my session. Are there any other tools I should look into to help me figure out where all my memory is being used? (I suspect there was something wrong in that particular session - I haven't had the same trouble since doing very similar things). Have a good weekend, Janet On May 11, 2011, at 3:11 PM, Hervé Pagès wrote: > Hi Janet, > > You might have found an issue with the "compact" method for AlignedRead > objects (not sure, I'll look more into this), however I have a long > story for you about why you might not even need to use compact(). > > First thing to understand is that seeing a big "object size" is not > necessarily a problem. This is because object.size() is not doing a > very good job on objects that contain external pointers (e.g. > DNAString, DNAStringSet) or environments. > > Here is an example: > > library(BSgenome.Celegans.UCSC.ce2) > chrI <- Celegans$chrI > > Then: > > > chrI > 15080483-letter "DNAString" instance > seq: > GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC > > > object.size(chrI) > 15082800 bytes > > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 1129166 60.4 1590760 85.0 1368491 73.1 > Vcells 2386831 18.3 2968425 22.7 2397912 18.3 > > > object.size(list(chrI, chrI)) > 30165656 bytes > > > x <- rep.int(list(chrI), 100000) > > > object.size(x) > 1508280800040 bytes > > Hey, how is this possible? I don't have that amount of memory on my > laptop! > > Also: > > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 1129383 60.4 1710298 91.4 1590760 85.0 > Vcells 2486879 19.0 3196846 24.4 2948256 22.5 > > Clearly object.size() is not telling the truth. > > What happens is that, strictly speaking, a DNAString object doesn't > "contain" the sequence data, but it contains a pointer to a memory > location where the sequence data is stored. This allows "sharing" > the sequence data by several objects. For example, other DNAString > instances can point to the same data as chrI: > > > subject <- chrI > > > subject@shared > SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > > chrI@shared > SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > > exon <- subseq(chrI, start=1001, width=200) > > > exon > 200-letter "DNAString" instance > seq: > TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACA...CAAAATTTTTGGAAATTAGTTTAAAAATCTCACATT > > > exon@shared > SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > Note that even if 'exon' is only "looking" at a small portion of the > shared data, it's still pointing to the same memory location (i.e. same > starting address). The bookkeeping of what part of the whole shared data > needs to be looked at exactly is handled via the "offset" and "length" > slots: > > > str(chrI) > Formal class 'DNAString' [package "Biostrings"] with 5 slots > ..@ shared :Formal class 'SharedRaw' [package "IRanges"] with 2 > slots > .. .. ..@ xp :<externalptr> > .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8> > ..@ offset : int 0 > ..@ length : int 15080483 > ..@ elementMetadata: NULL > ..@ metadata : list() > > > str(exon) > Formal class 'DNAString' [package "Biostrings"] with 5 slots > ..@ shared :Formal class 'SharedRaw' [package "IRanges"] with 2 > slots > .. .. ..@ xp :<externalptr> > .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8> > ..@ offset : int 1000 > ..@ length : int 200 > ..@ elementMetadata: NULL > ..@ metadata : list() > > The sequence data can also be shared by other types of objects. For example: > > > rna <- RNAString(exon) > > > rna > 200-letter "RNAString" instance > seq: > UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUACA...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU > > > rna@shared > SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > > m <- matchPattern("UA", rna) > > > m > Views on a 200-letter RNAString subject > subject: > UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUA...AAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU > views: > start end width > [1] 24 25 2 [UA] > [2] 29 30 2 [UA] > [3] 33 34 2 [UA] > [4] 151 152 2 [UA] > [5] 154 155 2 [UA] > [6] 181 182 2 [UA] > [7] 186 187 2 [UA] > > > subject(m)@shared > SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > > m[[4]] > 2-letter "RNAString" instance > seq: UA > > > m[[4]]@shared > SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > > dnaset <- DNAStringSet(m) > > > dnaset > A DNAStringSet instance of length 7 > width seq > [1] 2 TA > [2] 2 TA > [3] 2 TA > [4] 2 TA > [5] 2 TA > [6] 2 TA > [7] 2 TA > > > dnaset@pool > SharedRaw_Pool of length 1 > 1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > As you can see, all these operations happen without copying the original > sequence data (chromosome I). Hence they are very fast, and, most of the > times (but not always, more on this below) very memory efficient too, > despite what's reported by object.size(). If the original sequence data > needs to be modified, then it will be copied before modification: > > > subseq(rna, start=7, width=3) <- RNAString("AA") > > > rna > 199-letter "RNAString" instance > seq: > UUUUUCAAUUUUUUGAAAUGAAUAUCGUAGCUACAG...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU > > > rna@shared > SharedRaw of length 199 (data starting at address 0x33a3b58) > > Concatenation too will trigger a copy: > > > uaua <- c(m[[4]], m[[1]]) > > > uaua > 4-letter "RNAString" instance > seq: UAUA > > > uaua@shared > SharedRaw of length 4 (data starting at address 0x53a1120) > > For DNAStringSet objects (and XStringSet objects in general) the > mechanism used for sharing sequence data is a little bit more > complicated than for XString objects: instead of a single pointer, > they use a list of pointers (stored in the "pool" slot) so the > sequence data can be fragmented into several memory chunks (each > chunk can be shared with other objects). For example: > > > dnaset2 <- c(DNAStringSet(rna), DNAStringSet(uaua)) > > > dnaset2 > A DNAStringSet instance of length 2 > width seq > [1] 199 > TTTTTCAATTTTTTGAAATGAATATCGTAGCTAC...AATTTTTGGAAATTAGTTTAAAAATCTCACATT > [2] 4 TATA > > > dnaset2@pool > SharedRaw_Pool of length 2 > 1: SharedRaw of length 199 (data starting at address 0x33a3b58) > 2: SharedRaw of length 4 (data starting at address 0x53a1120) > > To summarize, all the XString, XStringSet, XStringViews, MaskedXString > objects use SharedRaw objects internally to store the sequence data > (the SharedRaw class is an internal class defined in IRanges). > You can think of those SharedRaw objects as the basic storage units > for storing sequence data i.e. each of them is a chunk of memory > containing sequence data. It can be shared (i.e. more than 1 object > is "using it") or not (only 1 object is "using it"). If no object > is using it, then it's a candidate for garbbage collection. > > For example, at this point, the SharedRaw object containing > chromosome I is used by the following objects: 'chrI', 'x', > 'subject', 'exon', 'm' and 'dnaset'. So all of them would need > to be removed (with rm()) for the SharedRaw object to be garbbage > collected. > > Note that IRanges/Biostrings don't contain any code for handling > this, it just relies on the standard behaviour of external pointers > in R (a SharedRaw object is basically an R external pointer pointing > to a raw vector) and how the garbbage collector handles them. > > OK, now back to your original question. When you subset your > AlignedRead object, the "small" AlignedRead object is still > pointing to the original sequence data: > > > aln0 > class: AlignedRead > length: 1000 reads; width: 35 cycles > chromosome: NM NM ... chr5.fa 29:255:255 > position: NA NA ... 71805980 NA > strand: NA NA ... + NA > alignQuality: NumericQuality > alignData varLabels: run lane ... filtering contig > > > sread(aln0)@pool > SharedRaw_Pool of length 1 > 1: SharedRaw of length 35000 (data starting at address 0x810f418) > > > sread(aln0[1:10])@pool > SharedRaw_Pool of length 1 > 1: SharedRaw of length 35000 (data starting at address 0x810f418) > > Hence the strange result reported by object.size(): > > > object.size(aln0) > 175968 bytes > > > object.size(aln0[1:10]) > 92480 bytes > > 'aln0[1:10]' has a smaller size, but not as small as one would expect. > > Does it matter? It depends. > > If you want to save the small object to a file (serialization), then > yes (the whole SharedRaw object will be saved, which is not good). > In that case, you want to "compact" the object first. compact() was > specifically introduced for this use case: it creates a "full" copy > of the original object ("full" here means that even the sequence data > are copied), but only the sequence data that are effectively "being > looked at" are copied. The resulting object is semantically equivalent > to the original but its internal representation and size are different: > > > dnaset > A DNAStringSet instance of length 7 > width seq > [1] 2 TA > [2] 2 TA > [3] 2 TA > [4] 2 TA > [5] 2 TA > [6] 2 TA > [7] 2 TA > > > dnaset@pool > SharedRaw_Pool of length 1 > 1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038) > > > object.size(dnaset) > 15084424 bytes > > > compact(dnaset) > A DNAStringSet instance of length 7 > width seq > [1] 2 TA > [2] 2 TA > [3] 2 TA > [4] 2 TA > [5] 2 TA > [6] 2 TA > [7] 2 TA > > > compact(dnaset)@pool > SharedRaw_Pool of length 1 > 1: SharedRaw of length 14 (data starting at address 0x80ed740) > > > object.size(compact(dnaset)) > 3952 bytes > > Note that, in this particular situation, compact() could take > advantage of the fact that the elements in 'dnaset' are repeated > (from a sequence content point of view, not from a memory location > point of view) to achieve greater size reduction... but that's > another story. > > Also if you work one chromosome at a time in a loop, and if the > objects you create within the loop are pointing to the chromosome > sequence, and if those references are going to persist after you > are done with the chromosome, then yes, you want to compact those > objects. By compacting them you remove the references to the > chromosome sequence and therefore allow it to be garbbage collected. > Otherwise you will end up with all the chromosome sequences loaded > into memory! > > For example, you wouldn't want to do something like this: > > lapply(seqnames(Celegans), > function(seqname) > matchPattern("CAACTTAC", Celegans[[seqname]])) > > because the result would be a list of XStringViews objects, one > per chromosome, each of them pointing to the chromosome sequence. > That will work for Celegans but for Hsapiens you will end up with > all the chromosomes in memory! > > For your use case, the benefit of using compact() on the subsetted > AlignedRead objects depends on what you will do with it. If those > small objects are just temporary (e.g. you will do coverage() on > them, and you won't care about the subsetted AlignedRead objects > anymore), then I would recommend that you do not use compact(): it > will slow down things and you will end up using more memory than > you need (because compact() creates a copy). But if you want to keep > them around, but also want to be able to remove the big AlignedRead > object in order to save memory, then you might want to compact them. > > One thing to remember is that object.size() should tell the truth > on a compacted object. And also summing the individual object.size() > for a collection of compacted objects tells the truth about the > memory effectively used by the collection. But summing the > individual object.size() in general does not tell the truth: > it might overestimate the amount of memory effectively used, > because it might count the memory used by a SharedRaw objects > several times (as many times as the SharedRaw object is used). > > Hope the story was helpful and not too long ;-) > > H. > > > On 11-05-10 05:16 PM, Janet Young wrote: >> Hi again, >> >> I've been looking into this question a bit more, now using compact(). Things >> seemed to go well with the example dataset provided with ShortRead (as far >> as I can tell). But when I use our own much larger dataset the compaction >> factor is not very good (my small object is the first 100 reads, the full >> set is ~18 million reads, but the small object is 40% the size in memory of >> the full object). I think it looks like quality and sreads haven't really >> been compacted nearly as much as I would expect. The code is below. > > >> >> Martin - if you would like access to this dataset I can email you about that >> privately, but I'm guessing the same would happen with any big dataset? >> >> Janet >> >> >> >> library(ShortRead) >> >> ########## a big dataset (our data) >> >> read2Dir<- >> "solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2" >> >> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", >> type="SolexaExport") >> my_reads_verysmall<- my_reads[1:100] >> my_reads_verysmall_compact<- compact(my_reads[1:100]) >> >> length(my_reads) >> # [1] 17894091 >> length(my_reads_verysmall) >> # [1] 100 >> length(my_reads_verysmall_compact) >> # [1] 100 >> >> object.size(my_reads) >> # 3190125528 bytes >> object.size(my_reads_verysmall) >> # 1753653496 bytes >> object.size(my_reads_verysmall_compact) >> # 1253663384 bytes >> >> as.numeric(object.size(my_reads_verysmall)) / >> as.numeric(object.size(my_reads)) >> # [1] 0.549713 >> >> as.numeric(object.size(my_reads_verysmall_compact)) / >> as.numeric(object.size(my_reads)) >> # [1] 0.3929825 >> >> ##### where is most of that memory? >> object.size(position(my_reads_verysmall_compact)) >> # 440 bytes >> object.size(chromosome(my_reads_verysmall_compact)) >> # 3040 bytes >> object.size(sread(my_reads_verysmall_compact)) >> # 626820976 bytes >> object.size(quality(my_reads_verysmall_compact)) >> # 626821568 bytes >> >> myDNA2<- DNAStringSet(as.character(sread(my_reads_verysmall_compact))) >> object.size(myDNA2) >> # 9944 bytes >> >> >> ###### and ShortRead's example dataset, 1000 reads >> >> exptPath<- system.file("extdata", package = "ShortRead") >> sp<- SolexaPath(exptPath) >> aln<- readAligned(sp, "s_2_export.txt") >> >> aln ## aln has 1000 reads >> aln_small<- aln[1:2] ### aln 2 has 2 reads >> aln_small_compact<- compact(aln[1:2]) >> >> object.size(aln) >> # 175968 bytes >> object.size(aln_small) >> # 91488 bytes >> object.size(aln_small_compact) >> # 21744 bytes >> >> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln)) >> # [1] 0.5199127 >> >> as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln)) >> # [1] 0.1235679 >> >> myDNA<- DNAStringSet( as.character(sread(aln_small_compact)) ) >> identical(myDNA,sread(aln_small_compact)) >> [1] TRUE >> >> aln_medium_compact<- compact(aln[1:300]) >> myDNA_medium<- DNAStringSet( as.character(sread(aln_medium_compact)) ) >> identical(myDNA,sread(aln_medium_compact)) >> # [1] FALSE >> >> aln_10_compact<- compact(aln[1:10]) >> myDNA_10<- DNAStringSet( as.character(sread(aln_10_compact)) ) >> identical(myDNA,sread(aln_10_compact)) >> # [1] FALSE >> >> as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln)) >> # [1] 0.1317512 >> >> ######### >> >> >> sessionInfo() >> >> R version 2.13.0 (2011-04-13) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] ShortRead_1.10.0 Rsamtools_1.4.0 lattice_0.19-23 >> [4] Biostrings_2.20.0 GenomicRanges_1.4.0 IRanges_1.10.0 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.12.0 grid_2.13.0 hwriter_1.3 >> >> >> >> >> On May 10, 2011, at 4:25 PM, Martin Morgan wrote: >> >>> On 05/10/2011 02:47 PM, Janet Young wrote: >>>> Hi, (probably hello to you, Martin) >>> >>> Hi (or is that hello?) Janet -- >>> >>>> I'm looking at some Illumina seq data, and trying to be more rigorous >>>> than I have been in the past about memory usage and tidying up unused >>>> variables. I'm a little mystified by something - I wonder if you can >>>> help me understand? >>>> >>>> I'm starting with a big AlignedRead object (one full lane of seq >>>> data) and then I've been using [] on AlignedRead objects to take >>>> various subsets of the data (and then looking at quality scores, map >>>> positions, etc). I'm also taking some very small subsets (e.g. just >>>> the first 100 reads) to test and optimize some functions I'm >>>> writing. >>>> >>>> My confusion comes because even though I'm cutting down the number of >>>> seq reads by a lot (e.g. from 18 million to just 100 reads), the new >>>> AlignedRead object still takes up a lot of memory. >>> >>> XStringSet (including DNAStringSet and BStringSet, which are used to hold >>> AlignedRead sequence and quality information) are actually 'views' (start >>> and end coordinates) on an underlying XString; when you subset a >>> DNAStringSet, you subset the view but not the underlying DNAString. This >>> might sound bad, but actually you have just one instance of the DNAString, >>> and two light-weight views (the full view, and the subset view). If you're >>> done with the full DNAString, you can force it to be compacated with >>> compact(), e.g., after your first example >>> >>>> object.size(s1) >>> 50840 bytes >>>> object.size(s1[1:10]) # mostly smaller 'view' >>> 38984 bytes >>>> object.size(s1[1]) # a little smaller 'view', same big DNAString >>> 38864 bytes >>>> object.size(compact(s1[1])) >>> 3912 bytes >>> >>> or >>> >>>> object.size(aln) >>> 175968 bytes >>>> object.size(aln[1]) >>> 91488 bytes >>>> object.size(compact(aln[1])) >>> 21584 bytes >>> >>> (I think Herve is working on your weighted coverage question) >>> >>> Martin >>> >>>> >>>> Two examples are given below - in both cases the small object takes >>>> about half as much memory as the original, even though the number of >>>> reads is now very much smaller. >>>> >>>> Do you have any suggestions as to how I might reduce the memory >>>> footprint of the subsetted AlignedRead object? Is this an expected >>>> behavior? >>>> >>>> thanks very much, >>>> >>>> Janet >>>> >>>> >>>> library(ShortRead) >>>> >>>> exptPath<- system.file("extdata", package = "ShortRead") sp<- >>>> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt") >>>> >>>> aln ## aln has 1000 reads aln_small<- aln[1:2] ### aln 2 has 2 >>>> reads >>>> >>>> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes >>>> >>>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln)) >>>> #### [1] 0.4978324 >>>> >>>> read2Dir<- >>>> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2" >>>> >>>> >>> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", >>> type="SolexaExport") >>>> my_reads_verysmall<- my_reads[1:100] >>>> >>>> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100 >>>> >>>> object.size(my_reads) # 3190125528 bytes >>>> object.size(my_reads_verysmall) # 1753653496 bytes >>>> >>>> as.numeric(object.size(my_reads_verysmall)) / >>>> as.numeric(object.size(my_reads)) # [1] 0.549713 >>>> >>>> >>>> >>>> sessionInfo() >>>> >>>> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386 >>>> (32-bit) >>>> >>>> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>>> >>>> attached base packages: [1] stats graphics grDevices utils >>>> datasets methods base >>>> >>>> other attached packages: [1] ShortRead_1.10.0 Rsamtools_1.4.1 >>>> lattice_0.19-26 Biostrings_2.20.0 [5] GenomicRanges_1.4.3 >>>> IRanges_1.10.0 >>>> >>>> loaded via a namespace (and not attached): [1] Biobase_2.12.1 >>>> grid_2.13.0 hwriter_1.3 >>>> >>>> _______________________________________________ Bioc-sig-sequencing >>>> mailing list Bioc-sig-sequencing@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> >>> -- >>> Computational Biology >>> Fred Hutchinson Cancer Research Center >>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>> >>> Location: M1-B861 >>> Telephone: 206 667-2793 >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> Bioc-sig-sequencing@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing