Hi Janet,

You might have found an issue with the "compact" method for AlignedRead
objects (not sure, I'll look more into this), however I have a long
story for you about why you might not even need to use compact().

First thing to understand is that seeing a big "object size" is not
necessarily a problem. This is because object.size() is not doing a
very good job on objects that contain external pointers (e.g.
DNAString, DNAStringSet) or environments.

Here is an example:

  library(BSgenome.Celegans.UCSC.ce2)
  chrI <- Celegans$chrI

Then:

  > chrI
    15080483-letter "DNAString" instance
seq: GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC

  > object.size(chrI)
  15082800 bytes

  > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
  Ncells 1129166 60.4    1590760 85.0  1368491 73.1
  Vcells 2386831 18.3    2968425 22.7  2397912 18.3

  > object.size(list(chrI, chrI))
  30165656 bytes

  > x <- rep.int(list(chrI), 100000)

  > object.size(x)
  1508280800040 bytes

Hey, how is this possible? I don't have that amount of memory on my
laptop!

Also:

  > gc()
            used (Mb) gc trigger (Mb) max used (Mb)
  Ncells 1129383 60.4    1710298 91.4  1590760 85.0
  Vcells 2486879 19.0    3196846 24.4  2948256 22.5

Clearly object.size() is not telling the truth.

What happens is that, strictly speaking, a DNAString object doesn't
"contain" the sequence data, but it contains a pointer to a memory
location where the sequence data is stored. This allows "sharing"
the sequence data by several objects. For example, other DNAString
instances can point to the same data as chrI:

  > subject <- chrI

  > subject@shared
  SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

  > chrI@shared
  SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

  > exon <- subseq(chrI, start=1001, width=200)

  > exon
    200-letter "DNAString" instance
seq: TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACA...CAAAATTTTTGGAAATTAGTTTAAAAATCTCACATT

  > exon@shared
  SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

Note that even if 'exon' is only "looking" at a small portion of the
shared data, it's still pointing to the same memory location (i.e. same
starting address). The bookkeeping of what part of the whole shared data
needs to be looked at exactly is handled via the "offset" and "length"
slots:

  > str(chrI)
  Formal class 'DNAString' [package "Biostrings"] with 5 slots
..@ shared :Formal class 'SharedRaw' [package "IRanges"] with 2 slots
    .. .. ..@ xp                    :<externalptr>
    .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8>
    ..@ offset         : int 0
    ..@ length         : int 15080483
    ..@ elementMetadata: NULL
    ..@ metadata       : list()

  > str(exon)
  Formal class 'DNAString' [package "Biostrings"] with 5 slots
..@ shared :Formal class 'SharedRaw' [package "IRanges"] with 2 slots
    .. .. ..@ xp                    :<externalptr>
    .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8>
    ..@ offset         : int 1000
    ..@ length         : int 200
    ..@ elementMetadata: NULL
    ..@ metadata       : list()

The sequence data can also be shared by other types of objects. For example:

  > rna <- RNAString(exon)

  > rna
    200-letter "RNAString" instance
seq: UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUACA...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU

  > rna@shared
  SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

  > m <- matchPattern("UA", rna)

  > m
    Views on a 200-letter RNAString subject
subject: UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUA...AAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU
  views:
      start end width
  [1]    24  25     2 [UA]
  [2]    29  30     2 [UA]
  [3]    33  34     2 [UA]
  [4]   151 152     2 [UA]
  [5]   154 155     2 [UA]
  [6]   181 182     2 [UA]
  [7]   186 187     2 [UA]

  > subject(m)@shared
  SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

  > m[[4]]
    2-letter "RNAString" instance
  seq: UA

  > m[[4]]@shared
  SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

  > dnaset <- DNAStringSet(m)

  > dnaset
    A DNAStringSet instance of length 7
      width seq
  [1]     2 TA
  [2]     2 TA
  [3]     2 TA
  [4]     2 TA
  [5]     2 TA
  [6]     2 TA
  [7]     2 TA

  > dnaset@pool
  SharedRaw_Pool of length 1
  1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

As you can see, all these operations happen without copying the original
sequence data (chromosome I). Hence they are very fast, and, most of the
times (but not always, more on this below) very memory efficient too,
despite what's reported by object.size(). If the original sequence data
needs to be modified, then it will be copied before modification:

  > subseq(rna, start=7, width=3) <- RNAString("AA")

  > rna
    199-letter "RNAString" instance
seq: UUUUUCAAUUUUUUGAAAUGAAUAUCGUAGCUACAG...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU

  > rna@shared
  SharedRaw of length 199 (data starting at address 0x33a3b58)

Concatenation too will trigger a copy:

  > uaua <- c(m[[4]], m[[1]])

  > uaua
    4-letter "RNAString" instance
  seq: UAUA

  > uaua@shared
  SharedRaw of length 4 (data starting at address 0x53a1120)

For DNAStringSet objects (and XStringSet objects in general) the
mechanism used for sharing sequence data is a little bit more
complicated than for XString objects: instead of a single pointer,
they use a list of pointers (stored in the "pool" slot) so the
sequence data can be fragmented into several memory chunks (each
chunk can be shared with other objects). For example:

  > dnaset2 <- c(DNAStringSet(rna), DNAStringSet(uaua))

  > dnaset2
    A DNAStringSet instance of length 2
      width seq
[1] 199 TTTTTCAATTTTTTGAAATGAATATCGTAGCTAC...AATTTTTGGAAATTAGTTTAAAAATCTCACATT
  [2]     4 TATA

  > dnaset2@pool
  SharedRaw_Pool of length 2
  1: SharedRaw of length 199 (data starting at address 0x33a3b58)
  2: SharedRaw of length 4 (data starting at address 0x53a1120)

To summarize, all the XString, XStringSet, XStringViews, MaskedXString
objects use SharedRaw objects internally to store the sequence data
(the SharedRaw class is an internal class defined in IRanges).
You can think of those SharedRaw objects as the basic storage units
for storing sequence data i.e. each of them is a chunk of memory
containing sequence data. It can be shared (i.e. more than 1 object
is "using it") or not (only 1 object is "using it"). If no object
is using it, then it's a candidate for garbbage collection.

For example, at this point, the SharedRaw object containing
chromosome I is used by the following objects: 'chrI', 'x',
'subject', 'exon', 'm' and 'dnaset'. So all of them would need
to be removed (with rm()) for the SharedRaw object to be garbbage
collected.

Note that IRanges/Biostrings don't contain any code for handling
this, it just relies on the standard behaviour of external pointers
in R (a SharedRaw object is basically an R external pointer pointing
to a raw vector) and how the garbbage collector handles them.

OK, now back to your original question. When you subset your
AlignedRead object, the "small" AlignedRead object is still
pointing to the original sequence data:

  > aln0
  class: AlignedRead
  length: 1000 reads; width: 35 cycles
  chromosome: NM NM ... chr5.fa 29:255:255
  position: NA NA ... 71805980 NA
  strand: NA NA ... + NA
  alignQuality: NumericQuality
  alignData varLabels: run lane ... filtering contig

  > sread(aln0)@pool
  SharedRaw_Pool of length 1
  1: SharedRaw of length 35000 (data starting at address 0x810f418)

  > sread(aln0[1:10])@pool
  SharedRaw_Pool of length 1
  1: SharedRaw of length 35000 (data starting at address 0x810f418)

Hence the strange result reported by object.size():

  > object.size(aln0)
  175968 bytes

  > object.size(aln0[1:10])
  92480 bytes

'aln0[1:10]' has a smaller size, but not as small as one would expect.

Does it matter? It depends.

If you want to save the small object to a file (serialization), then
yes (the whole SharedRaw object will be saved, which is not good).
In that case, you want to "compact" the object first. compact() was
specifically introduced for this use case: it creates a "full" copy
of the original object ("full" here means that even the sequence data
are copied), but only the sequence data that are effectively "being
looked at" are copied. The resulting object is semantically equivalent
to the original but its internal representation and size are different:

  > dnaset
    A DNAStringSet instance of length 7
      width seq
  [1]     2 TA
  [2]     2 TA
  [3]     2 TA
  [4]     2 TA
  [5]     2 TA
  [6]     2 TA
  [7]     2 TA

  > dnaset@pool
  SharedRaw_Pool of length 1
  1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

  > object.size(dnaset)
  15084424 bytes

  > compact(dnaset)
    A DNAStringSet instance of length 7
      width seq
  [1]     2 TA
  [2]     2 TA
  [3]     2 TA
  [4]     2 TA
  [5]     2 TA
  [6]     2 TA
  [7]     2 TA

  > compact(dnaset)@pool
  SharedRaw_Pool of length 1
  1: SharedRaw of length 14 (data starting at address 0x80ed740)

  > object.size(compact(dnaset))
  3952 bytes

Note that, in this particular situation, compact() could take
advantage of the fact that the elements in 'dnaset' are repeated
(from a sequence content point of view, not from a memory location
point of view) to achieve greater size reduction... but that's
another story.

Also if you work one chromosome at a time in a loop, and if the
objects you create within the loop are pointing to the chromosome
sequence, and if those references are going to persist after you
are done with the chromosome, then yes, you want to compact those
objects. By compacting them you remove the references to the
chromosome sequence and therefore allow it to be garbbage collected.
Otherwise you will end up with all the chromosome sequences loaded
into memory!

For example, you wouldn't want to do something like this:

  lapply(seqnames(Celegans),
         function(seqname)
             matchPattern("CAACTTAC", Celegans[[seqname]]))

because the result would be a list of XStringViews objects, one
per chromosome, each of them pointing to the chromosome sequence.
That will work for Celegans but for Hsapiens you will end up with
all the chromosomes in memory!

For your use case, the benefit of using compact() on the subsetted
AlignedRead objects depends on what you will do with it. If those
small objects are just temporary (e.g. you will do coverage() on
them, and you won't care about the subsetted AlignedRead objects
anymore), then I would recommend that you do not use compact(): it
will slow down things and you will end up using more memory than
you need (because compact() creates a copy). But if you want to keep
them around, but also want to be able to remove the big AlignedRead
object in order to save memory, then you might want to compact them.

One thing to remember is that object.size() should tell the truth
on a compacted object. And also summing the individual object.size()
for a collection of compacted objects tells the truth about the
memory effectively used by the collection. But summing the
individual object.size() in general does not tell the truth:
it might overestimate the amount of memory effectively used,
because it might count the memory used by a SharedRaw objects
several times (as many times as the SharedRaw object is used).

Hope the story was helpful and not too long ;-)

H.


On 11-05-10 05:16 PM, Janet Young wrote:
Hi again,

I've been looking into this question a bit more, now using compact(). Things 
seemed to go well with the example dataset provided with ShortRead (as far as I 
can tell). But when I use our own much larger dataset the compaction factor is 
not very good (my small object is the first 100 reads, the full set is ~18 
million reads, but the small object is 40% the size in memory of the full 
object).    I think it looks like quality and sreads haven't really been 
compacted nearly as much as I would expect. The code is below.



Martin - if you would like access to this dataset I can email you about that 
privately, but I'm guessing the same would happen with any big dataset?

Janet



library(ShortRead)

########## a big dataset (our data)

read2Dir<- 
"solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"

my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
my_reads_verysmall<- my_reads[1:100]
my_reads_verysmall_compact<- compact(my_reads[1:100])

length(my_reads)
# [1] 17894091
length(my_reads_verysmall)
# [1] 100
length(my_reads_verysmall_compact)
# [1] 100

object.size(my_reads)
# 3190125528 bytes
object.size(my_reads_verysmall)
# 1753653496 bytes
object.size(my_reads_verysmall_compact)
# 1253663384 bytes

as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
# [1] 0.549713

as.numeric(object.size(my_reads_verysmall_compact)) / 
as.numeric(object.size(my_reads))
# [1] 0.3929825

##### where is most of that memory?
object.size(position(my_reads_verysmall_compact))
# 440 bytes
object.size(chromosome(my_reads_verysmall_compact))
# 3040 bytes
object.size(sread(my_reads_verysmall_compact))
# 626820976 bytes
object.size(quality(my_reads_verysmall_compact))
# 626821568 bytes

myDNA2<-  DNAStringSet(as.character(sread(my_reads_verysmall_compact)))
object.size(myDNA2)
# 9944 bytes


###### and ShortRead's example dataset, 1000 reads

exptPath<- system.file("extdata", package = "ShortRead")
sp<- SolexaPath(exptPath)
aln<- readAligned(sp, "s_2_export.txt")

aln  ## aln has 1000 reads
aln_small<- aln[1:2]   ### aln 2 has 2 reads
aln_small_compact<- compact(aln[1:2])

object.size(aln)
# 175968 bytes
object.size(aln_small)
# 91488 bytes
object.size(aln_small_compact)
# 21744 bytes

as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
# [1] 0.5199127

as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln))
# [1] 0.1235679

myDNA<- DNAStringSet( as.character(sread(aln_small_compact)) )
identical(myDNA,sread(aln_small_compact))
[1] TRUE

aln_medium_compact<- compact(aln[1:300])
myDNA_medium<- DNAStringSet( as.character(sread(aln_medium_compact)) )
identical(myDNA,sread(aln_medium_compact))
# [1] FALSE

aln_10_compact<- compact(aln[1:10])
myDNA_10<- DNAStringSet( as.character(sread(aln_10_compact)) )
identical(myDNA,sread(aln_10_compact))
# [1] FALSE

as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln))
# [1] 0.1317512

#########


sessionInfo()

R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ShortRead_1.10.0    Rsamtools_1.4.0     lattice_0.19-23
[4] Biostrings_2.20.0   GenomicRanges_1.4.0 IRanges_1.10.0

loaded via a namespace (and not attached):
[1] Biobase_2.12.0 grid_2.13.0    hwriter_1.3




On May 10, 2011, at 4:25 PM, Martin Morgan wrote:

On 05/10/2011 02:47 PM, Janet Young wrote:
Hi, (probably hello to you, Martin)

Hi (or is that hello?) Janet --

I'm looking at some Illumina seq data, and trying to be more rigorous
than I have been in the past about memory usage and tidying up unused
variables. I'm a little mystified by something - I wonder if you can
help me understand?

I'm starting with a big AlignedRead object (one full lane of seq
data) and then I've been using [] on AlignedRead objects to take
various subsets of the data (and then looking at quality scores, map
positions, etc).   I'm also taking some very small subsets (e.g. just
the first 100 reads) to test and optimize some functions I'm
writing.

My confusion comes because even though I'm cutting down the number of
seq reads by a lot (e.g. from 18 million to just 100 reads), the new
AlignedRead object still takes up a lot of memory.

XStringSet (including DNAStringSet and BStringSet, which are used to hold 
AlignedRead sequence and quality information) are actually 'views' (start and 
end coordinates) on an underlying XString; when you subset a DNAStringSet, you 
subset the view but not the underlying DNAString. This might sound bad, but 
actually you have just one instance of the DNAString, and two light-weight 
views (the full view, and the subset view). If you're done with the full 
DNAString, you can force it to be compacated with compact(), e.g., after your 
first example

object.size(s1)
50840 bytes
object.size(s1[1:10]) # mostly smaller 'view'
38984 bytes
object.size(s1[1]) # a little smaller 'view', same big DNAString
38864 bytes
object.size(compact(s1[1]))
3912 bytes

or

object.size(aln)
175968 bytes
object.size(aln[1])
91488 bytes
object.size(compact(aln[1]))
21584 bytes

(I think Herve is working on your weighted coverage question)

Martin


Two examples are given below - in both cases the small object takes
about half as much memory as the original, even though the number of
reads is now very much smaller.

Do you have any suggestions as to how I might reduce the memory
footprint of the subsetted AlignedRead object?  Is this an expected
behavior?

thanks very much,

Janet


library(ShortRead)

exptPath<- system.file("extdata", package = "ShortRead") sp<-
SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")

aln  ## aln has 1000 reads aln_small<- aln[1:2]   ### aln 2 has 2
reads

object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes

as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
#### [1] 0.4978324

read2Dir<-
"data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"


my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
my_reads_verysmall<- my_reads[1:100]

length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100

object.size(my_reads) # 3190125528 bytes
object.size(my_reads_verysmall) # 1753653496 bytes

as.numeric(object.size(my_reads_verysmall)) /
as.numeric(object.size(my_reads)) # [1] 0.549713



sessionInfo()

R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
(32-bit)

locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats     graphics  grDevices utils
datasets  methods   base

other attached packages: [1] ShortRead_1.10.0    Rsamtools_1.4.1
lattice_0.19-26     Biostrings_2.20.0 [5] GenomicRanges_1.4.3
IRanges_1.10.0

loaded via a namespace (and not attached): [1] Biobase_2.12.1
grid_2.13.0    hwriter_1.3

_______________________________________________ Bioc-sig-sequencing
mailing list Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to