First, a very easy question: What is the difference between using
what="character" and what=character() in scan()? What is the reason for
the character() syntax?
I am working with some character vectors that are up to about 27.5 million
elements long. The elements are always unique. Specifically, these are
names of genetic markers. This is how much memory those names take up:
snps <- scan("SNPs.txt", what=character())
Read 27446736 items
object.size(snps)
1756363648 bytes
object.size(snps)/length(snps)
63.9917128215173 bytes
As you can see, that's about 1.76 GB of memory for the vector at an
average of 64 bytes per element. The longest string is only 14 bytes,
though. The file takes up 313 MB.
Using 64 bytes per element instead of 14 bytes per element is costing me a
total of 1,372,336,800 bytes. In a different example where the longest
string is 4 characters, the elements each use 8 bytes. So it looks like
I'm stuck with either 8 bytes or 64 bytes. Is that true? There is no way
to modify that?
By the way...
It turns out that 99.72% of those character strings are of the form
paste("rs", Int) where Int is an integer of no more than 9 digits. So if
I use only those markers, drop the "rs" off, and load them as integers, I
see a huge improvement:
snps <- scan("SNPs_rs.txt", what=integer())
Read 27369706 items
object.size(snps)
109478864 bytes
object.size(snps)/length(snps)
4.00000146146985 bytes
That saves 93.8% of the memory by dropping 0.28% of the markers and
encoding as integers instead of strings. I might end up doing this by
encoding the other characters as negative integers.
Mike
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.