First, a very easy question: What is the difference between using what="character" and what=character() in scan()? What is the reason for the character() syntax?

I am working with some character vectors that are up to about 27.5 million elements long. The elements are always unique. Specifically, these are names of genetic markers. This is how much memory those names take up:

snps <- scan("SNPs.txt", what=character())
Read 27446736 items
object.size(snps)
1756363648 bytes
object.size(snps)/length(snps)
63.9917128215173 bytes

As you can see, that's about 1.76 GB of memory for the vector at an average of 64 bytes per element. The longest string is only 14 bytes, though. The file takes up 313 MB.

Using 64 bytes per element instead of 14 bytes per element is costing me a total of 1,372,336,800 bytes. In a different example where the longest string is 4 characters, the elements each use 8 bytes. So it looks like I'm stuck with either 8 bytes or 64 bytes. Is that true? There is no way to modify that?

By the way...

It turns out that 99.72% of those character strings are of the form paste("rs", Int) where Int is an integer of no more than 9 digits. So if I use only those markers, drop the "rs" off, and load them as integers, I see a huge improvement:

snps <- scan("SNPs_rs.txt", what=integer())
Read 27369706 items
object.size(snps)
109478864 bytes
object.size(snps)/length(snps)
4.00000146146985 bytes

That saves 93.8% of the memory by dropping 0.28% of the markers and encoding as integers instead of strings. I might end up doing this by encoding the other characters as negative integers.

Mike

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to