On Aug 9, 2012, at 5:29 PM, Sean Ruddy wrote:
Hi,
First, thanks in advance. Some useful info:
version
platform x86_64-unknown-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
version.string R version 2.15.1 (2012-06-22)
I'm trying to use the table() function on a 2 column matrix that has
711
million rows (see below). However, it freezes. If I subset the
matrix to be
less than or equal to 2^29 (500+ million) then the table() function
finishes in minutes. As soon as I go larger than that--beginning with
2^29+1--it gets stuck, ie. nothing happens even after hours of
running. I
assume it has something to do with memory since I believe that's the
32 bit
limit but I'm running on a 64 bit machine.
The maximum size of a vector or matrix (= nrow x ncol) is the same on
32 and 64 bit machines: 2^32-1
Here's the matrix:
head(DRI.mtx)
POSITION BP
38076904 C
38076905 C
38076906 A
38076907 T
38076908 C
38076909 C
The result from table (if the matrix has less than 2^29 rows) is
head(table(DRI.mtx))
BP
POSITION A C G N T
115247036 17 0 0 0 0
115247037 31 0 0 0 0
115247038 46 0 0 0 0
115247039 0 0 54 0 0
115247040 0 0 1 0 66
115247041 0 0 0 0 78
I've tracked the problem down to the C-file, "unique.c". table() calls
factor() which calls unique() which I believe calls "unique.c".
Browsing
through the C file I found an if statement that checks if the size
of the
vector is larger than 2^30-1. If TRUE it gives the error message
"too large
for hashing". I do not get any error message when I run table() on
the full
matrix but I wonder if maybe I should be and if the limit of 2^30 is
too
high and should be lowered. Maybe it's just my set up or maybe it has
nothing to do with unique.c. I don't know.
Here's the part of unique.c I was referring to:
/*
Choose M to be the smallest power of 2
not less than 2*n and set K = log2(M).
Need K >= 1 and hence M >= 2, and 2^M <= 2^31 -1, hence n <= 2^30.
Dec 2004: modified from 4*n to 2*n, since in the worst case we have
a 50% full table, and that is still rather efficient -- see
R. Sedgewick (1998) Algorithms in C++ 3rd edition p.606.
*/
static void MKsetup(int n, HashData *d)
{
int n2 = 2 * n;
if(n < 0 || n > 1073741824) /* protect against overflow to -ve */
error(_("length %d is too large for hashing"), n);
d->M = 2;
d->K = 1;
while (d->M < n2) {
d->M *= 2;
d->K += 1;
}
}
"n" I presume is the number of rows of the matrix so I don't see why
this
wouldn't run properly though I'm not sure what is causing the
problem in
the unique.c file and I have no idea how to troubleshoot.
I have a work around that reads in chunks at a time, but I'm very
interested in why there appears to be a limit at 2^29 when according
to the
unique.c file it should be twice that.
Matrices are stored as vectors, so the maximum number of rows of a two
column matrix _should _ be half of the maximum length of a vector.
Issues with reaching the limits for matrix or vector sizes come up
from time to time but this is the first in my memory for size of
factor objects.
David Winsemius, MD
Alameda, CA, USA
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.