Part of that decision may depend upon how big the dataset is and what is intended to be done with the ID's:
> object.size(1011001001001) [1] 36
> object.size("1011001001001") [1] 52
> object.size(factor("1011001001001")) [1] 244
They will by default, as Andy indicates, be read and stored as doubles. They are too large for integers, at least on my system:
> .Machine$integer.max [1] 2147483647
Converting to a character might make sense, with only a minimal memory penalty. However, using a factor results in a notable memory penalty, if the attributes of a factor are not needed.
That depends on how long the vectors are. The memory overhead for factors is per vector, with only 4 bytes used for each additional element (if the level already appears). The memory overhead for character data is per element -- there is no amortization for repeated values.
> object.size(factor("1011001001001"))
[1] 244
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
> # bytes per element in factor, for length 4:
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
> # bytes per element in factor, for length 1000:
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
> # bytes per element in character data, for length 1000:
> object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
>
So, for long vectors with relatively few different values, storage as factors is far more memory efficient (this is because the character data is stored only once per level, and each element is stored as a 4-byte integer). (The above was done on Windows 2000).
-- Tony Plate
If any mathematical operations are to be performed with the ID's then leaving them as doubles makes most sense.
Dan, more information on the numerical characteristics of your system can be found by using:
.Machine
See ?.Machine and ?object.size for more information.
HTH,
Marc Schwartz
On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
> If I'm not mistaken, numerics are read in as doubles, so that shouldn't be a
> problem. However, I'd try using factor or character.
>
> Andy
>
> > From: Dan Bolser
> >
> > I store an id as a big number, could this be a problem?
> >
> > Should I convert to at string when I use read.table(...
> >
> > example id's
> >
> > 1001001001001
> > 1001001001002
> > ...
> > 1002001002005
> >
> >
> > Bigest is probably
> >
> > 1011001001001
> >
> > Ta,
> > Dan.
> >
______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html