At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
Part of that decision may depend upon how big the dataset is and what is
intended to be done with the ID's:

> object.size(1011001001001)
[1] 36

> object.size("1011001001001")
[1] 52

> object.size(factor("1011001001001"))
[1] 244


They will by default, as Andy indicates, be read and stored as doubles. They are too large for integers, at least on my system:

> .Machine$integer.max
[1] 2147483647

Converting to a character might make sense, with only a minimal memory
penalty. However, using a factor results in a notable memory penalty, if
the attributes of a factor are not needed.

That depends on how long the vectors are. The memory overhead for factors is per vector, with only 4 bytes used for each additional element (if the level already appears). The memory overhead for character data is per element -- there is no amortization for repeated values.


> object.size(factor("1011001001001"))
[1] 244
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
> # bytes per element in factor, for length 4:
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
> # bytes per element in factor, for length 1000:
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
> # bytes per element in character data, for length 1000:
> object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
>


So, for long vectors with relatively few different values, storage as factors is far more memory efficient (this is because the character data is stored only once per level, and each element is stored as a 4-byte integer). (The above was done on Windows 2000).

-- Tony Plate

If any mathematical operations are to be performed with the ID's then
leaving them as doubles makes most sense.

Dan, more information on the numerical characteristics of your system
can be found by using:

.Machine

See ?.Machine and ?object.size for more information.

HTH,

Marc Schwartz


On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
> If I'm not mistaken, numerics are read in as doubles, so that shouldn't be a
> problem. However, I'd try using factor or character.
>
> Andy
>
> > From: Dan Bolser
> >
> > I store an id as a big number, could this be a problem?
> >
> > Should I convert to at string when I use read.table(...
> >
> > example id's
> >
> > 1001001001001
> > 1001001001002
> > ...
> > 1002001002005
> >
> >
> > Bigest is probably
> >
> > 1011001001001
> >
> > Ta,
> > Dan.
> >


______________________________________________
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

______________________________________________ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to