On Mon, 14 Feb 2005, Berton Gunter wrote:


read all 200 million rows a pipe dream no matter what
platform I'm using?

In principle R can handle this with enough memory. However,
200 million
rows and three columns is 4.8Gb of storage, and R usually needs a few
times the size of the data for working space.

You would likely be better off not reading the whole data set
at once, but
loading sections of it from Oracle as needed.


-thomas


Thomas's comment raises a question:

Can comeone give me an example (perhaps in a private response, since I'm off
topic here) where one actually needs all cases in a large data set ("large"
being > 1e6, say) to do a STATISTICAL analysis? By "statistical" I exclude,
say searching for some particular characteristic like an adverse event in a
medical or customer repair database, etc. Maybe a definition of
"statistical" is: anything that cannot be routinely done in a single pass
database query.

The reason I ask this is that it seems to me that with millions of cases,
(careful, perhaps stratified or in some other not completely at random way)
sampling should always suffice to reduce a dataset to manageable size
sufficient for the data analysis needs at hand. But my ignorance and naivete
probably show here.

I think they are very rare. I have seen just one, a Poisson glm for a not common event (so ca 70% of the counts were zero) with nine categorical predictors. There were about 0.7m records, and a 10% sample was not sufficient to select a model (we got quite different answers from different samples) that predicted accurately on fairly small (but still 10,000 or more) subgroups. Homogeneity is always suspect in large datasets, and in retrospect we could have used several sub-models splitting on some of the variables. But a log-linear model is one of the few ways I know to effectively summarize such large categorical datasets.


In the original problem (which was spatial) I suspect that a local (in space) model would suffice.

--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to