May I offer a perhaps contrary perspective on this. Statistical **theory** tells us that the precision of estimates improves as sample size increases. However, in practice, this is not always the case. The reason is that it can take time to collect that extra data, and things change over time. So the very definition of what one is measuring, the measurement technology by which it is measured (think about estimating tumor size or disease incidence or underemployment, for example), the presence or absence of known or unknown large systematic effects, and so forth may change in unknown ways. This defeats, or at least complicates, the fundamental assumption that one is sampling from a (fixed) population or stable (e.g. homogeneous, stationary) process, so it's no wonder that all statistical bets are off. Of course, sometimes the necessary information to account for these issues is present, and appropriate (but often complex) statistical analyses can be performed. But not always.
Thus, I am suspicious, cynical even, about those who advocate collecting "all the data" and subjecting the whole vast heterogeneous mess to arcane and ever more computer intensive (and adjustable parameter ridden) "data mining" algorithms to "detect trends" or "discover knowledge." To me, it sounds like a prescription for "turning on all the equipment and waiting to see what happens" in the science lab instead of performing careful, well-designed experiments. I realize, of course, that there are many perfectly legitimate areas of scientific research, from geophysics to evolutionary biology to sociology, where one cannot (easily) perform planned experiments. But my point is that good science demands that in all circumstances, and especially when one accumulates and attempts to aggregata data taken over spans of time and space, one needs to beware of oversimplification, including statistical oversimplification. So interrogate the measurement, be skeptical of stability, expect inconsistency. While "all models are wrong but some are useful" (George Box), the second law tells us that entropy still rules. (Needless to say, public or private contrary views are welcome). -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Weiwei Shi > Sent: Tuesday, April 25, 2006 12:10 PM > To: bogdan romocea > Cc: r-help > Subject: Re: [R] regression modeling > > i believe it is not a question only related to regression > modeling. The > correlation between the sample size and confidence of > prediction in data > mining is not as clear as traditional stat approach. My > concern is not in > that theoretical discussion but more practical, looking for a > good algorithm > when response variable is continuous when large dataset is concerned. > > On 4/25/06, bogdan romocea <[EMAIL PROTECTED]> wrote: > > > > There is an aspect, worthy of careful consideration, you > don't seem to > > be aware of. I'll ask the question for you: How does the > > explanatory/predictive potential of a dataset vary as the > dataset gets > > larger and larger? > > > > > > > -----Original Message----- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of Weiwei Shi > > > Sent: Monday, April 24, 2006 12:45 PM > > > To: r-help > > > Subject: [R] regression modeling > > > > > > Hi, there: > > > I am looking for a regression modeling (like regression > > > trees) approach for > > > a large-scale industry dataset. Any suggestion on a package > > > from R or from > > > other sources which has a decent accuracy and scalability? Any > > > recommendation from experience is highly appreciated. > > > > > > Thanks, > > > > > > Weiwei > > > > > > -- > > > Weiwei Shi, Ph.D > > > > > > "Did you always know?" > > > "No, I did not. But I believed..." > > > ---Matrix III > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > -- > Weiwei Shi, Ph.D > > "Did you always know?" > "No, I did not. But I believed..." > ---Matrix III > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html