On 4/11/07, Robert Duval <[EMAIL PROTECTED]> wrote: > So I guess my question is... > > Is there any hope of R being modified on its core in order to handle > more graciously large datasets? (You've mentioned SAS and SPSS, I'd > add Stata to the list). > > Or should we (the users of large datasets) expect to keep on working > with the present tools for the time to come?
We're certainly aware of the desire of many users to be able to handle large data sets. I have just spent a couple of days working with a student from another department who wanted to work with a very large data set that was poorly structured. Most of my time was spent trying to convince her about the limitations in the structure of her data and what could realistically be expected to be computed with it. If your purpose is to perform data manipulation and extraction on large data sets then I think that it is not unreasonable to be expected to learn to use SQL. I find it convenient to use R to do data manipulation because I know the language and the support tools well but I don't expect to do data cleaning on millions of records with it. I am probably too conservative in what I will ask R to handle for me because I started using S on a Vax-11/750 that had 2 megabytes of memory and it's hard to break old habits. I think the trend in working with large data sets in R will be toward a hybrid approach of using a database for data storage and retrieval plus R for the model definition and computation. Miguel Manese's SQLiteDF package and some of the work in Bioconductor are steps in this direction. However, as was mentioned earlier in this thread, there is an underlying assumption with R that the user is thinking about the analysis as he/she is doing it. We sometimes see questions about "I have a data set with (some large number) of records on several hundred or thousands of variables" and I want to fit a generalized linear model to it. I would be hard pressed to think of a situation where I wanted hundreds of variables in a statistical model unless they are generated from one or more factors that have many levels. And, in that case, I would want to use random effects rather than fixed effects in a model. So just saying that the big challenge is to fit some kind of model with lots of coefficients to a very large number of observations may be missing the point. Defining the model better may be the point. Let me conclude by saying that these are general observations and not directed to you personally, Robert. I don't know what you want R to do graciously to large data sets so my response is more to the general point that there should always be a balance between thinking about the structure of the data and the model and brute force computation. One can do data analysis by using the computer as a blunt instrument with which to bludgeon the problem to death but one can't do elegant data analysis like that. > > robert > > On 4/11/07, Marc Schwartz <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote: > > > On Wed, 2007-04-11 at 17:56 +0200, Bi-Info > > > (http://members.home.nl/bi-info) wrote: > > > > I certainly have that idea too. SPSS functions in a way the same, > > > > although it specialises in PC applications. Memory addition to a PC is > > > > not a very expensive thing these days. On my first AT some extra memory > > > > cost 300 dollars or more. These days you get extra memory with a package > > > > of marshmellows or chocolate bars if you need it. > > > > All computations on a computer are discrete steps in a way, but I've > > > > heard that SAS computations are split up in strictly divided steps. That > > > > also makes procedures "attachable" I've been told, and interchangable. > > > > Different procedures can use the same code which alternatively is > > > > cheaper in memory usages or disk usage (the old days...). That makes SAS > > > > by the way a complicated machine to build because procedures who are > > > > split up into numerous fragments which make complicated bookkeeping. If > > > > you do it that way, I've been told, you can do a lot of computations > > > > with very little memory. One guy actually computed quite complicated > > > > models with "only 32MB or less", which wasn't very much for "his type of > > > > calculations". Which means that SAS is efficient in memory handling I > > > > think. It's not very efficient in dollar handling... I estimate. > > > > > > > > Wilfred > > > > > > <snip> > > > > > > Oh....SAS is quite efficient in dollar handling, at least when it comes > > > to the annual commercial licenses...along the same lines as the > > > purported efficiency of the U.S. income tax system: > > > > > > "How much money do you have? Send it in..." > > > > > > There is a reason why SAS is the largest privately held software company > > > in the world and it is not due to the academic licensing structure, > > > which constitutes only about 12% of their revenue, based upon their > > > public figures. > > > > Hmmm......here is a classic example of the problems of reading pie > > charts. > > > > The figure I quoted above, which is from reading the 2005 SAS Annual > > Report on their web site (such as it is for a private company) comes > > from a 3D exploded pie chart (ick...). > > > > The pie chart uses 3 shades of grey and 5 shades of blue to > > differentiate 8 market segments and their percentages of total worldwide > > revenue. > > > > I mis-read the 'shade of grey' allocated to Education as being 12% > > (actually 11.7%). > > > > A re-read of the chart, zooming in close on the pie in a PDF reader, > > appears to actually show that Education is but 1.8% of their annual > > worldwide revenue. > > > > Government based installations, which are presumably the other notable > > market segment in which substantially discounted licenses are provided, > > is 14.6%. > > > > The report is available here for anyone else curious: > > > > http://www.sas.com/corporate/report05/annualreport05.pdf > > > > Somebody needs to send SAS a copy of Tufte or Cleveland. > > > > I have to go and rest my eyes now... ;-) > > > > Regards, > > > > Marc > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.