On Fri, 6 Jan 2006, Martin Maechler wrote:
"FrPi" == François Pinard <[EMAIL PROTECTED]> on Thu, 5 Jan 2006 22:41:21 -0500 writes:FrPi> [Brian Ripley] >> I rather thought that using a DBMS was standard practice in the >> R community for those using large datasets: it gets discussed rather >> often. FrPi> Indeed. (I tried RMySQL even before speaking of R to my co-workers.) >> Another possibility is to make use of the several DBMS interfaces already >> available for R. It is very easy to pull in a sample from one of those, >> and surely keeping such large data files as ASCII not good practice. FrPi> Selecting a sample is easy. Yet, I'm not aware of any FrPi> SQL device for easily selecting a _random_ sample of FrPi> the records of a given table. On the other hand, I'm FrPi> no SQL specialist, others might know better. FrPi> We do not have a need yet for samples where I work, FrPi> but if we ever need such, they will have to be random, FrPi> or else, I will always fear biases. >> One problem with Francois Pinard's suggestion (the credit has got lost) >> is that R's I/O is not line-oriented but stream-oriented. So selecting >> lines is not particularly easy in R. FrPi> I understand that you mean random access to lines, FrPi> instead of random selection of lines. Once again, FrPi> this chat comes out of reading someone else's problem, FrPi> this is not a problem I actually have. SPSS was not FrPi> randomly accessing lines, as data files could well be FrPi> hold on magnetic tapes, where random access is not FrPi> possible on average practice. SPSS reads (or was FrPi> reading) lines sequentially from beginning to end, and FrPi> the _random_ sample is built while the reading goes. FrPi> Suppose the file (or tape) holds N records (N is not FrPi> known in advance), from which we want a sample of M FrPi> records at most. If N <= M, then we use the whole FrPi> file, no sampling is possible nor necessary. FrPi> Otherwise, we first initialise M records with the FrPi> first M records of the file. Then, for each record in FrPi> the file after the M'th, the algorithm has to decide FrPi> if the record just read will be discarded or if it FrPi> will replace one of the M records already saved, and FrPi> in the latter case, which of those records will be FrPi> replaced. If the algorithm is carefully designed, FrPi> when the last (N'th) record of the file will have been FrPi> processed this way, we may then have M records FrPi> randomly selected from N records, in such a a way that FrPi> each of the N records had an equal probability to end FrPi> up in the selection of M records. I may seek out for FrPi> details if needed. FrPi> This is my suggestion, or in fact, more a thought that FrPi> a suggestion. It might represent something useful FrPi> either for flat ASCII files or even for a stream of FrPi> records coming out of a database, if those effectively FrPi> do not offer ready random sampling devices. FrPi> P.S. - In the (rather unlikely, I admit) case the gang FrPi> I'm part of would have the need described above, and FrPi> if I then dared implementing it myself, would it be welcome? I think this would be a very interesting tool and I'm also intrigued about the details of the algorithm you outline above.
It's called `reservoir sampling' and is described in my simulation book and Knuth and elsewhere.
If it would be made to work on all kind of read.table()-readable files, (i.e. of course including *.csv); that might be a valuable tool for all those -- and there are many -- for whom working with DBMs is too daunting initially.
It would be better (for the reasons I gave) to do this in a separate file preprocessor: read.table reads from a connection not a file, of course.
-- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html