Re: [R] Suggestion for big files [was: Re: A comment about R:]

Prof Brian Ripley Fri, 06 Jan 2006 03:17:51 -0800

On Fri, 6 Jan 2006, Martin Maechler wrote:

"FrPi" == François Pinard <[EMAIL PROTECTED]>
    on Thu, 5 Jan 2006 22:41:21 -0500 writes:


   FrPi> [Brian Ripley]
   >> I rather thought that using a DBMS was standard practice in the
   >> R community for those using large datasets: it gets discussed rather
   >> often.

   FrPi> Indeed.  (I tried RMySQL even before speaking of R to my co-workers.)

   >> Another possibility is to make use of the several DBMS interfaces already
   >> available for R.  It is very easy to pull in a sample from one of those,
   >> and surely keeping such large data files as ASCII not good practice.

   FrPi> Selecting a sample is easy.  Yet, I'm not aware of any
   FrPi> SQL device for easily selecting a _random_ sample of
   FrPi> the records of a given table.  On the other hand, I'm
   FrPi> no SQL specialist, others might know better.

   FrPi> We do not have a need yet for samples where I work,
   FrPi> but if we ever need such, they will have to be random,
   FrPi> or else, I will always fear biases.

   >> One problem with Francois Pinard's suggestion (the credit has got lost)
   >> is that R's I/O is not line-oriented but stream-oriented.  So selecting
   >> lines is not particularly easy in R.

   FrPi> I understand that you mean random access to lines,
   FrPi> instead of random selection of lines.  Once again,
   FrPi> this chat comes out of reading someone else's problem,
   FrPi> this is not a problem I actually have.  SPSS was not
   FrPi> randomly accessing lines, as data files could well be
   FrPi> hold on magnetic tapes, where random access is not
   FrPi> possible on average practice.  SPSS reads (or was
   FrPi> reading) lines sequentially from beginning to end, and
   FrPi> the _random_ sample is built while the reading goes.

   FrPi> Suppose the file (or tape) holds N records (N is not
   FrPi> known in advance), from which we want a sample of M
   FrPi> records at most.  If N <= M, then we use the whole
   FrPi> file, no sampling is possible nor necessary.
   FrPi> Otherwise, we first initialise M records with the
   FrPi> first M records of the file.  Then, for each record in
   FrPi> the file after the M'th, the algorithm has to decide
   FrPi> if the record just read will be discarded or if it
   FrPi> will replace one of the M records already saved, and
   FrPi> in the latter case, which of those records will be
   FrPi> replaced.  If the algorithm is carefully designed,
   FrPi> when the last (N'th) record of the file will have been
   FrPi> processed this way, we may then have M records
   FrPi> randomly selected from N records, in such a a way that
   FrPi> each of the N records had an equal probability to end
   FrPi> up in the selection of M records.  I may seek out for
   FrPi> details if needed.

   FrPi> This is my suggestion, or in fact, more a thought that
   FrPi> a suggestion.  It might represent something useful
   FrPi> either for flat ASCII files or even for a stream of
   FrPi> records coming out of a database, if those effectively
   FrPi> do not offer ready random sampling devices.


   FrPi> P.S. - In the (rather unlikely, I admit) case the gang
   FrPi> I'm part of would have the need described above, and
   FrPi> if I then dared implementing it myself, would it be welcome?

I think this would be a very interesting tool and
I'm also intrigued about the details of the algorithm you
outline above.

It's called `reservoir sampling' and is described in my simulation bookand Knuth and elsewhere.

If it would be made to work on all kind of read.table()-readable
files, (i.e. of course including *.csv);   that might be a valuable
tool for all those -- and there are many -- for whom working
with DBMs is too daunting initially.

It would be better (for the reasons I gave) to do this in a separate filepreprocessor: read.table reads from a connection not a file, of course.


--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Suggestion for big files [was: Re: A comment about R:]

Reply via email to