randomly pick rows from data files

John Clarke Mon, 06 Sep 2010 05:25:21 -0700

Hi,

I have a few large text files ~ 3 GBs of data in total with millions of rows
of data. Each row only has one value.


I want to randomly pick 20000 lines and output these as the result.

Mu first thought was to have many mappers and one reducer and assign a
random number as the key and let the sorter sort based on this key. The
reducer would then output the first X (20k in this case) and exit.

Is there a better way? I believe the above will work but it seems quite
inefficient.

Thanks,
John

randomly pick rows from data files

Reply via email to