Hi,

I have a few large text files ~ 3 GBs of data in total with millions of rows
of data. Each row only has one value.

I want to randomly pick 20000 lines and output these as the result.

Mu first thought was to have many mappers and one reducer and assign a
random number as the key and let the sorter sort based on this key. The
reducer would then output the first X (20k in this case) and exit.

Is there a better way? I believe the above will work but it seems quite
inefficient.

Thanks,
John

Reply via email to