Hi, I have a few large text files ~ 3 GBs of data in total with millions of rows of data. Each row only has one value.
I want to randomly pick 20000 lines and output these as the result. Mu first thought was to have many mappers and one reducer and assign a random number as the key and let the sorter sort based on this key. The reducer would then output the first X (20k in this case) and exit. Is there a better way? I believe the above will work but it seems quite inefficient. Thanks, John
