It should be pretty easy to down-sample the data to have no more than 1000-10,000 keys. Sort those and take every n-th key omitting the first and last key. This last can probably best be done as a conventional script after you have knocked down the data to small size.
Note that most of your joins can just go away since all you want are the keys. On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgra...@gmail.com> wrote: > 1. use Pig to read in our datasets, join/filter/transform/etc before > writing the output back to HDFS with N reducers ordered by key, where > N is the number of splits we'll create. > 2. Manually plucking out the first key of each reducer output file to > make a list of split keys. >