last key. This last can probably best be done as a conventional script > after you have knocked down the data to small size.
Yeah, the split command comes to mind. > Note that most of your joins can just go away since all you want are the > keys. Sure, but you need to make sure that you're still producing a RowKey for each key/value, as opposed to a distinct set of RowKeys, right? This is to make sure you still take into account an uneven distribution of cells as well as RowKeys. On Tue, Mar 29, 2011 at 2:56 PM, Ted Dunning <tdunn...@maprtech.com> wrote: > It should be pretty easy to down-sample the data to have no more than > 1000-10,000 keys. Sort those and take every n-th key omitting the first and > last key. This last can probably best be done as a conventional script > after you have knocked down the data to small size. > Note that most of your joins can just go away since all you want are the > keys. > > On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgra...@gmail.com> wrote: >> >> 1. use Pig to read in our datasets, join/filter/transform/etc before >> writing the output back to HDFS with N reducers ordered by key, where >> N is the number of splits we'll create. >> 2. Manually plucking out the first key of each reducer output file to >> make a list of split keys. >