last key.  This last can probably best be done as a conventional script
> after you have knocked down the data to small size.

Yeah, the split command comes to mind.

> Note that most of your joins can just go away since all you want are the
> keys.

Sure, but you need to make sure that you're still producing a RowKey
for each key/value, as opposed to a distinct set of RowKeys, right?
This is to make sure you still take into account an uneven
distribution of cells as well as RowKeys.


On Tue, Mar 29, 2011 at 2:56 PM, Ted Dunning <tdunn...@maprtech.com> wrote:
> It should be pretty easy to down-sample the data to have no more than
> 1000-10,000 keys.  Sort those and take every n-th key omitting the first and
> last key.  This last can probably best be done as a conventional script
> after you have knocked down the data to small size.
> Note that most of your joins can just go away since all you want are the
> keys.
>
> On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgra...@gmail.com> wrote:
>>
>> 1. use Pig to read in our datasets, join/filter/transform/etc before
>> writing the output back to HDFS with N reducers ordered by key, where
>> N is the number of splits we'll create.
>> 2. Manually plucking out the first key of each reducer output file to
>> make a list of split keys.
>

Reply via email to