Your mileage may vary.

If you are grouping records so that all the results are equal size, then
uniquing the keys before sampling is good.  On the other hand, if you have
larger data items for repeated keys due to the grouping, then giving fewer
big keys to some regionservers is good.  You can also unique the keys after
a rough sampling pass.  This gives you results in between the two
approaches.  Keep in mind that this is just an initialization pass.  It
doesn't have to be perfect.

On Tue, Mar 29, 2011 at 3:11 PM, Bill Graham <billgra...@gmail.com> wrote:

> > Note that most of your joins can just go away since all you want are the
> > keys.
>
> Sure, but you need to make sure that you're still producing a RowKey
> for each key/value, as opposed to a distinct set of RowKeys, right?
> This is to make sure you still take into account an uneven
> distribution of cells as well as RowKeys.

Reply via email to