I am not very familiar with Pig.
Assuming reducer output file is SequenceFile, steps 2 and 3 can be
automated.

On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgra...@gmail.com> wrote:

> I've been thinking about this topic lately so I'll fork from another
> discussion to ask if anyone has a good approach to determining keys
> for pre-splitting from a known dataset. We have a key scenario similar
> to what Ted describes below.
>
> We periodically run MR jobs to transform and bulk load data from HDFS
> into HBase using Pig. The approach I've used to find the best keys for
> the splits is very manual and clunky so I'm wondering if others have a
> better approach, perhaps one that could even lead to automation. :)
>
> Here's what I've done:
>
> 1. use Pig to read in our datasets, join/filter/transform/etc before
> writing the output back to HDFS with N reducers ordered by key, where
> N is the number of splits we'll create.
> 2. Manually plucking out the first key of each reducer output file to
> make a list of split keys.
> 3. Creating the HBase table with keys from step 2.
> 4. Re-running step 1, this time removing the 'ORDER BY key' and
> writing to HBase.
>
> The pre-created splits are guaranteed to be evenly distributed, but
> the process of determining the keys to split on isn't ideal. Is there
> a better technique to do steps 1-2 in a way where the split keys can
> just be output to a file?
>
> Suggestions?
>
> ---------- Forwarded message ----------
> From: Ted Dunning <tdunn...@maprtech.com>
> Date: Tue, Mar 29, 2011 at 11:38 AM
> Subject: Re: Performance test results
> To: user@hbase.apache.org
> Cc: Jean-Daniel Cryans <jdcry...@apache.org>, Eran Kutner
> <e...@gigya.com>, Stack <st...@duboce.net>
>
>
> Watch out when pre-splitting.  Your key distribution may not be as uniform
> as you might think.  This particularly happens when keys are represented in
> some printable form.  Base 64, for instance only populates a small fraction
> of the base 256 key space.
>
> On Tue, Mar 29, 2011 at 10:54 AM, Jean-Daniel Cryans <jdcry...@apache.org
> >wrote:
>
> > - Inserting into a new table without pre-splitting it is bound to be a
> > red herring of bad performance. Please pre-split it with methods such
> > as
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor
> > ,
> > byte[][])
> >
>

Reply via email to