I am not very familiar with Pig. Assuming reducer output file is SequenceFile, steps 2 and 3 can be automated.
On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgra...@gmail.com> wrote: > I've been thinking about this topic lately so I'll fork from another > discussion to ask if anyone has a good approach to determining keys > for pre-splitting from a known dataset. We have a key scenario similar > to what Ted describes below. > > We periodically run MR jobs to transform and bulk load data from HDFS > into HBase using Pig. The approach I've used to find the best keys for > the splits is very manual and clunky so I'm wondering if others have a > better approach, perhaps one that could even lead to automation. :) > > Here's what I've done: > > 1. use Pig to read in our datasets, join/filter/transform/etc before > writing the output back to HDFS with N reducers ordered by key, where > N is the number of splits we'll create. > 2. Manually plucking out the first key of each reducer output file to > make a list of split keys. > 3. Creating the HBase table with keys from step 2. > 4. Re-running step 1, this time removing the 'ORDER BY key' and > writing to HBase. > > The pre-created splits are guaranteed to be evenly distributed, but > the process of determining the keys to split on isn't ideal. Is there > a better technique to do steps 1-2 in a way where the split keys can > just be output to a file? > > Suggestions? > > ---------- Forwarded message ---------- > From: Ted Dunning <tdunn...@maprtech.com> > Date: Tue, Mar 29, 2011 at 11:38 AM > Subject: Re: Performance test results > To: user@hbase.apache.org > Cc: Jean-Daniel Cryans <jdcry...@apache.org>, Eran Kutner > <e...@gigya.com>, Stack <st...@duboce.net> > > > Watch out when pre-splitting. Your key distribution may not be as uniform > as you might think. This particularly happens when keys are represented in > some printable form. Base 64, for instance only populates a small fraction > of the base 256 key space. > > On Tue, Mar 29, 2011 at 10:54 AM, Jean-Daniel Cryans <jdcry...@apache.org > >wrote: > > > - Inserting into a new table without pre-splitting it is bound to be a > > red herring of bad performance. Please pre-split it with methods such > > as > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor > > , > > byte[][]) > > >