The output is a text file. I'm sure I could write something using the
HDFS Java API to pull the first line of each file, but I'm looking for
an approach to extract these keys all via MR, if possible.


On Tue, Mar 29, 2011 at 2:33 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> I am not very familiar with Pig.
> Assuming reducer output file is SequenceFile, steps 2 and 3 can be
> automated.
>
> On Tue, Mar 29, 2011 at 2:15 PM, Bill Graham <billgra...@gmail.com> wrote:
>>
>> I've been thinking about this topic lately so I'll fork from another
>> discussion to ask if anyone has a good approach to determining keys
>> for pre-splitting from a known dataset. We have a key scenario similar
>> to what Ted describes below.
>>
>> We periodically run MR jobs to transform and bulk load data from HDFS
>> into HBase using Pig. The approach I've used to find the best keys for
>> the splits is very manual and clunky so I'm wondering if others have a
>> better approach, perhaps one that could even lead to automation. :)
>>
>> Here's what I've done:
>>
>> 1. use Pig to read in our datasets, join/filter/transform/etc before
>> writing the output back to HDFS with N reducers ordered by key, where
>> N is the number of splits we'll create.
>> 2. Manually plucking out the first key of each reducer output file to
>> make a list of split keys.
>> 3. Creating the HBase table with keys from step 2.
>> 4. Re-running step 1, this time removing the 'ORDER BY key' and
>> writing to HBase.
>>
>> The pre-created splits are guaranteed to be evenly distributed, but
>> the process of determining the keys to split on isn't ideal. Is there
>> a better technique to do steps 1-2 in a way where the split keys can
>> just be output to a file?
>>
>> Suggestions?
>>
>> ---------- Forwarded message ----------
>> From: Ted Dunning <tdunn...@maprtech.com>
>> Date: Tue, Mar 29, 2011 at 11:38 AM
>> Subject: Re: Performance test results
>> To: user@hbase.apache.org
>> Cc: Jean-Daniel Cryans <jdcry...@apache.org>, Eran Kutner
>> <e...@gigya.com>, Stack <st...@duboce.net>
>>
>>
>> Watch out when pre-splitting.  Your key distribution may not be as uniform
>> as you might think.  This particularly happens when keys are represented
>> in
>> some printable form.  Base 64, for instance only populates a small
>> fraction
>> of the base 256 key space.
>>
>> On Tue, Mar 29, 2011 at 10:54 AM, Jean-Daniel Cryans
>> <jdcry...@apache.org>wrote:
>>
>> > - Inserting into a new table without pre-splitting it is bound to be a
>> > red herring of bad performance. Please pre-split it with methods such
>> > as
>> >
>> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor
>> > ,
>> > byte[][])
>> >
>
>

Reply via email to