Jim,

This looks like a sane way to do what you want.  Is your question strictly
on other methods to put the same data layout into HBase from the MR job, or
also about the choice of structure?

As far as how else to use HBase as a data sink, you can make use of
TableOutputFormat.  In my experience, however, it has been faster to
directly use the API as you are right now.  However you can actually batch
your BatchUpdates automatically using TOF. 

http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/mapr
ed/TableOutputFormat.html

One thing that will really help with performance is to not get a new HTable
in each map.  There was a post in the past day or two regarding this.  Pull
it out into a member of the class, initialize it in the job initialization,
and just reuse the same one in each reducer task.

JG

> -----Original Message-----
> From: Jim Twensky [mailto:jim.twen...@gmail.com]
> Sent: Monday, December 22, 2008 12:38 PM
> To: hbase-user@hadoop.apache.org
> Subject: Using Hbase as data sink
> 
> Hello,
> I have an application which is similar to the word count example given
> on
> the Hadoop Map/Reduce tutorial. Instead of counting the words however,
> I
> count the phrases. That is, for a sentence like:
> 
> "How are you"
> 
> I emit the following phrases inside my mapper:
> 
> How 1
> How are 1
> How are you 1
> are 1
> are you 1
> you 1
> 
> and then inside the reducer, I aggregate the same keys and send them to
> an
> output file.
> 
> However, I want to load these (phrase,count) pairs to Hbase instead of
> storing them in a file. I've already written the code and it works but
> I
> have some concerns about its performance and I'm not sure if this is
> the
> right way to do it. Here is how my reducer looks:
> 
> public class Reduce extends MapReduceBase implements Reducer<Text,
> IntWritable, Text, IntWritable> {
> 
>     public void reduce(Text key, Iterator<IntWritable> values,
> OutputCollector<Text, IntWritable> output, Reporter reporter) throws
> IOException {
>         int sum = 0;
>         while (values.hasNext()) {
>             sum += values.next().get();
>         }
> 
>         HBaseConfiguration conf = new HBaseConfiguration();
>         HTable table = new HTable(conf, "phrases");
> 
>         String row = key.toString();
>         BatchUpdate update = new BatchUpdate(row);
>         update.put("counters:value", Bytes.toBytes(sum));
>         table.commit(update);
> 
>     }
> }
> 
> Now as you can see, my reducer has an output collecter of type
> <Text,IntWritable> but I don't call the output collector at all.
> Instead I
> load the data to Hbase via table.commit.
> I also use NullOutputFormat to avoid getting any empty output files.
> 
> This code works and does what I want but I'm not convinced that this is
> the
> right way to do it. I tried to go over the example codes like
> BuildTableIndex.java and the others but all of them already had
> reducers of
> the following form:
> 
> reduce(ImmutableBytesWritable key, Iterator<RowResult> values,
>       OutputCollector<ImmutableBytesWritable, LuceneDocumentWrapper>
> output,
>       @SuppressWarnings("unused") Reporter reporter)
> 
> which is not how I get the intermadiate key,value pairs from the mapper
> into
> the reducer.
> 
> Can you give me some advice and a few lines of sample source code if
> there
> is a way to load the data using an output collector? Basically, I'm
> confused
> on how to specify the reducer input parameters and which class to
> subclass?
> I posted this to the mailing list because I couldn't find any more
> examples
> anywhere else.
> 
> Thanks in advance,
> Jim

Reply via email to