Thanks Ryan, Kevin and Stack for your helpful answers and recommendations!

On Oct 20, 2009, at 1:58 AM, Ryan Rawson wrote:

I have to recommend doing the puts via the API straight in the mapper.
Passing all your data thru the shuffle is not necessary, since
inserting into hbase is a form of sorting.  Besides lets not copy a
100gb import more times than we have to, right?

On Mon, Oct 19, 2009 at 11:41 PM, Kevin Peterson <[email protected]> wrote:
On Mon, Oct 19, 2009 at 7:40 PM, yz5od2 <woods5242- [email protected]>wrote:

ok, so what you are saying is that my mapper should talk directly to Hbase to write the data into it? Or I should define my Mapper implementation class
like

Mapper<LongWritable,Text,Text,byte[]>


Your Mapper must output a Hadoop Writable. You have two options:

1. Handle HBase all yourself, and you are just using Hadoop as a way to distribute your load and data across your cluster. Then you can just use NullWritables and not call output.collect (0.19 API) or context.write (0.20
API) at all.
2. Output HBase Puts and Deletes from the Mapper and use TableOutputFormat. Put and Delete extend Writable, but don't share a more specific superclass, so the signature for the Mapper is the somewhat confusing <K1, V1, K2, Writable>, where K1 and V1 are whatever is needed for your input, and K2 is
completely ignored.

The second one would involve writing less code. You would do something like
this:

byte[] rowId = ...;
byte[] content = pojo.serialize();
Put put = new Put(rowId);
put.add(Bytes.toBytes("content"), Bytes.toBytes("thrift-thingie"), content);
context.write(NullWritable.get(), put);

As Ryan says, you don't want to use Hadoop writables as your serialization scheme, but they are part of the API to pass data to an output format.

I don't know if the first has any advantages. Probably flexibilty, and
better control over details like when to flush the commits.


Reply via email to