Thanks for the answer Todd. I realized that I was making my life harder by using the low level record writer directly. Instead I just made the mapper output a <ImmutableBytesWriteable, KeyValue> pair and set the output format to HFileOutputFormat. It works really great! I have a follow up question, after I run the loadtable.rb script it looks a little while before the table is actually ready to be queried. Is there a way to programmatically test if the table is "ready"? I am using hbase-0.20.6. Thanks!
On Wed, Jan 5, 2011 at 6:48 PM, Todd Lipcon <[email protected]> wrote: > Hi Nanheng, > > It sounds like you're on the right path. It sounds like you're missing the > "commit" step when using the output format. > > The layout of the output dir should look something like: > output/ > output/colfam/ > output/colfam/234923423 > output/colfam/349593453 <-- these are just unique IDs > > Thanks > -Todd > > > > On Wed, Jan 5, 2011 at 3:54 PM, Nanheng Wu <[email protected]> wrote: > >> Hi, >> >> I am new to HBase and Hadoop and I am trying to find the best way to >> bulk load a table from HDFS to HBase. I don't mind creating a new >> table for each batch and what I understand using HFileOutputFormat >> directly in a MR job is the most efficient method. My input data set >> is already in sorted order, it seems to me that I don't need to use >> reducers, which require me to do a globally sort already sorted data. >> I tried to use HFileOutputFormat.getRecordWriter in my mapper and 0 >> reducers but the output directory has a only a _temporary directory >> with my outputs in each subdirectory. That doesn't seem be be what the >> loadtable script expects (a column family directory with HFiles). Can >> someone tell me if what I am doing makes sense in general or how to do >> this properly? Thanks! >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
