I also recently had this problem, trying to index 6+ billion records into
HBase.  The job would take about 4 hours before it brought down the entire
cluster, at only around 60% complete.

After trying a bunch of things, we went to bulk loading.  This is actually
pretty easy, though the hardest part is that you need to have a table ready
with the region splits you are going to use.  Region splits aside, there
are 2 steps:

1) Change your job to instead of executing yours Puts, just output them
using context.write.  Put is writable. (We used ImmutableBytesWritable as
the Key, representing the rowKey)
2) Add another job that reads that input and configure it
using HFileOutputFormat.configureIncrementalLoad(Job job, HTable table);
 This will add the right reducer.

Once those two have run, you can finalize the process using the
completebulkload tool documented at http://hbase.apache.org/bulk-loads.html

For the region splits problem, we created another job which sorted all of
the puts by the key (hadoop does this automatically) and had a single
reducer.  It stepped through all of the Puts calculating up the total size
until it reached some threshold.  When it did it recorded the bytearray and
used that for the start of the next region. We used the result of this job
to create a new table.  There is probably a better way to do this but it
takes like 20 minutes to write.

This whole process took less than an hour, with the bulk load part only
taking 15 minutes.  Much better!

On Wed, May 9, 2012 at 11:08 AM, Something Something <
mailinglist...@gmail.com> wrote:

> Hey Oliver,
>
> Thanks a "billion" for the response -:)  I will take any code you can
> provide even if it's a hack!  I will even send you an Amazon gift card -
> not that you care or need it -:)
>
> Can you share some performance statistics?  Thanks again.
>
>
> On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF) <om...@gbif.org> wrote:
>
> > Heya Something,
> >
> > I had a similar task recently and by far the best way to go about this is
> > with bulk loading after pre-splitting your target table.  As you know
> > ImportTsv doesn't understand Avro files so I hacked together my own
> > ImportAvro class to create the Hfiles that I eventually moved into HBase
> > with completebulkload.  I haven't committed my class anywhere because
> it's
> > a pretty ugly hack, but I'm happy to share it with you as a starting
> point.
> >  Doing billions of puts will just drive you crazy.
> >
> > Cheers,
> > Oliver
> >
> > On 2012-05-09, at 4:51 PM, Something Something wrote:
> >
> > > I ran the following MR job that reads AVRO files & puts them on HBase.
> >  The
> > > files have tons of data (billions).  We have a fairly decent size
> > cluster.
> > > When I ran this MR job, it brought down HBase.  When I commented out
> the
> > > Puts on HBase, the job completed in 45 seconds (yes that's seconds).
> > >
> > > Obviously, my HBase configuration is not ideal.  I am using all the
> > default
> > > HBase configurations that come out of Cloudera's distribution:
> >  0.90.4+49.
> > >
> > > I am planning to read up on the following two:
> > >
> > > http://hbase.apache.org/book/important_configurations.html
> > > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/
> > >
> > > But can someone quickly take a look and recommend a list of priorities,
> > > such as "try this first..."?  That would be greatly appreciated.  As
> > > always, thanks for the time.
> > >
> > >
> > > Here's the Mapper. (There's no reducer):
> > >
> > >
> > >
> > > public class AvroProfileMapper extends AvroMapper<GenericData.Record,
> > > NullWritable> {
> > >    private static final Logger logger =
> > > LoggerFactory.getLogger(AvroProfileMapper.class);
> > >
> > >    final private String SEPARATOR = "*";
> > >
> > >    private HTable table;
> > >
> > >    private String datasetDate;
> > >    private String tableName;
> > >
> > >    @Override
> > >    public void configure(JobConf jobConf) {
> > >        super.configure(jobConf);
> > >        datasetDate = jobConf.get("datasetDate");
> > >        tableName = jobConf.get("tableName");
> > >
> > >        // Open table for writing
> > >        try {
> > >            table = new HTable(jobConf, tableName);
> > >            table.setAutoFlush(false);
> > >            table.setWriteBufferSize(1024 * 1024 * 12);
> > >        } catch (IOException e) {
> > >            throw new RuntimeException("Failed table construction", e);
> > >        }
> > >    }
> > >
> > >    @Override
> > >    public void map(GenericData.Record record,
> AvroCollector<NullWritable>
> > > collector,
> > >                    Reporter reporter) throws IOException {
> > >
> > >        String u1 = record.get("u1").toString();
> > >
> > >        GenericData.Array<GenericData.Record> fields =
> > > (GenericData.Array<GenericData.Record>) record.get("bag");
> > >        for (GenericData.Record rec : fields) {
> > >            Integer s1 = (Integer) rec.get("s1");
> > >            Integer n1 = (Integer) rec.get("n1");
> > >            Integer c1 = (Integer) rec.get("c1");
> > >            Integer freq = (Integer) rec.get("freq");
> > >            if (freq == null) {
> > >                freq = 0;
> > >            }
> > >
> > >            String key = u1 + SEPARATOR + n1 + SEPARATOR + c1 +
> SEPARATOR
> > +
> > > s1;
> > >            Put put = new Put(Bytes.toBytes(key));
> > >            put.setWriteToWAL(false);
> > >            put.add(Bytes.toBytes("info"), Bytes.toBytes("frequency"),
> > > Bytes.toBytes(freq.toString()));
> > >            try {
> > >                table.put(put);
> > >            } catch (IOException e) {
> > >                throw new RuntimeException("Error while writing to " +
> > > table + " table.", e);
> > >            }
> > >
> > >        }
> > >        logger.error("------------  Finished processing user: " + u1);
> > >    }
> > >
> > >    @Override
> > >    public void close() throws IOException {
> > >        table.close();
> > >    }
> > >
> > > }
> >
> >
> > --
> > Oliver Meyn
> > Software Developer
> > Global Biodiversity Information Facility (GBIF)
> > +45 35 32 15 12
> > http://www.gbif.org
> >
> >
>

Reply via email to