I'm trying to load data into a table from a Hadoop map job. I have a main
table that stores an average of about 2k per row, and I want to have two
additional index tables, which index 10-20byte keys in the primary table. I
have used TableIndexed and it worked beautifully on small scale testing.

When I tried to use it at a larger scale, it seems to just freeze up. I see
the Hadoop jobs get through maybe 2.5 million records at a good pace, and
then they just hang. Eventually Hadoop kills the jobs after they haven't
responded for 40 minutes. I don't see anything in the logs (though I
wouldn't know what to look for).

In comparison, when I remove the TableIndexed region server from
hbase-site.xml, I'm able to easily load my full batch of 12 million records
in an hour.

Details of cluster:
1 node ZooKeeper and HBase Master
4 nodes ZooKeeper, Region Server and DataNode
4 hadoop datanode / tasktrackers with 3 map slots each
1 hadoop namenode and jobtracker

All nodes are EC2 large instances, 2 cores, 8GB ram, two local 500GB disks.

I have not tuned any memory or performance related settings. I turn on
TableIndexed by setting hbase.regionserver.class to
org.apache.hadoop.hbase.ipc.IndexedRegionInterface and
hbase.regionserver.impl to
org.apache.hadoop.hbase.regionserver.tableindexed.IndexedRegionServer. I'm
using HBase 20.1 RC1, with transactional jar compiled from 0.20.0 with
HBASE-1885, which includes my index key creator.

The behavior makes me think it's something like I need to call commit, but I
can't find anything mentioned. Any ideas?

Reply via email to