Hi all,

I am using Hadoop (0.20.2) and HBase to periodically import data (every 15 
minutes). There are a number of import processes, but generally they all create 
a sequence file on HDFS, which is then run through a MapReduce job. The 
MapReduce uses the identity mapper (the input file is a Hadoop sequence file) 
and a specialized reducer that does the following:
- Combine the values for a key into one value
- Do a Get from HBase to retrieve existing values for the same key
- Combine the existing value from HBase and the new one into one value again
- Put the final value into HBase under the same key (thus 'overwrite' the 
existing row; I keep only one version)

After I upgraded HBase to the 0.20.4 release, the reducers sometimes start 
hanging on a Get. When the jobs start, some reducers run to completion fine, 
but after a while the last reducers will start to hang. Eventually the reducers 
are killed of by Hadoop (after 600 secs).

I did a thread dump for one of the hanging reducers. It looks like this:
"main" prio=10 tid=0x0000000048083800 nid=0x4c93 in Object.wait() 
[0x00000000420ca000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00002aaaaeb50d70> (a 
org.apache.hadoop.hbase.ipc.HBaseClient$Call)
        at java.lang.Object.wait(Object.java:485)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:721)
        - locked <0x00002aaaaeb50d70> (a 
org.apache.hadoop.hbase.ipc.HBaseClient$Call)
        at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
        at $Proxy2.get(Unknown Source)
        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:450)
        at org.apache.hadoop.hbase.client.HTable$4.call(HTable.java:448)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1050)
        at org.apache.hadoop.hbase.client.HTable.get(HTable.java:447)
        at 
net.ripe.inrdb.hbase.accessor.real.HBaseTableAccessor.get(HBaseTableAccessor.java:36)
        at 
net.ripe.inrdb.hbase.store.HBaseStoreUpdater.getExistingRecords(HBaseStoreUpdater.java:101)
        at 
net.ripe.inrdb.hbase.store.HBaseStoreUpdater.mergeTimelinesWithExistingRecords(HBaseStoreUpdater.java:60)
        at 
net.ripe.inrdb.hbase.store.HBaseStoreUpdater.doInsert(HBaseStoreUpdater.java:40)
        at 
net.ripe.inrdb.core.store.SinglePartitionStore$Updater.insert(SinglePartitionStore.java:92)
        at 
net.ripe.inrdb.core.store.CompositeStore$CompositeStoreUpdater.insert(CompositeStore.java:142)
        at 
net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:70)
        at 
net.ripe.inrdb.importer.StoreInsertReducer.reduce(StoreInsertReducer.java:17)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

So the client hangs in a wait() call, waiting on a HBaseClient$Call object. I 
looked at the code. The wait is in a while() loop and has no time out, so it 
figures that it never gets out of there if no notify() gets called on the 
object. I am not sure for exactly what condition it is waiting, however.

Meanwhile, after this has happened, I cannot shutdown the master server 
normally. I have to kill -9 it, to make it shut down. Normally and before this 
problem occurs, the master server shuts down just fine. (Sorry, didn't do a 
thread dump of the master and now I downgraded to 0.20.3 again.)

I cannot reproduce this error on my local setup (developer machine). It only 
occurs on our (currently modest) cluster of one machine running 
master+NN+Zookeeper and four datanodes which are all task trackers and region 
servers as well. The inputs to the periodic MapReduce jobs are very small 
(ranging from some Kb to several Mb) and thus contain not so many records. I 
know this is not very efficient to do in MapReduce and will be faster when 
inserted in process by the importer process because of startup overhead, but we 
are setting up this architecture of importers and insertion for anticipated 
larger loads (up to 80 million records per day).

Does anyone have a clue about what happens? Or where to look for further 
investigation?

Thanks a lot!


Cheers,
Friso

Reply via email to