Try less concurrent reducers at a time. The issue happened late in the game? The insert into HBase only starts at the frameworks' 66% mark IIRC. If not that, then there was probably a build up over the life of the insert that began to overwhelm.
St.Ack On Fri, Feb 5, 2010 at 4:12 AM, Victor Hsieh <[email protected]> wrote: > Hi Stack, > > Yes. That machine was heavily loaded, in a development environment. > On each node, we have datanode, tasktracker and regionserver. In the > MR job, the task is more and less like loading data from the disk, > processing them and putting into hbase. We had 4 reducers and 5 > region servers in this case. > > Though the I/O was heavy, the problem didn't occur quickly but very late. > > Besides, I've also observed that with WAL turned on, it failed with > our smaller dataset (about 7G). In contrast, turning off WAL made it > pass, while still failed with larger dataset (about 36G), and happened > lately. > > Thanks, > Victor > > On Fri, Feb 5, 2010 at 2:43 PM, Stack <[email protected]> wrote: >> On Wed, Feb 3, 2010 at 6:37 PM, Victor Hsieh <[email protected]> wrote: >>> Hi, >>> >>> I'm running MapReduce with TableOutputFormat, and seems to have >>> similar trouble as HBASE-1603. After digging to the log, I realized >>> that the time when my reducer failed might possibly related to region >>> "callgraph,0958HGCAKZ,1265213010477" split. Could you guys help to >>> check it out? >> >> Yeah, looks similiar. >> >> This is suspicious: >> >>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on >>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec >>> 2010-02-04 00:38:35,492 INFO >> >> Can you tell what the above was about? Lots of files to compact or >> was it just that the machine was super loaded? Can you lightening the >> loading on hbase? How many concurrent reducers running at any one >> time? >> >> Here's another indicator of heavy loading: >> >>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush >>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms, >>> sequence id=55264870, compaction requested=true >> >> Thats 27 seconds doing the flush closing region as part of the split >> operation (see how writes are disabled just above it). Its only >> writing 54M. Shouldn't be taking that long. >> >> We're at 00:39:02 when we we close parent region. >> >> Says split took 2 minutes. Thats a long time. It looks like machines >> are heavily loaded. Are you running tasktrackers beside the >> datanodes. Are they stealing resources from the datanodes? Is your >> mapreduce job i/o heavy? >> >> St.Ack >> >>> >>> Help appreciated, >>> Victor >>> >>> PS. Here is the log of reducer, hbase master and region server around >>> the failure time (00:39:42). >>> >>> ==== Reducer log ==== >>> 2010-02-04 00:39:42,447 WARN org.apache.hadoop.mapred.TaskTracker: >>> Error running child >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>> contact region server Some server, retryOnlyOne=true, index=0, >>> islastrow=false, tries=9, numtries=10, i=0, listsize=4444, >>> region=callgraph,0958HGCAKZ,1265213010477 for region >>> callgraph,0958HGCAKZ,1265213010477, row '0988JHE2YS', but failed after >>> 10 attempts. >>> ... >>> >>> 2010-02-04 00:39:44,769 >>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>> contact region server Some server, retryOnlyOne=true, index=0, >>> islastrow=false, tries=9, numtries=10, i=0, listsize=4444, >>> region=callgraph,0958HGCAKZ,1265213010477 for region >>> callgraph,0958HGCAKZ,1265213010477, row '0972DHJAM2', but failed after >>> 10 attempts. >>> Exceptions: >>> >>> at >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166) >>> at >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247) >>> at >>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609) >>> at org.apache.hadoop.hbase.client.HTable.put(HTable.java:474) >>> at >>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:94) >>> at >>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:55) >>> at >>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) >>> .... >>> >>> ==== HBase Master log ==== >>> >>> 2010-02-04 00:38:35,492 INFO >>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on >>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec >>> 2010-02-04 00:38:35,492 INFO >>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region >>> callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Closing >>> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for >>> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks >>> outstanding on region callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush >>> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore >>> size 54.3m >>> 2010-02-04 00:39:02,542 DEBUG >>> org.apache.hadoop.hbase.regionserver.Store: Added >>> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113, >>> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to >>> callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:39:02,542 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush >>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms, >>> sequence id=55264870, compaction requested=true >>> 2010-02-04 00:39:02,543 DEBUG >>> org.apache.hadoop.hbase.regionserver.Store: closed pointee >>> 2010-02-04 00:39:02,544 INFO >>> org.apache.hadoop.hbase.regionserver.HRegion: Closed >>> callgraph,0958HGCAKZ,1265213010477 >>> ... >>> 2010-02-04 00:40:35,144 INFO >>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, >>> META updated, and report to master all successful. Old region=REGION >>> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY => >>> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true, >>> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME => >>> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL => >>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE >>> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595, >>> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec >>> 2010-02-04 00:40:35,144 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on >>> callgraph,0958HGCAKZ,1265213010477 because closing/closed >>> >>> ==== HBase Region Server log ==== >>> >>> 2010-02-04 00:38:35,492 INFO >>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on >>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec >>> 2010-02-04 00:38:35,492 INFO >>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region >>> callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Closing >>> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for >>> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks >>> outstanding on region callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:38:35,616 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush >>> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore >>> size 54.3m >>> 2010-02-04 00:39:02,542 DEBUG >>> org.apache.hadoop.hbase.regionserver.Store: Added >>> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113, >>> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to >>> callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:39:02,542 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush >>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms, >>> sequence id=55264870, compaction requested=true >>> 2010-02-04 00:39:02,543 DEBUG >>> org.apache.hadoop.hbase.regionserver.Store: closed pointee >>> 2010-02-04 00:39:02,544 INFO >>> org.apache.hadoop.hbase.regionserver.HRegion: Closed >>> callgraph,0958HGCAKZ,1265213010477 >>> 2010-02-04 00:40:35,144 INFO >>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, >>> META updated, and report to master all successful. Old region=REGION >>> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY => >>> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true, >>> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME => >>> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL => >>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE >>> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595, >>> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec >>> 2010-02-04 00:40:35,144 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on >>> callgraph,0958HGCAKZ,1265213010477 because closing/closed >>> >> >
