Try less concurrent reducers at a time.   The issue happened late in
the game?  The insert into HBase only starts at the frameworks' 66%
mark IIRC.   If not that, then there was probably a build up over the
life of the insert that began to overwhelm.

St.Ack

On Fri, Feb 5, 2010 at 4:12 AM, Victor Hsieh <[email protected]> wrote:
> Hi Stack,
>
> Yes.  That machine was heavily loaded, in a development environment.
> On each node, we have datanode, tasktracker and regionserver.  In the
> MR job, the task is more and less like loading data from the disk,
> processing them and putting into hbase.  We had 4 reducers and 5
> region servers in this case.
>
> Though the I/O was heavy, the problem didn't occur quickly but very late.
>
> Besides, I've also observed that with WAL turned on, it failed with
> our smaller dataset (about 7G).  In contrast, turning off WAL made it
> pass, while still failed with larger dataset (about 36G), and happened
> lately.
>
> Thanks,
> Victor
>
> On Fri, Feb 5, 2010 at 2:43 PM, Stack <[email protected]> wrote:
>> On Wed, Feb 3, 2010 at 6:37 PM, Victor Hsieh <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm running MapReduce with TableOutputFormat, and seems to have
>>> similar trouble as HBASE-1603.  After digging to the log, I realized
>>> that the time when my reducer failed might possibly related to region
>>> "callgraph,0958HGCAKZ,1265213010477" split.  Could you guys help to
>>> check it out?
>>
>> Yeah, looks similiar.
>>
>> This is suspicious:
>>
>>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
>>> 2010-02-04 00:38:35,492 INFO
>>
>> Can you tell what the above was about?  Lots of files to compact or
>> was it just that the machine was super loaded?  Can you lightening the
>> loading on hbase?  How many concurrent reducers running at any one
>> time?
>>
>> Here's another indicator of heavy loading:
>>
>>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
>>> sequence id=55264870, compaction requested=true
>>
>> Thats 27 seconds doing the flush closing region as part of the split
>> operation (see how writes are disabled just above it).  Its only
>> writing 54M.  Shouldn't be taking that long.
>>
>> We're at 00:39:02 when we we close parent region.
>>
>> Says split took 2 minutes.  Thats a long time.  It looks like machines
>> are heavily loaded.  Are you running tasktrackers beside the
>> datanodes.  Are they stealing resources from the datanodes?  Is your
>> mapreduce job i/o heavy?
>>
>> St.Ack
>>
>>>
>>> Help appreciated,
>>> Victor
>>>
>>> PS. Here is the log of reducer, hbase master and region server around
>>> the failure time (00:39:42).
>>>
>>> ==== Reducer log ====
>>> 2010-02-04 00:39:42,447 WARN org.apache.hadoop.mapred.TaskTracker:
>>> Error running child
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>> contact region server Some server, retryOnlyOne=true, index=0,
>>> islastrow=false, tries=9, numtries=10, i=0, listsize=4444,
>>> region=callgraph,0958HGCAKZ,1265213010477 for region
>>> callgraph,0958HGCAKZ,1265213010477, row '0988JHE2YS', but failed after
>>> 10 attempts.
>>> ...
>>>
>>> 2010-02-04 00:39:44,769
>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>>> contact region server Some server, retryOnlyOne=true, index=0,
>>> islastrow=false, tries=9, numtries=10, i=0, listsize=4444,
>>> region=callgraph,0958HGCAKZ,1265213010477 for region
>>> callgraph,0958HGCAKZ,1265213010477, row '0972DHJAM2', but failed after
>>> 10 attempts.
>>> Exceptions:
>>>
>>>        at 
>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>>>        at 
>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>>>        at 
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:474)
>>>        at 
>>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:94)
>>>        at 
>>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:55)
>>>        at 
>>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>        ....
>>>
>>> ==== HBase Master log ====
>>>
>>> 2010-02-04 00:38:35,492 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
>>> 2010-02-04 00:38:35,492 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
>>> callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Closing
>>> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
>>> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks
>>> outstanding on region callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush
>>> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore
>>> size 54.3m
>>> 2010-02-04 00:39:02,542 DEBUG
>>> org.apache.hadoop.hbase.regionserver.Store: Added
>>> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113,
>>> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to
>>> callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:39:02,542 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
>>> sequence id=55264870, compaction requested=true
>>> 2010-02-04 00:39:02,543 DEBUG
>>> org.apache.hadoop.hbase.regionserver.Store: closed pointee
>>> 2010-02-04 00:39:02,544 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: Closed
>>> callgraph,0958HGCAKZ,1265213010477
>>> ...
>>> 2010-02-04 00:40:35,144 INFO
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
>>> META updated, and report to master all successful. Old region=REGION
>>> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY =>
>>> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true,
>>> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME =>
>>> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>>> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595,
>>> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec
>>> 2010-02-04 00:40:35,144 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on
>>> callgraph,0958HGCAKZ,1265213010477 because closing/closed
>>>
>>> ==== HBase Region Server log ====
>>>
>>> 2010-02-04 00:38:35,492 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
>>> 2010-02-04 00:38:35,492 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
>>> callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Closing
>>> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
>>> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks
>>> outstanding on region callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:38:35,616 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush
>>> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore
>>> size 54.3m
>>> 2010-02-04 00:39:02,542 DEBUG
>>> org.apache.hadoop.hbase.regionserver.Store: Added
>>> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113,
>>> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to
>>> callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:39:02,542 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
>>> sequence id=55264870, compaction requested=true
>>> 2010-02-04 00:39:02,543 DEBUG
>>> org.apache.hadoop.hbase.regionserver.Store: closed pointee
>>> 2010-02-04 00:39:02,544 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion: Closed
>>> callgraph,0958HGCAKZ,1265213010477
>>> 2010-02-04 00:40:35,144 INFO
>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
>>> META updated, and report to master all successful. Old region=REGION
>>> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY =>
>>> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true,
>>> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME =>
>>> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>>> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595,
>>> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec
>>> 2010-02-04 00:40:35,144 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on
>>> callgraph,0958HGCAKZ,1265213010477 because closing/closed
>>>
>>
>

Reply via email to