Hi Stack,

Yes.  That machine was heavily loaded, in a development environment.
On each node, we have datanode, tasktracker and regionserver.  In the
MR job, the task is more and less like loading data from the disk,
processing them and putting into hbase.  We had 4 reducers and 5
region servers in this case.

Though the I/O was heavy, the problem didn't occur quickly but very late.

Besides, I've also observed that with WAL turned on, it failed with
our smaller dataset (about 7G).  In contrast, turning off WAL made it
pass, while still failed with larger dataset (about 36G), and happened
lately.

Thanks,
Victor

On Fri, Feb 5, 2010 at 2:43 PM, Stack <[email protected]> wrote:
> On Wed, Feb 3, 2010 at 6:37 PM, Victor Hsieh <[email protected]> wrote:
>> Hi,
>>
>> I'm running MapReduce with TableOutputFormat, and seems to have
>> similar trouble as HBASE-1603.  After digging to the log, I realized
>> that the time when my reducer failed might possibly related to region
>> "callgraph,0958HGCAKZ,1265213010477" split.  Could you guys help to
>> check it out?
>
> Yeah, looks similiar.
>
> This is suspicious:
>
>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
>> 2010-02-04 00:38:35,492 INFO
>
> Can you tell what the above was about?  Lots of files to compact or
> was it just that the machine was super loaded?  Can you lightening the
> loading on hbase?  How many concurrent reducers running at any one
> time?
>
> Here's another indicator of heavy loading:
>
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
>> sequence id=55264870, compaction requested=true
>
> Thats 27 seconds doing the flush closing region as part of the split
> operation (see how writes are disabled just above it).  Its only
> writing 54M.  Shouldn't be taking that long.
>
> We're at 00:39:02 when we we close parent region.
>
> Says split took 2 minutes.  Thats a long time.  It looks like machines
> are heavily loaded.  Are you running tasktrackers beside the
> datanodes.  Are they stealing resources from the datanodes?  Is your
> mapreduce job i/o heavy?
>
> St.Ack
>
>>
>> Help appreciated,
>> Victor
>>
>> PS. Here is the log of reducer, hbase master and region server around
>> the failure time (00:39:42).
>>
>> ==== Reducer log ====
>> 2010-02-04 00:39:42,447 WARN org.apache.hadoop.mapred.TaskTracker:
>> Error running child
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> contact region server Some server, retryOnlyOne=true, index=0,
>> islastrow=false, tries=9, numtries=10, i=0, listsize=4444,
>> region=callgraph,0958HGCAKZ,1265213010477 for region
>> callgraph,0958HGCAKZ,1265213010477, row '0988JHE2YS', but failed after
>> 10 attempts.
>> ...
>>
>> 2010-02-04 00:39:44,769
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
>> contact region server Some server, retryOnlyOne=true, index=0,
>> islastrow=false, tries=9, numtries=10, i=0, listsize=4444,
>> region=callgraph,0958HGCAKZ,1265213010477 for region
>> callgraph,0958HGCAKZ,1265213010477, row '0972DHJAM2', but failed after
>> 10 attempts.
>> Exceptions:
>>
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1166)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1247)
>>        at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:609)
>>        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:474)
>>        at 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:94)
>>        at 
>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:55)
>>        at 
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>        ....
>>
>> ==== HBase Master log ====
>>
>> 2010-02-04 00:38:35,492 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
>> 2010-02-04 00:38:35,492 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
>> callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Closing
>> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
>> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks
>> outstanding on region callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush
>> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore
>> size 54.3m
>> 2010-02-04 00:39:02,542 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Added
>> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113,
>> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to
>> callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:39:02,542 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
>> sequence id=55264870, compaction requested=true
>> 2010-02-04 00:39:02,543 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: closed pointee
>> 2010-02-04 00:39:02,544 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Closed
>> callgraph,0958HGCAKZ,1265213010477
>> ...
>> 2010-02-04 00:40:35,144 INFO
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
>> META updated, and report to master all successful. Old region=REGION
>> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY =>
>> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true,
>> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME =>
>> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595,
>> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec
>> 2010-02-04 00:40:35,144 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on
>> callgraph,0958HGCAKZ,1265213010477 because closing/closed
>>
>> ==== HBase Region Server log ====
>>
>> 2010-02-04 00:38:35,492 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
>> region callgraph,0958HGCAKZ,1265213010477 in 21mins, 31sec
>> 2010-02-04 00:38:35,492 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
>> callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Closing
>> callgraph,0958HGCAKZ,1265213010477: compactions & flushes disabled
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for
>> region, no outstanding scanners on callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: No more row locks
>> outstanding on region callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:38:35,616 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush
>> for region callgraph,0958HGCAKZ,1265213010477. Current region memstore
>> size 54.3m
>> 2010-02-04 00:39:02,542 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: Added
>> hdfs://master.hadoop.mydomain.org:9000/hbase/callgraph/1429362021/pointee/2375866396720998113,
>> entries=323359, sequenceid=55264870, memsize=54.3m, filesize=15.7m to
>> callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:39:02,542 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush
>> of ~54.3m for region callgraph,0958HGCAKZ,1265213010477 in 26926ms,
>> sequence id=55264870, compaction requested=true
>> 2010-02-04 00:39:02,543 DEBUG
>> org.apache.hadoop.hbase.regionserver.Store: closed pointee
>> 2010-02-04 00:39:02,544 INFO
>> org.apache.hadoop.hbase.regionserver.HRegion: Closed
>> callgraph,0958HGCAKZ,1265213010477
>> 2010-02-04 00:40:35,144 INFO
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
>> META updated, and report to master all successful. Old region=REGION
>> => {NAME => 'callgraph,0958HGCAKZ,1265213010477', STARTKEY =>
>> '0958HGCAKZ', ENDKEY => '', ENCODED => 1429362021, OFFLINE => true,
>> SPLIT => true, TABLE => {{NAME => 'callgraph', FAMILIES => [{NAME =>
>> 'pointee', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
>> => 'true'}]}}, new regions: callgraph,0958HGCAKZ,1265215115595,
>> callgraph,0972EKCRJ1,1265215115595. Split took 1mins, 59sec
>> 2010-02-04 00:40:35,144 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Skipping compaction on
>> callgraph,0958HGCAKZ,1265213010477 because closing/closed
>>
>

Reply via email to