+Konstantin (there's something weird in append handling) Some more updates. Hope this will help. I had this hunch that I was seeing those weird issues when HDFS DN was at 80% capacity (but nowhere near full!). So I quickly spun off a cluster that had 5 DNs with modest (and unbalanced!) amount of storage. Here's what started happening towards the end of loading 2M records into HBase:
On the master:
{"statustimems":-1,"status":"Waiting for distributed tasks to finish.
scheduled=4 done=0
error=3","starttimems":1320796207862,"description":"Doing distributed
log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting
for distributed tasks to finish. scheduled=4 done=0
error=1","starttimems":1320796206563,"description":"Doing distributed
log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
for distributed tasks to finish. scheduled=4 done=0
error=2","starttimems":1320796205304,"description":"Doing distributed
log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
for distributed tasks to finish. scheduled=4 done=0
error=3","starttimems":1320796203957,"description":"Doing distributed
log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}]
11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
distributed log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
status=Waiting for distributed tasks to finish. scheduled=4 done=0
error=3, state=RUNNING, startTime=1320796203957, completionTime=-1
appears to have been leaked
11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
distributed log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
status=Waiting for distributed tasks to finish. scheduled=4 done=0
error=2, state=RUNNING, startTime=1320796205304, completionTime=-1
appears to have been leaked
11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
distributed log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
status=Waiting for distributed tasks to finish. scheduled=4 done=0
error=1, state=RUNNING, startTime=1320796206563, completionTime=-1
appears to have been leaked
And the behavior on the DNs was even weirder. I'm attaching a log
from one of the DNs. The last exception is a shocker to me:
11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of
hdfs://ip-10-46-114-25.ec2.internal:17020/
hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020
%2C1320792860210.1320796004063 failed, returning error
java.io.IOException: Failed to open
hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2
.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063
fo
r append
But perhaps its is cascading from some of the earlier ones.
Anyway, take a look at the attached log.
Now, this is a tricky issue to reproduce. Just before it started failing
again I had a completely clean run over here:
http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/
Which makes me believe it is NOT configuration related.
Thanks,
Roman.
fail.log.gz
Description: GNU Zip compressed data
