+Konstantin (there's something weird in append handling)

Some more updates. Hope this will help. I had this hunch that
I was seeing those weird issues when HDFS DN was at 80%
capacity (but nowhere near full!). So I quickly spun off a cluster
that had 5 DNs with modest (and unbalanced!) amount of
storage. Here's what started happening towards the end of
loading 2M records into HBase:

On the master:

{"statustimems":-1,"status":"Waiting for distributed tasks to finish.
scheduled=4 done=0
error=3","starttimems":1320796207862,"description":"Doing distributed
log split in 
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting
for distributed tasks to finish. scheduled=4 done=0
error=1","starttimems":1320796206563,"description":"Doing distributed
log split in 
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
for distributed tasks to finish. scheduled=4 done=0
error=2","starttimems":1320796205304,"description":"Doing distributed
log split in 
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
for distributed tasks to finish. scheduled=4 done=0
error=3","starttimems":1320796203957,"description":"Doing distributed
log split in 
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}]

11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
distributed log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
status=Waiting for distributed tasks to finish.  scheduled=4 done=0
error=3, state=RUNNING, startTime=1320796203957, completionTime=-1
appears to have been leaked
11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
distributed log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
status=Waiting for distributed tasks to finish.  scheduled=4 done=0
error=2, state=RUNNING, startTime=1320796205304, completionTime=-1
appears to have been leaked
11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
distributed log split in
[hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
status=Waiting for distributed tasks to finish.  scheduled=4 done=0
error=1, state=RUNNING, startTime=1320796206563, completionTime=-1
appears to have been leaked

And the behavior on the DNs was even weirder. I'm attaching a log
from one of the DNs. The last exception is a shocker to me:

11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of
hdfs://ip-10-46-114-25.ec2.internal:17020/
hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020
%2C1320792860210.1320796004063 failed, returning error
java.io.IOException: Failed to open
hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2
.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063
fo
r append

But perhaps its is cascading from some of the earlier ones.

Anyway, take a look at the attached log.

Now, this is a tricky issue to reproduce. Just before it started failing
again I had a completely clean run over here:
    
http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/

Which makes me believe it is NOT configuration related.

Thanks,
Roman.

Attachment: fail.log.gz
Description: GNU Zip compressed data

Reply via email to