[ https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626344#comment-13626344 ]
Hudson commented on HBASE-8276: ------------------------------- Integrated in HBase-0.94-security #133 (See [https://builds.apache.org/job/HBase-0.94-security/133/]) HBASE-8276 Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager" (Jeffrey) (Revision 1465161) Result = SUCCESS tedyu : Files : * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java * /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKSplitLog.java * /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestSplitLogManager.java > Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the > distributed log manager" > ----------------------------------------------------------------------------------------------- > > Key: HBASE-8276 > URL: https://issues.apache.org/jira/browse/HBASE-8276 > Project: HBase > Issue Type: Bug > Reporter: Jeffrey Zhong > Assignee: Jeffrey Zhong > Fix For: 0.94.7 > > Attachments: hbase-8276.patch, hbase-8276-v1.patch > > > In recent tests, we found situations that when some data nodes are down and > file operations are slow depending on underlying hdfs timeout(normally 30 > secs and socket connection timeout maybe around 1 min). While split log task > heart beat time out is only 25 secs, a split log task will be preempted by > SplitLogManager and assign to someone else after the 25 secs. On a small > cluster, you'll see the same task is keeping bounced back & force for a > while. I pasted a snippet of related logs below. You can search "preempted > from" to see a task is preempted. > {code} > 2013-04-01 21:22:08,599 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159, > length=127639653 > 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recovering file > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159 > 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Finished lease recover attempt for > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159 > 2013-04-01 21:22:09,629 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp, > length=0 > 2013-04-01 21:22:09,629 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp, > length=0 > 2013-04-01 21:22:09,630 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp, > length=0 > 2013-04-01 21:22:09,666 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp, > length=0 > 2013-04-01 21:22:09,722 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp, > length=0 > 2013-04-01 21:22:09,734 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp, > length=0 > 2013-04-01 21:22:09,770 WARN > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old > edits file. It could be the result of a previous failed split attempt. > Deleting > hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp, > length=0 > 2013-04-01 21:22:34,985 INFO > org.apache.hadoop.hbase.regionserver.SplitLogWorker: task > /hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159 > preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current > task state and owner=unassigned > ip-10-137-16-140.ec2.internal,60000,1364849528428 > {code} > The exact same issue is fixed by hbase-6738 in trunk so here comes the > backport. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira