[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653495#comment-13653495 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12582559/hbase-7006-combined-v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 31 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.util.TestHBaseFsck Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5620//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, > hbase-7006-combined-v6.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652258#comment-13652258 ] stack commented on HBASE-7006: -- I added note to refguide that folks should run w/ newer zks and point to ZOOKEEPER-1277 as a justification. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652253#comment-13652253 ] stack commented on HBASE-7006: -- [~jeffreyz] Thanks. I asked about zxid. "I think you mean the zxid? That's a 64bit number where the lower 32bits are the xid and the upper 32 bits are the epoch. The xid increases for each write, the epoch increases when there is a leader change. The zxid should always only increase. There was a bug where the lower 32bits could roll over, however that resulted in the epoch number increasing as well (64bits++) - so the constraint was maintained (but the cluster would fail/lockup for another issue, I fixed that in recent releases though.. Now when that is about to happen it forces a new leader election)." Above is from our Patrick Hunt. Says fix is in Apache ZK (3.3.5, 3.4.4). If you look at tail of the below issue, you will see an hbase favorite user running into rollover issue: https://issues.apache.org/jira/browse/ZOOKEEPER-1277 Let me make sure we add to notes that folks should upgrade to these versions of zk. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652106#comment-13652106 ] Jeffrey Zhong commented on HBASE-7006: -- It seems that I forgot to publish it. You should have it now. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652094#comment-13652094 ] stack commented on HBASE-7006: -- [~jeffreyz] Nice. Good one. Up on rb, you may have missed another set of reviews of mine. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652074#comment-13652074 ] Jeffrey Zhong commented on HBASE-7006: -- {quote} How can you be sure all edits in WALs from crashed server were replicated already? {quote} This is guaranteed by the replication fail over logic. Replication waits for log splitting finish and then resume replication on those wal files from failed RS. The above change just make sure we don't replicate WAL edits created by replay command again because those edits will be replicated from the original wal file. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652060#comment-13652060 ] stack commented on HBASE-7006: -- bq. 2) Set replayed WAL edits replication scope to null so that WAL edits created by replay command won't be double replicated. How can you be sure all edits in WALs from crashed server were replicated already? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651704#comment-13651704 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12582262/hbase-7006-combined-v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 27 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.regionserver.TestAtomicOperation org.apache.hadoop.hbase.security.access.TestAccessController Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5589//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13650365#comment-13650365 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581779/hbase-7006-combined-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 24 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestDistributedLogSplitting Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5563//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v4.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649070#comment-13649070 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581779/hbase-7006-combined-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 24 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5554//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, > hbase-7006-combined-v4.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647328#comment-13647328 ] stack commented on HBASE-7006: -- Thinking on it, flushing after all logs recovered is a bad idea because it a special case. Replay mutations, as is, are treated like any other inbound edit. I think this good. Turning off WALs and flushing on the end and trying to figure what we failed to write or writing hfiles directly -- if you could, and I don't think you can since edits need to be sorted in an hfile -- and by-passing memstore and then telling the Region to pick up the new hfile when done all introduce new states that we will have to manage complicating critical recovery. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647325#comment-13647325 ] Jeffrey Zhong commented on HBASE-7006: -- {quote} This might be very important. Also now we will allow the writes on the recovering region when this replay is happening. These other writes + replays might be doing flushes in btw. {quote} This is valid concern. Let's compare the new way with old way. old log splitting appends each WAL edit into a recovered.edits file while the new way flush disk only when memstore reaching certain size. Therefore, even with allowing writes during recovery, new distributed log replay still has better disk writing characteristics(assuming normal situations). While your concern is more relevant when a system close to its disk IO or other capacity. Allowing writes could deteriorate whole system even more. I think a system operator should rate limiting in a higher level not using recovery logic to reject traffic because nodes are expected to be down at anytime and we don't want our users get affected even a system is in recovery. Being said that, we could provide a config flag to disallow writes during recovery. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647280#comment-13647280 ] Anoop Sam John commented on HBASE-7006: --- [~yuzhih...@gmail.com] bq.Without multi WAL, the above implies that all regions from one failed region server be assigned to one active region server. Yes with multi WAL only.. I was just saying it for future consideration :) bq.I guess the underlying assumption above is that there are several region groups in multi WAL Yes that is the assumption I have made. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647278#comment-13647278 ] Ted Yu commented on HBASE-7006: --- bq. we can do a HLog to region opening RS collocation Without multi WAL, the above implies that all regions from one failed region server be assigned to one active region server. This negates the performance benefit of distributed log splitting. bq. assigning all regions in one group to a RS I guess the underlying assumption above is that there are several region groups in multi WAL such that we gain parallelism across multiple active region servers. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647273#comment-13647273 ] Anoop Sam John commented on HBASE-7006: --- Do we need a cleaner abstraction layer for RS->RS communication? May be later when we can do a HLog to region opening RS collocation (RS where the region is newly assigned only doing the HLog split) we can do stuff in this layer so as to avoid the RS connection based calls but just get the Region ref from RS and do direct writes) As I mentioned in some above comment when we can do the multi WAL and if we go with fixed regions for a WAL (we are infact doing virtula groups of regions in RS), we can try(max try) assigning all regions in one group to a RS and give the log splitting work for those WAL to this RS then it will be 100% locality wrt the replay commands. Sounds sensible? May be in such a case the replay can create the HFiles directly avoiding the memstore write and then flushes? (Like the bulk loading way) Some thoughts coming.. Pls correct me if I am going wrong. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647252#comment-13647252 ] Anoop Sam John commented on HBASE-7006: --- [~jeffreyz] I also had the same question as from Stack regarding the WAL. This might be very important. Also now we will allow the writes on the recovering region when this replay is happening. These other writes + replays might be doing flushes in btw.. Any way replays alone also might be doing flushes in between(because of memstore sizes).. When this replays are in progress for some regions opened in a RS, now the replay requests from other RS taking some handlers. Whether this will affect the normal functioning of the RS? May be we can test this also IMO. The cluster is normal functioning with read,writes and then this RS down happens. So whether/how it will impact the normal read write throughput. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647084#comment-13647084 ] Jeffrey Zhong commented on HBASE-7006: -- [~saint@gmail.com] Good comments! Please see my responses in reverse order of your feedbacks: {quote} Would there be any advantage NOT writing the WAL on replay and only when done, then flush {quote} This is very good question. Actually I was thinking to evaluate this after this feature is in as a possible optimization. Currently receiving RS does a WAL sync for each replay batch. In the optimization scenario, we could replay mutaions with SKIP_WAL durability and flush at the end. The gain mostly depends on the "sequential" write performance of wal syncs. I think it's worth a try here. {quote} The two sequenceids are never related right? They are only applied to the logs of the server who passed the particular sequenceid to the master? {quote} No, sequenceIds from different RSs are totally un-related. Yes. Currently we use the up-to-date flushed sequence id when we open the region by looking all the store files as we do today. {quote} + "...check if all WALs of a failed region server have been successfully replayed." How is this done? {quote} We rely on the fact that when log split for a failed RS is done then all its wal files are recovered so we don't really does the check. {quote} + How will a crashed regionserver ".. and appending itself into the list of...": i.e. append itself to list of crashed servers (am I reading this wrong)? {quote} Master SSH does the work not the dead RS. {quote} + Is your assumption about out-of-order replay of edits new to this feature? {quote} Yes. I'll amend the design doc based on your other comments. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647069#comment-13647069 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12581430/hbase-7006-combined-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestOpenedRegionHandler org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5527//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646864#comment-13646864 ] stack commented on HBASE-7006: -- Some comments on the design doc: + Nit: Add author, date, and add issue number so can go back to the hosting issue should I trip over the doc w/o any other context. + Is your assumption about out-of-order replay of edits new to this feature? I suppose in the old/current way of log splitting, we do stuff in sequenceid order because we wrote the recovered.edits files named by sequenceid... so they were ordered when the regionserver read them in? We should highlight your assumption more. I think if we move to multiple-WALs we'll want to also take on this assumption doing recovery. + Given the assumption, we should list the problematic scenarios (or point to where we list them already -- I think the 'Current Limitations' section here http://hbase.apache.org/book.html#version.delete should have the list we currently know). + "...check if all WALs of a failed region server have been successfully replayed." How is this done? + How will a crashed regionserver ".. and appending itself into the list of...": i.e. append itself to list of crashed servers (am I reading this wrong)? bq. "For each region per failed region server, we stores the last flushed sequence Id from the region server before it failed." This is the mechanism that has the regionserver telling the master its current sequenceid everytime it flushes to an hfile? So when server crashes, master writes a znode under the recovering-regions with the last reported seq id? if a new regionserver hosting a recovery of regions then crashes, it gets a new znode w/ its current sequenceid? Now we have two crashed servers with (probably) two different sequenceids whose logs we are recovering. The two sequenceids are never related right? They are only applied to the logs of the server who passed the particular sequenceid to the master? Question: So it looks like we replay the WALs of a crashed regionserver by playing them into the new region host servers. There does not seem to be a flush when the replay of the old crashed servers WALs is done. Is your thinking that it is not needed since the old edits are now in the new servers WAL? Would there be any advantage NOT writing the WAL on replay and only when done, then flush (I suppose not, thinking about it, and in fact, it would probably make replay more complicated since we'd have to have this new operation to do; a flush-when-all-WALS-recovered). Good stuff. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645705#comment-13645705 ] Jeffrey Zhong commented on HBASE-7006: -- Thanks [~anoop.hbase] for the reviewing! {quote} For the replay we call replay interface addded in HRS from another HRS. So all the Mutations in that call are replay mutations. {quote} Agree. In fact, current implementation is this way. The replay flag is NOT added into MutationProto protobuf message but in the Mutation class. So client doesn't need to specify the flag while the receiving region server set the flag so that write path code can do special logic for the replay otherwise I have to add a new 'replay' flag input argument to all functions along the write path. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645431#comment-13645431 ] Anoop Sam John commented on HBASE-7006: --- Added some comments in RB. Not yet completed the review.. Mutation.replay -> This new state varianle is needed really? For the replay we call replay interface addded in HRS from another HRS. So all the Mutations in that call are replay mutations. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644851#comment-13644851 ] Jeffrey Zhong commented on HBASE-7006: -- TestMetaReaderEditor is related and the other three passed locally. I'll include fixes in the next patch. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644703#comment-13644703 ] stack commented on HBASE-7006: -- Are some of the above failures because of your patch J? (Reviewing now...) > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644339#comment-13644339 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580940/hbase-7006-combined-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 4 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationQueueFailover org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort org.apache.hadoop.hbase.backup.TestHFileArchiving org.apache.hadoop.hbase.catalog.TestMetaReaderEditor Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5482//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting > Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643549#comment-13643549 ] Jeffrey Zhong commented on HBASE-7006: -- Sure, I'll put the latest combined patch in the review board this weekend. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643485#comment-13643485 ] stack commented on HBASE-7006: -- [~jeffreyz] Yeah, rb it please sir (A few of us were talking about it today... we are all fired up for reviewing more!). Thanks J. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643420#comment-13643420 ] Himanshu Vashishtha commented on HBASE-7006: This has really bloated now. Can you please rb it. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642429#comment-13642429 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580612/hbase-7006-combined-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings (more than the trunk's current 0 warnings). {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.backup.TestHFileArchiving org.apache.hadoop.hbase.security.access.TestAccessController Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5459//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642027#comment-13642027 ] Jeffrey Zhong commented on HBASE-7006: -- Hey Ram, Thanks for the good questions. Below are the answers: 1) {code} catch (KeeperException e) { +LOG.warn("Cannot get lastFlushedSequenceId from ZooKeeper for server=" + regionServerName ++ "; region=" + encodedRegionName, e); + } {code} In this scenario, we can't get last flushed sequence Id so we'll replay all edits in the wal. There will be some duplicated replay while it won't affect correctness. {code} +} catch (KeeperException e) { + LOG.warn("Cannot remove recovering regions from ZooKeeper", e); +} {code} We have other place to do stale data GC. Therefore, after a little bit, the recovering ZK node should be removed: In SplitLogManager, we have following code: // Garbage collect left-over /hbase/recovering-regions/... znode if (tot == 0 && inflightWorkItems.size() == 0 && tasks.size() == 0) { removeRecoveringRegionsFromZK(null); } -Jeffrey > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641981#comment-13641981 ] Jeffrey Zhong commented on HBASE-7006: -- [~anoop.hbase] Are you suggesting to add a req counter at the receiving RS to see how many replays is happening? I think it's a good idea. In addition, I don't see there is such counter for each individual command such as put, get, scan etc. I can add new counters for all client commands in RS. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641977#comment-13641977 ] ramkrishna.s.vasudevan commented on HBASE-7006: --- It is more dependent on ZK now. will these exceptions cause any problem if it happens always {code} catch (KeeperException e) { +LOG.warn("Cannot get lastFlushedSequenceId from ZooKeeper for server=" + regionServerName ++ "; region=" + encodedRegionName, e); + } {code} {code} +} catch (KeeperException e) { + LOG.warn("Cannot remove recovering regions from ZooKeeper", e); +} {code} > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641431#comment-13641431 ] Anoop Sam John commented on HBASE-7006: --- [~jeffreyz] Do we need the metric like req count to be affected by the replay requests? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641338#comment-13641338 ] ramkrishna.s.vasudevan commented on HBASE-7006: --- Patch looks good on a high level. Will go through the patch > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13639871#comment-13639871 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12580153/hbase-7006-combined-v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 21 new or modified tests. {color:green}+1 hadoop1.0{color}. The patch compiles against the hadoop 1.0 profile. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5417//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, > LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638964#comment-13638964 ] Anoop Sam John commented on HBASE-7006: --- The comparison numbers looks promising! So now we make the region available for writes immediately. Have you run the test with clients doing the writes to region soon after it is opened for write? Going through the patch.. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638831#comment-13638831 ] stack commented on HBASE-7006: -- [~jeffreyz] Nice numbers in posted doc. What does below mean sir? {code} + // make current mutation as a distributed log replay change + protected boolean isReplay = false; {code} Why we have this isReplay in a Mutation? Because these edits get treated differently over on serverside? Suggest calling the data member replay or logReplay or walReplay and then the accessor is isLogReply or isWALReplay. isReplay is the name of a method that returns whether the data member replay is true or not. Does this define belong in this patch? + /** Conf key that specifies region assignment timeout value */ + public static final String REGION_ASSIGNMENT_TIME_OUT = "hbase.master.region.assignment.time.out"; Why we timing out assignments in this patch? Is this log splitting that is referred to in the metric name below? + void updateMetaSplitTime(long time); If so, should it be updateMetaWALSplitTime? And given what this patch is about, should it be WALReplay? Ditto for updateMetaSplitSize Excuse me if I am not following what is going on w/ the above (because I see later that you have replay metrics going on) Default is false? {code} +distributedLogReplay = this.conf.getBoolean(HConstants.DISTRIBUTED_LOG_REPLAY_KEY, false); {code} Should we turn it on in trunk and off in 0.95? (Should we turn it on in 0.95 so it gets a bit of testing?) Something wrong w/ license in WALEditsReplaySink Skimmed the patch. Let me come back w/ a decent review. Looks good J. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637026#comment-13637026 ] Ted Yu commented on HBASE-7006: --- For ReplicationZookeeper.java : {code} + public static byte[] toByteArray( + final long position) { {code} Considering lockToByteArray() method that follows above, maybe rename above as positionToByteArray() {code} + public static final String REGION_ASSIGNMENT_TIME_OUT = "hbase.master.region.assignment.time.out"; {code} How about "hbase.master.region.assignment.timeout" ? {code} + static final String REPLAY_BATCH_SIZE_DESC = "Number of changes of each replay batch."; {code} ""Number of changes of each" -> ""Number of changes in each" For AssignmentManager.java : {code} +long end = (timeOut <= 0) ? Long.MAX_VALUE : System.currentTimeMillis() + timeOut; ... + if (System.currentTimeMillis() > end) { {code} Please use EnvironmentEdge. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637011#comment-13637011 ] Jonathan Hsieh commented on HBASE-7006: --- Lovely. Thanks! > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636984#comment-13636984 ] Jeffrey Zhong commented on HBASE-7006: -- [~jmhsieh] The initial performance number is in the attachment 'LogSplitting Comparison.pdf'. Thanks. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636983#comment-13636983 ] Jonathan Hsieh commented on HBASE-7006: --- [~jeffreyz] Do we have any numbers of how this improves our recovery time? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636151#comment-13636151 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579497/hbase-7006-combined.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 15 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5360//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636138#comment-13636138 ] Anoop Sam John commented on HBASE-7006: --- Will start reviewing the patch by tomorrow Jeffrey Zhong. This will be an interesting stuff in MTTR. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636134#comment-13636134 ] Hadoop QA commented on HBASE-7006: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579488/hbase-7006-combined.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 15 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 site{color}. The mvn site goal succeeds with this patch. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.io.TestHeapSize Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/5356//console This message is automatically generated. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634456#comment-13634456 ] Jimmy Xiang commented on HBASE-7006: I prefer small patches, otherwise, it is hard to review. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632441#comment-13632441 ] Jeffrey Zhong commented on HBASE-7006: -- [~jxiang] Thanks in advance for reviewing. The assumption documented in the write-up is verified and relies on idempotence of hbase. I think it makes sense to review the combined patch to reduce reviewing effort but I fully relies on each reviewer's preferences. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632359#comment-13632359 ] Jimmy Xiang commented on HBASE-7006: You mentioned that this patch depends on some assumption. Have you verified it? If so, which patch should be reviewed and committed at first? Or you want them all be reviewed and committed together? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632198#comment-13632198 ] Jimmy Xiang commented on HBASE-7006: Cool, that's great! > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632172#comment-13632172 ] Jeffrey Zhong commented on HBASE-7006: -- {quote} it sounds we trade disk io to network io {quote} No, we cut both disk io and network ios relating to recovered.edits files creations & deletions. Currently we replay the wal to the destination region server while in old way the destination RS reads recovered edits from underlying hdfs. In terms of network io, they're same because the old way still need read recovered edits file across wire. The difference is that in distributed replay wal edits are pushed to the destination RS while the old way is pulling edits from recovered edits(which are intermediate files). In summary, the IOs related to recovered.edits files are all gone without any extra IOs. I think this question is common and I'll include this in the write up. {quote} Suppose a region server failed again in the middle, does a split worker need to split the WAL again? This means a WAL may be read/split multiple times {quote} We handle sequential RS failures like a new RS failure and replay its WALs left behind. We may read a WAL multiple times in sequential failures but not replay multiple times if edits are flushed. {quote} In the attached performance testing, do we have a breakdown on how many time it spends on reading the log file, writing to the recovered edits file? How did you measure the log splitting time? {quote} I don't have the break down since reading and writing happen at the same time. In normal cases, writing finish several secs after reading is done. We have metrics in splitlogmanager which measures the total splitting time and that's what I used in the testing. The latest combined patch is attached in 7837. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631981#comment-13631981 ] Jimmy Xiang commented on HBASE-7006: I read the proposal and have some questions. At first, it sounds we trade disk io to network io, which should have better performance. As to the memstore flush write saving after recovered.edits have been replayed, the proposal needs to do the same, right? You just write them to another WAL file, isn't it true? Suppose a region server failed again in the middle, does a split worker need to split the WAL again? This means a WAL may be read/split multiple times? In the attached performance testing, do we have a breakdown on how many time it spends on reading the log file, writing to the recovered edits file? How did you measure the log splitting time? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug > Components: MTTR >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.95.1 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583653#comment-13583653 ] Jeffrey Zhong commented on HBASE-7006: -- Mark it critical so that we can ship this into 0.96. Thanks, -Jeffrey > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576821#comment-13576821 ] Jeffrey Zhong commented on HBASE-7006: -- @Ted, Yes, my first patch will include major logic for this JIRA and will be attached to a sub task JIRA(to be created) and be submitted within these two days. There will be two more sub JIRAs: one is to create a replay command and the other is to add metrics for better reporting purpose. Thanks, -Jeffrey > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576339#comment-13576339 ] Ted Yu commented on HBASE-7006: --- @Jeff: Do you plan to publish your patch in sub-task of this JIRA ? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576304#comment-13576304 ] Enis Soztutar commented on HBASE-7006: -- Agreed that it is the middle ground. On region open, RS has to do a read on the index, and seek, and sequential read for each region. However, in your approach as you reported off-list, we are paying for re-locating the regions, and the rpc overhead instead of just streaming sequential writes to hdfs. I was just curious, given the current implementation, which one would be faster. I am not suggesting that we should prototype that as well, especially given that we can open the regions for writes in 1-2 secs with this. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576244#comment-13576244 ] Jeffrey Zhong commented on HBASE-7006: -- The big table approach is kind of middle ground approach between the existing implementation and the proposal in the JIRA. The file block implementation seems need more work though. Each region server has to read all those newly created block files to replay edits but cut writes significantly so it should have improvements over existing approach(not the new proposal as it still read recovery data twice: one is in log splitting and the other is in replay phase and incur some extra writes). Thanks, -Jeffrey > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576225#comment-13576225 ] Enis Soztutar commented on HBASE-7006: -- These are excellent results, especially with large # of regions. Also we will benefit from other improvements on connection management, region discovery, etc, which means that those numbers can go even lower. Let's try to get this in with the current set of changes, then as we debug more and learn more, we can do follow ups. One thing we did not test is to not write a file per region per WAL file, but do the bigtable approach. Namely, for each WAL file, read up until DFS block size (128MB), sort the edits per region in memory, and write a file per block. The files have a simple index per region. Not sure how we can test that easily though. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576178#comment-13576178 ] Ted Yu commented on HBASE-7006: --- This is encouraging. Looking forward to your patch. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: LogSplitting Comparison.pdf, > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550705#comment-13550705 ] Jeffrey Zhong commented on HBASE-7006: -- Thanks Stack for reviewing the proposal! {quote} What if we do multiple WALs per regionserver? That shouldn't change your processing model far as I can see. {quote} Yeah, you're right. multiples WALs per RS won't affect the proposal. Thanks, -Jeffrey > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong > Fix For: 0.96.0 > > Attachments: > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550662#comment-13550662 ] stack commented on HBASE-7006: -- Excellent write up Jeffrey. Was thinking myself that we might do what Nicolas suggests on the end. It looks like you handle failures properly. Savings will be large I'd think. Actually simplifies the log splitting process I'd say. What if we do multiple WALs per regionserver? That shouldn't change your processing model far as I can see. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Assignee: Jeffrey Zhong >Priority: Critical > Fix For: 0.96.0 > > Attachments: > ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479513#comment-13479513 ] stack commented on HBASE-7006: -- [~nkeywal] No sir. Limit was 8 WALs but write rate overran the limit so almost 40 WALs each. > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Priority: Critical > Fix For: 0.96.0 > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster
[ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478770#comment-13478770 ] nkeywal commented on HBASE-7006: Nothing related to HBASE-6738? There is not a limit of 32 WALs per node (hence 900 wals)? Or have you lost more nodes? > [MTTR] Study distributed log splitting to see how we can make it faster > --- > > Key: HBASE-7006 > URL: https://issues.apache.org/jira/browse/HBASE-7006 > Project: HBase > Issue Type: Bug >Reporter: stack >Priority: Critical > Fix For: 0.96.0 > > > Just saw interesting issue where a cluster went down hard and 30 nodes had > 1700 WALs to replay. Replay took almost an hour. It looks like it could run > faster that much of the time is spent zk'ing and nn'ing. > Putting in 0.96 so it gets a look at least. Can always punt. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira