[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13653495#comment-13653495
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12582559/hbase-7006-combined-v6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 31 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestHBaseFsck

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5620//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch, 
> hbase-7006-combined-v6.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652258#comment-13652258
 ] 

stack commented on HBASE-7006:
--

I added note to refguide that folks should run w/ newer zks and point to 
ZOOKEEPER-1277 as a justification.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652253#comment-13652253
 ] 

stack commented on HBASE-7006:
--

[~jeffreyz] Thanks.

I asked about zxid.

"I think you mean the zxid? That's a 64bit number where the lower
32bits are the xid and the upper 32 bits are the epoch. The xid
increases for each write, the epoch increases when there is a leader
change. The zxid should always only increase. There was a bug where
the lower 32bits could roll over, however that resulted in the epoch
number increasing as well (64bits++) - so the constraint was
maintained (but the cluster would fail/lockup for another issue, I
fixed that in recent releases though.. Now
when that is about to happen it forces a new leader election)."

Above is from our Patrick Hunt.  Says fix is in Apache ZK (3.3.5, 3.4.4).

If you look at tail of the below issue, you will see an hbase favorite user 
running into rollover issue:

https://issues.apache.org/jira/browse/ZOOKEEPER-1277

Let me make sure we add to notes that folks should upgrade to these versions of 
zk.


> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652106#comment-13652106
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

It seems that I forgot to publish it. You should have it now. Thanks. 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652094#comment-13652094
 ] 

stack commented on HBASE-7006:
--

[~jeffreyz] Nice.  Good one.

Up on rb, you may have missed another set of reviews of mine.  Thanks.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652074#comment-13652074
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

{quote}
How can you be sure all edits in WALs from crashed server were replicated 
already?
{quote}
This is guaranteed by the replication fail over logic. Replication waits for 
log splitting finish and then resume replication on those wal files from failed 
RS. The above change just make sure we don't replicate WAL edits created by 
replay command again because those edits will be replicated from the original 
wal file.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13652060#comment-13652060
 ] 

stack commented on HBASE-7006:
--

bq. 2) Set replayed WAL edits replication scope to null so that WAL edits 
created by replay command won't be double replicated.

How can you be sure all edits in WALs from crashed server were replicated 
already? 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13651704#comment-13651704
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12582262/hbase-7006-combined-v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 27 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.regionserver.TestAtomicOperation
  org.apache.hadoop.hbase.security.access.TestAccessController

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5589//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v5.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13650365#comment-13650365
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12581779/hbase-7006-combined-v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 24 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.master.TestDistributedLogSplitting

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5563//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v4.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13649070#comment-13649070
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12581779/hbase-7006-combined-v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 24 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5554//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, 
> hbase-7006-combined-v4.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647328#comment-13647328
 ] 

stack commented on HBASE-7006:
--

Thinking on it, flushing after all logs recovered is a bad idea because it a 
special case.  Replay mutations, as is, are treated like any other inbound 
edit.  I think this good. 

Turning off WALs and flushing on the end and trying to figure what we failed to 
write or writing hfiles directly -- if you could, and I don't think you can 
since edits need to be sorted in an hfile -- and by-passing memstore and then 
telling the Region to pick up the new hfile when done all introduce new states 
that we will have to manage complicating critical recovery.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647325#comment-13647325
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

{quote}
This might be very important. Also now we will allow the writes on the 
recovering region when this replay is happening. These other writes + replays 
might be doing flushes in btw.
{quote}
This is valid concern. Let's compare the new way with old way. old log 
splitting appends each WAL edit into a recovered.edits file while the new way 
flush disk only when memstore reaching certain size. Therefore, even with 
allowing writes during recovery, new distributed log replay still has better 
disk writing characteristics(assuming normal situations). 
While your concern is more relevant when a system close to its disk IO or other 
capacity. Allowing writes could deteriorate whole system even more. I think a 
system operator should rate limiting in a higher level not using recovery logic 
to reject traffic because nodes are expected to be down at anytime and we don't 
want our users get affected even a system is in recovery. Being said that, we 
could provide a config flag to disallow writes during recovery.  


> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647280#comment-13647280
 ] 

Anoop Sam John commented on HBASE-7006:
---

[~yuzhih...@gmail.com]
bq.Without multi WAL, the above implies that all regions from one failed region 
server be assigned to one active region server. 
Yes with multi WAL only.. I was just saying it for future consideration :)
bq.I guess the underlying assumption above is that there are several region 
groups in multi WAL 
Yes that is the assumption I have made. 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647278#comment-13647278
 ] 

Ted Yu commented on HBASE-7006:
---

bq. we can do a HLog to region opening RS collocation
Without multi WAL, the above implies that all regions from one failed region 
server be assigned to one active region server. This negates the performance 
benefit of distributed log splitting.

bq. assigning all regions in one group to a RS
I guess the underlying assumption above is that there are several region groups 
in multi WAL such that we gain parallelism across multiple active region 
servers.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647273#comment-13647273
 ] 

Anoop Sam John commented on HBASE-7006:
---

Do we need a cleaner abstraction layer for RS->RS communication?  May be later 
when we can do a HLog to region opening RS collocation (RS where the region is 
newly assigned only doing the HLog split) we can do stuff in this layer so as 
to avoid the RS connection based calls but just get the Region ref from RS and 
do direct writes)

As I mentioned in some above comment when we can do the multi WAL and if we go 
with fixed regions for a WAL (we are infact doing virtula groups of regions in 
RS), we can try(max try) assigning all regions in one group to a RS and give 
the log splitting work for those WAL to this RS then it will be 100% locality 
wrt the replay commands. Sounds sensible? May be in such a case the replay can 
create the HFiles directly avoiding the memstore write and then flushes? (Like 
the bulk loading way)  Some thoughts coming.. Pls correct me if I am going 
wrong.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647252#comment-13647252
 ] 

Anoop Sam John commented on HBASE-7006:
---

[~jeffreyz] I also had the same question as from Stack regarding the WAL. This 
might be very important.  Also now we will allow the writes on the recovering 
region when this replay is happening. These other writes + replays might be 
doing flushes in btw.. Any way replays alone also might be doing flushes in 
between(because of memstore sizes)..  When this replays are in progress for 
some regions opened in a RS, now the replay requests from other RS taking some 
handlers.  Whether this will affect the normal functioning of the RS?  May be 
we can test this also IMO. The cluster is normal functioning with read,writes 
and then this RS down happens. So whether/how it will impact the normal read 
write throughput.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647084#comment-13647084
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

[~saint@gmail.com] Good comments! Please see my responses in reverse order 
of your feedbacks:

{quote}
Would there be any advantage NOT writing the WAL on replay and only when done, 
then flush
{quote}
This is very good question. Actually I was thinking to evaluate this after this 
feature is in as a possible optimization. Currently receiving RS does a WAL 
sync for each replay batch. In the optimization scenario, we could replay 
mutaions with SKIP_WAL durability and flush at the end. The gain mostly depends 
on the "sequential" write performance of wal syncs. I think it's worth a try 
here.   

{quote}
 The two sequenceids are never related right? They are only applied to the logs 
of the server who passed the particular sequenceid to the master?
{quote}
No, sequenceIds from different RSs are totally un-related. Yes. Currently we 
use the up-to-date flushed sequence id when we open the region by looking all 
the store files as we do today. 

{quote}
+ "...check if all WALs of a failed region server have been successfully 
replayed." How is this done?
{quote}
We rely on the fact that when log split for a failed RS is done then all its 
wal files are recovered so we don't really does the check.
{quote}
+ How will a crashed regionserver ".. and appending itself into the list 
of...": i.e. append itself to list of crashed servers (am I reading this wrong)?
{quote}
Master SSH does the work not the dead RS.

{quote}
+ Is your assumption about out-of-order replay of edits new to this feature? 
{quote}
Yes. I'll amend the design doc based on your other comments. Thanks.





> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13647069#comment-13647069
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12581430/hbase-7006-combined-v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 21 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.master.TestOpenedRegionHandler
  
org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5527//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v3.patch, hbase-7006-combined-v4.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-05-01 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646864#comment-13646864
 ] 

stack commented on HBASE-7006:
--

Some comments on the design doc:

+ Nit: Add author, date, and add issue number so can go back to the hosting 
issue should I trip over the doc w/o any other context.
+ Is your assumption about out-of-order replay of edits new to this feature?  I 
suppose in the old/current way of log splitting, we do stuff in sequenceid 
order because we wrote the recovered.edits files named by sequenceid... so they 
were ordered when the regionserver read them in? We should highlight your 
assumption more.  I think if we move to multiple-WALs we'll want to also take 
on this assumption doing recovery.
+ Given the assumption, we should list the problematic scenarios (or point to 
where we list them already -- I think the 'Current Limitations' section here 
http://hbase.apache.org/book.html#version.delete should have the list we 
currently know).
+ "...check if all WALs of a failed region server have been successfully 
replayed."  How is this done?
+ How will a crashed regionserver ".. and appending itself into the list 
of...": i.e. append itself to list of crashed servers (am I reading this wrong)?

bq. "For each region per failed region server, we stores the last flushed 
sequence Id from the region server before it failed."

This is the mechanism that has the regionserver telling the master its current 
sequenceid everytime it flushes to an hfile?  So when server crashes, master 
writes a znode under the recovering-regions with the last reported seq id?
if a new regionserver hosting a recovery of regions then crashes, it gets a new 
znode w/ its current sequenceid?  Now we have two crashed servers with 
(probably) two different sequenceids whose logs we are recovering.  The two 
sequenceids are never related right?  They are only applied to the logs of the 
server who passed the particular sequenceid to the master?


Question:  So it looks like we replay the WALs of a crashed regionserver by 
playing them into the new region host servers.  There does not seem to be a 
flush when the replay of the old crashed servers WALs is done.  Is your 
thinking that it is not needed since the old edits are now in the new servers 
WAL?  Would there be any advantage NOT writing the WAL on replay and only when 
done, then flush (I suppose not, thinking about it, and in fact, it would 
probably make replay more complicated since we'd have to have this new 
operation to do; a flush-when-all-WALS-recovered).

Good stuff.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-30 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645705#comment-13645705
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

Thanks [~anoop.hbase] for the reviewing!
{quote}
For the replay we call replay interface addded in HRS from another HRS. So all 
the Mutations in that call are replay mutations.
{quote}
Agree. In fact, current implementation is this way. The replay flag is NOT 
added into MutationProto protobuf message but in the Mutation class. So client 
doesn't need to specify the flag while the receiving region server set the flag 
so that write path code can do special logic for the replay otherwise I have to 
add a new 'replay' flag input argument to all functions along the write path.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-30 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645431#comment-13645431
 ] 

Anoop Sam John commented on HBASE-7006:
---

Added some comments in RB. Not yet completed the review..
Mutation.replay -> This new state varianle is needed really? For the replay we 
call replay interface addded in HRS from another HRS. So all the Mutations in 
that call are replay mutations.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-29 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644851#comment-13644851
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

TestMetaReaderEditor is related and the other three passed locally. I'll 
include fixes in the next patch. Thanks.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-29 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644703#comment-13644703
 ] 

stack commented on HBASE-7006:
--

Are some of the above failures because of your patch J? (Reviewing now...)

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644339#comment-13644339
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12580940/hbase-7006-combined-v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 21 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 4 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 release 
audit warnings (more than the trunk's current 0 warnings).

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.replication.TestReplicationQueueFailover
  
org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort
  org.apache.hadoop.hbase.backup.TestHFileArchiving
  org.apache.hadoop.hbase.catalog.TestMetaReaderEditor

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5482//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, hbase-7006-combined-v3.patch, LogSplitting 
> Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-26 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643549#comment-13643549
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

Sure, I'll put the latest combined patch in the review board this weekend. 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-26 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643485#comment-13643485
 ] 

stack commented on HBASE-7006:
--

[~jeffreyz] Yeah, rb it please sir (A few of us were talking about it today... 
we are all fired up for reviewing more!).   Thanks J.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-26 Thread Himanshu Vashishtha (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13643420#comment-13643420
 ] 

Himanshu Vashishtha commented on HBASE-7006:


This has really bloated now. Can you please rb it. Thanks.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642429#comment-13642429
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12580612/hbase-7006-combined-v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 21 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:red}-1 release audit{color}.  The applied patch generated 1 release 
audit warnings (more than the trunk's current 0 warnings).

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.backup.TestHFileArchiving
  org.apache.hadoop.hbase.security.access.TestAccessController

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//testReport/
Release audit warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/patchReleaseAuditProblems.txt
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5459//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> hbase-7006-combined-v2.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-25 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642027#comment-13642027
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

Hey Ram,
Thanks for the good questions. Below are the answers:
1) 







{code}
catch (KeeperException e) {
+LOG.warn("Cannot get lastFlushedSequenceId from ZooKeeper for server=" 
+ regionServerName
++ "; region=" + encodedRegionName, e);
+  }
{code}
In this scenario, we can't get last flushed sequence Id so we'll replay all 
edits in the wal. There will be some duplicated replay while it won't affect 
correctness.








{code}
+} catch (KeeperException e) {
+  LOG.warn("Cannot remove recovering regions from ZooKeeper", e);
+}
{code}
We have other place to do stale data GC. Therefore, after a little bit, the 
recovering ZK node should be removed:
In SplitLogManager, we have following code:















  // Garbage collect left-over /hbase/recovering-regions/... znode
  if (tot == 0 && inflightWorkItems.size() == 0 && tasks.size() == 0) {
removeRecoveringRegionsFromZK(null);
  }
-Jeffrey
  


> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-25 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641981#comment-13641981
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

[~anoop.hbase] Are you suggesting to add a req counter at the receiving RS to 
see how many replays is happening? I think it's a good idea. In addition, I 
don't see there is such counter for each individual command such as put, get, 
scan etc. I can add new counters for all client commands in RS. Thanks.  

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-25 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641977#comment-13641977
 ] 

ramkrishna.s.vasudevan commented on HBASE-7006:
---

It is more dependent on ZK now. will these exceptions cause any problem if it 
happens always
{code}
catch (KeeperException e) {
+LOG.warn("Cannot get lastFlushedSequenceId from ZooKeeper for server=" 
+ regionServerName
++ "; region=" + encodedRegionName, e);
+  }
{code}
{code}
+} catch (KeeperException e) {
+  LOG.warn("Cannot remove recovering regions from ZooKeeper", e);
+}
{code}

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-24 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641431#comment-13641431
 ] 

Anoop Sam John commented on HBASE-7006:
---

[~jeffreyz] Do we need the metric like req count to be affected by the replay 
requests?

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-24 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641338#comment-13641338
 ] 

ramkrishna.s.vasudevan commented on HBASE-7006:
---

Patch looks good on a high level.  Will go through the patch

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13639871#comment-13639871
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12580153/hbase-7006-combined-v1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 21 new 
or modified tests.

{color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5417//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, hbase-7006-combined-v1.patch, 
> LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-23 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638964#comment-13638964
 ] 

Anoop Sam John commented on HBASE-7006:
---

The comparison numbers looks promising! So now we make the region available for 
writes immediately. Have you run the test with clients doing the writes to 
region soon after it is opened for write? Going through the patch..

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-22 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13638831#comment-13638831
 ] 

stack commented on HBASE-7006:
--

[~jeffreyz] Nice numbers in posted doc.

What does below mean sir?

{code}
+  // make current mutation as a distributed log replay change
+  protected boolean isReplay = false;
{code}

Why we have this isReplay in a Mutation?  Because these edits get treated 
differently over on serverside?

Suggest calling the data member replay or logReplay or walReplay and then the 
accessor is isLogReply or isWALReplay.  isReplay is the name of a method that 
returns whether the data member replay is true or not.

Does this define belong in this patch?

+  /** Conf key that specifies region assignment timeout value */
+  public static final String REGION_ASSIGNMENT_TIME_OUT = 
"hbase.master.region.assignment.time.out";

Why we timing out assignments in this patch?

Is this log splitting that is referred to in the metric name below?

+  void updateMetaSplitTime(long time);

If so, should it be updateMetaWALSplitTime?  And given what this patch is 
about, should it be WALReplay?

Ditto for updateMetaSplitSize

Excuse me if I am not following what is going on w/ the above (because I see 
later that you have replay metrics going on)

Default is false?

{code}
+distributedLogReplay = 
this.conf.getBoolean(HConstants.DISTRIBUTED_LOG_REPLAY_KEY, false);
{code}

Should we turn it on in trunk and off in 0.95?  (Should we turn it on in 0.95 
so it gets a bit of testing?)

Something wrong w/ license in WALEditsReplaySink

Skimmed the patch.  Let me come back w/ a decent review.  Looks good J.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-19 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637026#comment-13637026
 ] 

Ted Yu commented on HBASE-7006:
---

For ReplicationZookeeper.java :
{code}
+  public static byte[] toByteArray(
+  final long position) {
{code}
Considering lockToByteArray() method that follows above, maybe rename above as 
positionToByteArray()
{code}
+  public static final String REGION_ASSIGNMENT_TIME_OUT = 
"hbase.master.region.assignment.time.out";
{code}
How about "hbase.master.region.assignment.timeout" ?
{code}
+  static final String REPLAY_BATCH_SIZE_DESC = "Number of changes of each 
replay batch.";
{code}
""Number of changes of each" -> ""Number of changes in each"

For AssignmentManager.java :
{code}
+long end = (timeOut <= 0) ? Long.MAX_VALUE : System.currentTimeMillis() + 
timeOut;
...
+  if (System.currentTimeMillis() > end) {
{code}
Please use EnvironmentEdge.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-19 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637011#comment-13637011
 ] 

Jonathan Hsieh commented on HBASE-7006:
---

Lovely.  Thanks!

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-19 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636984#comment-13636984
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

[~jmhsieh] The initial performance number is in the attachment 'LogSplitting 
Comparison.pdf'. Thanks. 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-19 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636983#comment-13636983
 ] 

Jonathan Hsieh commented on HBASE-7006:
---

[~jeffreyz] Do we have any numbers of how this improves our recovery time?

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636151#comment-13636151
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12579497/hbase-7006-combined.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 15 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.TestHeapSize

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5360//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-19 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636138#comment-13636138
 ] 

Anoop Sam John commented on HBASE-7006:
---

Will start reviewing the patch by tomorrow Jeffrey Zhong. This will be an 
interesting stuff in MTTR.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636134#comment-13636134
 ] 

Hadoop QA commented on HBASE-7006:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12579488/hbase-7006-combined.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 15 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.TestHeapSize

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/5356//console

This message is automatically generated.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: hbase-7006-combined.patch, LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-17 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634456#comment-13634456
 ] 

Jimmy Xiang commented on HBASE-7006:


I prefer small patches, otherwise, it is hard to review.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-15 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632441#comment-13632441
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

[~jxiang] Thanks in advance for reviewing. The assumption documented in the 
write-up is verified and relies on idempotence of hbase. I think it makes sense 
to review the combined patch to reduce reviewing effort but I fully relies on 
each reviewer's preferences.
 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-15 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632359#comment-13632359
 ] 

Jimmy Xiang commented on HBASE-7006:


You mentioned that this patch depends on some assumption.  Have you verified 
it? If so, which patch should be reviewed and committed at first? Or you want 
them all be reviewed and committed together?

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-15 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632198#comment-13632198
 ] 

Jimmy Xiang commented on HBASE-7006:


Cool, that's great!

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-15 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13632172#comment-13632172
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

{quote}
 it sounds we trade disk io to network io
{quote}
No, we cut both disk io and network ios relating to recovered.edits files 
creations & deletions.

Currently we replay the wal to the destination region server while in old way 
the destination RS reads recovered edits from underlying hdfs. In terms of 
network io, they're same because the old way still need read recovered edits 
file across wire. The difference is that in distributed replay wal edits are 
pushed to the destination RS while the old way is pulling edits from recovered 
edits(which are intermediate files). 

In summary, the IOs related to recovered.edits files are all gone without any 
extra IOs. I think this question is common and I'll include this in the write 
up.

{quote}
Suppose a region server failed again in the middle, does a split worker need to 
split the WAL again? This means a WAL may be read/split multiple times
{quote}
We handle sequential RS failures like a new RS failure and replay its WALs left 
behind.  We may read a WAL multiple times in sequential failures but not replay 
multiple times if edits are flushed.  

{quote}
In the attached performance testing, do we have a breakdown on how many time it 
spends on reading the log file, writing to the recovered edits file? How did 
you measure the log splitting time?
{quote}
I don't have the break down since reading and writing happen at the same time. 
In normal cases, writing finish several secs after reading is done. We have 
metrics in splitlogmanager which measures the total splitting time and that's 
what I used in the testing. 

The latest combined patch is attached in 7837.


> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-04-15 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631981#comment-13631981
 ] 

Jimmy Xiang commented on HBASE-7006:


I read the proposal and have some questions. At first, it sounds we trade disk 
io to network io, which should have better performance.  As to the memstore 
flush write saving after recovered.edits have been replayed, the proposal needs 
to do the same, right?  You just write them to another WAL file, isn't it true?

Suppose a region server failed again in the middle, does a split worker need to 
split the WAL again? This means a WAL may be read/split multiple times?

In the attached performance testing, do we have a breakdown on how many time it 
spends on reading the log file, writing to the recovered edits file?  How did 
you measure the log splitting time?

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>  Components: MTTR
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.95.1
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-21 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13583653#comment-13583653
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

Mark it critical so that we can ship this into 0.96.

Thanks,
-Jeffrey

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-12 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576821#comment-13576821
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

@Ted,

Yes, my first patch will include major logic for this JIRA and will be attached 
to a sub task JIRA(to be created) and be submitted within these two days. There 
will be two more sub JIRAs: one is to create a replay command and the other is 
to add metrics for better reporting purpose.

Thanks,
-Jeffrey 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576339#comment-13576339
 ] 

Ted Yu commented on HBASE-7006:
---

@Jeff:
Do you plan to publish your patch in sub-task of this JIRA ?

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-11 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576304#comment-13576304
 ] 

Enis Soztutar commented on HBASE-7006:
--

Agreed that it is the middle ground. On region open, RS has to do a read on the 
index, and seek, and sequential read for each region. However, in your approach 
as you reported off-list, we are paying for re-locating the regions, and the 
rpc overhead instead of just streaming sequential writes to hdfs. I was just 
curious, given the current implementation, which one would be faster. I am not 
suggesting that we should prototype that as well, especially given that we can 
open the regions for writes in 1-2 secs with this. 


> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-11 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576244#comment-13576244
 ] 

Jeffrey Zhong commented on HBASE-7006:
--


The big table approach is kind of middle ground approach between the existing 
implementation and the proposal in the JIRA. The file block implementation 
seems need more work though. Each region server has to read all those newly 
created block files to replay edits but cut writes significantly so it should 
have improvements over existing approach(not the new proposal as it still read 
recovery data twice: one is in log splitting and the other is in replay phase 
and incur some extra writes).

Thanks,
-Jeffrey

 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-11 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576225#comment-13576225
 ] 

Enis Soztutar commented on HBASE-7006:
--

These are excellent results, especially with large # of regions. Also we will 
benefit from other improvements on connection management, region discovery, 
etc, which means that those numbers can go even lower. Let's try to get this in 
with the current set of changes, then as we debug more and learn more, we can 
do follow ups. 

One thing we did not test is to not write a file per region per WAL file, but 
do the bigtable approach. Namely, for each WAL file, read up until DFS block 
size (128MB), sort the edits per region in memory, and write a file per block. 
The files have a simple index per region. Not sure how we can test that easily 
though. 

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-02-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576178#comment-13576178
 ] 

Ted Yu commented on HBASE-7006:
---

This is encouraging.

Looking forward to your patch.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: LogSplitting Comparison.pdf, 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-01-10 Thread Jeffrey Zhong (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550705#comment-13550705
 ] 

Jeffrey Zhong commented on HBASE-7006:
--

Thanks Stack for reviewing the proposal! 

{quote}
What if we do multiple WALs per regionserver? That shouldn't change your 
processing model far as I can see.
{quote}
Yeah, you're right. multiples WALs per RS won't affect the proposal.

Thanks,
-Jeffrey

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
> Fix For: 0.96.0
>
> Attachments: 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2013-01-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550662#comment-13550662
 ] 

stack commented on HBASE-7006:
--

Excellent write up Jeffrey.

Was thinking myself that we might do what Nicolas suggests on the end.

It looks like you handle failures properly.

Savings will be large I'd think.

Actually simplifies the log splitting process I'd say.

What if we do multiple WALs per regionserver?  That shouldn't change your 
processing model far as I can see.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Assignee: Jeffrey Zhong
>Priority: Critical
> Fix For: 0.96.0
>
> Attachments: 
> ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006.pdf
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2012-10-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479513#comment-13479513
 ] 

stack commented on HBASE-7006:
--

[~nkeywal] No sir.  Limit was 8 WALs but write rate overran the limit so almost 
40 WALs each.

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Priority: Critical
> Fix For: 0.96.0
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-7006) [MTTR] Study distributed log splitting to see how we can make it faster

2012-10-18 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478770#comment-13478770
 ] 

nkeywal commented on HBASE-7006:


Nothing related to HBASE-6738?
There is not a limit of 32 WALs per node (hence 900 wals)? Or have you lost 
more nodes?

> [MTTR] Study distributed log splitting to see how we can make it faster
> ---
>
> Key: HBASE-7006
> URL: https://issues.apache.org/jira/browse/HBASE-7006
> Project: HBase
>  Issue Type: Bug
>Reporter: stack
>Priority: Critical
> Fix For: 0.96.0
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 
> 1700 WALs to replay.  Replay took almost an hour.  It looks like it could run 
> faster that much of the time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira