[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005853#comment-16005853 ] Hadoop QA commented on HBASE-18027: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 23s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 58s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 54s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 6s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 30m 11s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 58s {color} | {color:red} hbase-server generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 101m 12s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 146m 2s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hbase-server | | | Should org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$ReplicationEntryList be a _static_ inner class? At HBaseInterClusterReplicationEndpoint.java:inner class? At HBaseInterClusterReplicationEndpoint.java:[lines 198-216] | | Failed junit tests | hadoop.hbase.replication.multiwal.TestReplicationEndpointWithMultipleWAL | | | hadoop.hbase.replication.TestReplicationEndpoint | | | hadoop.hbase.replication.multiwal.TestReplicationEndpointWithMultipleAsyncWAL | | Timed out junit tests | org.apache.hadoop.hbase.master.procedure.TestTableDescriptorModificationFromClient | | | org.apache.hadoop.hbase.TestAcidGuarantees | | | org.apache.hadoop.hbase.master.snapshot.TestSnapshotFileCache | | | org.apache.hadoop.hbase.master.procedure.TestRestoreSnapshotProcedure | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:757bf37 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867479/HBASE-18027.patch | | JIRA Issue | HBASE-18027 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 0699854560b0 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007092#comment-16007092 ] Hadoop QA commented on HBASE-18027: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 32s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 48s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 27m 45s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 108m 27s {color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 149m 19s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.12.3 Server=1.12.3 Image:yetus/hbase:757bf37 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867611/HBASE-18027.patch | | JIRA Issue | HBASE-18027 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 742ef1c65a59 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 0ae0edc | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | whitespace | https://builds.apache.org/job/PreCommit-HBASE-Build/6766/artifact/patchprocess/whitespace-eol.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/6766/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/6766/console | | Powered by | Apache Yetus 0.3.0 http://yetus.apache.org | This message was automatically generated. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase >
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007433#comment-16007433 ] Andrew Purtell commented on HBASE-18027: Can I get a review [~lhofhansl] ? > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027.patch, > HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007603#comment-16007603 ] Hadoop QA commented on HBASE-18027: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 28s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 19s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 6s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 33s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 23s {color} | {color:green} branch-1 passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 3m 47s {color} | {color:red} hbase-server in branch-1 has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 51s {color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 28m 47s {color} | {color:green} The patch does not cause any errors with Hadoop 2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} | | {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 149m 54s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 201m 49s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.master.balancer.TestStochasticLoadBalancer2 | | | hadoop.hbase.regionserver.TestCompactionInDeadRegionServer | | | hadoop.hbase.regionserver.TestPerColumnFamilyFlush | | | hadoop.hbase.replication.TestSerialReplication | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:58c504e | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867686/HBASE-18027-branch-1.patch | | JIRA Issue | HBASE-18027 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux d27b0515c080 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/hbase.sh | | git revision | branch-1 / 51cb537 | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | findbugs | https://builds.apache.org/job/PreCommit-HBASE-Build/6768/artifact/patchprocess/branch-findbugs-hbase-server-warnings.html | | whitespace | https://builds.apache.org/job/PreCommit-HBASE-Build/6768/artifact/patchprocess/whitespace-eol.txt | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/6768/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | https://builds.apache.org/job/PreCommit-HBASE-Build/6768/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Re
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009126#comment-16009126 ] Lars Hofhansl commented on HBASE-18027: --- Looking > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027.patch, > HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009134#comment-16009134 ] Lars Hofhansl commented on HBASE-18027: --- So looking at the code... In the original code I assume that the caller do the size enforcement. And indeed I see that happening in the code. {{HBaseInterClusterReplicationEndpoint.replicate}} is called from {{ReplicationSourceWorkerThread.shipEdits}}, which is called from {{ReplicationSourceWorkerThread.run}} after the call to {{ReplicationSourceWorkerThread.readAllEntriesToReplicateOrNextFile}} which reads the next batch _and_ crucially enforces the replication batch size limit. So any single batch issued from within {{replicate}} can be larger than the overall batch size enforced (which defaults to 64MB). So I don't seen how this cause a problem (but as usually, it is entirely possible that I missed a piece of the puzzle here) > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027.patch, > HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009136#comment-16009136 ] Lars Hofhansl commented on HBASE-18027: --- (And hence perhaps this is just checking the the replication batch size limit is <= the RPC size limit) > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027.patch, > HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009381#comment-16009381 ] Andrew Purtell commented on HBASE-18027: [~lhofhansl] The worklist handed to HICRE#Replicator can exceed the RPC limit so we break it into separate RPCs if we need to. I think that's the best place to do it, since it is the code that is directly involved with creating the RPCs. The replication batch limit and the RPC size limits can be set in a way that accidentally conflict, so we need to do this check at the last step. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027.patch, > HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009409#comment-16009409 ] Andrew Purtell commented on HBASE-18027: Let me add some debug level logging where we detect the worklist exceeds RPC size limits in the proposed changes. That's a good point. It will indicate if the various configurations are in conflict and aid debugging if the caller's size estimation is broken or there is some other factor causing unexpectedly large worklists to be handed to Replicator. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027.patch, > HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009503#comment-16009503 ] Hadoop QA commented on HBASE-18027: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 28s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s {color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 17s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 18s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 27s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 25s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 25s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s {color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 60m 47s {color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 or 3.0.0-alpha2. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 138m 7s {color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 0s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 226m 4s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.regionserver.TestPerColumnFamilyFlush | | | hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster | | Timed out junit tests | org.apache.hadoop.hbase.replication.regionserver.TestWALEntryStream | | | org.apache.hadoop.hbase.snapshot.TestSecureExportSnapshot | | | org.apache.hadoop.hbase.snapshot.TestExportSnapshot | | | org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient | | | org.apache.hadoop.hbase.filter.TestFuzzyRowFilterEndToEnd | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.03.0-ce Server=17.03.0-ce Image:yetus/hbase:757bf37 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12867949/HBASE-18027.patch | | JIRA Issue | HBASE-18027 | | Optional Tests | asflicense javac javadoc unit findbugs hadoopcheck hbaseanti checkstyle compile | | uname | Linux 29c9fdc8e4a8 4.8.3-std-1 #1 SMP Fri Oct 21 11:15:43 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 305ffcb | | Default Java | 1.8.0_131 | | findbugs | v3.0.0 | | whitespace | https://builds.apache.org/job/PreCommit-HBASE-Build/6781/artifact/patchprocess/whitespace-eol.txt | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/6781/artifact/patchprocess/patch-unit-hbase-server.txt | | unit test logs | ht
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009511#comment-16009511 ] Lars Hofhansl commented on HBASE-18027: --- bq. The worklist handed to HICRE#Replicator can exceed the RPC limit so we break it into separate RPCs But how would it do that? That can only happen when the last edit in the batch is huge (so it would be batch-size-limit + size-of-last-edit). The initial design was for simplicity, so that only the outer code needs to do any bookkeeping (like progress in ZK). Lemme look at the patch a little closer. Offhand I'd be more a fan of handling this in the calling code. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027-branch-1.patch, > HBASE-18027.patch, HBASE-18027.patch, HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009513#comment-16009513 ] Lars Hofhansl commented on HBASE-18027: --- I also notice that that readAllEntriesToReplicateOrNextFile calculates the size this way (in 1.3.x): {code} WAL.Entry entry = ...; ... currentSize += entry.getEdit().heapSize(); currentSize += calculateTotalSizeOfStoreFiles(edit); {code} Perhaps that may be the discrepancy...? (and the fact that we check after we added the entry - as you point out) We can do this patch of course. But I do think it'd be simpler and easier to follow/change later if we fix it in the caller and don't introduce another loop inside the sending code. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027-branch-1.patch, > HBASE-18027.patch, HBASE-18027.patch, HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009567#comment-16009567 ] Andrew Purtell commented on HBASE-18027: [~lhofhansl] The problems we are facing in production are of the nature "Replication failure across many superpod hosts with error "Rpc data length X exceeded limit YY, set hbase.ipc.max.request.size on server to override this limit(not recommended)"". Its not clear how we are getting to such huge RPCs in the first place. Obviously some WALEdits are very large. Phoenix is in the mix. I noticed this code in Replicator does not split the worklist up if the sum is larger than the RPC limit. This is with 0.98 (and patched, too), which is dead code. Perhaps we table this change and wait and see if the same problems occur with 1.3? > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027-branch-1.patch, > HBASE-18027.patch, HBASE-18027.patch, HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009570#comment-16009570 ] Andrew Purtell commented on HBASE-18027: If you want to do this in the caller instead it needs to factor in RPC size limits there, which is independent of the setting for replication batch size. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027-branch-1.patch, > HBASE-18027.patch, HBASE-18027.patch, HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HBASE-18027) HBaseInterClusterReplicationEndpoint should respect RPC size limits when batching edits
[ https://issues.apache.org/jira/browse/HBASE-18027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011109#comment-16011109 ] Andrew Purtell commented on HBASE-18027: [~lhofhansl] and I discussed this in person. Summarizing the discussion here: Rather than introduce a new inner loop (would result in a loop in a loop in a loop), lift the check for RPC size limit violation up into the caller. Because the implementation of Replicator won't change a different unit test will be necessary. There is also a TRACE level debug log line that I want to make DEBUG and move. I will put up a new patch shortly. > HBaseInterClusterReplicationEndpoint should respect RPC size limits when > batching edits > --- > > Key: HBASE-18027 > URL: https://issues.apache.org/jira/browse/HBASE-18027 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.0.0, 1.4.0, 1.3.1 >Reporter: Andrew Purtell >Assignee: Andrew Purtell > Fix For: 2.0.0, 1.4.0, 1.3.2 > > Attachments: HBASE-18027-branch-1.patch, HBASE-18027-branch-1.patch, > HBASE-18027.patch, HBASE-18027.patch, HBASE-18027.patch > > > In HBaseInterClusterReplicationEndpoint#replicate we try to replicate in > batches. We create N lists. N is the minimum of configured replicator > threads, number of 100-waledit batches, or number of current sinks. Every > pending entry in the replication context is then placed in order by hash of > encoded region name into one of these N lists. Each of the N lists is then > sent all at once in one replication RPC. We do not test if the sum of data in > each N list will exceed RPC size limits. This code presumes each individual > edit is reasonably small. Not checking for aggregate size while assembling > the lists into RPCs is an oversight and can lead to replication failure when > that assumption is violated. > We can fix this by generating as many replication RPC calls as we need to > drain a list, keeping each RPC under limit, instead of assuming the whole > list will fit in one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)