[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683864#comment-16683864 ] Wellington Chevreuil commented on HBASE-21461: -- Thanks [~stack]. I had submitted another patch as txt file, if applied to hbase-operator-tools, it would create below structure: {noformat} hbase-operator-tools - hbase-hbck2 - hbase-replication - wal-split-replication-cp{noformat} Let me know if this should be ok to be added to tools repo. BTW, I'm assuming we can't attach files here as *.patch*, as jira would try to run it on main hbase repository, right? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt, > HBASE-21461-master.001.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682722#comment-16682722 ] stack commented on HBASE-21461: --- Lets do option #2. I can help. Its too early (or too late -- smile) for #3. I should try adding the patch here over in the tools repo? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682563#comment-16682563 ] Wellington Chevreuil commented on HBASE-21461: -- {quote}It could be here on this JIRA w/ instructions on how to build. Might be ok given limited audience... but wouldn't encourage confidence in the 'hosed' operator. {quote} Maybe have a built jar available here with install instructions. I guess requiring a whole build environment setup would be too discouraging for admins/operators. {quote}Or it'd be in tools repo... Your plan for a replication submodule sounds good. In it would be a submodule for this cp ... setting the jdk7 compile target and having dependency on branch-1. {quote} This approach benefit is that we start tide up the house and put most of support/operations "hacks" in its specific shelves. BTW, should we have another jira/thread to discuss what else could be moved to "/operator-tools/replication" submodule (assuming there's none yet)? {quote}Or, we start the cp 'store' repo... where we start putting cps. (smile). {quote} That could be another way to organise extra tools/features. Are there other CPs planned to be moved out of hbase main project? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682538#comment-16682538 ] stack commented on HBASE-21461: --- It could be here on this JIRA w/ instructions on how to build. Might be ok given limited audience... but wouldn't encourage confidence in the 'hosed' operator. Or it'd be in tools repo... Your plan for a replication submodule sounds good. In it would be a submodule for this cp ... setting the jdk7 compile target and having dependency on branch-1. Or, we start the cp 'store' repo... where we start putting cps. (smile). > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682392#comment-16682392 ] Wellington Chevreuil commented on HBASE-21461: -- Thanks for the insights, [~stack]! {quote}I agree it an operator-tool but it is a bit 'odd' being branch-1 only and a CP only (small audience – but super cool throwing these hosed operators a lifeline...). {quote} My thought was to have it as the first feature of "replication" sub-module from operator-tool. Any potential other utilities for replication operation related issues could be then placed there as well. The limited audience, though, might be indeed something to consider if it's really worth the effort for now. {quote}How would we package it? Would we build a jar over in hbase-operator-tool and then operator would take it and install when they had a constipated replication stream? {quote} Yeah, operators would need to download (if we are planning to expose a download page for operator-tool) or build it and install it. Put that way, does not sound really like a tool, since it's not a simple matter of running an external application that interacts and fixes hbase problems. Maybe we should call it a "medicine" (a laxative one :)). {quote}One other thought is that we add to the refguide a section on constipation (smile) w/ a pointer here w/ instructions on how to install. {quote} Liked this idea too. In this case where and how would we place the CP? Were you thinking on providing the builtin jar somewhere, or just the raw code in patch format attached to a jira? I tend to prefer the former, as a mean to cover a broader audience of operators that may not be familiar with the build process. > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682153#comment-16682153 ] stack commented on HBASE-21461: --- Good on you [~wchevreuil] I wish we had a 'store' for CPs. We'd put it 'there'. I agree it an operator-tool but it is a bit 'odd' being branch-1 only and a CP only (small audience -- but super cool throwing these hosed operators a lifeline...). How would we package it? Would we build a jar over in hbase-operator-tool and then operator would take it and install when they had a constipated replication stream? One other thought is that we add to the refguide a section on constipation (smile) w/ a pointer here w/ instructions on how to install. > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681710#comment-16681710 ] Wellington Chevreuil commented on HBASE-21461: -- {quote}My only concern is that this may be pigeon-hole'd into only having relevance for a small amount of deploys. However, even if one person finds value from it, it's probably worth it. {quote} Yeah, that's likely to be the case, mainly as more and more clusters move to newer versions that already have the mentioned fix (unless another unforeseen condition can trigger similar problem). It also relies on CP API version 1, so it's not even compatible with hbase 2 (which may not be a big deal, as the issue cause is already tackled on version 2). But, since we had this ready and usable, thought worth share it anyway :) > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681697#comment-16681697 ] Josh Elser commented on HBASE-21461: {quote}It will still replicate in same sequence, however in several batches, instead of a single large one. This is currently done synchronously. Also, it preserves the OP original timestamp from source, which I think is the most critical here to maintain the correct state. {quote} Ok, cool. When you put it that way, I agree :). My brain is still sputtering to get started. {quote}This CP, however, is thought more as an admin tool (that's why I propose it as part of operators tools) {quote} Gotcha. I don't think we have a well-defined "measure" of what we want to put into operator-tools yet. My only concern is that this may be pigeon-hole'd into only having relevance for a small amount of deploys. However, even if one person finds value from it, it's probably worth it. [~stack] or [~busbey], any thoughts on including such a tool into operator-tools? {quote}Yeah, definitely worth try it, I haven't evaluated such backport actually, I was trying to integrate it on our own distribution that's based on 1.2 (with some divergences), but couldn't manage to get it working properly. I can try a "pure" branch-1.2, though. {quote} Cool, that's definitely a parallel thread for us to keep a finger on. Making sure our upstream has the necessary changes when we can make them is important. Thanks for the great info, Wellington. Making my life easy :) > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681685#comment-16681685 ] Wellington Chevreuil commented on HBASE-21461: -- Thanks for reviewing this, [~elserj] ! {quote}Is it safe to split up a WalEdit? Wouldn't this introduce the potential for us to replicate the wrong state in the destination cluster? I'm not positive, but it's giving me pause. {quote} It will still replicate in same sequence, however in several batches, instead of a single large one. This is currently done synchronously. Also, it preserves the OP original timestamp from source, which I think is the most critical here to maintain the correct state. {quote} If that is safe (or at least, safe enough), couldn't we just split up the WalEdit within the source RS and prevent the need for an extra CP/user-intervention? {quote} That would be the ideal and definitive solution to avoid the problem from happening. This CP, however, is thought more as an admin tool (that's why I propose it as part of operators tools). For deployments already affected by this issue, that would still require the manual intervention for cleanup and unblock replications sinks. The CP, once plugged, would just allow the stuck replication edits to drain, with no other manual intervention (it's much simpler than having to cleanup znodes, oldWALs, run wal player, etc). Also, the definitive fixes from 18027 requires an upgrade process. From product supporting perspective, many organizations usually have restricted policies for upgrades. {quote} Could/should we backport 18027 to branch-1.2? Have you looked at that/found it infeasible? {quote} Yeah, definitely worth try it, I haven't evaluated such backport actually, I was trying to integrate it on our own distribution that's based on 1.2 (with some divergences), but couldn't manage to get it working properly. I can try a "pure" branch-1.2, though. > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681646#comment-16681646 ] Josh Elser commented on HBASE-21461: {quote}With replication enabled deployments, it's possible that faulty ingestion clients may lead to single WalEntry containing too many edits for same cell. This would cause *ReplicationSink,* in the target cluster, to attempt single batch mutation with too many operations, what in turn can lead to very large RPC requests, which may not fit in the final target RS rpc queue. In this case, the messages below are seen on target RS trying to perform the sink: {quote} [~wchevreuil], trying to understand this one a little better... Is it safe to split up a WalEdit? Wouldn't this introduce the potential for us to replicate the wrong state in the destination cluster? I'm not positive, but it's giving me pause. If that is safe (or at least, safe enough), couldn't we just split up the WalEdit within the source RS and prevent the need for an extra CP/user-intervention? {quote}HBASE-18027 introduced some safeguards such large RPC requests, which may already help avoid such scenario. That is not available for 1.2 releases, though, and this CP tool may still be relevant for 1.2 clusters. {quote} Could/should we backport 18027 to branch-1.2? Have you looked at that/found it infeasible? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681348#comment-16681348 ] Wellington Chevreuil commented on HBASE-21461: -- Uploaded an initial patch as txt, since I'm not sure it would be applied to the proper hbase-operator-tools repository. Since this specific CP is dependent on hbase branch-1, maybe we should create similar branch structure for hbase-operator-tools repository, so that we can place tools that are targeted to specific hbase versions on related branches? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)