[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-12 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683864#comment-16683864
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

Thanks [~stack]. I had submitted another patch as txt file, if applied to 
hbase-operator-tools, it would create below structure:
{noformat}
hbase-operator-tools
   - hbase-hbck2
   - hbase-replication
   - wal-split-replication-cp{noformat}

Let me know if this should be ok to be added to tools repo. BTW, I'm assuming 
we can't attach files here as *.patch*, as jira would try to run it on main 
hbase repository, right?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt, 
> HBASE-21461-master.001.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682722#comment-16682722
 ] 

stack commented on HBASE-21461:
---

Lets do option #2. I can help. Its too early (or too late -- smile) for #3.  I 
should try adding the patch here over in the tools repo?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682563#comment-16682563
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

{quote}It could be here on this JIRA w/ instructions on how to build. Might be 
ok given limited audience... but wouldn't encourage confidence in the 'hosed' 
operator.
{quote}
Maybe have a built jar available here with install instructions. I guess 
requiring a whole build environment setup would be too discouraging for 
admins/operators.
{quote}Or it'd be in tools repo... Your plan for a replication submodule sounds 
good. In it would be a submodule for this cp ... setting the jdk7 compile 
target and having dependency on branch-1.
{quote}
This approach benefit is that we start tide up the house and put most of 
support/operations "hacks" in its specific shelves. BTW, should we have another 
jira/thread to discuss what else could be moved to 
"/operator-tools/replication" submodule (assuming there's none yet)?
{quote}Or, we start the cp 'store' repo... where we start putting cps. 
(smile).
{quote}
That could be another way to organise extra tools/features. Are there other CPs 
planned to be moved out of hbase main project?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682538#comment-16682538
 ] 

stack commented on HBASE-21461:
---

It could be here on this JIRA w/ instructions on how to build. Might be ok 
given limited audience... but wouldn't encourage confidence in the 'hosed' 
operator.

Or it'd be in tools repo... Your plan for a replication submodule sounds good. 
In it would be a submodule for this cp ... setting the jdk7 compile target and 
having dependency on branch-1.

Or, we start the cp 'store' repo... where we start putting cps. (smile).



> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682392#comment-16682392
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

Thanks for the insights, [~stack]!
{quote}I agree it an operator-tool but it is a bit 'odd' being branch-1 only 
and a CP only (small audience – but super cool throwing these hosed operators a 
lifeline...).
{quote}
My thought was to have it as the first feature of "replication" sub-module from 
operator-tool. Any potential other utilities for replication operation related 
issues could be then placed there as well. The limited audience, though, might 
be indeed something to consider if it's really worth the effort for now.
{quote}How would we package it? Would we build a jar over in 
hbase-operator-tool and then operator would take it and install when they had a 
constipated replication stream?
{quote}
Yeah, operators would need to download (if we are planning to expose a download 
page for operator-tool) or build it and install it. Put that way, does not 
sound really like a tool, since it's not a simple matter of running an external 
application that interacts and fixes hbase problems. Maybe we should call it a 
"medicine" (a laxative one :)).
{quote}One other thought is that we add to the refguide a section on 
constipation (smile) w/ a pointer here w/ instructions on how to install.
{quote}
Liked this idea too. In this case where and how would we place the CP? Were you 
thinking on providing the builtin jar somewhere, or just the raw code in patch 
format attached to a jira? I tend to prefer the former, as a mean to cover a 
broader audience of operators that may not be familiar with the build process.

 

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC 

[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-09 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682153#comment-16682153
 ] 

stack commented on HBASE-21461:
---

Good on you [~wchevreuil]

I wish we had a 'store' for CPs. We'd put it 'there'. I agree it an 
operator-tool but it is a bit 'odd' being branch-1 only and a CP only (small 
audience -- but super cool throwing these hosed operators a lifeline...). How 
would we package it? Would we build a jar over in hbase-operator-tool and then 
operator would take it and install when they had a constipated replication 
stream?

One other thought is that we add to the refguide a section on constipation 
(smile) w/ a pointer here w/ instructions on how to install.

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681710#comment-16681710
 ] 

Wellington Chevreuil commented on HBASE-21461:
--


{quote}My only concern is that this may be pigeon-hole'd into only having 
relevance for a small amount of deploys. However, even if one person finds 
value from it, it's probably worth it.
{quote}

Yeah, that's likely to be the case, mainly as more and more clusters move to 
newer versions that already have the mentioned fix (unless another unforeseen 
condition can trigger similar problem). It also relies on CP API version 1, so 
it's not even compatible with hbase 2 (which may not be a big deal, as the 
issue cause is already tackled on version 2). But, since we had this ready and 
usable, thought worth share it anyway :)


> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-09 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681697#comment-16681697
 ] 

Josh Elser commented on HBASE-21461:


{quote}It will still replicate in same sequence, however in several batches, 
instead of a single large one. This is currently done synchronously. Also, it 
preserves the OP original timestamp from source, which I think is the most 
critical here to maintain the correct state.
{quote}
Ok, cool. When you put it that way, I agree :). My brain is still sputtering to 
get started.
{quote}This CP, however, is thought more as an admin tool (that's why I propose 
it as part of operators tools)
{quote}
Gotcha. I don't think we have a well-defined "measure" of what we want to put 
into operator-tools yet. My only concern is that this may be pigeon-hole'd into 
only having relevance for a small amount of deploys. However, even if one 
person finds value from it, it's probably worth it.

[~stack] or [~busbey], any thoughts on including such a tool into 
operator-tools?
{quote}Yeah, definitely worth try it, I haven't evaluated such backport 
actually, I was trying to integrate it on our own distribution that's based on 
1.2 (with some divergences), but couldn't manage to get it working properly. I 
can try a "pure" branch-1.2, though.
{quote}
Cool, that's definitely a parallel thread for us to keep a finger on. Making 
sure our upstream has the necessary changes when we can make them is important.

Thanks for the great info, Wellington. Making my life easy :)

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681685#comment-16681685
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

Thanks for reviewing this, [~elserj] !
{quote}Is it safe to split up a WalEdit? Wouldn't this introduce the potential 
for us to replicate the wrong state in the destination cluster? I'm not 
positive, but it's giving me pause.
{quote}
It will still replicate in same sequence, however in several batches, instead 
of a single large one. This is currently done synchronously. Also, it preserves 
the OP original timestamp from source, which I think is the most critical here 
to maintain the correct state.

{quote}
If that is safe (or at least, safe enough), couldn't we just split up the 
WalEdit within the source RS and prevent the need for an extra 
CP/user-intervention?
{quote}

That would be the ideal and definitive solution to avoid the problem from 
happening. This CP, however, is thought more as an admin tool (that's why I 
propose it as part of operators tools). For deployments already affected by 
this issue, that would still require the manual intervention for cleanup and 
unblock replications sinks. The CP, once plugged, would just allow the stuck 
replication edits to drain, with no other manual intervention (it's much 
simpler than having to cleanup znodes, oldWALs, run wal player, etc). Also, the 
definitive fixes from 18027 requires an upgrade process. From product 
supporting perspective, many organizations usually have restricted policies for 
upgrades.

{quote}
Could/should we backport 18027 to branch-1.2? Have you looked at that/found it 
infeasible?
{quote}
Yeah, definitely worth try it, I haven't evaluated such backport actually, I 
was trying to integrate it on our own distribution that's based on 1.2 (with 
some divergences), but couldn't manage to get it working properly. I can try a 
"pure" branch-1.2, though.

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already 

[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-09 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681646#comment-16681646
 ] 

Josh Elser commented on HBASE-21461:


{quote}With replication enabled deployments, it's possible that faulty 
ingestion clients may lead to single WalEntry containing too many edits for 
same cell. This would cause *ReplicationSink,* in the target cluster, to 
attempt single batch mutation with too many operations, what in turn can lead 
to very large RPC requests, which may not fit in the final target RS rpc queue. 
In this case, the messages below are seen on target RS trying to perform the 
sink:
{quote}
[~wchevreuil], trying to understand this one a little better...

Is it safe to split up a WalEdit? Wouldn't this introduce the potential for us 
to replicate the wrong state in the destination cluster? I'm not positive, but 
it's giving me pause.

If that is safe (or at least, safe enough), couldn't we just split up the 
WalEdit within the source RS and prevent the need for an extra 
CP/user-intervention?
{quote}HBASE-18027 introduced some safeguards such large RPC requests, which 
may already help avoid such scenario. That is not available for 1.2 releases, 
though, and this CP tool may still be relevant for 1.2 clusters.
{quote}
Could/should we backport 18027 to branch-1.2? Have you looked at that/found it 
infeasible?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-09 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681348#comment-16681348
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

Uploaded an initial patch as txt, since I'm not sure it would be applied to the 
proper hbase-operator-tools repository. Since this specific CP is dependent on 
hbase branch-1, maybe we should create similar branch structure for 
hbase-operator-tools repository, so that we can place tools that are targeted 
to specific hbase versions on related branches?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)