[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch

Wellington Chevreuil (JIRA) Sat, 10 Nov 2018 05:10:08 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682392#comment-16682392
 ]


Wellington Chevreuil commented on HBASE-21461:
----------------------------------------------

Thanks for the insights, [~stack]!
{quote}I agree it an operator-tool but it is a bit 'odd' being branch-1 only 
and a CP only (small audience – but super cool throwing these hosed operators a 
lifeline...).
{quote}
My thought was to have it as the first feature of "replication" sub-module from 
operator-tool. Any potential other utilities for replication operation related 
issues could be then placed there as well. The limited audience, though, might 
be indeed something to consider if it's really worth the effort for now.
{quote}How would we package it? Would we build a jar over in 
hbase-operator-tool and then operator would take it and install when they had a 
constipated replication stream?
{quote}
Yeah, operators would need to download (if we are planning to expose a download 
page for operator-tool) or build it and install it. Put that way, does not 
sound really like a tool, since it's not a simple matter of running an external 
application that interacts and fixes hbase problems. Maybe we should call it a 
"medicine" (a laxative one :)).
{quote}One other thought is that we add to the refguide a section on 
constipation (smile) w/ a pointer here w/ instructions on how to install.
{quote}
Liked this idea too. In this case where and how would we place the CP? Were you 
thinking on providing the builtin jar somewhere, or just the raw code in patch 
format attached to a jira? I tend to prefer the former, as a mean to cover a 
broader audience of operators that may not be familiar with the build process.

 

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21461
>                 URL: https://issues.apache.org/jira/browse/HBASE-21461
>             Project: HBase
>          Issue Type: New Feature
>          Components: hbase-operator-tools, Replication
>            Reporter: Wellington Chevreuil
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single batch

Reply via email to