[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HBASE-21461 started by Wellington Chevreuil. ---------------------------------------------------- > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication > Reporter: Wellington Chevreuil > Assignee: Wellington Chevreuil > Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt, > HBASE-21461-master.001.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)