[ https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Purtell updated HBASE-22301: ----------------------------------- Attachment: HBASE-22301-branch-1.patch > Consider rolling the WAL if the HDFS write pipeline is slow > ----------------------------------------------------------- > > Key: HBASE-22301 > URL: https://issues.apache.org/jira/browse/HBASE-22301 > Project: HBase > Issue Type: Improvement > Components: wal > Reporter: Andrew Purtell > Assignee: Andrew Purtell > Priority: Minor > Fix For: 3.0.0, 1.5.0, 2.3.0 > > Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, > HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch > > > Consider the case when a subset of the HDFS fleet is unhealthy but suffering > a gray failure not an outright outage. HDFS operations, notably syncs, are > abnormally slow on pipelines which include this subset of hosts. If the > regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be > consumed waiting for acks from the datanodes in the pipeline (recall that > some of them are sick). Imagine a write heavy application distributing load > uniformly over the cluster at a fairly high rate. With the WAL subsystem > slowed by HDFS level issues, all handlers can be blocked waiting to append to > the WAL. Once all handlers are blocked, the application will experience > backpressure. All (HBase) clients eventually have too many outstanding writes > and block. > Because the application is distributing writes near uniformly in the > keyspace, the probability any given service endpoint will dispatch a request > to an impacted regionserver, even a single regionserver, approaches 1.0. So > the probability that all service endpoints will be affected approaches 1.0. > In order to break the logjam, we need to remove the slow datanodes. Although > there is HDFS level monitoring, mechanisms, and procedures for this, we > should also attempt to take mitigating action at the HBase layer as soon as > we find ourselves in trouble. It would be enough to remove the affected > datanodes from the writer pipelines. A super simple strategy that can be > effective is described below: > This is with branch-1 code. I think branch-2's async WAL can mitigate but > still can be susceptible. branch-2 sync WAL is susceptible. > We already roll the WAL writer if the pipeline suffers the failure of a > datanode and the replication factor on the pipeline is too low. We should > also consider how much time it took for the write pipeline to complete a sync > the last time we measured it, or the max over the interval from now to the > last time we checked. If the sync time exceeds a configured threshold, roll > the log writer then too. Fortunately we don't need to know which datanode is > making the WAL write pipeline slow, only that syncs on the pipeline are too > slow and exceeding a threshold. This is enough information to know when to > roll it. Once we roll it, we will get three new randomly selected datanodes. > On most clusters the probability the new pipeline includes the slow datanode > will be low. (And if for some reason it does end up with a problematic > datanode again, we roll again.) > This is not a silver bullet but this can be a reasonably effective mitigation. > Provide a metric for tracking when log roll is requested (and for what > reason). > Emit a log line at log roll time that includes datanode pipeline details for > further debugging and analysis, similar to the existing slow FSHLog sync log > line. > If we roll too many times within a short interval of time this probably means > there is a widespread problem with the fleet and so our mitigation is not > helping and may be exacerbating those problems or operator difficulties. > Ensure log roll requests triggered by this new feature happen infrequently > enough to not cause difficulties under either normal or abnormal conditions. > A very simple strategy that could work well under both normal and abnormal > conditions is to define a fairly lengthy interval, default 5 minutes, and > then insure we do not roll more than once during this interval for this > reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005)