[ https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Purtell updated HBASE-22301: ----------------------------------- Status: Patch Available (was: Open) Patch for branch-1 > Consider rolling the WAL if the HDFS write pipeline is slow > ----------------------------------------------------------- > > Key: HBASE-22301 > URL: https://issues.apache.org/jira/browse/HBASE-22301 > Project: HBase > Issue Type: Improvement > Components: wal > Reporter: Andrew Purtell > Assignee: Andrew Purtell > Priority: Minor > Fix For: 3.0.0, 1.5.0, 2.3.0 > > Attachments: HBASE-22301-branch-1.patch > > > Consider the case when a subset of the HDFS fleet is unhealthy but suffering > a gray failure not an outright outage. HDFS operations, notably syncs, are > abnormally slow on pipelines which include this subset of hosts. If the > regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be > consumed waiting for acks from the datanodes in the pipeline (recall that > some of them are sick). Imagine a write heavy application distributing load > uniformly over the cluster at a fairly high rate. With the WAL subsystem > slowed by HDFS level issues, all handlers can be blocked waiting to append to > the WAL. Once all handlers are blocked, the application will experience > backpressure. > This is with branch-1 code. I think branch-2's async WAL can mitigate but > still can be susceptible. branch-2 sync WAL is susceptible. > We already roll the WAL writer if the pipeline suffers the failure of a > datanode and the replication factor on the pipeline is too low. We should > also consider how much time it took for the write pipeline to complete a sync > the last time we measured it, or the max over the interval from now to the > last time we checked. If the sync time exceeds a configured threshold, roll > the log writer then too. Fortunately we don't need to know which datanode is > making the WAL write pipeline slow, only that syncs on the pipeline are too > slow and exceeding a threshold. This is enough information to know when to > roll it. Once we roll it, we will get three new randomly selected datanodes. > On most clusters the probability the new pipeline includes the slow datanode > will be low. (And if for some reason it does end up with a problematic > datanode again, we roll again.) > This is not a silver bullet but this can be a reasonably effective mitigation. > Provide a metric for tracking when log roll is requested (and for what > reason). > Emit a log line at log roll time that includes datanode pipeline details for > further debugging and analysis, similar to the existing slow FSHLog sync log > line. > If we roll too many times within a short interval of time this probably means > there is a widespread problem with the fleet and so our mitigation is not > helping and may be exacerbating those problems or operator difficulties. > Ensure log roll requests triggered by this new feature happen infrequently > enough to not cause difficulties under either normal or abnormal conditions. > A very simple strategy that could work well under both normal and abnormal > conditions is to define a fairly lengthy interval, default 5 minutes, and > then insure we do not roll more than once during this interval for this > reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005)