[ 
https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827141#comment-16827141
 ] 

Andrew Purtell commented on HBASE-22301:
----------------------------------------

[~dmanning] It is possible there could be no writes for a long time, not very 
likely, but we can handle it. If we find that the difference between 'now' and 
the last time we triggered a roll is twice the monitoring interval when the 
count finally goes over threshold, we can reset the count instead of requesting 
a roll. This will prevent the corner case you describe. 

Regarding the default thresholds in this patch. I picked 10 slow syncs in one 
minute as a totally arbitrary choice so I could complete the change and get a 
patch up for consideration. Now let's discuss what should be reasonable 
defaults.

Based on your analysis of our fleet under normal operation this change would 
result in:
 - If threshold is 10 slow syncs in 1 minute, we would request ~ 30,000 WAL 
rolls under normal operating conditions per day over on the order of 100 
clusters. Load is distributed unevenly so dividing this number evenly by number 
of clusters doesn't make sense. This is more than we would want, I think. 
 - If threshold is 200 slow syncs in 1 minute, we would request ~ 475 WAL rolls 
under normal operating conditions per day over on the order of 100 clusters. 
This would not be harmful. 
 - During the incident that inspired this change, we had in excess of 500 slow 
sync warnings in one minute.

As mentioned above, slow sync warnings can easily be false positives due to 
regionserver GC activity, which makes using them as signal problematic, but not 
unreasonable if we set the thresholds to sufficiently discriminate abnormal 
conditions.

Also, bear in mind that under steady state writes we will frequently roll the 
log upon reaching the file size roll threshold anyway. False positive slow sync 
based rolls will be noise among this activity if we set the threshold right.

Therefore, I think the next patch will have a default threshold of 100 slow 
syncs in one minute. Still kind of arbitrary, as defaults tend to be, but given 
the particular example of our production that would amount to ~950 rolls under 
normal operating conditions over 100 clusters in one day, but, in trade, it 
would trigger even if cluster is only under modest write load and would 
certainly have discriminated the HDFS level issues we encountered during our 
incident.

> Consider rolling the WAL if the HDFS write pipeline is slow
> -----------------------------------------------------------
>
>                 Key: HBASE-22301
>                 URL: https://issues.apache.org/jira/browse/HBASE-22301
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>             Fix For: 3.0.0, 1.5.0, 2.3.0
>
>         Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, 
> HBASE-22301-branch-1.patch
>
>
> Consider the case when a subset of the HDFS fleet is unhealthy but suffering 
> a gray failure not an outright outage. HDFS operations, notably syncs, are 
> abnormally slow on pipelines which include this subset of hosts. If the 
> regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be 
> consumed waiting for acks from the datanodes in the pipeline (recall that 
> some of them are sick). Imagine a write heavy application distributing load 
> uniformly over the cluster at a fairly high rate. With the WAL subsystem 
> slowed by HDFS level issues, all handlers can be blocked waiting to append to 
> the WAL. Once all handlers are blocked, the application will experience 
> backpressure. All (HBase) clients eventually have too many outstanding writes 
> and block.
> Because the application is distributing writes near uniformly in the 
> keyspace, the probability any given service endpoint will dispatch a request 
> to an impacted regionserver, even a single regionserver, approaches 1.0. So 
> the probability that all service endpoints will be affected approaches 1.0.
> In order to break the logjam, we need to remove the slow datanodes. Although 
> there is HDFS level monitoring, mechanisms, and procedures for this, we 
> should also attempt to take mitigating action at the HBase layer as soon as 
> we find ourselves in trouble. It would be enough to remove the affected 
> datanodes from the writer pipelines. A super simple strategy that can be 
> effective is described below:
> This is with branch-1 code. I think branch-2's async WAL can mitigate but 
> still can be susceptible. branch-2 sync WAL is susceptible. 
> We already roll the WAL writer if the pipeline suffers the failure of a 
> datanode and the replication factor on the pipeline is too low. We should 
> also consider how much time it took for the write pipeline to complete a sync 
> the last time we measured it, or the max over the interval from now to the 
> last time we checked. If the sync time exceeds a configured threshold, roll 
> the log writer then too. Fortunately we don't need to know which datanode is 
> making the WAL write pipeline slow, only that syncs on the pipeline are too 
> slow and exceeding a threshold. This is enough information to know when to 
> roll it. Once we roll it, we will get three new randomly selected datanodes. 
> On most clusters the probability the new pipeline includes the slow datanode 
> will be low. (And if for some reason it does end up with a problematic 
> datanode again, we roll again.)
> This is not a silver bullet but this can be a reasonably effective mitigation.
> Provide a metric for tracking when log roll is requested (and for what 
> reason).
> Emit a log line at log roll time that includes datanode pipeline details for 
> further debugging and analysis, similar to the existing slow FSHLog sync log 
> line.
> If we roll too many times within a short interval of time this probably means 
> there is a widespread problem with the fleet and so our mitigation is not 
> helping and may be exacerbating those problems or operator difficulties. 
> Ensure log roll requests triggered by this new feature happen infrequently 
> enough to not cause difficulties under either normal or abnormal conditions. 
> A very simple strategy that could work well under both normal and abnormal 
> conditions is to define a fairly lengthy interval, default 5 minutes, and 
> then insure we do not roll more than once during this interval for this 
> reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to