[jira] [Commented] (TS-3549) configurable option to avoid thundering herd due to concurrent requests for the same object

Alan M. Carroll (JIRA) Mon, 27 Apr 2015 07:31:56 -0700

    [ 
https://issues.apache.org/jira/browse/TS-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514197#comment-14514197
 ]


Alan M. Carroll commented on TS-3549:
-------------------------------------

Overall it seems reasonable, but there are some issues -

"thundering_herd" is a particularly bad name for the configuration option and 
member. "bypass_origin_on_write_lock_fail" or something similar would be much 
better.

There is a potential issue with changing the '@DataInfo' to '@Ats-Internat' if 
that header gets written to cache. I need to check on that specifically to see 
if the '@' headers are written to disk (hopefully not).

Also, I thought we had agreed on '@Ats-Internal', not '@Ats-Internal-Message'.

> configurable option to avoid thundering herd due to concurrent requests for 
> the same object
> -------------------------------------------------------------------------------------------
>
>                 Key: TS-3549
>                 URL: https://issues.apache.org/jira/browse/TS-3549
>             Project: Traffic Server
>          Issue Type: New Feature
>          Components: HTTP
>    Affects Versions: 5.3.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>             Fix For: 6.0.0
>
>         Attachments: TS-3549.diff
>
>
> When ATS is used as a delivery server for a video live streaming event, it's 
> possible that there are a huge number of concurrent requests for the same 
> object. Depending on the type of the object being requested, the cache lookup 
> for those objects can result in either a stale copy of the object (e.g 
> manifest files) or a complete cache miss (e.g segment files). ATS currently 
> supports different types of connection collapse (e.g. *read-while-write* 
> functionality - 
> *https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#read-while-writer*,
>  swr etc) but, in order for the *rww* to kick-in, ATS requires the complete 
> response headers for the object be received and validated. In other words, 
> until this happens, any number of incoming requests for the same object that 
> result in a cache miss or a cache stale would be forwarded to the origin. For 
> a scenario such as a live event, this leaves a sufficiently significant 
> window, where there could be 100's of requests being forwarded to the origin 
> for the same object. It has been observed during production that this results 
> in significant increase in latency for the objects waiting in 
> read-while-write state. 
> Note that, there are also a couple of settings 
> *proxy.config.http.cache.open_read_retry_time* and 
> *proxy.config.http.cache.max_open_read_retries* 
> (*https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#open-read-retry-timeout*)
>  that can alleviate the thundering herd to some extent, by re-trying to get 
> the read lock for the object as configured. With these configured, ATS would 
> retry to get the read lock for as long and if it's still not available due to 
> the write lock being held by the first request that was forwarded to the 
> origin (for e.g. the response headers have not been received yet), then all 
> the waiting requests would simply be forwarded to the origin (by disabling 
> cache for each of them). 
> It is almost impossible to get the above settings accurate to help in all 
> possible situations (traffic, concurrent connections, network conditions 
> etc). Due to this reason, a configurable workaround is proposed below that 
> avoids the thundering herd completely. The patch below is mainly from 
> [~jlaue] and [~psudaemon] with some additional clean up, configuration 
> control and debug headers etc.
> Basically, when configured, on failing to obtain a write lock for an object 
> (which means, there's another ongoing parallel request for the same object 
> that was forwarded to the origin), if it's a cache refresh miss, a stale copy 
> of the object is served, while if it's a complete cache miss, a *502* error 
> is returned to let the client (e.g. player) to reattempt. The *502* error 
> also includes a special internal ATS header named {{@ats-internal-messages}} 
> with the appropriate value to allow for custom logging or for plugins to take 
> any appropriate actions (e.g. prevent a fail-over if there's such a plugin 
> that does fail-over on a regular 502 error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TS-3549) configurable option to avoid thundering herd due to concurrent requests for the same object

Reply via email to