[ https://issues.apache.org/jira/browse/TS-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514241#comment-14514241 ]
Sudheer Vinukonda commented on TS-3549: --------------------------------------- - I don't really have any particular preference for the naming, but, can you explain why you think *thundering_herd* is a bad name - it seems a little more intuitive/self-explanatory (and something that the *users* can perhaps relate to?) than *bypass_origin_on_write_lock_fail*? - If *@DataInfo* has never been used at all, can you explain what issues (if any) do you anticipate? - Again, I don't really have any particular preference over the naming; I am fine with changing the name to *@Ats-Internal* if that's what you prefer. > configurable option to avoid thundering herd due to concurrent requests for > the same object > ------------------------------------------------------------------------------------------- > > Key: TS-3549 > URL: https://issues.apache.org/jira/browse/TS-3549 > Project: Traffic Server > Issue Type: New Feature > Components: HTTP > Affects Versions: 5.3.0 > Reporter: Sudheer Vinukonda > Assignee: Sudheer Vinukonda > Fix For: 6.0.0 > > Attachments: TS-3549.diff > > > When ATS is used as a delivery server for a video live streaming event, it's > possible that there are a huge number of concurrent requests for the same > object. Depending on the type of the object being requested, the cache lookup > for those objects can result in either a stale copy of the object (e.g > manifest files) or a complete cache miss (e.g segment files). ATS currently > supports different types of connection collapse (e.g. *read-while-write* > functionality - > *https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#read-while-writer*, > swr etc) but, in order for the *rww* to kick-in, ATS requires the complete > response headers for the object be received and validated. In other words, > until this happens, any number of incoming requests for the same object that > result in a cache miss or a cache stale would be forwarded to the origin. For > a scenario such as a live event, this leaves a sufficiently significant > window, where there could be 100's of requests being forwarded to the origin > for the same object. It has been observed during production that this results > in significant increase in latency for the objects waiting in > read-while-write state. > Note that, there are also a couple of settings > *proxy.config.http.cache.open_read_retry_time* and > *proxy.config.http.cache.max_open_read_retries* > (*https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#open-read-retry-timeout*) > that can alleviate the thundering herd to some extent, by re-trying to get > the read lock for the object as configured. With these configured, ATS would > retry to get the read lock for as long and if it's still not available due to > the write lock being held by the first request that was forwarded to the > origin (for e.g. the response headers have not been received yet), then all > the waiting requests would simply be forwarded to the origin (by disabling > cache for each of them). > It is almost impossible to get the above settings accurate to help in all > possible situations (traffic, concurrent connections, network conditions > etc). Due to this reason, a configurable workaround is proposed below that > avoids the thundering herd completely. The patch below is mainly from > [~jlaue] and [~psudaemon] with some additional clean up, configuration > control and debug headers etc. > Basically, when configured, on failing to obtain a write lock for an object > (which means, there's another ongoing parallel request for the same object > that was forwarded to the origin), if it's a cache refresh miss, a stale copy > of the object is served, while if it's a complete cache miss, a *502* error > is returned to let the client (e.g. player) to reattempt. The *502* error > also includes a special internal ATS header named {{@ats-internal-messages}} > with the appropriate value to allow for custom logging or for plugins to take > any appropriate actions (e.g. prevent a fail-over if there's such a plugin > that does fail-over on a regular 502 error). -- This message was sent by Atlassian JIRA (v6.3.4#6332)