Bit more information. Using jmxterm and inspecting the state of a node when it's "slow" playing hints, I can see the following from the node that has hints to play:
$>get MaxHintsInProgress #mbean = org.apache.cassandra.db:type=StorageProxy: MaxHintsInProgress = 2048; $>get HintsInProgress #mbean = org.apache.cassandra.db:type=StorageProxy: HintsInProgress = 0; $>get TotalHints #mbean = org.apache.cassandra.db:type=StorageProxy: TotalHints = 129687; Is there some throttling that would cause hints to not be played at all if, for instance, the cluster has enough load or something related to a timeout setting? On Fri, Oct 27, 2017 at 1:49 AM, Andrew Bialecki < andrew.biale...@klaviyo.com> wrote: > We have a 96 node cluster running 3.11 with 256 vnodes each. We're running > a rolling restart. As we restart nodes, we notice that each node takes a > while to have all other nodes be marked as up and this corresponds to nodes > that haven't finished playing hints. > > We looked at the hinted handoff throttling, noticed it was still the > default of 1024, so we tried to turn it off by setting it to zero. Reading > the source, it looks like that rate limiting won't take affect until the > current set of hints have finished. So we made that change cluster wide and > then restarted the next node. However, we still saw the same issue. > > Looking at iftop and network throughput, it's very low (~10kB/s) and > therefore the few 100k of hints that accumulate while the node is restart > end up take several minutes to get sent. > > Any other knobs we should be tuning to increase hinted handoff throughput? > Or other reasons why hinted handoff runs so slowly? > > -- > Andrew Bialecki > -- Andrew Bialecki <https://www.klaviyo.com/>