Hi Nimi,

My suspicions would probably lie somewhere between GC and large partitions.

The first tool would probably be a trace but if you experience full client
timeouts from dropped messages you may find it hard to find the issue. You
can try running the trace with cqlsh's timeouts cranked all the way against
the local node with CL=ONE to try to force the local machine to answer.

What does nodetool tpstats report for dropped message counts? Are they very
high? Primarily restricted to READ, or including MUTATION, etc. ?

Are there specific PK's that trigger this behavior, either all the time or
more consistently? That would finger either very large partition sizes or
potentially bad hardware on a node. cfhistograms will show you various
percentile partition sizes and your max as well.

GC should be accessible via JMX and also you should have GCInspector logs
in cassandra/system.log that should give you per-collection breakdowns.

--Bryan


On Wed, Jul 6, 2016 at 6:22 PM, Nimi Wariboko Jr <n...@channelmeter.com>
wrote:

> Hi,
>
> I've begun experiencing very high tail latencies across my clusters. While
> Cassandra's internal metrics report <1ms read latencies, measuring
> responses from within the driver in my applications (roundtrips of
> query/execute frames), have 90% round trip times of up to a second for very
> basic queries (SELECT a,b FROM table WHERE pk=x).
>
> I've been studying the logs to try and get a handle on what could be going
> wrong. I don't think there are GC issues, but the logs mention dropped
> messages due to timeouts while the threadpools are nearly empty -
>
> https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c
>
> Relevant line:
> REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
> timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
> ms and Mean cross-node dropped latency: 0 ms
>
> Are there any tools I can use to start to understand what is causing these
> issues?
>
> Nimi
>
>

Reply via email to