Re: ALL range query monitors failing frequently

Matthew O'Riordan Wed, 28 Jun 2017 07:05:25 -0700

Hi Kurt

Thanks for the response.  Few comments in line:

On Wed, Jun 28, 2017 at 1:17 PM, kurt greaves <k...@instaclustr.com> wrote:

> You're correct in that the timeout is only driver side. The server will
> have its own timeouts configured in the cassandra.yaml file.
>
Yup, OK.

I suspect either that you have a node down in your cluster (or 4),
>
Nope, that’s not what is happening as a) we have monitoring on all nodes,
b) there is nothing in the logs.

> or your queries are gradually getting slower.
>
Perhaps, but we have query time metrics that don’t seem to indicate any
obvious issues.  See the attached metrics from the last 12 hours for quorum
queries.

> This kind of aligns with the slow query statements in your logs. Are you
> making changes/updates to the partitions that you are querying?
>
No

> It could be that the partitions are now spread across multiple SSTables
> and thus slowing things down. You should perform a trace to get a better
> idea of the issue.
>
If I run a CONSISTENCY QUORUM | ALL range query, it is visually very slow
using cqlsh and unfortunately results in a trace failure: “Statement trace
did not complete within 10 seconds”.

A hacky workaround would be to increase your read timeouts server side
> (read_timeout_in_ms), however this will mask underlying data model issues.
>
Yup, I certainly don’t like the idea of that.

I’m interested in what you said about the partitions being spread across
multiple SSTables.  Any pointers on what to look for there?

I then wondered if perhaps a range query is really just not a good idea,
even if only for monitoring purposes.  I tried querying for just one row
with the ID specified i.e. something like SELECT * from keyspace.table
where id = 123;  It was still incredibly slow (with CONSISTENCY ALL) and
failed a few times to generate a trace, but finally resulted in a trace
that can be seen at
https://gist.github.com/mattheworiordan/b1133008bf6fd14bfe6937a0004c8789#file-cassandra-trace-log
.

The worse offender seemed to be 34.207.246.175, so I ran the same query on
that instance itself to see if it is under load / servicing requests slowly
and it’s not. See
https://gist.github.com/mattheworiordan/b1133008bf6fd14bfe6937a0004c8789#file-local-cassandra-trace-log
.

So as far as I can tell, it looks like there may be some issue with nodes
communicating with each other perhaps, but the logs don’t reveal much.
Where to now?

-- 

Regards,

Matthew O'Riordan
CEO who codes
Ably - simply better realtime <https://www.ably.io/>

*Ably News: Ably push notifications have gone live
<https://blog.ably.io/ably-push-notifications-are-now-available-64cb8ae37e74>*

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: ALL range query monitors failing frequently

Reply via email to