Re: Understanding multi region read query and latency

Raphael Mazelier Sun, 07 Aug 2022 11:26:37 -0700

> "Read repair is in the blocking read path for the query, yep"

OK interesting. This is not what I understood from the documentation.And I use localOne level consistency.

I enabled tracing (see in the attachment of my first msg)/ but I didn'tsee read repair in the trace (and btw I tried to completely disable iton my table setting both read_repair_chance andlocal_dc_read_repair_chance to 0).

The problem when enabling trace in cqlsh is that I only get slow result.For having fast answer I need to iterate faster on my queries.

I can provide again trace for analysis. I got something more readable inpython.


Best,

--

Raphael


On 07/08/2022 19:30, C. Scott Andreas wrote:

> but still as I understand the documentation the read repair shouldnot be in the blocking path of a query ?

Read repair is in the blocking read path for the query, yep. At quorumconsistency levels, the read repair must complete before returning aresult to the client to ensure the data returned would be visible onsubsequent reads that address the remainder of the quorum.

If you enable tracing - either for a single CQL statement that isexpected to be slow, or probabilistic from the server side to catch aslow query in the act - that will help identify what’s happening.


- Scott

On Aug 7, 2022, at 10:25 AM, Raphael Mazelier <[email protected]> wrote:

Nope. And what really puzzle me is in the trace we really show thedifference between queries. The fast queries only request read fromone replicas, while slow queries request from multiple replicas (andnot only local to the dc).


On 07/08/2022 14:02, Stéphane Alleaume wrote:

Hi

Is there some GC which could affect coordinarir node ?

Kind regards
Stéphane

Le dim. 7 août 2022, 13:41, Raphael Mazelier <[email protected]> aécrit :


    Thanks for the answer but I was well aware of this. I use
    localOne as consistency level.

    My client connect to a local seeds, then choose a local
    coordinator (as far I can understand the trace log).

    Then for a batch of request I got approximately 98% of request
    treated in 2/3ms in local DC with one read request, and 2%
    treated by many nodes (according to the trace) and then way
    longer (250ms).

    ?

    On 06/08/2022 14:30, Bowen Song via user wrote:


    See the diagram below. Your problem almost certainly arises
    from step 4, in which an incorrect consistency level set by the
    client caused the coordinator node to send the READ command to
    nodes in other DCs.

    The load balancing policy only affects step 2 and 3, not step 1
    or 4.

    You should change the consistency level to
    LOCAL_ONE/LOCAL_QUORUM/etc. to fix the problem.

    On 05/08/2022 22:54, Bowen Song wrote:

    The DCAwareRoundRobinPolicy/TokenAwareHostPolicy controlls
    which Cassandra coordinator node the client sends queries to,
    not the nodes it connects to, nor the nodes that performs the
    actual read.

    A client sends a CQL read query to a coordinator node, and the
    coordinator node parses the CQL query, and send READ requests
    to other nodes in the cluster based on the consistency level.

    Have you checked the consistency level of the session (and the
    query if applicable)? Is it prefixed with "LOCAL_"? If not,
    the coordinator will send the READ requests to non-local DCs.


    On 05/08/2022 19:40, Raphael Mazelier wrote:


    Hi Cassandra Users,

    I'm relatively new to Cassandra and first I have to say I'm
    really impressed by the technology.

    Good design and a lot of stuff to understand the underlying
    (the Oreilly book help a lot as well as thelastpickle blog
    post).

    I have an muli-datacenter c* cluster (US, Europe, Singapore)
    with eight node on each (two seeds on each region), two racks
    on Eu, Singapore, 3 on US. Everything deployed in AWS.

    We have a keyspace configured with network topology and two
    replicas on every region like this: {'class':
    'NetworkTopologyStrategy', 'ap-southeast-1': '2',
    'eu-west-1': '2', 'us-east-1': '2'}


    Investigating some performance issue I noticed strange things
    in my experiment:

    What we expect is very slow latency 3/5ms max for this
    specific select query. So we want every read to be local the
    each datacenter.

    We configure DCAwareRoundRobinPolicy(local_dc=DC) in python,
    and the same in Go
    gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("DC"))

    Testing a bit with two short program (I can provide them) in
    go and python I notice very strange result. Basically I do
    the same query over and over with a very limited dataset of id.

    The first result were surprising cause the very first query
    were always more than 250ms and after with stressing c*
    (playing with sleep between query) I can achieve a good ratio
    of query at 3/4 ms (what I expected).

    My guess was that long query were somewhat executed not
    locally (or at least imply multi datacenter queries) and
    short one no.

    Activating tracing in my program (like enalbing trace in
    cqlsh) kindla confirm my suspicion.

    (I will provide trace in attachment).

    My question is why sometime C* try to read not localy? how we
    can disable it? what is the criteria for this?

    (btw I'm very not fan of this multi region design for theses
    very specific kind of issues...)

    Also side question: why C* is so slow at connection? it's
    like it's trying to reach every nodes in each DC? (we only
    provide locals seeds however). Sometimes it take more than
    20s...

    Any help appreciated.

    Best,

--

    Raphael Mazelier

Re: Understanding multi region read query and latency

Reply via email to