[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Benedict Elliott Smith (Jira) Wed, 08 Apr 2020 06:16:47 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078265#comment-17078265
 ]


Benedict Elliott Smith commented on CASSANDRA-15700:
----------------------------------------------------

bq. on what ground you say this is an isolated problem?

This appears to be an issue in communication: the word isolated refers to the 
problem _code_.  Please be assured I consider this a _serious_ problem, I am 
just unconcerned about resolving it.  There is no need to panic and rush a fix.

bq. I’d still like to understand how the pruning approach we're discussing here 
is important to the control flow semantics at all

In the word of Dirk Gently: everything’s connected :)  Your prior alternative 
proposal to replace the existing semantics involved two distinct changes to 
control flow, namely introducing a hash timer wheel (something I’m in favour of 
generally, but demonstrably a control flow change, and preferable to defer 
until 5.0) and eliminating the expiry on enqueue.  I just consider these kinds 
of change to be riskier at this stage.

bq. The collapsed stacks clearly show most time is spent by pruning itself, 
that is by iterating the queue, rather than by expiring messages

I believe it shows as much as 10% of time in real expirations?  That is not 
insignificant, and given how relatively cheap evaluating an expiration is, it 
_may_ well be the case that the algorithmic inefficiency we are discussing is 
incidental to the behaviour.

bq. to make 4.0 behave as similarly as possible to 3.11

This leaves a significant gap still: 4.0 will use the local node message 
arrival to determine the timeout for its response, so there is still plenty of 
scope for messages to expire ahead of 3.x

bq. No, I meant to compute the "next expire time" as an approximation of the 
expire time of the processed messages, rather than relying on exactly computing 
it via the pruner at every event loop delivery run.
bq. We can keep the enqueue() pruning, as that's not the worst offender (see 
collapsed stacks).

The enqueue pruning is cheap because we compute the minimum expiration time, so 
it is infrequently called; if we only guess this number now, we offer no 
guarantees the balance of new messages dropped in favour of expired messages.

bq. so why adding an additional mechanism and 4th config property (your 20% 
threshold)

Is this an additional mechanism? We already have a mechanism, we just pick our 
number differently.  Guessing at a number is also a mechanism, surely?  How 
would we configure the guess algorithm, and why wouldn’t we expose its 
parameters?  I had assumed we would not make this configurable, in the same way 
we would not make the assumptions of any guess algorithm configurable, since 
its purpose is just to guarantee algorithmic complexity and bound how far from 
our memory limits we permit expired messages to be preferred over unexpired 
messages.

Honestly though it sounds like there is not much between our proposals now, 
which is reassuring, so perhaps we should focus on the common ground we have.  
However, I am now on leave so I would appreciate it if you can be patient until 
I return to continue this discussion.  I would also like to look more closely 
again at the existing behaviour, as the pruning is closely related to the 
migration of new records from the MPSC queue to the internal queue.  I have 
been trying to respond promptly to your queries, but I feel that in doing so my 
responses have not been sufficiently well considered, and I would prefer to 
take time to produce a complete and coherent view and proposal.  Is that 
acceptable to you?

> Performance regression on internode messaging
> ---------------------------------------------
>
>                 Key: CASSANDRA-15700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Normal
>             Fix For: 4.0
>
>         Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Reply via email to