[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Sergio Bossa (Jira) Wed, 08 Apr 2020 03:39:54 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078042#comment-17078042
 ]


Sergio Bossa commented on CASSANDRA-15700:
------------------------------------------

{quote}This would be a regression from 3.0 unfortunately
{quote}
Oh that's a good point, forgot we do that in 3.0+ too. We can keep the 
{{enqueue()}} pruning, as that's not the worst offender (see collapsed stacks).
{quote}I assume you mean to maintain a guess of the number of items we _on 
average_ have in the queue?
{quote}
No, I meant to compute the "next expire time" as an approximation of the expire 
time of the processed messages, rather than relying on exactly computing it via 
the pruner at every event loop delivery run.
{quote}One of the features running through this work is that we make strong 
guarantees, and this would weaken that.  I would prefer to be able to stipulate 
that (e.g.) we never reject an {{enqueue}} if more than 20% of the queue is 
already expired.
{quote}
How is that related to what we're discussing? Enqueuing new messages is 
regulated by memory limiting, so that's what gives us strong guarantees; also, 
we do already prune the backlog if memory limits are met.
{quote}messing with important control flow semantics is something I would 
prefer to avoid.
{quote}
I'd still like to understand how the pruning approach we're discussing here is 
important to the control flow semantics at all, as I don't think I've got a 
clear answer yet, although it might be me missing the point. What I've 
heard/understood is:

1) It protects us against saturating memory upon network stalls.

2) It protects us against saturating memory upon too many expired messages.

AFAIU, none of those is accurate, as the current implementation doesn't satisfy 
#1, and #2 is covered by the memory limits implementation.
{quote}I think the least risky approach is to change how we compute and select 
the expiration time we use for triggering an expiration. This also has the 
benefit of maintaining well-define guarantees, is simple, and modifies no 
behaviours besides the selection of this value.
{quote}
This is something that can definitely be tried first, to reduce the amount of 
pruner runs. I will test this next.
{quote}More specifically (e.g.), you pick a time {{t}}, such that you expect 
the set of messages with expiration less than {{t}}, say {{M(<t)}}, should 
collectively occupy no more than 20% of the connection's data or item limit.
{quote}
This is an example of borderline unnecessary complexity in my book :) The 
reason is, memory limits are already implemented via a separate mechanism and 
controlled by 3 different config properties 
(\{{internode_application_send_queue_*}}), so why adding an additional 
mechanism and 4th config property (your 20% threshold)? In other words, my 
definition of complexity is whatever additional piece of code or configuration 
which is not justified by an actual use case or doesn't solve an actual 
problem, and I can hardly see what problem we're trying to solve with that. If 
I'm missing anything, happy to hear further explanations.
{quote}FWIW, we should also exclude the possibility that there really are more 
legitimate expirations occurring. The new patch properly enforces timeouts for 
an operation _and its responses_, so we may also be seeing a different 
behaviour as we enforce these timeouts accurately.
{quote}
Good point, I wouldn't exclude that, but after a thorough review (as thorough 
as a 2 days review of a large piece of unknown code can be) we leaned against 
that being the case because:

1) We disabled cross node timeouts, to make 4.0 behave as similarly as possible 
to 3.11, and to remove some of the complexity around computing the expire time 
based on the creation time (which being done via cross-node time translations 
could be suffering from subtle bugs).

2) The collapsed stacks clearly show most time is spent by pruning itself, that 
is by iterating the queue, rather than by expiring messages: if time were spent 
by actually expiring messages, it would have shown in the stacks.
{quote}It remains a relatively isolated problem, and nothing that I would be 
concerned about causing a delay to GA.
{quote}
This honestly worries me quite a bit: on what ground you say this is an 
isolated problem? And on what ground isn't a massive performance regression a 
concern for our GA date?

> Performance regression on internode messaging
> ---------------------------------------------
>
>                 Key: CASSANDRA-15700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Normal
>             Fix For: 4.0
>
>         Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Reply via email to