[ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077512#comment-17077512
 ] 

Benedict Elliott Smith commented on CASSANDRA-15700:
----------------------------------------------------

{quote}Remove pruning and expire time tracking from enqueue.
{quote}
This would be a regression from 3.0 unfortunately, though the semantic for 3.0 
was a to use the head of the list, which we could potentially restore (not 
necessarily as trivially as might be desired, but it is achievable).
{quote}Compute a weighted average of each message expire time.
{quote}
I assume you mean to maintain a guess of the number of items we _on average_ 
have in the queue? In which case I am fairly strongly -1 on this category 
approach. One of the features running through this work is that we make strong 
guarantees, and this would weaken that.  I would prefer to be able to stipulate 
that (e.g.) we never reject an {{enqueue}} if more than 20% of the queue is 
already expired.
{quote}I would personally opt for a fix which doesn't increase overall 
complexity
{quote}
Complexity is in the eye of the beholder :)  I think the main goal should be 
minimising risk to delivery. The stability of this patch has been tested 
_fairly_ extensively and so far held up, so messing with important control flow 
semantics is something I would prefer to avoid.

I think the least risky approach is to change how we compute and select the 
expiration time we use for triggering an expiration. This also has the benefit 
of maintaining well-define guarantees, is simple, and modifies no behaviours 
besides the selection of this value. For instance, picking an expiration time 
that we expect to contain a certain amount of data, then simply tracking 
_accurately_ on the delivery thread that it really does so. Once it doesn't, we 
prune to pick a better value.

More specifically (e.g.), you pick a time {{t}}, such that you expect the set 
of messages with expiration less than {{t}}, say {{M(<t)}}, should collectively 
occupy no more than 20% of the connection's data or item limit. As you 
introduce new messages, you simply record the total amount of data represented 
by {{M(<t)}}, and if it crosses the threshold, you recompute picking a new 
{{t}}. Assuming a random distribution of expiration this should be amortised 
constant time, as whatever {{t}} we pick first, our new {{t}} will be 
recomputed only {{O(1/n)}} times, at {{O(n)}} cost.

This has the advantage of changing none of the important behaviours, so 
requires only isolated analysis of this particular calculation, and the 
calculation is simple.

FWIW, we should also exclude the possibility that there really are more 
legitimate expirations occurring. The new patch properly enforces timeouts for 
an operation _and its responses_, so we may also be seeing a different 
behaviour as we enforce these timeouts accurately.  3.0 was much more lax in 
this regard, and there is a real chance that this is involved given the 
presence of real expirations.  It _might_ be that the time spent pruning is 
immaterial given the impact on progress of this behavioural change. Previously 
we may have been happily serving operations that had actually had their timeout 
elapse (for instance). Unfortunately I don't know enough about the workload you 
are running to say much with confidence. Either way, this is a good opportunity 
to patch this known weakness, but further investigation might be warranted to 
isolate possibly multiple causes of the aggregate behaviour you are seeing.

I won't be able to participate in this discussion for a week or two, but I 
expect [~aleksey] will get involved soon. Either way, there is no need to solve 
this on such a short time horizon I don't think.

> Performance regression on internode messaging
> ---------------------------------------------
>
>                 Key: CASSANDRA-15700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Normal
>             Fix For: 4.0
>
>         Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to