[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Sergio Bossa (Jira) Tue, 07 Apr 2020 07:53:26 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077309#comment-17077309
 ]


Sergio Bossa commented on CASSANDRA-15700:
------------------------------------------

{quote}Delivery will only be scheduled at enqueue if it is not already 
processing outstanding work, and though iirc the enqueue operation will itself 
forcefully prune in some circumstances, there are anyway no guarantees that new 
messages will be arriving to this connection.
{quote}
Correct.
{quote}Delivery will only reschedule itself once the outbound TCP channel has 
room, and so in a network outage that does not terminate the connection or 
stalled recipient process (for example), where it is making no (or very slow) 
progress, it will not be woken up to process further pending messages.
{quote}
Isn't this a problem regardless, given the current implementation? That is, we 
currently prune upon:

1) Receiving a new message.

2) Finishing processing messages in an event loop run.

As we said, we have no guarantee we'll keep receiving messages, which rules #1 
out, and if the event loop is stalling (i.e. due to network issues or whatever) 
we'll not run the message processing loop anyway, which rules #2 out and leaves 
us with the same problem of leaving expired messages in the queue.

If this really is the problem we want to solve, which is a bit of an edge case 
but fine, something as easy as scheduling a background pruning at the earliest 
expiry date before "stalling" (i.e. before setting the connection as 
unwritable) would do it (possibly using a Hashed Wheel Timer because the Netty 
scheduler sucks a bit at scheduled tasks).

> Performance regression on internode messaging
> ---------------------------------------------
>
>                 Key: CASSANDRA-15700
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15700
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Internode
>            Reporter: Sergio Bossa
>            Assignee: Sergio Bossa
>            Priority: Normal
>             Fix For: 4.0
>
>         Attachments: Oss40vsOss311.png, oss40.gc, oss40_nogc.tar.xz, 
> oss40_system.log
>
>
> Me and [~jasonstack] have been investigating a performance regression 
> affecting 4.0 during a 3 nodes, RF 3 write throughput test with a timeseries 
> like workload, as shown in this plot, where blue is 3.11 and orange is 4.0:
> !Oss40vsOss311.png|width=389,height=214!
>  It's been a bit of a long investigation, but two clues ended up standing out:
> 1) An abnormal number of expired messages on 4.0 (as shown in the attached  
> system log), while 3.11 has almost none.
> 2) An abnormal GC activity (as shown in the attached gc log).
> Turns out the two are related, as the [on expired 
> callback|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundConnection.java#L462]
>  creates a huge amount of strings in the {{id()}} call. The next question is 
> what causes all those message expirations; we thoroughly reviewed the 
> internode messaging code and the only issue we could find so far is related 
> to the "batch pruning" calls 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L81]
>  and 
> [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundMessageQueue.java#L188]:
>  it _seems_ too much time is spent on those, causing the event loop to fall 
> behind in processing the rest of the messages, which will end up being 
> expired. This is supported by the analysis of the collapsed stacks (after 
> fixing the GC issue):
> {noformat}
> (tprint (top-aggregated-calls oss40nogc "EventLoopDelivery:doRun" 5))
> org/apache/cassandra/net/OutboundConnection$EventLoopDelivery:doRun 3456
> org/apache/cassandra/net/OutboundMessageQueue:access$600 1621
> org/apache/cassandra/net/PrunableArrayQueue:prune 1621
> org/apache/cassandra/net/OutboundMessageQueue$WithLock:close 1621
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1620
> {noformat}
> Those are the top 5 sampled calls from {{EventLoopDelivery#doRun()}} which 
> spends half of its time pruning. But only a tiny portion of such pruning time 
> is spent actually expiring:
> {noformat}
> (tprint (top-aggregated-calls oss40nogc 
> "OutboundMessageQueue:pruneInternalQueueWithLock" 5))
> org/apache/cassandra/net/OutboundMessageQueue:pruneInternalQueueWithLock 1900
> org/apache/cassandra/net/PrunableArrayQueue:prune 1894
> org/apache/cassandra/net/OutboundMessageQueue$1Pruner:onPruned 147
> org/apache/cassandra/net/OutboundConnection$$Lambda$444/740904487:accept 147
> org/apache/cassandra/net/OutboundConnection:onExpired 147
> {noformat}
> And indeed, the {{PrunableArrayQueue:prune()}} self time is dominant:
> {noformat}
> (tprint (top-self-calls oss40nogc "PrunableArrayQueue:prune" 5))
> org/apache/cassandra/net/PrunableArrayQueue:prune 1718
> org/apache/cassandra/net/OutboundConnection:releaseCapacity 27
> java/util/concurrent/ConcurrentHashMap:replaceNode 19
> java/util/concurrent/ConcurrentLinkedQueue:offer 16
> java/util/concurrent/LinkedBlockingQueue:offer 15
> {noformat}
> That said, before proceeding with a PR to fix those issues, I'd like to 
> understand: what's the reason to prune so often, rather than just when 
> polling the message during delivery? If there's a reason I'm missing, let's 
> talk about how to optimize pruning, otherwise let's get rid of that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15700) Performance regression on internode messaging

Reply via email to