[jira] [Commented] (CASSANDRA-14660) Improve TokenMetaData cache populating performance for large cluster

Benedict (JIRA) Wed, 22 Aug 2018 06:01:52 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588800#comment-16588800
 ]


Benedict commented on CASSANDRA-14660:
--------------------------------------

[~wpc] thanks for the patch.  I'm going to go out on a limb and say that people 
are stretched too thin to review this in time for the 4.0 feature freeze, which 
is a shame.  That said, while I haven't taken a proper look at the patch, it 
looks fairly straightforward so I might be wrong.  

I would say there's a reasonable chance this qualifies for inclusion in 3.0+, 
as well, since this sounds like pretty bad behaviour.  In which case it may not 
be affected by the freeze.

However, while where at it, it's not clear to me why we recalculate during 
read, as opposed to on invalidation.  This might marginally slow down responses 
to ring changes, but these have no fixed schedule for notification anyway, so 
we must be robust to this, and all operations (including those entirely 
not-overlapping with the change) stopping briefly during a ring change seems 
unnecessarily burdensome.

> Improve TokenMetaData cache populating performance for large cluster
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-14660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14660
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>         Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP
>            Reporter: Pengchao Wang
>            Priority: Critical
>              Labels: Performance
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>         Attachments: 14660-trunk.txt, TokenMetaDataBenchmark.java
>
>
> TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent 
> token and topology view on coordinations without paying read lock cost. Upon 
> first read the method acquire a synchronize lock and generate a copy of major 
> token meta data structures and cached it, and upon every token meta data 
> changes(due to gossip changes), the cache get cleared and next read will 
> taking care of cache population.
> For small to medium size clusters this strategy works pretty well. But large 
> clusters can actually be suffered from the locking since cache populating is 
> much slower. On one of our largest cluster (~1000 nodes,  125k tokens, C* 
> 3.0.15)  each cache population take about 500~700ms, and during that there 
> are no requests can go through since synchronize lock was acquired. This 
> caused waves of timeouts errors when there are large amount gossip messages 
> propagating cross the cluster, such as in the case of cluster restarting.
> Base on profiling we found that the cost mostly comes from copying 
> tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use 
> TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in 
> TreeMap helps reduce copying complexity from O(N*log(N)) to O(N) when copying 
> from already ordered data. But guava's TreeMultiMap copying missed that 
> optimization and make it ~10 times slower than it actually need to be on our 
> size of cluster.
> The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a 
> vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can 
> copy it O(N) time.
> I also attached a benchmark script (TokenMetaDataBenchmark.java), which 
> simulates a large cluster then measures average latency for TokenMetaData 
> cache populating.
> Benchmark result before and after that patch:
> {code:java}
> trunk: 
> before 100ms, after 13ms
> 3.0.x: 
> before 199ms, after 15ms
>  {code}
> (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N) 
> optimization is not applied because the key comparator is dynamically created 
> and TreeMap cannot determine the source and dest are in same order)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14660) Improve TokenMetaData cache populating performance for large cluster

Reply via email to