[ https://issues.apache.org/jira/browse/CASSANDRA-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590593#comment-16590593 ]
Pengchao Wang commented on CASSANDRA-14660: ------------------------------------------- {quote}So a super simple quick fix would be to delete {{SortedBiMultiValMap}} and move its static methods to {{BiMultiValMap}} but called e.g. {{createWithSortedKeys}}. In this method we could provide a forward {{TreeMap}} and reverse {{HashMultimap}}. {quote} In HasMultimap the value collection is a HashSet, which probably will not work for us. E.g getting tokens for a specific endpoints will not give us sorted token anymore. {code:java} Set<Token> tokens = tokenToEndpointMap.inverse().get(endpoint); {code} {quote}If we were to use a map based on our BTree, we could eliminate the distinction between {{sortedTokens}} and this map, since we can treat a {{BTreeSet}} as a {{List}}, entirely eliminating the currently experienced pauses and all copying. {quote} The BTree based immutable map idea sounds very interesting. I can explore that options when I got some time. > Improve TokenMetaData cache populating performance for large cluster > -------------------------------------------------------------------- > > Key: CASSANDRA-14660 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14660 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Environment: Benchmark is on MacOSX 10.13.5, 2017 MBP > Reporter: Pengchao Wang > Priority: Critical > Labels: Performance > Fix For: 3.0.x, 3.11.x, 4.x > > Attachments: 14660-trunk.txt, TokenMetaDataBenchmark.java > > > TokenMetaData#cachedOnlyTokenMap is a method C* used to get a consistent > token and topology view on coordinations without paying read lock cost. Upon > first read the method acquire a synchronize lock and generate a copy of major > token meta data structures and cached it, and upon every token meta data > changes(due to gossip changes), the cache get cleared and next read will > taking care of cache population. > For small to medium size clusters this strategy works pretty well. But large > clusters can actually be suffered from the locking since cache populating is > much slower. On one of our largest cluster (~1000 nodes, 125k tokens, C* > 3.0.15) each cache population take about 500~700ms, and during that there > are no requests can go through since synchronize lock was acquired. This > caused waves of timeouts errors when there are large amount gossip messages > propagating cross the cluster, such as in the case of cluster restarting. > Base on profiling we found that the cost mostly comes from copying > tokenToEndpointMap. It is a SortedBiMultiValueMap made from a forward map use > TreeMap and a reverse map use guava TreeMultiMap. There is an optimization in > TreeMap helps reduce copying complexity from O(N*log(N)) to O(N) when copying > from already ordered data. But guava's TreeMultiMap copying missed that > optimization and make it ~10 times slower than it actually need to be on our > size of cluster. > The patch attached to the issue replace the reverse TreeMultiMap<K, V> to a > vanilla TreeMap<K, TreeSet<V>> in SortedBiMultiValueMap to make sure we can > copy it O(N) time. > I also attached a benchmark script (TokenMetaDataBenchmark.java), which > simulates a large cluster then measures average latency for TokenMetaData > cache populating. > Benchmark result before and after that patch: > {code:java} > trunk: > before 100ms, after 13ms > 3.0.x: > before 199ms, after 15ms > {code} > (On 3.0.x even the forward TreeMap copying is slow, the O(N*log(N)) to O(N) > optimization is not applied because the key comparator is dynamically created > and TreeMap cannot determine the source and dest are in same order) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org