[ https://issues.apache.org/jira/browse/KAFKA-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780115#comment-16780115 ]
Jonathan Gordon commented on KAFKA-7652: ---------------------------------------- {quote}1) when you profile on latest trunk did you see the same pattern as observed in [https://i.imgur.com/IHxC2cZ.png] as well as in the trace logging compared with 0.10.2.x? {quote} The image you linked is actually for 0.10.2.x, which is our current deployment. It shows us gated by RocksDB, but that's actually *faster* than what we saw in 0.11.0.0, the recent trunk, or the test I just ran against 2.2.0-rc0: [https://i.imgur.com/L6PWIEF.png] {quote}2) practically the lookups in the caching layer is very cheap and hence even increased a lot it should not contribute to much overhead, whereas the fetches on the underlying store would be much more expensive. Could you confirm if the performance bottleneck is from the underlying rocksDB, or from the caching layer access? {quote} For 2.2.0-rc0, we're spending the bulk of our time trying to retrieve records from the NamedCache. See: [^0.10.2.1-NamedCache.txt] [^2.2.0-rc0_b-NamedCache.txt] While I agree it seems it should be more performant per retrieval, as you can see from the latest logs, it's the difference between 1,096,089 (2.2.0-rc0) and 19,245 (0.10.2.1) hits per second to the cache. The two orders of magnitude appear to outweigh whatever performance benefit we'd receive from the caching layer. This is just one of 8 tasks. During their respective runs, the services consumed 8.4M messages (0.10.2.1) with no lag vs 637K messages (2.2.0-rc0) with considerable lag. I'd be happy to run again with whatever custom logging or configuration you suggest to help further pinpoint the problem. > Kafka Streams Session store performance degradation from 0.10.2.2 to 0.11.0.0 > ----------------------------------------------------------------------------- > > Key: KAFKA-7652 > URL: https://issues.apache.org/jira/browse/KAFKA-7652 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.11.0.0, 0.11.0.1, 0.11.0.2, 0.11.0.3, 1.1.1, 2.0.0, > 2.0.1 > Reporter: Jonathan Gordon > Assignee: Guozhang Wang > Priority: Major > Labels: kip > Fix For: 2.2.0 > > Attachments: 0.10.2.1-NamedCache.txt, 2.2.0-rc0_b-NamedCache.txt, > kafka_10_2_1_flushes.txt, kafka_11_0_3_flushes.txt > > > I'm creating this issue in response to [~guozhang]'s request on the mailing > list: > [https://lists.apache.org/thread.html/97d620f4fd76be070ca4e2c70e2fda53cafe051e8fc4505dbcca0321@%3Cusers.kafka.apache.org%3E] > We are attempting to upgrade our Kafka Streams application from 0.10.2.1 but > experience a severe performance degradation. The highest amount of CPU time > seems spent in retrieving from the local cache. Here's an example thread > profile with 0.11.0.0: > [https://i.imgur.com/l5VEsC2.png] > When things are running smoothly we're gated by retrieving from the state > store with acceptable performance. Here's an example thread profile with > 0.10.2.1: > [https://i.imgur.com/IHxC2cZ.png] > Some investigation reveals that it appears we're performing about 3 orders > magnitude more lookups on the NamedCache over a comparable time period. I've > attached logs of the NamedCache flush logs for 0.10.2.1 and 0.11.0.3. > We're using session windows and have the app configured for > commit.interval.ms = 30 * 1000 and cache.max.bytes.buffering = 10485760 > I'm happy to share more details if they would be helpful. Also happy to run > tests on our data. > I also found this issue, which seems like it may be related: > https://issues.apache.org/jira/browse/KAFKA-4904 > > KIP-420: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-420%3A+Add+Single+Value+Fetch+in+Session+Stores] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)