[ https://issues.apache.org/jira/browse/SAMZA-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319620#comment-15319620 ]
Yi Pan (Data Infrastructure) commented on SAMZA-957: ---------------------------------------------------- Merged and submitted. Thanks! > Avoid unnecessary KV Store flushes (part 3) > ------------------------------------------- > > Key: SAMZA-957 > URL: https://issues.apache.org/jira/browse/SAMZA-957 > Project: Samza > Issue Type: Bug > Reporter: Jake Maes > Assignee: Jake Maes > Fix For: 0.10.1 > > Attachments: SAMZA-957_1.patch > > > We had an issue where RocksDB performance severely degraded for 23 hours and > then resolved itself. To troubleshoot the issue I gathered some samples of > the compaction stats from the RocksDB log and engaged with the RocksDB team > via an existing, related issue: > https://github.com/facebook/rocksdb/issues/696#issuecomment-222549220 > They pointed out that the job was flushing excessively: > {quote} > If you overload RocksDB with work (i.e. do bunch of writes really fast, or in > your case, bunch of small flushes), it will begin stalling writes while the > compactions (deferred work) completes. An interesting thing with RocksDB and > LSM architecture is that the more behind you are on compactions, the more > expensive the compactions are (due to increased write amplifications and > single-threadness of L0->L1 compaction). So our write stalls have to be tuned > exactly right for RocksDB to behave well with extremely high write rate. > {quote} > Looking through our commit history I see that SAMZA-812 and SAMZA-873 have > both intended to address this issue, by reducing the amount of flushes in > CachedStore. > To be fair, the job in question did not have the SAMZA-873 patch, but I see > even more room for improvement. Namely, CachedStore should *never* flush the > underlying store unless its flush() was called. It can purge its dirty items > to trade off performance for correctness, but flushing is excessive. So, this > patch will remove the flushes from the all() and range() methods, simplify > the LRU logic, and add a good unit test to verify and explain the proper LRU > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)