[ 
https://issues.apache.org/jira/browse/SAMZA-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319620#comment-15319620
 ] 

Yi Pan (Data Infrastructure) commented on SAMZA-957:
----------------------------------------------------

Merged and submitted. Thanks!

> Avoid unnecessary KV Store flushes (part 3)
> -------------------------------------------
>
>                 Key: SAMZA-957
>                 URL: https://issues.apache.org/jira/browse/SAMZA-957
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>             Fix For: 0.10.1
>
>         Attachments: SAMZA-957_1.patch
>
>
> We had an issue where RocksDB performance severely degraded for 23 hours and 
> then resolved itself. To troubleshoot the issue I gathered some samples of 
> the compaction stats from the RocksDB log and engaged with the RocksDB team 
> via an existing, related issue: 
> https://github.com/facebook/rocksdb/issues/696#issuecomment-222549220
> They pointed out that the job was flushing excessively:
> {quote}
> If you overload RocksDB with work (i.e. do bunch of writes really fast, or in 
> your case, bunch of small flushes), it will begin stalling writes while the 
> compactions (deferred work) completes. An interesting thing with RocksDB and 
> LSM architecture is that the more behind you are on compactions, the more 
> expensive the compactions are (due to increased write amplifications and 
> single-threadness of L0->L1 compaction). So our write stalls have to be tuned 
> exactly right for RocksDB to behave well with extremely high write rate.
> {quote}
> Looking through our commit history I see that SAMZA-812 and SAMZA-873 have 
> both intended to address this issue, by reducing the amount of flushes in 
> CachedStore. 
> To be fair, the job in question did not have the SAMZA-873 patch, but I see 
> even more room for improvement. Namely, CachedStore should *never* flush the 
> underlying store unless its flush() was called. It can purge its dirty items 
> to trade off performance for correctness, but flushing is excessive. So, this 
> patch will remove the flushes from the all() and range() methods, simplify 
> the LRU logic, and add a good unit test to verify and explain the proper LRU 
> behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to