Thanks for the detailed explanation Yun Tang and clearly all of the effort
you have put into it. Based on what was described here I would also vote
for going forward with the upgrade.

It's a pity that this regression wasn't caught in the RocksDB community. I
would have two questions/ideas:
1. Can we push for better benchmark coverage in the RocksDB project in the
future?
2. Can we try to catch this kind of problems with RocksDB earlier? For
example with more frequent RocksDB upgrades, or building test flink builds
with the most recent RocksDB version to run our benchmarks and validate
newer RocksDB versions?

Best,
Piotrek

śr., 4 sie 2021 o 19:59 Yuval Itzchakov <yuva...@gmail.com> napisał(a):

> Hi Yun,
> Thank you for the elaborate explanation and even more so for the super
> hard work that you're doing digging into RocksDB and chasing after
> hundreds of commits in order to fix them so we can all benefit.
>
> I can say for myself that optimizing towards memory is more important
> ATM for us, and I'm totally +1 for this.
>
> On Wed, Aug 4, 2021 at 8:50 PM Yun Tang <myas...@live.com> wrote:
>
>> Hi Yuval,
>>
>> Upgrading RocksDB version is a long story since Flink-1.10.
>> When we first plan to introduce write buffer manager to help control the
>> memory usage of RocksDB, we actually wanted to bump up to RocksDB-5.18 from
>> current RocksDB-5.17. However, we found performance regression in our micro
>> benchmark on state operations [1] if bumped to RocksDB-5.18. We did not
>> figure the root cause at that time and decide to cherry pick the commits of
>> write buffer manager to our own FRocksDB [2]. And we finally released our
>> own frocksdbjni-5.17.2-artisans-2.0 at that time.
>>
>> As time goes no, more and more bugs or missed features have been reported
>> in the old RocksDB version. Such as:
>>
>>    1. Cannot support ARM platform [3]
>>    2. Dose not have stable deleteRange API, which is useful for Flink
>>    scale out [4]
>>    3. Cannot support strict block cache [5]
>>    4. Checkpoint might stuck if using UNIVERSVAL compaction strategy [6]
>>    5. Uncontrolled log size make us disabled the RocksDB internal LOG [7]
>>    6. RocksDB's optimizeForPointLookup option might cause data lost [8]
>>    7. Current dummy entry used for memory control in RocksDB-5.17 is too
>>    large, leading performance problem [9]
>>    8. Cannot support alpine-based images.
>>    9. .......
>>
>> Some of the bugs are walked around, and some are still open.
>>
>> And we decide to make some changes from Flink-1.12. First of all, we
>> reported the performance regression problem compared with RocksDB-5.18 and
>> RocksDB-5.17 to RocksDB community [10]. However, as RocksDB-5.x versions
>> are a bit older for the community, and RocksJava usage might not be the
>> core part for facebook guys, we did not get useful replies. Thus, we decide
>> to figure out the root cause of performance regression by ourself.
>> Fortunately, we find the cause via binary search the commits among
>> RocksDB-5.18 and RocksDB-5.17, and updated in the original thread [10]. To
>> be short, the performance regression is due to different implementation of
>> `__thread` and `thread_local` in gcc and would have more impact on dynamic
>> loading [11], which is also what current RocksJava jar package does. With
>> my patch [12], the performance regression would disappear if comparing
>> RocksDB-5.18 with RocksDB-5.17.
>>
>> Unfortunately, RocksDB-5.18 still has many bugs and we want to bump to
>> RocksDB-6.x. However, another performance regression appeared even with my
>> patch [12]. With previous knowledge, we know that we must verify the built
>> .so files with our java-based benchmark instead of using RocksDB built-in
>> db-bench. I started to search the 1340+ commits from RocksDB-5.18 to
>> RocksDB-6.11 to find the performance problem. However, I did not figure out
>> the root cause after spending several weeks this time. The performance
>> behaves up and down in those commits and I cannot get *the commit *which
>> lead the performance regression. Take this commit of integrating block
>> cache tracer in block-based table reader [13] for example, I noticed that
>> this commit would cause a bit performance regression and that might be the
>> useless usage accounting in operations, however, the problematic code was
>> changed in later commits. Thus, after several weeks digging, I have to give
>> up for the endless searching in the thousand commits temporarily. As
>> RocksDB community seems not make the project management system public,
>> unlike Apache's open JIRA systems, we do not know what benchmark they
>> actually run before releasing each version to guarantee the performance.
>>
>> With my patch [10] on latest RocksDB-6.20.3, we could get the results on
>> nexmark in the original thread sent by Stephan, and we can see the
>> performance behaves closely in many real-world cases. And we also hope new
>> features, such as direct buffer supporting [14] in RocksJava could help
>> improve RocksDB's performance in the future.
>>
>> Hope this could explain what we already did.
>>
>>
>> [1] https://github.com/apache/flink-benchmarks
>> [2] https://github.com/ververica/frocksdb/tree/FRocksDB-5.17.2
>> [3] https://issues.apache.org/jira/browse/FLINK-13598
>> [4] https://issues.apache.org/jira/browse/FLINK-21321
>> [5] https://github.com/facebook/rocksdb/issues/6247
>> [6] https://issues.apache.org/jira/browse/FLINK-21726
>> [7] https://issues.apache.org/jira/browse/FLINK-15068
>> [8] https://issues.apache.org/jira/browse/FLINK-17800
>> [9] https://github.com/facebook/rocksdb/pull/5175
>> [10] https://github.com/facebook/rocksdb/issues/5774
>> [11] http://david-grs.github.io/tls_performance_overhead_cost_linux/
>> [12] https://github.com/ververica/frocksdb/pull/19
>> [13] https://github.com/facebook/rocksdb/pull/5441/
>> [14] https://github.com/facebook/rocksdb/pull/2283
>>
>>
>> Best,
>> Yun Tang
>>
>>
>> On Wed, Aug 4, 2021 at 2:36 PM Yuval Itzchakov <yuva...@gmail.com> wrote:
>>
>> We are heavy users of RocksDB and have had several issues with memory
>> managed in Kubernetes, most of them actually went away when we upgraded
>> from Flink 1.9 to 1.13.
>>
>> Do we know why there's such a huge performance regression? Can we improve
>> this somehow with some flag tweaking? It would be great if we see a more in
>> depth explanation of the gains vs losses of upgrading.
>>
>> On Wed, Aug 4, 2021 at 3:08 PM Stephan Ewen <se...@apache.org> wrote:
>>
>>
>>
>> ------------------------------
>> *From:* Nico Kruber <n...@apache.org>
>> *Sent:* Thursday, August 5, 2021 0:10
>> *To:* u...@flink.apache.org <u...@flink.apache.org>; dev <
>> dev@flink.apache.org>
>> *Subject:* Re: [ANNOUNCE] RocksDB Version Upgrade and Performance
>>
>> That's actually also what I'm seeing most of the time and what I'd expect
>> to
>> improve with the newer RocksDB version.
>> Hence, I'd also favour the upgrade even if there is a slight catch with
>> respect to performance - we should, however, continue to investigate this
>> together with the RocksDB community.
>>
>>
>> Nico
>>
>> On Wednesday, 4 August 2021 14:26:32 CEST David Anderson wrote:
>> > I am hearing quite often from users who are struggling to manage memory
>> > usage, and these are all users using RocksDB. While I don't know for
>> > certain that RocksDB is the cause in every case, from my perspective,
>> > getting the better memory stability of version 6.20 in place is
>> critical.
>> >
>> > Regards,
>> > David
>> >
>> > On Wed, Aug 4, 2021 at 8:08 AM Stephan Ewen <se...@apache.org> wrote:
>> > > Hi all!
>> > >
>> > > *!!!  If you are a big user of the Embedded RocksDB State Backend and
>> have
>> > > performance sensitive workloads, please read this !!!*
>> > >
>> > > I want to quickly raise some awareness for a RocksDB version upgrade
>> we
>> > > plan to do, and some possible impact on application performance.
>> > >
>> > > *We plan to upgrade RocksDB to version 6.20.* That version of RocksDB
>> > > unfortunately introduces some non-trivial performance regression. In
>> our
>> > > Nexmark Benchmark, at least one query is up to 13% slower.
>> > > With some fixes, this can be improved, but even then there is an
>> overall
>> > > *regression up to 6% in some queries*. (See attached table for results
>> > > from relevant Nexmark Benchmark queries).
>> > >
>> > > We would do this update nonetheless, because we need to get new
>> features
>> > > and bugfixes from RocksDB in.
>> > >
>> > > Please respond to this mail thread if you have major concerns about
>> this.
>> > >
>> > >
>> > > *### Fallback Plan*
>> > >
>> > > Optionally, we could fall back to Plan B, which is to upgrade RocksDB
>> only
>> > > to version 5.18.4.
>> > > Which has no performance regression (after applying a custom patch).
>> > >
>> > > While this spares us the performance degradation of RocksDB 6.20.x,
>> this
>> > >
>> > > has multiple disadvantages:
>> > >   - Does not include the better memory stability (strict cache
>> control)
>> > >   - Misses out on some new features which some users asked about
>> > >   - Does not have the latest RocksDB bugfixes
>> > >
>> > > The latest point is especially bad in my opinion. While we can
>> cherry-pick
>> > > some bugfixes back (and have done this in the past), users typically
>> run
>> > > into an issue first and need to trace it back to RocksDB, then one of
>> the
>> > > committers can find the relevant patch from RocksDB master and
>> backport
>> > > it.
>> > > That isn't the greatest user experience.
>> > >
>> > > Because of those disadvantages, we would prefer to do the upgrade to
>> the
>> > > newer RocksDB version despite the unfortunate performance regression.
>> > >
>> > > Best,
>> > > Stephan
>>
>>
>> --
>> Dr. Nico Kruber | Solutions Architect
>>
>> Follow us @VervericaData Ververica
>> --
>> Join Flink Forward - The Apache Flink Conference
>> Stream Processing | Event Driven | Real Time
>> --
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>> --
>> Ververica GmbH
>> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>> Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton
>> Wehner
>>
>>
>>
>
> --
> Best Regards,
> Yuval Itzchakov.
>

Reply via email to