Hi, all.
We have built a Standalone Cluster based on Flink 1.17-RC3 release package with 8 TMs with Nexmark[1]. The test results show that Flink 1.17 has a significant performance degradation compared to Flink 1.13 (About 8.49%), shall we need to identify the reason for the performance degradation before Flink 1.17 is released? >From the logs of JM, we initially suspect that this may be caused by the task not being stopped in time until a checkpoint is completed. Some of the test logs and results are excerpted below. (*Note that the last few metrics with lower cpu usage.*) Flink 1.17: q18 Benchmark Queries: [q18] ================================================================== Start to run query q18 with workload [tps=10 M, eventsNum=100 M, percentage=bid:46,auction:3,person:1,kafkaServers:null] Start the warmup for at most 120000ms and 100000000 events. Stop the warmup, cost 120100ms. Monitor metrics after 10 seconds. Start to monitor metrics until job is finished. Current Cores=18.33 (8 TMs) Current Cores=15.15 (8 TMs) Current Cores=14.73 (8 TMs) Current Cores=17.18 (8 TMs) Current Cores=17.11 (8 TMs) Current Cores=12.27 (8 TMs) Current Cores=13.64 (8 TMs) Current Cores=13.68 (8 TMs) Current Cores=13.64 (8 TMs) Current Cores=16.08 (8 TMs) Current Cores=14.1 (8 TMs) Current Cores=15.07 (8 TMs) Current Cores=14.18 (8 TMs) Current Cores=12.46 (8 TMs) Current Cores=12.26 (8 TMs) Current Cores=13.58 (8 TMs) Current Cores=15.94 (8 TMs) Current Cores=14.67 (8 TMs) Current Cores=15.25 (8 TMs) Current Cores=13.4 (8 TMs) Current Cores=13.82 (8 TMs) Current Cores=10.47 (8 TMs) Current Cores=1.91 (8 TMs) Current Cores=0.05 (8 TMs) Current Cores=0.04 (8 TMs) Current Cores=0.06 (8 TMs) Current Cores=0.03 (8 TMs) Current Cores=0.05 (8 TMs) Current Cores=0.03 (8 TMs) Current Cores=0.05 (8 TMs) Current Cores=0.03 (8 TMs) Current Cores=0.06 (8 TMs) Summary Average: EventsNum=100,000,000, Cores=9.98, Time=170.295 s Stop job query q18 -------------------------------- Nexmark Results -------------------------------- +-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+ | Nexmark Query | Events Num | Cores | Time(s) | Cores * Time(s) | Throughput/Cores | +-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+ |q18 |100,000,000 |9.98 |170.295 |1699.286 |58.85 K/s | |Total |100,000,000 |9.978 |170.295 |1699.286 |58.85 K/s | +-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+ Flink 1.13: q18 Benchmark Queries: [q18] ================================================================== Start to run query q18 with workload [tps=10 M, eventsNum=100 M, percentage=bid:46,auction:3,person:1,kafkaServers:null] Start the warmup for at most 120000ms and 100000000 events. Stop the warmup, cost 120100ms. Monitor metrics after 10 seconds. Start to monitor metrics until job is finished. Current Cores=13.63 (8 TMs) Current Cores=12.72 (8 TMs) Current Cores=12.34 (8 TMs) Current Cores=12.17 (8 TMs) Current Cores=12.59 (8 TMs) Current Cores=14.18 (8 TMs) Current Cores=11.06 (8 TMs) Current Cores=11.69 (8 TMs) Current Cores=11.88 (8 TMs) Current Cores=11.11 (8 TMs) Current Cores=13.48 (8 TMs) Current Cores=16.34 (8 TMs) Current Cores=11.3 (8 TMs) Current Cores=12.82 (8 TMs) Current Cores=9.91 (8 TMs) Current Cores=11.45 (8 TMs) Current Cores=10.13 (8 TMs) Current Cores=13.87 (8 TMs) Current Cores=12.21 (8 TMs) Current Cores=13.41 (8 TMs) Current Cores=12.15 (8 TMs) Current Cores=13.63 (8 TMs) Current Cores=10.92 (8 TMs) Current Cores=10.51 (8 TMs) Summary Average: EventsNum=100,000,000, Cores=12.31, Time=126.297 s Stop job query q18 -------------------------------- Nexmark Results -------------------------------- +-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+ | Nexmark Query | Events Num | Cores | Time(s) | Cores * Time(s) | Throughput/Cores | +-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+ |q18 |100,000,000 |12.31 |126.297 |1555.081 |64.31 K/s | |Total |100,000,000 |12.313 |126.297 |1555.081 |64.31 K/s | +-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+ [1] https://github.com/nexmark/nexmark Best Yu Chen.