Hi, all.


We have built a Standalone Cluster based on Flink 1.17-RC3 release package
with 8 TMs with Nexmark[1].



The test results show that Flink 1.17 has a significant performance
degradation compared to Flink 1.13 (About 8.49%), shall we need to identify
the reason for the performance degradation before Flink 1.17 is released?



>From the logs of JM, we initially suspect that this may be caused by the
task not being stopped in time until a checkpoint is completed.



Some of the test logs and results are excerpted below. (*Note that the last
few metrics with lower cpu usage.*)



Flink 1.17: q18

Benchmark Queries: [q18]

==================================================================

Start to run query q18 with workload [tps=10 M, eventsNum=100 M,
percentage=bid:46,auction:3,person:1,kafkaServers:null]

Start the warmup for at most 120000ms and 100000000 events.

Stop the warmup, cost 120100ms.

Monitor metrics after 10 seconds.

Start to monitor metrics until job is finished.

Current Cores=18.33 (8 TMs)

Current Cores=15.15 (8 TMs)

Current Cores=14.73 (8 TMs)

Current Cores=17.18 (8 TMs)

Current Cores=17.11 (8 TMs)

Current Cores=12.27 (8 TMs)

Current Cores=13.64 (8 TMs)

Current Cores=13.68 (8 TMs)

Current Cores=13.64 (8 TMs)

Current Cores=16.08 (8 TMs)

Current Cores=14.1 (8 TMs)

Current Cores=15.07 (8 TMs)

Current Cores=14.18 (8 TMs)

Current Cores=12.46 (8 TMs)

Current Cores=12.26 (8 TMs)

Current Cores=13.58 (8 TMs)

Current Cores=15.94 (8 TMs)

Current Cores=14.67 (8 TMs)

Current Cores=15.25 (8 TMs)

Current Cores=13.4 (8 TMs)

Current Cores=13.82 (8 TMs)

Current Cores=10.47 (8 TMs)

Current Cores=1.91 (8 TMs)

Current Cores=0.05 (8 TMs)

Current Cores=0.04 (8 TMs)

Current Cores=0.06 (8 TMs)

Current Cores=0.03 (8 TMs)

Current Cores=0.05 (8 TMs)

Current Cores=0.03 (8 TMs)

Current Cores=0.05 (8 TMs)

Current Cores=0.03 (8 TMs)

Current Cores=0.06 (8 TMs)

Summary Average: EventsNum=100,000,000, Cores=9.98, Time=170.295 s

Stop job query q18

-------------------------------- Nexmark Results
--------------------------------



+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

| Nexmark Query     | Events Num        | Cores             |
Time(s)           | Cores * Time(s)   | Throughput/Cores  |

+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

|q18                |100,000,000        |9.98
|170.295            |1699.286           |58.85 K/s          |

|Total              |100,000,000        |9.978
|170.295            |1699.286           |58.85 K/s          |

+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+







Flink 1.13: q18

Benchmark Queries: [q18]

==================================================================

Start to run query q18 with workload [tps=10 M, eventsNum=100 M,
percentage=bid:46,auction:3,person:1,kafkaServers:null]

Start the warmup for at most 120000ms and 100000000 events.

Stop the warmup, cost 120100ms.

Monitor metrics after 10 seconds.

Start to monitor metrics until job is finished.

Current Cores=13.63 (8 TMs)

Current Cores=12.72 (8 TMs)

Current Cores=12.34 (8 TMs)

Current Cores=12.17 (8 TMs)

Current Cores=12.59 (8 TMs)

Current Cores=14.18 (8 TMs)

Current Cores=11.06 (8 TMs)

Current Cores=11.69 (8 TMs)

Current Cores=11.88 (8 TMs)

Current Cores=11.11 (8 TMs)

Current Cores=13.48 (8 TMs)

Current Cores=16.34 (8 TMs)

Current Cores=11.3 (8 TMs)

Current Cores=12.82 (8 TMs)

Current Cores=9.91 (8 TMs)

Current Cores=11.45 (8 TMs)

Current Cores=10.13 (8 TMs)

Current Cores=13.87 (8 TMs)

Current Cores=12.21 (8 TMs)

Current Cores=13.41 (8 TMs)

Current Cores=12.15 (8 TMs)

Current Cores=13.63 (8 TMs)

Current Cores=10.92 (8 TMs)

Current Cores=10.51 (8 TMs)

Summary Average: EventsNum=100,000,000, Cores=12.31, Time=126.297 s

Stop job query q18

-------------------------------- Nexmark Results
--------------------------------



+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

| Nexmark Query     | Events Num        | Cores             |
Time(s)           | Cores * Time(s)   | Throughput/Cores  |

+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+

|q18                |100,000,000        |12.31
|126.297            |1555.081           |64.31 K/s          |

|Total              |100,000,000        |12.313
|126.297            |1555.081           |64.31 K/s          |

+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+







[1] https://github.com/nexmark/nexmark



Best

Yu Chen.

Reply via email to