[ 
https://issues.apache.org/jira/browse/FLINK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612408#comment-16612408
 ] 

陈梓立 commented on FLINK-10320:
-----------------------------

[~pnowojski] thanks for your quick reply! Seems like my response is a bit 
late(laugh).

The most important benchmark of {{JobMaster}} while scheduling IMO is how fast 
it reacts to rpc requests, which include slot offering, task deploying and task 
state update, fault tolerance and so on.

But measure them apart seems unreasonable since they rely on each other, so 
take the whole schedule process into one benchmark. And, you're right that the 
time of it might not be the most expressive target. Known little about jmh I 
see that the exist benchmark always test how much latency or ops of a given 
benchmark function. ops looks like how many times the whole function execute 
within the given time. I'd appreciate if there are other more exact targets the 
benchmark framework provided.

For implementation part, during a early attempt I am also facing that these 
mock {{TaskExecutor}} implemented not that directly. Rather just provide 
{{TaskExecutorGateway}}, we definitely have to simulate finishing the task so 
that a {{TaskExecutor}} required. There is no {{TestingTaskExecutor}} yet and 
to control how task finish we should override some method of a real 
{{TaskExecutor}}. Also simulate callback actions like heartbeat or task 
actions. It would be reasonable to put tm/actions threads into a thread 
pool(different from that for jm) so that we don't crash the local machine.

> Introduce JobMaster schedule micro-benchmark
> --------------------------------------------
>
>                 Key: FLINK-10320
>                 URL: https://issues.apache.org/jira/browse/FLINK-10320
>             Project: Flink
>          Issue Type: Improvement
>          Components: Tests
>            Reporter: 陈梓立
>            Assignee: 陈梓立
>            Priority: Major
>
> Based on {{org.apache.flink.streaming.runtime.io.benchmark}} stuff and the 
> repo [flink-benchmark|https://github.com/dataArtisans/flink-benchmarks], I 
> proposal to introduce another micro-benchmark which focuses on {{JobMaster}} 
> schedule performance
> h3. Target
> Benchmark how long from {{JobMaster}} startup(receive the {{JobGraph}} and 
> init) to all tasks RUNNING. Technically we use bounded stream and TM finishes 
> tasks as soon as they arrived. So the real interval we measure is to all 
> tasks FINISHED.
> h3. Case
> 1. JobGraph that cover EAGER + PIPELINED edges
> 2. JobGraph that cover LAZY_FROM_SOURCES + PIPELINED edges
> 3. JobGraph that cover LAZY_FROM_SOURCES + BLOCKING edges
> ps: maybe benchmark if the source is get from {{InputSplit}}?
> h3. Implement
> Based on the flink-benchmark repo, we finally run benchmark using jmh. So the 
> whole test suit is separated into two repos. The testing environment could be 
> located in the main repo, maybe under 
> flink-runtime/src/test/java/org/apache/flink/runtime/jobmaster/benchmark.
> To measure the performance of {{JobMaster}} scheduling, we need to simulate 
> an environment that:
> 1. has a real {{JobMaster}}
> 2. has a mock/testing {{ResourceManager}} that having infinite resource and 
> react immediately.
> 3. has a(many?) mock/testing {{TaskExecutor}} that deploy and finish tasks 
> immediately.
> [~trohrm...@apache.org] [~GJL] [~pnowojski] could you please review this 
> proposal to help clarify the goal and concrete details? Thanks in advance.
> Any suggestions are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to