[ 
https://issues.apache.org/jira/browse/FLINK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated FLINK-31125:
-----------------------------
    Description: 
Flink ML benchmark framework estimates the throughput by having a source 
operator generate a given number (e.g. 10^7) of input records with random 
values, let the given AlgoOperator process these input records, and divide the 
number of records by the total execution time. 

The overhead of generating random values for all input records has observable 
impact on the estimated throughput. We would like to minimize the overhead of 
the source operator so that the benchmark result can focus on the throughput of 
the AlgoOperator as much as possible.

Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] 
generates all input records in advance into memory before running the 
benchmark. This allows Spark ML benchmark to read records from memory instead 
of generating values for those records during the benchmark.

We can generate value once and re-use it for all input records. This approach 
minimizes the source operator head and allows us to compare Flink ML benchmark 
result with Spark ML benchmark result (from spark-sql-perf) fairly.




  was:
Flink ML benchmark framework estimates the throughput by having a source 
operator generate a given number (e.g. 10^7) of input records with random 
values, let the given AlgoOperator process these input records, and divide the 
number of records by the total execution time. 

The overhead of generating random values for all input records has observable 
impact on the estimated throughput. We would like to minimize the overhead of 
the source operator so that the benchmark result can focus on the throughput of 
the AlgoOperator as much as possible.

Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] 
generates all input records in advance into memory before running the 
benchmark. This allows Spark ML benchmark to read records from memory instead 
of generating values for those records during the benchmark.

We can generate value once and re-use it for all input records. This approach 
minimizes the overhead of source operator and allow us to compare the Flink ML 
benchmark result with Spark ML benchmark result (using spark-sql-perf) fairly.





> Flink ML benchmark framework should minimize the source operator overhead
> -------------------------------------------------------------------------
>
>                 Key: FLINK-31125
>                 URL: https://issues.apache.org/jira/browse/FLINK-31125
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>            Priority: Major
>             Fix For: ml-2.2.0
>
>
> Flink ML benchmark framework estimates the throughput by having a source 
> operator generate a given number (e.g. 10^7) of input records with random 
> values, let the given AlgoOperator process these input records, and divide 
> the number of records by the total execution time. 
> The overhead of generating random values for all input records has observable 
> impact on the estimated throughput. We would like to minimize the overhead of 
> the source operator so that the benchmark result can focus on the throughput 
> of the AlgoOperator as much as possible.
> Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] 
> generates all input records in advance into memory before running the 
> benchmark. This allows Spark ML benchmark to read records from memory instead 
> of generating values for those records during the benchmark.
> We can generate value once and re-use it for all input records. This approach 
> minimizes the source operator head and allows us to compare Flink ML 
> benchmark result with Spark ML benchmark result (from spark-sql-perf) fairly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to