[ 
https://issues.apache.org/jira/browse/FLINK-31240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Lin updated FLINK-31240:
-----------------------------
    Description: 
In some cases, users may need to convert the underlying DataStream to Table and 
then convert it back to DataStream(e.g. some Flink ML libraries accept a Table 
as input and convert it to DataStream for calculation.). This would cause 
unnecessary overhead because of data conversion between the internal data type 
and the external data type.

We can reduce the overhead by checking if there are paired 
`fromDataStream`/`toDataStream` function call without any transformation, if so 
using the source datastream directly.

 The performance of Flink ML's Bucketizer algorithm[1] is used to demonstrate 
the impact of this optimization. The execution time is obtained by taking the 
median execution time across 5 runs for each setup.

Before optimization: 40746ms
After optimization: 12972ms
Thus this optimization reduces the total execution time of Flink ML's 
Bucketizer algorithm to about 1/3.

[1] 
https://github.com/apache/flink-ml/blob/master/flink-ml-benchmark/src/main/resources/bucketizer-benchmark.json

  was:
In some cases, users may need to convert the underlying DataStream to Table and 
then convert it back to DataStream(e.g. some Flink ML libraries accept a Table 
as input and convert it to DataStream for calculation.). This would cause 
unnecessary overhead because of data conversion between the internal data type 
and the external data type.

We can reduce the overhead by checking if there are paired 
`fromDataStream`/`toDataStream` function call without any transformation, if so 
using the source datastream directly.

 


> Reduce the overhead of conversion between DataStream and Table
> --------------------------------------------------------------
>
>                 Key: FLINK-31240
>                 URL: https://issues.apache.org/jira/browse/FLINK-31240
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / API
>            Reporter: Jiang Xin
>            Assignee: Yunfeng Zhou
>            Priority: Major
>              Labels: pull-request-available
>
> In some cases, users may need to convert the underlying DataStream to Table 
> and then convert it back to DataStream(e.g. some Flink ML libraries accept a 
> Table as input and convert it to DataStream for calculation.). This would 
> cause unnecessary overhead because of data conversion between the internal 
> data type and the external data type.
> We can reduce the overhead by checking if there are paired 
> `fromDataStream`/`toDataStream` function call without any transformation, if 
> so using the source datastream directly.
>  The performance of Flink ML's Bucketizer algorithm[1] is used to demonstrate 
> the impact of this optimization. The execution time is obtained by taking the 
> median execution time across 5 runs for each setup.
> Before optimization: 40746ms
> After optimization: 12972ms
> Thus this optimization reduces the total execution time of Flink ML's 
> Bucketizer algorithm to about 1/3.
> [1] 
> https://github.com/apache/flink-ml/blob/master/flink-ml-benchmark/src/main/resources/bucketizer-benchmark.json



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to