Andreas Hailu created FLINK-27827:
-------------------------------------
Summary: StreamExecutionEnvironment method supporting explicit
Boundedness
Key: FLINK-27827
URL: https://issues.apache.org/jira/browse/FLINK-27827
Project: Flink
Issue Type: Improvement
Components: API / DataStream
Reporter: Andreas Hailu
When creating a {{{}DataStreamSource{}}}, an explicitly bounded input is only
returned if the {{InputFormat}} provided implements {{{}FileInputFormat{}}}.
This is results in runtime exceptions when trying to run applications in Batch
execution mode while using non {{{}FileInputFormat{}}}s e.g. Apache Iceberg
[1], Flink's Hadoop MapReduce compatibility API's [2] inputs, etc...
I understand there is a {{DataSource}} API [3] that supports the specification
of the boundedness of an input, but that would require all connectors to revise
their APIs to leverage it which would take some time.
My organization is in the middle of migrating from the {{DataSet}} API to the
{{DataStream }}API, and we've found this to be a challenge as nearly all of our
inputs have been determines to be unbounded as we use {{InputFormats}} that are
not {{{}FileInputFormat{}}}s. Our work-around was to provide a local patch in
{{StreamExecutionEnvironment}} with a method supporting explicitly bounded
inputs.
As this helped us implement a Batch {{DataStream}} solution, perhaps this is
something that may be helpful for others?
[1] [https://iceberg.apache.org/docs/latest/flink/#reading-with-datastream]
[2]
[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/dataset/hadoop_map_reduce/]
[3]
[https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/#the-data-source-api]
--
This message was sent by Atlassian Jira
(v8.20.7#820007)