Hi all,
Thanks for your participation.
In this thread, we got one +1 for option 1 and option 3, respectively. In the
original thread[1], we got two +1 for option 1, one +1 for option 2, and five
+1 and one -1 for option 3.
To summarize,
Option 1 (port side output to flatMap and deprecate split/select): three +1
Option 2 (introduce a new split/select and deprecate existing one): one +1
Option 3 ("correct" the existing split/select): six +1 and one -1
It seems that most people involved are in favor of "correcting" the existing
split/select. However, this will definitely break the API compatibility, in a
subtle way.
IMO, the real behavior of consecutive split/select's has never been thoroughly
clarified. Even in the community, it hard to say that we come into a consensus
on its real semantics[2-4]. Though the initial design is not ambiguous, there's
no doubt that its concept has drifted.
As the split/select is quite an ancient API, I cc'ed this to more members. It
couldn't be better if you can share your opinions on this.
Thanks,
Xingcan
[1]
https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/FLINK-1772
[3] https://issues.apache.org/jira/browse/FLINK-5031
[4] https://issues.apache.org/jira/browse/FLINK-11084
> On Jul 5, 2019, at 12:04 AM, 杨力 <[email protected]> wrote:
>
> I prefer the 1) approach. I used to carry fields, which is needed only for
> splitting, in the outputs of flatMap functions. Replacing it with outputTags
> would simplify data structures.
>
> Xingcan Cui <[email protected] <mailto:[email protected]>> 于 2019年7月5日周五
> 上午2:20写道:
> Hi folks,
>
> Two weeks ago, I started a thread [1] discussing whether we should discard
> the split/select methods (which have been marked as deprecation since v1.7)
> in DataStream API.
>
> The fact is, these methods will cause "unexpected" results when using
> consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times
> on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The
> reason is that following the initial design, the new split/select logic will
> always override the existing one on the same target operator, rather than
> append to it. Some users may not be aware of that, but if you do, a current
> solution would be to use the more powerful side output feature [2].
>
> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some
> restrictions to the existing split/select logic and suggest to replace it
> with side output in the future. However, considering that the side output is
> currently only available in the process function layer and the split/select
> could have been widely used in many real-world applications, we'd like to
> start a vote andlisten to the community on how to deal with them.
>
> In the discussion thread [1], we proposed three solutions as follows. All of
> them are feasible but have different impacts on the public API.
>
> 1) Port the side output feature to DataStream API's flatMap and replace
> split/select with it.
>
> 2) Introduce a dedicated function in DataStream API (with the "correct"
> behavior but a different name) that can be used to replace the existing
> split/select.
>
> 3) Keep split/select but change the behavior/semantic to be "correct".
>
> Note that this is just a vote for gathering information, so feel free to
> participate and share your opinions.
>
> The voting time will end on July 7th 17:00 EDT.
>
> Thanks,
> Xingcan
>
> [1]
> https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E
>
> <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E>
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
>
> <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html>