RE: About Deprecating split/select for DataStream API

xingcanc Mon, 17 Jun 2019 09:30:53 -0700

Hi all,

Thanks for sharing your thoughts on this topic.


First, we must admit that the current implementation for split/select is 
flawed. I roughly went through the source codes, the problem may be that for 
consecutive select/split(s), the former one will be overridden by the later one 
during StreamGraph generation phase. That's why we forbid this consecutive 
logic in FLINK-11084.

Now the question is whether we should guide users to migrate to the new side 
output feature or thoroughly rework the broken API with the correct semantics 
(instead of just trying to forbid all the "invalid" usages). 

Personally, I prefer the later solution because

1. The split/select may have been widely used without touching the broken part.
2. Though restricted compared with side output, the semantics for split/select 
itself is acceptable since union does not support different data types either.
3. We need a complete and easy-to-use transformation set for DataStream API. 
Enabling side output for flatMap may not be an ultimate solution.

To summarize, maybe we should not easily deprecate the split/select public API. 
If we come to a consensus on that, how about rewriting it based on side output? 
(like the implementation for join on coGroup)

Any feedback is welcome : )

Best,
Xingcan

-----Original Message-----
From: SHI Xiaogang <shixiaoga...@gmail.com> 
Sent: Monday, June 17, 2019 8:08 AM
To: Dawid Wysakowicz <dwysakow...@apache.org>
Cc: dev@flink.apache.org
Subject: Re: About Deprecating split/select for DataStream API

Hi Dawid,

Thanks a lot for your example.

I think most users will expect splitted1 to be empty in the example.

The unexpected results produced, in my opinion, is due to our problematic 
implementation, instead of the confusing semantics.
We can fix the problem if we add a SELECT operator to filter out unexpected 
records (Of course, we can find some optimization to improve the efficiency.).

After all, i prefer to fix the problems to make the results as expected.
What do you think?

Regards,
Xiaogang

Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午7:21写道：

> Yes you are correct. The problem I described applies to the split not 
> select as I wrote in the first email. Sorry for that.
>
> I will try to prepare a correct example. Let's have a look at this example:
>
>     val splitted1 = ds.split(if (1) then "a")
>
>     val splitted2 = ds.split(if (!=1) then "a")
>
> In those cases splitted1.select("a") -> will output all elements, the 
> same for splitted2, because the OutputSelector(s) are applied to 
> previous operator. The behavior I would assume is that splitted1 
> outputs only "1"s, whereas splitted2 all but "1"s
>
> On the other hand in a call
>
>     val splitted1 = ds.split(if ("1" or "2") then 
> "a").select("a").split(if ("3") then "b").select("b")
>
> I would assume an intersection of those two splits, so no results. 
> What actually happens is that it will be "1", "2" & "3"s. Actually, 
> right exceptions should be thrown in those cases not to produce 
> confusing results, but this just shows that this API is broken, if we 
> need to check for some prohibited configurations during runtime.
>
> Those weird behaviors are in my opinion results of the flawed API, as 
> it actually assigns an output selector to the previous operator. In 
> other words it modifies previous operator. I think it would be much 
> cleaner if this happened inside an operator rather than separately. 
> This is what SideOutputs do, as you define them inside the 
> ProcessFunction, rather than afterwards. Therefore I am very much in 
> favor of using them for those cases. Once again if the problem is that 
> they are available only in the ProcessFunction I would prefer enabling 
> them e.g. in FlatMap, rather than keeping the split/select.
>
>
>
> On 17/06/2019 09:40, SHI Xiaogang wrote:
> > Hi Dawid,
> >
> > As the select method is only allowed on SplitStreams, it's 
> > impossible to construct the example ds.split().select("a", "b").select("c", 
> > "d").
> >
> > Are you meaning ds.split().select("a", "b").split().select("c", "d")?
> > If so, then the tagging in the first split operation should not 
> > affect
> the
> > second one. Then
> >     splitted.select("a", "b") => empty
> >     splitted.select("c", "d") => ds
> >
> > I cannot quite catch your point here. It's appreciated if you can
> provide a
> > more concrete explanation?
> >
> > Regards,
> > Xiaogang Shi
> >
> >
> >
> >
> > Dawid Wysakowicz <dwysakow...@apache.org> 于2019年6月17日周一 下午3:10写道：
> >
> >> Hi all,
> >>
> >> Thank you for starting the discussion. To start with I have to say 
> >> I am not entirely against leaving them. On the other hand I totally 
> >> disagree that the semantics are clearly defined. Actually the 
> >> design is fundamentally flawed.
> >>
> >>    1. We use String as a selector for elements. This is not the cleanest
> >>    design, but I agree it is not the worst.
> >>    2. Users cannot define different types for different splits.
> >>    3. (The actual reason why I think it's actually better to drop the
> >>    split/select and introduce a better mechanism) The behavior of a
> split is
> >>    to actually add an output selector. We can have just a single
> selector on a
> >>    single operator, but the API allows (I would even say 
> >> encourages) to
> create
> >>    chains of split/select, which leads to undefined behavior. Take 
> >> this
> for
> >>    example: ds.split().select("a", "b").select("c", "d"). Which 
> >> tags
> should be
> >>    forwarded? ("a", "b", "c", "d") (union) or () (intersection). In 
> >> my
> opinion
> >>    the most obvious answer in this case would be the intersection. Let's
> >>    modify it slightly though and I would assume a different 
> >> behavior
> (the
> >>    union)
> >>
> >>            splitted = ds.split();
> >>
> >>            splitted.select("a", "b").map()
> >>
> >>            splitted.select("c", "d").map()
> >>
> >> Taking the 3rd argument into consideration I would be in favor of
> removing
> >> the current mechanism. I think the side outputs serve the purpose 
> >> much better with much cleaner semantics. I get the argument that 
> >> users are
> now
> >> forced to use processFunction if they want to use the side outputs. 
> >> If
> this
> >> is the main problem how about enabling them e.g. for flatMap as well?
> >>
> >> Best,
> >>
> >> Dawid
> >> On 17/06/2019 08:51, Jark Wu wrote:
> >>
> >> +1 to keep the split/select API. I think if there are some problems 
> >> +with
> >> the API, it's better to fix them instead of deprecating them.
> >> And select/split are straightforward and convenient APIs. It's 
> >> worth to have them.
> >>
> >> Regards,
> >> Jark
> >>
> >> On Mon, 17 Jun 2019 at 14:46, vino yang <yanghua1...@gmail.com> <
> yanghua1...@gmail.com> wrote:
> >>
> >>
> >> Hi,
> >>
> >> I also think it is valuable and reasonable to keep the split/select
> APIs.
> >> They are very convenient and widely used in our platform. I think 
> >> they
> are
> >> also used in other users' jobs.
> >> If the community has doubts about this, IMHO, it would be better to
> start a
> >> user survey.
> >>
> >> Best,
> >> Vino
> >>
> >> SHI Xiaogang <shixiaoga...@gmail.com> <shixiaoga...@gmail.com>
> 于2019年6月17日周一 上午11:55写道：
> >>
> >>
> >> Hi Xingcan,
> >>
> >> Thanks for bringing it up for discusson.
> >>
> >> I agree with you that we should not deprecate the split/select methods.
> >> Their semantics are very clear and they are widely adopted by Flink
> >>
> >> users.
> >>
> >> We should fix these problems instead of simply deprecating the methods.
> >>
> >> Regards,
> >> Xiaogang
> >>
> >> Xingcan Cui <xingc...@gmail.com> <xingc...@gmail.com> 于2019年6月15日周六
> 下午4:13写道：
> >>
> >>
> >> Hi all,
> >>
> >> Recently, I noticed that the split/select methods in DataStream API
> >>
> >> have
> >>
> >> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA 
> >> issue
> >> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
> https://issues.apache.org/jira/browse/FLINK-11084>).
> >>
> >> Although the two methods can be replaced by the more powerful side
> >>
> >> output
> >>
> >> feature[1], I still doubt whether we should really remove them in 
> >> the future.
> >>
> >> 1. From semantics, the split/select is the reverse operation to the
> >>
> >> union
> >>
> >> transformation. Without them, the DataStream API seems to be 
> >> missing a piece.
> >>
> >> 2. From accessibility, the side output only works for process
> >>
> >> functions,
> >>
> >> which means it forces the user to dive into a lower API.
> >>
> >> According to FLINK-11084 <
> >> https://issues.apache.org/jira/browse/FLINK-11084> <
> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> >> problems with the current implementation of the two methods. Maybe 
> >> we should fix the problems and re-active them again. Or if they 
> >> really
> >>
> >> need
> >>
> >> to
> >>
> >> be deprecated, we should at least mark the corresponding 
> >> documentation
> >>
> >> for
> >>
> >> that : )
> >>
> >> What do you think?
> >>
> >> Best,
> >> Xingcan
> >>
> >> [1]
> >>
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
> _output.html
> >>
> >> <
> >>
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
> _output.html
> >>
> >>
>
>

RE: About Deprecating split/select for DataStream API

Reply via email to