Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Jark Wu Fri, 18 Sep 2020 02:05:07 -0700

Personally I think the fieldNames Map is confusing and not handy.
I just have an idea but not sure what you think.
What about adding a new constructor with List field names, this enables all
name-based setter/getters.
Regarding to List -> Map cost for every record, we can suggest users to
reuse the Row in the task.


new Row(int arity)
new Row(List<String> fieldNames)

final Row reuse = new Row(Arrays.asList("myField", "myOtherField"))
reuse.setField("myField", 12);
reuse.setField("myOtherField", "This is a test");

My point is that, if we can have a handy constructor for named Row, we may
not need to distinguish the named-only or positionAndNamed mode.
This can avoid (fast-fail) the potential problem when setting an invalid
field.

We can also come up with a new class for the field names which will
construct the Map and be shared among all Row instances.

What do you think?

Best,
Jark

On Thu, 17 Sep 2020 at 16:48, Timo Walther <[email protected]> wrote:

> Hi everyone,
>
> thanks for all the feedback. I updated the FLIP again on Thursday to
> integrate the feedback I got from Jingsong and Jark offline. In
> particular I updated the `Improve dealing with Row in DataStream API`
> section another time. We introduced static methods for Row that should
> make the semantics clear to users:
>
> // allows to use index-based setters and getters (equivalent to new
> Row(int))
> // method exists for completeness
> public static withPositions(int length);
>
> // allows to use name-based setters and getters
> public static withNames();
>
> // allows to use both name-based and position-based setters and getters
> public static withNamesAndPositions(Map<String, Integer> fieldNames);
>
> In any case, non of the existing methods will be deprecated and only
> additional functionality will be available through the methods above.
>
> I started a voting thread on Friday. Please feel free to vote.
>
> Regards,
> Timo
>
> On 10.09.20 10:21, Danny Chan wrote:
> > Thanks for driving this Timo, +1 for voting ~
> >
> > Best,
> > Danny Chan
> > 在 2020年9月10日 +0800 PM3:54，Timo Walther <[email protected]>，写道：
> >> Thanks everyone for this healthy discussion. I updated the FLIP with the
> >> outcome. I think the result is one of the last core API refactoring and
> >> users will be happy to have a consistent changelog support. Thanks for
> >> all the contributions.
> >>
> >> If there are no objections, I would continue with a voting.
> >>
> >> What do you think?
> >>
> >> Regards,
> >> Timo
> >>
> >>
> >> On 09.09.20 14:31, Danny Chan wrote:
> >>> Thanks, i'm fine with that.
> >>>
> >>>
> >>> Timo Walther <[email protected]> 于2020年9月9日周三 下午7:02写道：
> >>>
> >>>> I agree with Jark. It reduces confusion.
> >>>>
> >>>> The DataStream API doesn't know changelog processing at all. A
> >>>> DataStream of Row can be used with both `fromDataStream` and
> >>>> `fromChangelogStream`. But only the latter API will interpret it as a
> >>>> changelog something.
> >>>>
> >>>> And as I mentioned before, the `toChangelogStream` must work with Row
> >>>> otherwise users are confused due to duplicate records with a missing
> >>>> changeflag.
> >>>>
> >>>> I will update the FLIP-136 a last time. I hope we can then continue
> to a
> >>>> vote.
> >>>>
> >>>> Regards,
> >>>> Timo
> >>>>
> >>>>
> >>>> On 09.09.20 10:50, Danny Chan wrote:
> >>>>> I think it would bring in much confusion by a different API name just
> >>>> because the DataStream generic type is different.
> >>>>> If there are ChangelogMode that only works for Row, can we have a
> type
> >>>> check there ?
> >>>>>
> >>>>> Switch to a new API name does not really solve the problem well,
> people
> >>>> still need to declare the ChangelogMode explicitly, and there are some
> >>>> confusions:
> >>>>>
> >>>>> • Should DataStream of Row type always use #fromChangelogStream ?
> >>>>> • Does fromChangelogStream works for only INSERT ChangelogMode ?
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Danny Chan
> >>>>> 在 2020年9月9日 +0800 PM4:21，Timo Walther <[email protected]>，写道：
> >>>>>> I had this in the inital design, but Jark had concerns at least for
> the
> >>>>>> `toChangelogStream(ChangelogMode)` (see earlier discussion).
> >>>>>>
> >>>>>> `fromDataStream(dataStream, schema, changelogMode)` would be
> possible.
> >>>>>>
> >>>>>> But in this case I would vote for a symmetric API. If we keep
> >>>>>> toChangelogStream we should also have a fromChangelogStream.
> >>>>>>
> >>>>>> And if we unify `toChangelogStream` and `toDataStream`, retractions
> >>>>>> cannot be represented for non-Rows and users will experience
> duplicate
> >>>>>> records with a missing changeflag.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Timo
> >>>>>>
> >>>>>>
> >>>>>> On 09.09.20 09:31, Danny Chan wrote:
> >>>>>>> “But I think the planner needs to
> >>>>>>> know whether the input is insert-only or not.”
> >>>>>>>
> >>>>>>> Does fromDataStream(dataStream, schema, changelogMode)
> >>>>>>>
> >>>>>>> solve your concerns ? People can pass around whatever ChangelogMode
> >>>> they like as an optional param.
> >>>>>>> By default: fromDataStream(dataStream, schema), the ChangelogMode
> is
> >>>> INSERT.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Danny Chan
> >>>>>>> 在 2020年9月9日 +0800 PM2:53，[email protected]，写道：
> >>>>>>>>
> >>>>>>>> But I think the planner needs to
> >>>>>>>> know whether the input is insert-only or not.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>
>

Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Reply via email to