Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Timo Walther Fri, 18 Sep 2020 07:14:17 -0700

Hi Jark,

the fieldNames map is not intended for users. I would also be fine tomake it a default scope constructor and access it with some internalutility class next to the Row class. The fieldNames map must only beused by serializers and converters. A user has no benefit in using it.

For the creation of new rows (without reusing, which only advanced usersusually do), I don't see a benefit of having:


final Row reuse = new Row(Arrays.asList("myField", "myOtherField"))
reuse.setField("myField", 12);
reuse.setField("myOtherField", "This is a test");

The purpose of Row.withName() is too create a Row easily and readablewithout declaring 50+ column names or dealing with indices in this range.

Personally, I would like to make Row an interface and have concrete rowimplementations for different purposes but this would break existingprograms too much.


What do you think?

Regards,
Timo


On 18.09.20 11:04, Jark Wu wrote:

Personally I think the fieldNames Map is confusing and not handy.
I just have an idea but not sure what you think.
What about adding a new constructor with List field names, this enables all
name-based setter/getters.
Regarding to List -> Map cost for every record, we can suggest users to
reuse the Row in the task.

new Row(int arity)
new Row(List<String> fieldNames)

final Row reuse = new Row(Arrays.asList("myField", "myOtherField"))
reuse.setField("myField", 12);
reuse.setField("myOtherField", "This is a test");

My point is that, if we can have a handy constructor for named Row, we may
not need to distinguish the named-only or positionAndNamed mode.
This can avoid (fast-fail) the potential problem when setting an invalid
field.

We can also come up with a new class for the field names which will
construct the Map and be shared among all Row instances.

What do you think?

Best,
Jark

On Thu, 17 Sep 2020 at 16:48, Timo Walther <[email protected]> wrote:

Hi everyone,

thanks for all the feedback. I updated the FLIP again on Thursday to
integrate the feedback I got from Jingsong and Jark offline. In
particular I updated the `Improve dealing with Row in DataStream API`
section another time. We introduced static methods for Row that should
make the semantics clear to users:

// allows to use index-based setters and getters (equivalent to new
Row(int))
// method exists for completeness
public static withPositions(int length);

// allows to use name-based setters and getters
public static withNames();

// allows to use both name-based and position-based setters and getters
public static withNamesAndPositions(Map<String, Integer> fieldNames);

In any case, non of the existing methods will be deprecated and only
additional functionality will be available through the methods above.

I started a voting thread on Friday. Please feel free to vote.

Regards,
Timo

On 10.09.20 10:21, Danny Chan wrote:

Thanks for driving this Timo, +1 for voting ~

Best,
Danny Chan
在 2020年9月10日 +0800 PM3:54，Timo Walther <[email protected]>，写道：

Thanks everyone for this healthy discussion. I updated the FLIP with the
outcome. I think the result is one of the last core API refactoring and
users will be happy to have a consistent changelog support. Thanks for
all the contributions.

If there are no objections, I would continue with a voting.

What do you think?

Regards,
Timo


On 09.09.20 14:31, Danny Chan wrote:

Thanks, i'm fine with that.


Timo Walther <[email protected]> 于2020年9月9日周三 下午7:02写道：

I agree with Jark. It reduces confusion.

The DataStream API doesn't know changelog processing at all. A
DataStream of Row can be used with both `fromDataStream` and
`fromChangelogStream`. But only the latter API will interpret it as a
changelog something.

And as I mentioned before, the `toChangelogStream` must work with Row
otherwise users are confused due to duplicate records with a missing
changeflag.

I will update the FLIP-136 a last time. I hope we can then continue

to a

vote.

Regards,
Timo


On 09.09.20 10:50, Danny Chan wrote:

I think it would bring in much confusion by a different API name just

because the DataStream generic type is different.

If there are ChangelogMode that only works for Row, can we have a

type

check there ?


Switch to a new API name does not really solve the problem well,

people

still need to declare the ChangelogMode explicitly, and there are some
confusions:


• Should DataStream of Row type always use #fromChangelogStream ?
• Does fromChangelogStream works for only INSERT ChangelogMode ?


Best,
Danny Chan
在 2020年9月9日 +0800 PM4:21，Timo Walther <[email protected]>，写道：

I had this in the inital design, but Jark had concerns at least for

the

`toChangelogStream(ChangelogMode)` (see earlier discussion).

`fromDataStream(dataStream, schema, changelogMode)` would be

possible.


But in this case I would vote for a symmetric API. If we keep
toChangelogStream we should also have a fromChangelogStream.

And if we unify `toChangelogStream` and `toDataStream`, retractions
cannot be represented for non-Rows and users will experience

duplicate

records with a missing changeflag.

Regards,
Timo


On 09.09.20 09:31, Danny Chan wrote:

“But I think the planner needs to
know whether the input is insert-only or not.”

Does fromDataStream(dataStream, schema, changelogMode)

solve your concerns ? People can pass around whatever ChangelogMode

they like as an optional param.

By default: fromDataStream(dataStream, schema), the ChangelogMode

is

INSERT.


Best,
Danny Chan
在 2020年9月9日 +0800 PM2:53，[email protected]，写道：


But I think the planner needs to
know whether the input is insert-only or not.

Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Reply via email to