Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Danny Chan Tue, 08 Sep 2020 23:35:32 -0700

Thanks for the summary Timo ~

I want to clarify a little bit, what is the conclusion about the 
fromChangelogStream and toChangelogStream, should we use this name or we use 
fromDataStream with an optional ChangelogMode flag ?


Best,
Danny Chan
在 2020年9月8日 +0800 PM8:22，Timo Walther <[email protected]>，写道：
> Hi Danny,
>
> Your proposed signatures sound good to me.
>
> fromDataStream(dataStream, Schema)
> toDataStream(table, AbstractDataType<?>)
>
> They address all my concerns. The API would not be symmetric anymore,
> but this is fine with me. Others raised concerns about deprecating
> `fromDataStream(dataStream, Expression)`. Are they fine with this as well?
>
> If there are no objections, I would update the FLIP with the methods
> above. Bu let me briefly summarize my thoughts on this again, so that we
> are all on the same page:
> - The biggest discussion point seems the fromInsertStream/toInsertStream.
> - I don’t have a strong opinion on naming, from/toDataStream would be
> fine for me. But:
> - It slightly different type mappings and might break existing pipelines
> silently. This point can be neglected as the differences are only minor.
> - We need a way of declaring the rowtime attribute but without declaring
> all columns again. Reduce manual schema work as much as possible.
> - Both Dawid and I don’t like the current either “position based” or
> “name based” expression logic that looks like a projection but is not.
> - Actually name based expressions are not necessary, since we have
> positions for all new data types.
> - Schema is not suitable to influence the output type for toDataStream.
> It should be DataType.
>
> All items are solved by Danny's suggestion.
>
> Regards,
> Timo
>
>
>
> On 08.09.20 14:04, Danny Chan wrote:
> > Hi, Timo ~
> >
> > "It is not about changelog mode compatibility, it is about the type
> > compatibility.”
> >
> > For fromDataStream(dataStream, Schema), there should not be compatibility 
> > problem or data type inconsistency. We know the logical type from Schema 
> > and physical type from the dataStream itself.
> >
> > For toDataStream(table, AbstractDataType<?>), we can get the logical type 
> > from the table itself
> > and the physical type from the passed data type.
> >
> > If both behavior are deterministic, what's the problem for type 
> > compatibility and safety?
> >
> > My concern is that in most of the cases, people use the "insert stream", 
> > they do not need to care about
> > the data stream ChangelogMode, so there is no need to distinguish them from 
> > the APIs, an optional param is enough. If we introduces 2 new API there, 
> > people have to choose between them, and can fromChangelogStream()
> > accept an insert stream ? What is the behavior if fromInsertStream() 
> > accepts a changelog stream ?
> >
> >
> > "This means potentially duplicate definition of fields and their data types 
> > etc”
> >
> > I agree, because table already has an underlying schema there.
> >
> > Best,
> > Danny Chan
> > 在 2020年9月3日 +0800 PM8:12，Timo Walther <[email protected]>，写道：
> > > Hi Danny,
> > >
> > > "if ChangelogMode.INSERT is the default, existing pipelines should be
> > > compatible"
> > >
> > > It is not about changelog mode compatibility, it is about the type
> > > compatibility. The renaming to `toInsertStream` is only to have a mean
> > > of dealing with data type inconsistencies that could break existing
> > > pipelines.
> > >
> > > As the FLIP describes, the following new behavior should be implemented:
> > >
> > > - It does this by translating the TypeInformation to DataType.
> > > - This will happen with a new TypeInfoDataTypeConverter that will no
> > > longer produce LegacyTypeInformationType.
> > > - All types from DataStream API should be supported by this converter.
> > > - TupleTypeInfoBase will be translated into a proper RowType or
> > > StructuredType.
> > > - BigDecimals will be converted to DECIMAL(38,18) by default.
> > > - Composite types (tuples, POJOs, rows) will be flattened by default if
> > > they are used as top-level records (similar to the old behavior).
> > > - The order of POJO field's is determined by the DataTypeExtractor and
> > > must not be defined manually anymore.
> > > - GenericTypeInfo is converted to RawType immediately by considering the
> > > current configuration.
> > > - A DataStream that originated from Table API will keep its DataType
> > > information due to ExternalTypeInfo implementing DataTypeQueryable.
> > >
> > > I would feel safer if we do this under a new method name.
> > >
> > > "toDataStream(table, schema.bindTo(DataType))"
> > >
> > > This is what I meant with "integrate the DataType into the Schema class
> > > itself". Yes, we can do that if everybody is fine with it. But why
> > > should a user specify both a schema and a data type? This means
> > > potentially duplicate definition of fields and their data types etc.
> > >
> > > Regards,
> > > Timo
> > >
> > >
> > > On 03.09.20 11:31, Danny Chan wrote:
> > > > "It is a more conservative approach to introduce that in a
> > > > new method rather than changing the existing one under the hood and
> > > > potentially break existing pipelines silently”
> > > >
> > > > I like the idea actually, but if ChangelogMode.INSERT is the default, 
> > > > existing pipelines should be compatible. We can see the other kinds of 
> > > > ChangelogMode as an extension.
> > > >
> > > > “for `toDataStream` users need to be
> > > > able to express whether they would prefer Row, POJO or atomic”
> > > >
> > > > I think most of the cases people do not need to convert the stream to a 
> > > > Row or POJO, because the table projection always returns a flatternned 
> > > > internal row, if people did want a POJO there, how about we bind the 
> > > > DataType to the existing schema, like this
> > > >
> > > > toDataStream(table, schema.bindTo(DataType))
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2020年9月3日 +0800 PM3:18，[email protected]，写道：
> > > > >
> > > > > It is a more conservative approach to introduce that in a
> > > > > new method rather than changing the existing one under the hood and
> > > > > potentially break existing pipelines silently
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Reply via email to