Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Jark Wu Tue, 08 Sep 2020 23:49:12 -0700

Hi,

I'm +1 to use the naming of from/toDataStream, rather than
from/toInsertStream. So we don't need to deprecate the existing
`fromDataStream`.


I'm +1 to Danny's proposal: fromDataStream(dataStream, Schema) and
toDataStream(table, AbstractDataType<?>)

I think we can also keep the method `createTemporaryView(path,
DataStream<T>)`.
I don't have a strong opinion on deprecating fromDataStream(datastream,
exprs), but slightly prefer to keep them.

Best,
Jark

On Wed, 9 Sep 2020 at 14:34, Danny Chan <yuzhao....@gmail.com> wrote:

> Thanks for the summary Timo ~
>
> I want to clarify a little bit, what is the conclusion about the
> fromChangelogStream and toChangelogStream, should we use this name or we
> use fromDataStream with an optional ChangelogMode flag ?
>
> Best,
> Danny Chan
> 在 2020年9月8日 +0800 PM8:22，Timo Walther <twal...@apache.org>，写道：
> > Hi Danny,
> >
> > Your proposed signatures sound good to me.
> >
> > fromDataStream(dataStream, Schema)
> > toDataStream(table, AbstractDataType<?>)
> >
> > They address all my concerns. The API would not be symmetric anymore,
> > but this is fine with me. Others raised concerns about deprecating
> > `fromDataStream(dataStream, Expression)`. Are they fine with this as
> well?
> >
> > If there are no objections, I would update the FLIP with the methods
> > above. Bu let me briefly summarize my thoughts on this again, so that we
> > are all on the same page:
> > - The biggest discussion point seems the fromInsertStream/toInsertStream.
> > - I don’t have a strong opinion on naming, from/toDataStream would be
> > fine for me. But:
> > - It slightly different type mappings and might break existing pipelines
> > silently. This point can be neglected as the differences are only minor.
> > - We need a way of declaring the rowtime attribute but without declaring
> > all columns again. Reduce manual schema work as much as possible.
> > - Both Dawid and I don’t like the current either “position based” or
> > “name based” expression logic that looks like a projection but is not.
> > - Actually name based expressions are not necessary, since we have
> > positions for all new data types.
> > - Schema is not suitable to influence the output type for toDataStream.
> > It should be DataType.
> >
> > All items are solved by Danny's suggestion.
> >
> > Regards,
> > Timo
> >
> >
> >
> > On 08.09.20 14:04, Danny Chan wrote:
> > > Hi, Timo ~
> > >
> > > "It is not about changelog mode compatibility, it is about the type
> > > compatibility.”
> > >
> > > For fromDataStream(dataStream, Schema), there should not be
> compatibility problem or data type inconsistency. We know the logical type
> from Schema and physical type from the dataStream itself.
> > >
> > > For toDataStream(table, AbstractDataType<?>), we can get the logical
> type from the table itself
> > > and the physical type from the passed data type.
> > >
> > > If both behavior are deterministic, what's the problem for type
> compatibility and safety?
> > >
> > > My concern is that in most of the cases, people use the "insert
> stream", they do not need to care about
> > > the data stream ChangelogMode, so there is no need to distinguish them
> from the APIs, an optional param is enough. If we introduces 2 new API
> there, people have to choose between them, and can fromChangelogStream()
> > > accept an insert stream ? What is the behavior if fromInsertStream()
> accepts a changelog stream ?
> > >
> > >
> > > "This means potentially duplicate definition of fields and their data
> types etc”
> > >
> > > I agree, because table already has an underlying schema there.
> > >
> > > Best,
> > > Danny Chan
> > > 在 2020年9月3日 +0800 PM8:12，Timo Walther <twal...@apache.org>，写道：
> > > > Hi Danny,
> > > >
> > > > "if ChangelogMode.INSERT is the default, existing pipelines should be
> > > > compatible"
> > > >
> > > > It is not about changelog mode compatibility, it is about the type
> > > > compatibility. The renaming to `toInsertStream` is only to have a
> mean
> > > > of dealing with data type inconsistencies that could break existing
> > > > pipelines.
> > > >
> > > > As the FLIP describes, the following new behavior should be
> implemented:
> > > >
> > > > - It does this by translating the TypeInformation to DataType.
> > > > - This will happen with a new TypeInfoDataTypeConverter that will no
> > > > longer produce LegacyTypeInformationType.
> > > > - All types from DataStream API should be supported by this
> converter.
> > > > - TupleTypeInfoBase will be translated into a proper RowType or
> > > > StructuredType.
> > > > - BigDecimals will be converted to DECIMAL(38,18) by default.
> > > > - Composite types (tuples, POJOs, rows) will be flattened by default
> if
> > > > they are used as top-level records (similar to the old behavior).
> > > > - The order of POJO field's is determined by the DataTypeExtractor
> and
> > > > must not be defined manually anymore.
> > > > - GenericTypeInfo is converted to RawType immediately by considering
> the
> > > > current configuration.
> > > > - A DataStream that originated from Table API will keep its DataType
> > > > information due to ExternalTypeInfo implementing DataTypeQueryable.
> > > >
> > > > I would feel safer if we do this under a new method name.
> > > >
> > > > "toDataStream(table, schema.bindTo(DataType))"
> > > >
> > > > This is what I meant with "integrate the DataType into the Schema
> class
> > > > itself". Yes, we can do that if everybody is fine with it. But why
> > > > should a user specify both a schema and a data type? This means
> > > > potentially duplicate definition of fields and their data types etc.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > >
> > > > On 03.09.20 11:31, Danny Chan wrote:
> > > > > "It is a more conservative approach to introduce that in a
> > > > > new method rather than changing the existing one under the hood and
> > > > > potentially break existing pipelines silently”
> > > > >
> > > > > I like the idea actually, but if ChangelogMode.INSERT is the
> default, existing pipelines should be compatible. We can see the other
> kinds of ChangelogMode as an extension.
> > > > >
> > > > > “for `toDataStream` users need to be
> > > > > able to express whether they would prefer Row, POJO or atomic”
> > > > >
> > > > > I think most of the cases people do not need to convert the stream
> to a Row or POJO, because the table projection always returns a flatternned
> internal row, if people did want a POJO there, how about we bind the
> DataType to the existing schema, like this
> > > > >
> > > > > toDataStream(table, schema.bindTo(DataType))
> > > > >
> > > > > Best,
> > > > > Danny Chan
> > > > > 在 2020年9月3日 +0800 PM3:18，dev@flink.apache.org，写道：
> > > > > >
> > > > > > It is a more conservative approach to introduce that in a
> > > > > > new method rather than changing the existing one under the hood
> and
> > > > > > potentially break existing pipelines silently
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-136: Improve interoperability between DataStream and Table API

Reply via email to