Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Jark Wu Tue, 24 Mar 2020 06:37:10 -0700

Thanks Timo for updating the formats section. That would be very helpful
for changelog supporting (FLIP-105).


I just left 2 minor comment about some method names. In general, I'm +1 to
start a voting.

--------------------------------------------------------------------------------------------------

Hi Becket,

I agree we shouldn't duplicate codes, especiall the runtime
implementations.
However, the interfaces proposed by FLIP-95 are mainly used during
optimization (compiling), not runtime.
I don't think there is much to share for this. Because table/sql
is declarative, but DataStream is imperative.
For example, filter push down, DataStream FilterableSource may allow to
accept a FilterFunction (which is a black box for the source).
However, table sources should pick the pushed filter expressions, some
sources may only support "=", "<", ">" conditions.
Pushing a FilterFunction doesn't work in table ecosystem. That means, the
connectors have to have some table-specific implementations.


Best,
Jark

On Tue, 24 Mar 2020 at 20:41, Kurt Young <[email protected]> wrote:

> Hi Becket,
>
> I don't think DataStream should see some SQL specific concepts such as
> Filtering or ComputedColumn. It's
> better to stay within SQL area and translate to more generic concept when
> translating to DataStream/Runtime
> layer, such as use MapFunction to represent computed column logic.
>
> Best,
> Kurt
>
>
> On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <[email protected]> wrote:
>
> > Hi Timo and Dawid,
> >
> > It's really great that we have the same goal. I am actually wondering if
> we
> > can go one step further to avoid some of the interfaces in Table as well.
> >
> > For example, if we have the FilterableSource, do we still need the
> > FilterableTableSource? Should DynamicTableSource just become a
> > Source<*Row*,
> > SourceSplitT, EnumChkT>?
> >
> > Can you help me understand a bit more about the reason we need the
> > following relational representation / wrapper interfaces v.s. the
> > interfaces that we could put to the Source in FLIP-27?
> >
> > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > SupportsFilterablePushDown v.s. FilterableSource
> > SupportsProjectablePushDown v.s. ProjectableSource
> > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > ScanTableSource v.s. ChangeLogDeserializer.
> > LookUpTableSource v.s. LookUpSource
> >
> > Assuming we have all the interfaces on the right side, do we still need
> the
> > interfaces on the left side? Note that the interfaces on the right can be
> > used by both DataStream and Table. If we do this, there will only be one
> > set of Source interfaces Table and DataStream, the only difference is
> that
> > the Source for table will have some specific plugins and configurations.
> An
> > omnipotent Source can implement all the the above interfaces and take a
> > Deserializer that implements both ComputedColumnDeserializer and
> > ChangeLogDeserializer.
> >
> > Would the SQL planner work with that?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> >
> >
> > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <[email protected]>
> > wrote:
> >
> > > +1. Thanks Timo for the design doc.
> > >
> > > We can also consider @Experimental too. But I am +1 to @PublicEvolving,
> > we
> > > should be confident in the current change.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <[email protected]>
> wrote:
> > >
> > > > @Becket: We totally agree that we don't need table specific
> connectors
> > > > during runtime. As Dawid said, the interfaces proposed here are just
> > for
> > > > communication with the planner. Once the properties (watermarks,
> > > > computed column, filters, projecttion etc.) are negotiated, we can
> > > > configure a regular Flink connector.
> > > >
> > > > E.g. setting the watermark assigner and deserialization schema of a
> > > > Kafka connector.
> > > >
> > > > For better separation of concerns, Flink connectors should not
> include
> > > > relational interfaces and depend on flink-table. This is the
> > > > responsibility of table source/sink.
> > > >
> > > > @Kurt: I would like to mark them @PublicEvolving already because we
> > need
> > > > to deprecate the old interfaces as early as possible. We cannot
> > redirect
> > > > to @Internal interfaces. They are not marked @Public, so we can still
> > > > evolve them. But a core design shift should not happen again, it
> would
> > > > leave a bad impression if we are redesign over and over again.
> Instead
> > > > we should be confident in the current change.
> > > >
> > > > Regards,
> > > > Timo
> > > >
> > > >
> > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > > > Hi Becket,
> > > > >
> > > > > Answering your question, we have the same intention not to
> duplicate
> > > > > connectors between datastream and table apis. The interfaces
> proposed
> > > in
> > > > > the FLIP are a way to describe relational properties of a source.
> The
> > > > > intention is as you described to translate all of those expressed
> as
> > > > > expressions or other Table specific structures into a DataStream
> > > source.
> > > > > In other words I think what we are doing here is in line with what
> > you
> > > > > described.
> > > > >
> > > > > Best,
> > > > >
> > > > > Dawid
> > > > >
> > > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > > >> Hi Timo,
> > > > >>
> > > > >> Thanks for the proposal. I completely agree that the current Table
> > > > >> connectors could be simplified quite a bit. I haven't finished
> > reading
> > > > >> everything, but here are some quick thoughts.
> > > > >>
> > > > >> Actually to me the biggest question is why should there be two
> > > different
> > > > >> connector systems for DataStream and Table? What is the
> fundamental
> > > > reason
> > > > >> that is preventing us from merging them to one?
> > > > >>
> > > > >> The basic functionality of a connector is to provide capabilities
> to
> > > do
> > > > IO
> > > > >> and Serde. Conceptually, Table connectors should just be
> DataStream
> > > > >> connectors that are dealing with Rows. It seems that quite a few
> of
> > > the
> > > > >> special connector requirements are just a specific way to do IO /
> > > Serde.
> > > > >> Taking SupportsFilterPushDown as an example, imagine we have the
> > > > following
> > > > >> interface:
> > > > >>
> > > > >> interface FilterableSource<PREDICATE> {
> > > > >>      void applyFilterable(Supplier<PREDICATE> predicate);
> > > > >> }
> > > > >>
> > > > >> And if a ParquetSource would like to support filterable, it will
> > > become:
> > > > >>
> > > > >> class ParquetSource implements Source,
> > > > FilterableSource(FilterPredicate> {
> > > > >>      ...
> > > > >> }
> > > > >>
> > > > >> For Table, one just need to provide an predicate supplier that
> > > converts
> > > > an
> > > > >> Expression to the specified predicate type. This has a few
> benefit:
> > > > >> 1. Same unified API for filterable for sources, regardless of
> > > > DataStream or
> > > > >> Table.
> > > > >> 2. The  DataStream users now can also use the
> ExpressionToPredicate
> > > > >> supplier if they want to.
> > > > >>
> > > > >> To summarize, my main point is that I am wondering if it is
> possible
> > > to
> > > > >> have a single set of connector interface for both Table and
> > > DataStream,
> > > > >> rather than having two hierarchies. I am not 100% sure if this
> would
> > > > work,
> > > > >> but if it works, this would be a huge win from both code
> maintenance
> > > and
> > > > >> user experience perspective.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Jiangjie (Becket) Qin
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > > [email protected]>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Timo,
> > > > >>>
> > > > >>> Thank you for the proposal. I think it is an important
> improvement
> > > that
> > > > >>> will benefit many parts of the Table API. The proposal looks
> really
> > > > good
> > > > >>> to me and personally I would be comfortable with voting on the
> > > current
> > > > >>> state.
> > > > >>>
> > > > >>> Best,
> > > > >>>
> > > > >>> Dawid
> > > > >>>
> > > > >>> On 23/03/2020 18:53, Timo Walther wrote:
> > > > >>>> Hi everyone,
> > > > >>>>
> > > > >>>> I received some questions around how the new interfaces play
> > > together
> > > > >>>> with formats and their factories.
> > > > >>>>
> > > > >>>> Furthermore, for MySQL or Postgres CDC logs, the format should
> be
> > > able
> > > > >>>> to return a `ChangelogMode`.
> > > > >>>>
> > > > >>>> Also, I incorporated the feedback around the factory design in
> > > > general.
> > > > >>>>
> > > > >>>> I added a new section `Factory Interfaces` to the design
> document.
> > > > >>>> This should be helpful to understand the big picture and
> > connecting
> > > > >>>> the concepts.
> > > > >>>>
> > > > >>>> Please let me know what you think?
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Timo
> > > > >>>>
> > > > >>>>
> > > > >>>> On 18.03.20 13:43, Timo Walther wrote:
> > > > >>>>> Hi Benchao,
> > > > >>>>>
> > > > >>>>> this is a very good question. I will update the FLIP about
> this.
> > > > >>>>>
> > > > >>>>> The legacy planner will not support the new interfaces. It will
> > > only
> > > > >>>>> support the old interfaces. With the next release, I think the
> > > Blink
> > > > >>>>> planner is stable enough to be the default one as well.
> > > > >>>>>
> > > > >>>>> Regards,
> > > > >>>>> Timo
> > > > >>>>>
> > > > >>>>> On 18.03.20 08:45, Benchao Li wrote:
> > > > >>>>>> Hi Timo,
> > > > >>>>>>
> > > > >>>>>> Thank you and others for the efforts to prepare this FLIP.
> > > > >>>>>>
> > > > >>>>>> The FLIP LGTM generally.
> > > > >>>>>>
> > > > >>>>>> +1 for moving blink data structures to table-common, it's
> useful
> > > to
> > > > >>>>>> udf too
> > > > >>>>>> in the future.
> > > > >>>>>> A little question is, do we plan to support the new interfaces
> > and
> > > > data
> > > > >>>>>> types in legacy planner?
> > > > >>>>>> Or we only plan to support these new interfaces in blink
> > planner.
> > > > >>>>>>
> > > > >>>>>> And using primary keys from DDL instead of derived key
> > information
> > > > from
> > > > >>>>>> each query is also a good idea,
> > > > >>>>>> we met some use cases where this does not works very well
> > before.
> > > > >>>>>>
> > > > >>>>>> This FLIP also makes the dependencies of table modules more
> > > clear, I
> > > > >>>>>> like
> > > > >>>>>> it very much.
> > > > >>>>>>
> > > > >>>>>> Timo Walther <[email protected]> 于2020年3月17日周二 上午1:36写道：
> > > > >>>>>>
> > > > >>>>>>> Hi everyone,
> > > > >>>>>>>
> > > > >>>>>>> I'm happy to present the results of long discussions that we
> > had
> > > > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> many
> > > > more
> > > > >>>>>>> have contributed to this design document.
> > > > >>>>>>>
> > > > >>>>>>> We would like to propose new long-term table source and table
> > > sink
> > > > >>>>>>> interfaces:
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > > >>>>>>>
> > > > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > > >>>>>>>
> > > > >>>>>>> The goals of this FLIP are:
> > > > >>>>>>>
> > > > >>>>>>> - Simplify the current interface architecture:
> > > > >>>>>>>        - Merge upsert, retract, and append sinks.
> > > > >>>>>>>        - Unify batch and streaming sources.
> > > > >>>>>>>        - Unify batch and streaming sinks.
> > > > >>>>>>>
> > > > >>>>>>> - Allow sources to produce a changelog:
> > > > >>>>>>>        - UpsertTableSources have been requested a lot by
> users.
> > > Now
> > > > >>>>>>> is the
> > > > >>>>>>> time to open the internal planner capabilities via the new
> > > > interfaces.
> > > > >>>>>>>        - According to FLIP-105, we would like to support
> > > > changelogs for
> > > > >>>>>>> processing formats such as Debezium.
> > > > >>>>>>>
> > > > >>>>>>> - Don't rely on DataStream API for source and sinks:
> > > > >>>>>>>        - According to FLIP-32, the Table API and SQL should
> be
> > > > >>>>>>> independent
> > > > >>>>>>> of the DataStream API which is why the `table-common` module
> > has
> > > no
> > > > >>>>>>> dependencies on `flink-streaming-java`.
> > > > >>>>>>>        - Source and sink implementations should only depend
> on
> > > the
> > > > >>>>>>> `table-common` module after FLIP-27.
> > > > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> > > > interfaces in
> > > > >>>>>>> `table-common` and strictly separate interfaces that
> > communicate
> > > > >>>>>>> with a
> > > > >>>>>>> planner and actual runtime reader/writers.
> > > > >>>>>>>
> > > > >>>>>>> - Implement efficient sources and sinks without planner
> > > > dependencies:
> > > > >>>>>>>        - Make Blink's internal data structures available to
> > > > connectors.
> > > > >>>>>>>        - Introduce stable interfaces for data structures that
> > can
> > > > be
> > > > >>>>>>> marked as `@PublicEvolving`.
> > > > >>>>>>>        - Only require dependencies on `flink-table-common` in
> > the
> > > > >>>>>>> future
> > > > >>>>>>>
> > > > >>>>>>> It finalizes the concept of dynamic tables and consideres how
> > all
> > > > >>>>>>> source/sink related classes play together.
> > > > >>>>>>>
> > > > >>>>>>> We look forward to your feedback.
> > > > >>>>>>>
> > > > >>>>>>> Regards,
> > > > >>>>>>> Timo
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>
> > > > >
> > > >
> > > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Reply via email to