Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

Jingsong Li Thu, 24 Sep 2020 05:18:39 -0700

Thanks Kurt and Flavio for your feedback.

To Kurt:

> Briefly introduce use cases and what kind of options are needed in your
opinion.

In the "Choose Scan Parallelism" chapter:
- I explained the user cases
- I adjusted the relationship to make user specified parallelism more
convenient

To Flavio:

Yes, you can configure `scan.infer-parallelism.max` or directly use
`scan.parallelism`.

To Kurt:

> Regarding the interface `DataStreamScanProvider`, a concrete example would
help the discussion.

>From the user feedback in [1]. There are two users have similar following
feedback (CC: liu shouwei):
(It is from user-zh, Translate to English)

Briefly introduce the background. One of the tasks of our group is that
users write SQL on the page. We are responsible for converting SQL
processing into Flink jobs and running them on our platform. The conversion
process depends on our SQL SDK.
Let me give you a few examples that we often use and feel that the new API
1.11 is not easy to implement:
1. We now have a specific Kafka data format. One Kafka data will be
converted into n (n is a positive integer) row data. Our approach is to add
a process / flatmap phase to emit datastream to deal with this situation,
which is transparent to users.
2. At present, we have encapsulated some of our own sinks. We will add a
process / filter before the sink to perform buffer pool / micro batch /
data filtering functions.
3. Adjust or specify the source / sink parallelism to the user specified
value. We also do this on the datastream level.
4. For some special source sinks, they will be combined with keyby
operations (transparent to users). We also do this on the datastream level.

For example, in question 2 above, we can implement it in the sinkfunction,
but I personally think it may not be ideal in design. When designing and
arranging functions / operators, I am used to following the principle of
"single responsibility of operators". This is why I split multiple process
/ filter operators in front of sink functions instead of coupling these
functions to sink functions. On the other hand, without datastream, the
cost of migrating to the new API is relatively higher. Or, we have some
special reasons. When operators are arranged, we will modify the task chain
strategy. At this time, the flexibility of datastream is essential.

[1]https://issues.apache.org/jira/browse/FLINK-18674

Best,
Jingsong

On Thu, Sep 24, 2020 at 4:15 PM Kurt Young <ykt...@gmail.com> wrote:

> Yeah, JDBC is definitely a popular use case we should consider.
>
> Best,
> Kurt
>
>
> On Thu, Sep 24, 2020 at 4:11 PM Flavio Pompermaier <pomperma...@okkam.it>
> wrote:
>
> > Hi Kurt, in the past we had a very interesting use case in this regard:
> our
> > customer (oracle) db was quite big and running too many queries in
> parallel
> > was too heavy and it was causing the queries to fail.
> > So we had to limit the source parallelism to 2 threads. After the
> fetching
> > of the data the other operators could use the max parallelism as usual..
> >
> > Best,
> > Flavio
> >
> > On Thu, Sep 24, 2020 at 9:59 AM Kurt Young <ykt...@gmail.com> wrote:
> >
> > > Thanks Jingsong for driving this, this is indeed a useful feature and
> > lots
> > > of users are asking for it.
> > >
> > > For setting a fixed source parallelism, I'm wondering whether this is
> > > necessary. For kafka,
> > > I can imagine users would expect Flink will use the number of
> partitions
> > as
> > > the parallelism. If it's too
> > > large, one can use the max parallelism to make it smaller.
> > > But for ES, which doesn't have ability to decide a reasonable
> parallelism
> > > on its own, it might make sense
> > > to introduce a user specified parallelism for such table source.
> > >
> > > So I think it would be better to reorganize the document a little bit,
> to
> > > explain the connectors one by one. Briefly
> > > introduce use cases and what kind of options are needed in your
> opinion.
> > >
> > > Regarding the interface `DataStreamScanProvider`, a concrete example
> > would
> > > help the discussion. What kind
> > > of scenarios do you want to support? And what kind of connectors need
> > such
> > > an interface?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Sep 23, 2020 at 9:30 PM admin <17626017...@163.com> wrote:
> > >
> > > > +1,it’s a good news
> > > >
> > > > > 2020年9月23日 下午6:22，Jingsong Li <jingsongl...@gmail.com> 写道：
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start a discussion about improving the new TableSource
> > and
> > > > > TableSink interfaces.
> > > > >
> > > > > Most connectors have been migrated to FLIP-95, but there are still
> > the
> > > > > Filesystem and Hive that have not been migrated. They have some
> > > > > requirements on table connector API. And users also have some
> > > additional
> > > > > requirements：
> > > > > - Some connectors have the ability to infer parallelism, the
> > > parallelism
> > > > is
> > > > > good for most cases.
> > > > > - Users have customized parallelism configuration requirements for
> > > source
> > > > > and sink.
> > > > > - The connectors need to use topology to build their source/sink
> > > instead
> > > > of
> > > > > a single function. Like JIRA[1], Partition Commit feature and File
> > > > > Compaction feature.
> > > > >
> > > > > Details are in [2].
> > > > >
> > > > > [1]https://issues.apache.org/jira/browse/FLINK-18674
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-146%3A+Improve+new+TableSource+and+TableSink+interfaces
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > >
> > > >
> >
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] FLIP-146: Improve new TableSource and TableSink interfaces

Reply via email to