Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Konstantin Knauf Thu, 30 Apr 2020 06:19:22 -0700

Hi everyone,

sorry for reviving this thread at this point in time. Generally, I think,
this is a very valuable effort. Have we considered only providing a very
basic data generator (+ discarding and printing sink tables) in Apache
Flink and moving a more comprehensive data generating table source to an
ecosystem project promoted on flink-packages.org. I think this has a lot of
potential (e.g. in combination with Java Faker [1]), but it would probably
be better served in a small separately maintained repository.


Cheers,

Konstantin

[1] https://github.com/DiUS/java-faker


On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[email protected]> wrote:

> Hi all,
>
> I created https://issues.apache.org/jira/browse/FLINK-16743 for follow-up
> discussion. FYI.
>
> Best,
> Jingsong Lee
>
> On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[email protected]> wrote:
>
> > I agree with Jingsong that sink schema inference and system tables can be
> > considered later. I wouldn’t recommend to tackle them for the sake of
> > simplifying user experience to the extreme. Providing the above handy
> > source and sink implementations already offer users a ton of immediate
> > value.
> >
> >
> > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[email protected]>
> wrote:
> >
> > > Hi Benchao,
> > >
> > > > do you think we need to add more columns with various types?
> > >
> > > I didn't list all types, but we should support primitive types,
> varchar,
> > > Decimal, Timestamp and etc...
> > > This can be done continuously.
> > >
> > > Hi Benchao, Jark,
> > > About console and blackhole, yes, they can have no schema, the schema
> can
> > > be inferred by upstream node.
> > > - But now we don't have this mechanism to do these configurable sink
> > > things.
> > > - If we want to support, we need a single way to support these two
> sinks.
> > > - And uses can use "create table like" and others way to simplify DDL.
> > >
> > > And for providing system/registered tables (`console` and `blackhole`):
> > > - I have no strong opinion on these system tables. In SQL, will be
> > "insert
> > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert
> into
> > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It
> > > seems that Blackhole is a universal thing, which makes me feel bad
> > > intuitively.
> > > - Can user override these tables? If can, we need ensure it can be
> > > overwrite by catalog tables.
> > >
> > > So I think we can leave these system tables to future too.
> > > What do you think?
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[email protected]> wrote:
> > >
> > > > Hi Jingsong,
> > > >
> > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so
> > > users
> > > > can use them directly:
> > > >
> > > > # this will log results to `.out` files
> > > > INSERT INTO console
> > > > SELECT ...
> > > >
> > > > # this will drop all received records
> > > > INSERT INTO blackhole
> > > > SELECT ...
> > > >
> > > > Here `console` and `blackhole` are system sinks which is similar to
> > > system
> > > > functions.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[email protected]>
> wrote:
> > > >
> > > > > Hi Jingsong,
> > > > >
> > > > > Thanks for bring this up. Generally, it's a very good proposal.
> > > > >
> > > > > About data gen source, do you think we need to add more columns
> with
> > > > > various types?
> > > > >
> > > > > About print sink, do we need to specify the schema?
> > > > >
> > > > > Jingsong Li <[email protected]> 于2020年3月23日周一 下午1:51写道：
> > > > >
> > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions.
> > > > > >
> > > > > > I reorganize with your suggestions, and try to expose DDLs:
> > > > > >
> > > > > > 1.datagen source:
> > > > > > - easy startup/test for streaming job
> > > > > > - performance testing
> > > > > >
> > > > > > DDL:
> > > > > > CREATE TABLE user (
> > > > > >     id BIGINT,
> > > > > >     age INT,
> > > > > >     description STRING
> > > > > > ) WITH (
> > > > > >     'connector.type' = 'datagen',
> > > > > >     'connector.rows-per-second'='100',
> > > > > >     'connector.total-records'='1000000',
> > > > > >
> > > > > >     'schema.id.generator' = 'sequence',
> > > > > >     'schema.id.generator.start' = '1',
> > > > > >
> > > > > >     'schema.age.generator' = 'random',
> > > > > >     'schema.age.generator.min' = '0',
> > > > > >     'schema.age.generator.max' = '100',
> > > > > >
> > > > > >     'schema.description.generator' = 'random',
> > > > > >     'schema.description.generator.length' = '100'
> > > > > > )
> > > > > >
> > > > > > Default is random generator.
> > > > > > Hi Jark, I don't want to bring complicated regularities, because
> it
> > > can
> > > > > be
> > > > > > done through computed columns. And it is hard to define
> > > > > > standard regularities, I think we can leave it to the future.
> > > > > >
> > > > > > 2.print sink:
> > > > > > - easy test for streaming job
> > > > > > - be very useful in production debugging
> > > > > >
> > > > > > DDL:
> > > > > > CREATE TABLE print_table (
> > > > > >     ...
> > > > > > ) WITH (
> > > > > >     'connector.type' = 'print'
> > > > > > )
> > > > > >
> > > > > > 3.blackhole sink
> > > > > > - very useful for high performance testing of Flink
> > > > > > - I've also run into users trying UDF to output, not sink, so
> they
> > > need
> > > > > > this sink as well.
> > > > > >
> > > > > > DDL:
> > > > > > CREATE TABLE blackhole_table (
> > > > > >     ...
> > > > > > ) WITH (
> > > > > >     'connector.type' = 'blackhole'
> > > > > > )
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Jingsong Lee
> > > > > >
> > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this
> > > > proposal. I
> > > > > > > think Bowen's proposal makes much sense to me.
> > > > > > >
> > > > > > > This is also a painful problem for PyFlink users. Currently
> there
> > > is
> > > > no
> > > > > > > built-in easy-to-use table source/sink and it requires users to
> > > > write a
> > > > > > lot
> > > > > > > of code to trying out PyFlink. This is especially painful for
> new
> > > > users
> > > > > > who
> > > > > > > are not familiar with PyFlink/Flink. I have also encountered
> the
> > > > > tedious
> > > > > > > process Bowen encountered, e.g. writing random source
> connector,
> > > > print
> > > > > > sink
> > > > > > > and also blackhole print sink as there are no built-in ones to
> > use.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Dian
> > > > > > >
> > > > > > > > 在 2020年3月22日，上午11:24，Jark Wu <[email protected]> 写道：
> > > > > > > >
> > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such
> > > > built-in
> > > > > > > > connectors.
> > > > > > > >
> > > > > > > > I will leave some my thoughts here:
> > > > > > > >
> > > > > > > >> 1. datagen source (random source)
> > > > > > > > I think we can merge the functinality of sequence-source into
> > > > random
> > > > > > > source
> > > > > > > > to allow users to custom their data values.
> > > > > > > > Flink can generate random data according to the field types,
> > > users
> > > > > > > > can customize their values to be more domain specific, e.g.
> > > > > > > > 'field.user'='User_[1-9]{0,1}'
> > > > > > > > This will be similar to kafka-datagen-connect[1].
> > > > > > > >
> > > > > > > >> 2. console sink (print sink)
> > > > > > > > This will be very useful in production debugging, to easily
> > > output
> > > > an
> > > > > > > > intermediate view or result view to a `.out` file.
> > > > > > > > So that we can look into the data representation, or check
> > dirty
> > > > > data.
> > > > > > > > This should be out-of-box without manually DDL registration.
> > > > > > > >
> > > > > > > >> 3. blackhole sink (no output sink)
> > > > > > > > This is very useful for high performance testing of Flink, to
> > > > > meansure
> > > > > > > the
> > > > > > > > throughput of the whole pipeline without sink.
> > > > > > > > Presto also provides this as a built-in connector [2].
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jark
> > > > > > > >
> > > > > > > > [1]:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > > > > > [2]:
> https://prestodb.io/docs/current/connector/blackhole.html
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[email protected]>
> > > > wrote:
> > > > > > > >
> > > > > > > >> +1.
> > > > > > > >>
> > > > > > > >> I would suggest to take a step even further and see what
> users
> > > > > really
> > > > > > > need
> > > > > > > >> to test/try/play with table API and Flink SQL. Besides this
> > one,
> > > > > > here're
> > > > > > > >> some more sources and sinks that I have developed or used
> > > > previously
> > > > > > to
> > > > > > > >> facilitate building Flink table/SQL pipelines.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>   1. random input data source
> > > > > > > >>      - should generate random data at a specified rate
> > according
> > > > to
> > > > > > > schema
> > > > > > > >>      - purposes
> > > > > > > >>         - test Flink pipeline and data can end up in
> external
> > > > > storage
> > > > > > > >>         correctly
> > > > > > > >>         - stress test Flink sink as well as tuning up
> external
> > > > > storage
> > > > > > > >>      2. print data sink
> > > > > > > >>      - should print data in row format in console
> > > > > > > >>      - purposes
> > > > > > > >>         - make it easier to test Flink SQL job e2e in IDE
> > > > > > > >>         - test Flink pipeline and ensure output data
> > > format/value
> > > > is
> > > > > > > >>         correct
> > > > > > > >>      3. no output data sink
> > > > > > > >>      - just swallow output data without doing anything
> > > > > > > >>      - purpose
> > > > > > > >>         - evaluate and tune performance of Flink source and
> > the
> > > > > whole
> > > > > > > >>         pipeline. Users' don't need to worry about sink back
> > > > > pressure
> > > > > > > >>
> > > > > > > >> These may be taken into consideration all together as an
> > effort
> > > to
> > > > > > lower
> > > > > > > >> the threshold of running Flink SQL/table API, and facilitate
> > > > users'
> > > > > > > daily
> > > > > > > >> work.
> > > > > > > >>
> > > > > > > >> Cheers,
> > > > > > > >> Bowen
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <
> > > > > [email protected]>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> Hi all,
> > > > > > > >>>
> > > > > > > >>> I heard some users complain that table is difficult to
> test.
> > > Now
> > > > > with
> > > > > > > SQL
> > > > > > > >>> client, users are more and more inclined to use it to test
> > > rather
> > > > > > than
> > > > > > > >>> program.
> > > > > > > >>> The most common example is Kafka source. If users need to
> > test
> > > > > their
> > > > > > > SQL
> > > > > > > >>> output and checkpoint, they need to:
> > > > > > > >>>
> > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > > > > > > >>> - 2.Write a program, mock input records, and produce
> records
> > to
> > > > > Kafka
> > > > > > > >>> topic.
> > > > > > > >>> - 3.Then test in Flink.
> > > > > > > >>>
> > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E.
> > > > > > > >>>
> > > > > > > >>> Then I found StatefulSequenceSource, it is very good
> because
> > it
> > > > has
> > > > > > > deal
> > > > > > > >>> with checkpoint things, so it is very good to checkpoint
> > > > > > > >> mechanism.Usually,
> > > > > > > >>> users are turned on checkpoint in production.
> > > > > > > >>>
> > > > > > > >>> With computed columns, user are easy to create a sequence
> > > source
> > > > > DDL
> > > > > > > same
> > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need
> > > launch
> > > > > > other
> > > > > > > >>> things.
> > > > > > > >>>
> > > > > > > >>> Have you consider this? What do you think?
> > > > > > > >>>
> > > > > > > >>> CC: @Aljoscha Krettek <[email protected]> the author
> > > > > > > >>> of StatefulSequenceSource.
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>> Jingsong Lee
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Best, Jingsong Lee
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Benchao Li
> > > > > School of Electronics Engineering and Computer Science, Peking
> > > University
> > > > > Tel:+86-15650713730
> > > > > Email: [email protected]; [email protected]
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>
>
> --
> Best, Jingsong Lee
>


-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to