Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Jark Wu Mon, 23 Mar 2020 01:44:46 -0700

Hi Jingsong,

Regarding (2) and (3), I was thinking to ignore manually DDL work, so users
can use them directly:


# this will log results to `.out` files
INSERT INTO console
SELECT ...

# this will drop all received records
INSERT INTO blackhole
SELECT ...

Here `console` and `blackhole` are system sinks which is similar to system
functions.

Best,
Jark

On Mon, 23 Mar 2020 at 16:33, Benchao Li <libenc...@gmail.com> wrote:

> Hi Jingsong,
>
> Thanks for bring this up. Generally, it's a very good proposal.
>
> About data gen source, do you think we need to add more columns with
> various types?
>
> About print sink, do we need to specify the schema?
>
> Jingsong Li <jingsongl...@gmail.com> 于2020年3月23日周一 下午1:51写道：
>
> > Thanks Bowen, Jark and Dian for your feedback and suggestions.
> >
> > I reorganize with your suggestions, and try to expose DDLs:
> >
> > 1.datagen source:
> > - easy startup/test for streaming job
> > - performance testing
> >
> > DDL:
> > CREATE TABLE user (
> >     id BIGINT,
> >     age INT,
> >     description STRING
> > ) WITH (
> >     'connector.type' = 'datagen',
> >     'connector.rows-per-second'='100',
> >     'connector.total-records'='1000000',
> >
> >     'schema.id.generator' = 'sequence',
> >     'schema.id.generator.start' = '1',
> >
> >     'schema.age.generator' = 'random',
> >     'schema.age.generator.min' = '0',
> >     'schema.age.generator.max' = '100',
> >
> >     'schema.description.generator' = 'random',
> >     'schema.description.generator.length' = '100'
> > )
> >
> > Default is random generator.
> > Hi Jark, I don't want to bring complicated regularities, because it can
> be
> > done through computed columns. And it is hard to define
> > standard regularities, I think we can leave it to the future.
> >
> > 2.print sink:
> > - easy test for streaming job
> > - be very useful in production debugging
> >
> > DDL:
> > CREATE TABLE print_table (
> >     ...
> > ) WITH (
> >     'connector.type' = 'print'
> > )
> >
> > 3.blackhole sink
> > - very useful for high performance testing of Flink
> > - I've also run into users trying UDF to output, not sink, so they need
> > this sink as well.
> >
> > DDL:
> > CREATE TABLE blackhole_table (
> >     ...
> > ) WITH (
> >     'connector.type' = 'blackhole'
> > )
> >
> > What do you think?
> >
> > Best,
> > Jingsong Lee
> >
> > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <dian0511...@gmail.com> wrote:
> >
> > > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I
> > > think Bowen's proposal makes much sense to me.
> > >
> > > This is also a painful problem for PyFlink users. Currently there is no
> > > built-in easy-to-use table source/sink and it requires users to write a
> > lot
> > > of code to trying out PyFlink. This is especially painful for new users
> > who
> > > are not familiar with PyFlink/Flink. I have also encountered the
> tedious
> > > process Bowen encountered, e.g. writing random source connector, print
> > sink
> > > and also blackhole print sink as there are no built-in ones to use.
> > >
> > > Regards,
> > > Dian
> > >
> > > > 在 2020年3月22日，上午11:24，Jark Wu <imj...@gmail.com> 写道：
> > > >
> > > > +1 to Bowen's proposal. I also saw many requirements on such built-in
> > > > connectors.
> > > >
> > > > I will leave some my thoughts here:
> > > >
> > > >> 1. datagen source (random source)
> > > > I think we can merge the functinality of sequence-source into random
> > > source
> > > > to allow users to custom their data values.
> > > > Flink can generate random data according to the field types, users
> > > > can customize their values to be more domain specific, e.g.
> > > > 'field.user'='User_[1-9]{0,1}'
> > > > This will be similar to kafka-datagen-connect[1].
> > > >
> > > >> 2. console sink (print sink)
> > > > This will be very useful in production debugging, to easily output an
> > > > intermediate view or result view to a `.out` file.
> > > > So that we can look into the data representation, or check dirty
> data.
> > > > This should be out-of-box without manually DDL registration.
> > > >
> > > >> 3. blackhole sink (no output sink)
> > > > This is very useful for high performance testing of Flink, to
> meansure
> > > the
> > > > throughput of the whole pipeline without sink.
> > > > Presto also provides this as a built-in connector [2].
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > [1]:
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html
> > > >
> > > >
> > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> wrote:
> > > >
> > > >> +1.
> > > >>
> > > >> I would suggest to take a step even further and see what users
> really
> > > need
> > > >> to test/try/play with table API and Flink SQL. Besides this one,
> > here're
> > > >> some more sources and sinks that I have developed or used previously
> > to
> > > >> facilitate building Flink table/SQL pipelines.
> > > >>
> > > >>
> > > >>   1. random input data source
> > > >>      - should generate random data at a specified rate according to
> > > schema
> > > >>      - purposes
> > > >>         - test Flink pipeline and data can end up in external
> storage
> > > >>         correctly
> > > >>         - stress test Flink sink as well as tuning up external
> storage
> > > >>      2. print data sink
> > > >>      - should print data in row format in console
> > > >>      - purposes
> > > >>         - make it easier to test Flink SQL job e2e in IDE
> > > >>         - test Flink pipeline and ensure output data format/value is
> > > >>         correct
> > > >>      3. no output data sink
> > > >>      - just swallow output data without doing anything
> > > >>      - purpose
> > > >>         - evaluate and tune performance of Flink source and the
> whole
> > > >>         pipeline. Users' don't need to worry about sink back
> pressure
> > > >>
> > > >> These may be taken into consideration all together as an effort to
> > lower
> > > >> the threshold of running Flink SQL/table API, and facilitate users'
> > > daily
> > > >> work.
> > > >>
> > > >> Cheers,
> > > >> Bowen
> > > >>
> > > >>
> > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <
> jingsongl...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> I heard some users complain that table is difficult to test. Now
> with
> > > SQL
> > > >>> client, users are more and more inclined to use it to test rather
> > than
> > > >>> program.
> > > >>> The most common example is Kafka source. If users need to test
> their
> > > SQL
> > > >>> output and checkpoint, they need to:
> > > >>>
> > > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > > >>> - 2.Write a program, mock input records, and produce records to
> Kafka
> > > >>> topic.
> > > >>> - 3.Then test in Flink.
> > > >>>
> > > >>> The step 1 and 2 are annoying, although this test is E2E.
> > > >>>
> > > >>> Then I found StatefulSequenceSource, it is very good because it has
> > > deal
> > > >>> with checkpoint things, so it is very good to checkpoint
> > > >> mechanism.Usually,
> > > >>> users are turned on checkpoint in production.
> > > >>>
> > > >>> With computed columns, user are easy to create a sequence source
> DDL
> > > same
> > > >>> to Kafka DDL. Then they can test inside Flink, don't need launch
> > other
> > > >>> things.
> > > >>>
> > > >>> Have you consider this? What do you think?
> > > >>>
> > > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author
> > > >>> of StatefulSequenceSource.
> > > >>>
> > > >>> Best,
> > > >>> Jingsong Lee
> > > >>>
> > > >>
> > >
> > >
> >
> > --
> > Best, Jingsong Lee
> >
>
>
> --
>
> Benchao Li
> School of Electronics Engineering and Computer Science, Peking University
> Tel:+86-15650713730
> Email: libenc...@gmail.com; libenc...@pku.edu.cn
>

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to