Hi Jingsong, Regarding (2) and (3), I was thinking to ignore manually DDL work, so users can use them directly:
# this will log results to `.out` files INSERT INTO console SELECT ... # this will drop all received records INSERT INTO blackhole SELECT ... Here `console` and `blackhole` are system sinks which is similar to system functions. Best, Jark On Mon, 23 Mar 2020 at 16:33, Benchao Li <libenc...@gmail.com> wrote: > Hi Jingsong, > > Thanks for bring this up. Generally, it's a very good proposal. > > About data gen source, do you think we need to add more columns with > various types? > > About print sink, do we need to specify the schema? > > Jingsong Li <jingsongl...@gmail.com> 于2020年3月23日周一 下午1:51写道: > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > 1.datagen source: > > - easy startup/test for streaming job > > - performance testing > > > > DDL: > > CREATE TABLE user ( > > id BIGINT, > > age INT, > > description STRING > > ) WITH ( > > 'connector.type' = 'datagen', > > 'connector.rows-per-second'='100', > > 'connector.total-records'='1000000', > > > > 'schema.id.generator' = 'sequence', > > 'schema.id.generator.start' = '1', > > > > 'schema.age.generator' = 'random', > > 'schema.age.generator.min' = '0', > > 'schema.age.generator.max' = '100', > > > > 'schema.description.generator' = 'random', > > 'schema.description.generator.length' = '100' > > ) > > > > Default is random generator. > > Hi Jark, I don't want to bring complicated regularities, because it can > be > > done through computed columns. And it is hard to define > > standard regularities, I think we can leave it to the future. > > > > 2.print sink: > > - easy test for streaming job > > - be very useful in production debugging > > > > DDL: > > CREATE TABLE print_table ( > > ... > > ) WITH ( > > 'connector.type' = 'print' > > ) > > > > 3.blackhole sink > > - very useful for high performance testing of Flink > > - I've also run into users trying UDF to output, not sink, so they need > > this sink as well. > > > > DDL: > > CREATE TABLE blackhole_table ( > > ... > > ) WITH ( > > 'connector.type' = 'blackhole' > > ) > > > > What do you think? > > > > Best, > > Jingsong Lee > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <dian0511...@gmail.com> wrote: > > > > > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I > > > think Bowen's proposal makes much sense to me. > > > > > > This is also a painful problem for PyFlink users. Currently there is no > > > built-in easy-to-use table source/sink and it requires users to write a > > lot > > > of code to trying out PyFlink. This is especially painful for new users > > who > > > are not familiar with PyFlink/Flink. I have also encountered the > tedious > > > process Bowen encountered, e.g. writing random source connector, print > > sink > > > and also blackhole print sink as there are no built-in ones to use. > > > > > > Regards, > > > Dian > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道: > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such built-in > > > > connectors. > > > > > > > > I will leave some my thoughts here: > > > > > > > >> 1. datagen source (random source) > > > > I think we can merge the functinality of sequence-source into random > > > source > > > > to allow users to custom their data values. > > > > Flink can generate random data according to the field types, users > > > > can customize their values to be more domain specific, e.g. > > > > 'field.user'='User_[1-9]{0,1}' > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > >> 2. console sink (print sink) > > > > This will be very useful in production debugging, to easily output an > > > > intermediate view or result view to a `.out` file. > > > > So that we can look into the data representation, or check dirty > data. > > > > This should be out-of-box without manually DDL registration. > > > > > > > >> 3. blackhole sink (no output sink) > > > > This is very useful for high performance testing of Flink, to > meansure > > > the > > > > throughput of the whole pipeline without sink. > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > Best, > > > > Jark > > > > > > > > [1]: > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> wrote: > > > > > > > >> +1. > > > >> > > > >> I would suggest to take a step even further and see what users > really > > > need > > > >> to test/try/play with table API and Flink SQL. Besides this one, > > here're > > > >> some more sources and sinks that I have developed or used previously > > to > > > >> facilitate building Flink table/SQL pipelines. > > > >> > > > >> > > > >> 1. random input data source > > > >> - should generate random data at a specified rate according to > > > schema > > > >> - purposes > > > >> - test Flink pipeline and data can end up in external > storage > > > >> correctly > > > >> - stress test Flink sink as well as tuning up external > storage > > > >> 2. print data sink > > > >> - should print data in row format in console > > > >> - purposes > > > >> - make it easier to test Flink SQL job e2e in IDE > > > >> - test Flink pipeline and ensure output data format/value is > > > >> correct > > > >> 3. no output data sink > > > >> - just swallow output data without doing anything > > > >> - purpose > > > >> - evaluate and tune performance of Flink source and the > whole > > > >> pipeline. Users' don't need to worry about sink back > pressure > > > >> > > > >> These may be taken into consideration all together as an effort to > > lower > > > >> the threshold of running Flink SQL/table API, and facilitate users' > > > daily > > > >> work. > > > >> > > > >> Cheers, > > > >> Bowen > > > >> > > > >> > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > jingsongl...@gmail.com> > > > >> wrote: > > > >> > > > >>> Hi all, > > > >>> > > > >>> I heard some users complain that table is difficult to test. Now > with > > > SQL > > > >>> client, users are more and more inclined to use it to test rather > > than > > > >>> program. > > > >>> The most common example is Kafka source. If users need to test > their > > > SQL > > > >>> output and checkpoint, they need to: > > > >>> > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > >>> - 2.Write a program, mock input records, and produce records to > Kafka > > > >>> topic. > > > >>> - 3.Then test in Flink. > > > >>> > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > >>> > > > >>> Then I found StatefulSequenceSource, it is very good because it has > > > deal > > > >>> with checkpoint things, so it is very good to checkpoint > > > >> mechanism.Usually, > > > >>> users are turned on checkpoint in production. > > > >>> > > > >>> With computed columns, user are easy to create a sequence source > DDL > > > same > > > >>> to Kafka DDL. Then they can test inside Flink, don't need launch > > other > > > >>> things. > > > >>> > > > >>> Have you consider this? What do you think? > > > >>> > > > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author > > > >>> of StatefulSequenceSource. > > > >>> > > > >>> Best, > > > >>> Jingsong Lee > > > >>> > > > >> > > > > > > > > > > -- > > Best, Jingsong Lee > > > > > -- > > Benchao Li > School of Electronics Engineering and Computer Science, Peking University > Tel:+86-15650713730 > Email: libenc...@gmail.com; libenc...@pku.edu.cn >