Thanks Jingsong for bringing up this discussion. +1 to this proposal. I think Bowen's proposal makes much sense to me.
This is also a painful problem for PyFlink users. Currently there is no built-in easy-to-use table source/sink and it requires users to write a lot of code to trying out PyFlink. This is especially painful for new users who are not familiar with PyFlink/Flink. I have also encountered the tedious process Bowen encountered, e.g. writing random source connector, print sink and also blackhole print sink as there are no built-in ones to use. Regards, Dian > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道: > > +1 to Bowen's proposal. I also saw many requirements on such built-in > connectors. > > I will leave some my thoughts here: > >> 1. datagen source (random source) > I think we can merge the functinality of sequence-source into random source > to allow users to custom their data values. > Flink can generate random data according to the field types, users > can customize their values to be more domain specific, e.g. > 'field.user'='User_[1-9]{0,1}' > This will be similar to kafka-datagen-connect[1]. > >> 2. console sink (print sink) > This will be very useful in production debugging, to easily output an > intermediate view or result view to a `.out` file. > So that we can look into the data representation, or check dirty data. > This should be out-of-box without manually DDL registration. > >> 3. blackhole sink (no output sink) > This is very useful for high performance testing of Flink, to meansure the > throughput of the whole pipeline without sink. > Presto also provides this as a built-in connector [2]. > > Best, > Jark > > [1]: > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> wrote: > >> +1. >> >> I would suggest to take a step even further and see what users really need >> to test/try/play with table API and Flink SQL. Besides this one, here're >> some more sources and sinks that I have developed or used previously to >> facilitate building Flink table/SQL pipelines. >> >> >> 1. random input data source >> - should generate random data at a specified rate according to schema >> - purposes >> - test Flink pipeline and data can end up in external storage >> correctly >> - stress test Flink sink as well as tuning up external storage >> 2. print data sink >> - should print data in row format in console >> - purposes >> - make it easier to test Flink SQL job e2e in IDE >> - test Flink pipeline and ensure output data format/value is >> correct >> 3. no output data sink >> - just swallow output data without doing anything >> - purpose >> - evaluate and tune performance of Flink source and the whole >> pipeline. Users' don't need to worry about sink back pressure >> >> These may be taken into consideration all together as an effort to lower >> the threshold of running Flink SQL/table API, and facilitate users' daily >> work. >> >> Cheers, >> Bowen >> >> >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <jingsongl...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I heard some users complain that table is difficult to test. Now with SQL >>> client, users are more and more inclined to use it to test rather than >>> program. >>> The most common example is Kafka source. If users need to test their SQL >>> output and checkpoint, they need to: >>> >>> - 1.Launch a Kafka standalone, create a Kafka topic . >>> - 2.Write a program, mock input records, and produce records to Kafka >>> topic. >>> - 3.Then test in Flink. >>> >>> The step 1 and 2 are annoying, although this test is E2E. >>> >>> Then I found StatefulSequenceSource, it is very good because it has deal >>> with checkpoint things, so it is very good to checkpoint >> mechanism.Usually, >>> users are turned on checkpoint in production. >>> >>> With computed columns, user are easy to create a sequence source DDL same >>> to Kafka DDL. Then they can test inside Flink, don't need launch other >>> things. >>> >>> Have you consider this? What do you think? >>> >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author >>> of StatefulSequenceSource. >>> >>> Best, >>> Jingsong Lee >>> >>