Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Dian Fu Sun, 22 Mar 2020 21:05:47 -0700

Thanks Jingsong for bringing up this discussion. +1 to this proposal. I think 
Bowen's proposal makes much sense to me.


This is also a painful problem for PyFlink users. Currently there is no 
built-in easy-to-use table source/sink and it requires users to write a lot of 
code to trying out PyFlink. This is especially painful for new users who are 
not familiar with PyFlink/Flink. I have also encountered the tedious process 
Bowen encountered, e.g. writing random source connector, print sink and also 
blackhole print sink as there are no built-in ones to use. 

Regards,
Dian

> 在 2020年3月22日，上午11:24，Jark Wu <imj...@gmail.com> 写道：
> 
> +1 to Bowen's proposal. I also saw many requirements on such built-in
> connectors.
> 
> I will leave some my thoughts here:
> 
>> 1. datagen source (random source)
> I think we can merge the functinality of sequence-source into random source
> to allow users to custom their data values.
> Flink can generate random data according to the field types, users
> can customize their values to be more domain specific, e.g.
> 'field.user'='User_[1-9]{0,1}'
> This will be similar to kafka-datagen-connect[1].
> 
>> 2. console sink (print sink)
> This will be very useful in production debugging, to easily output an
> intermediate view or result view to a `.out` file.
> So that we can look into the data representation, or check dirty data.
> This should be out-of-box without manually DDL registration.
> 
>> 3. blackhole sink (no output sink)
> This is very useful for high performance testing of Flink, to meansure the
> throughput of the whole pipeline without sink.
> Presto also provides this as a built-in connector [2].
> 
> Best,
> Jark
> 
> [1]:
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> [2]: https://prestodb.io/docs/current/connector/blackhole.html
> 
> 
> On Sat, 21 Mar 2020 at 12:31, Bowen Li <bowenl...@gmail.com> wrote:
> 
>> +1.
>> 
>> I would suggest to take a step even further and see what users really need
>> to test/try/play with table API and Flink SQL. Besides this one, here're
>> some more sources and sinks that I have developed or used previously to
>> facilitate building Flink table/SQL pipelines.
>> 
>> 
>>   1. random input data source
>>      - should generate random data at a specified rate according to schema
>>      - purposes
>>         - test Flink pipeline and data can end up in external storage
>>         correctly
>>         - stress test Flink sink as well as tuning up external storage
>>      2. print data sink
>>      - should print data in row format in console
>>      - purposes
>>         - make it easier to test Flink SQL job e2e in IDE
>>         - test Flink pipeline and ensure output data format/value is
>>         correct
>>      3. no output data sink
>>      - just swallow output data without doing anything
>>      - purpose
>>         - evaluate and tune performance of Flink source and the whole
>>         pipeline. Users' don't need to worry about sink back pressure
>> 
>> These may be taken into consideration all together as an effort to lower
>> the threshold of running Flink SQL/table API, and facilitate users' daily
>> work.
>> 
>> Cheers,
>> Bowen
>> 
>> 
>> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <jingsongl...@gmail.com>
>> wrote:
>> 
>>> Hi all,
>>> 
>>> I heard some users complain that table is difficult to test. Now with SQL
>>> client, users are more and more inclined to use it to test rather than
>>> program.
>>> The most common example is Kafka source. If users need to test their SQL
>>> output and checkpoint, they need to:
>>> 
>>> - 1.Launch a Kafka standalone, create a Kafka topic .
>>> - 2.Write a program, mock input records, and produce records to Kafka
>>> topic.
>>> - 3.Then test in Flink.
>>> 
>>> The step 1 and 2 are annoying, although this test is E2E.
>>> 
>>> Then I found StatefulSequenceSource, it is very good because it has deal
>>> with checkpoint things, so it is very good to checkpoint
>> mechanism.Usually,
>>> users are turned on checkpoint in production.
>>> 
>>> With computed columns, user are easy to create a sequence source DDL same
>>> to Kafka DDL. Then they can test inside Flink, don't need launch other
>>> things.
>>> 
>>> Have you consider this? What do you think?
>>> 
>>> CC: @Aljoscha Krettek <aljos...@apache.org> the author
>>> of StatefulSequenceSource.
>>> 
>>> Best,
>>> Jingsong Lee
>>> 
>>

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to