Thanks Yubin and Jane for the discussion!

+1 to fix this bug, although it's usually used as a test source, it's
important to provide the correct behavior for users.

for the invalid field length configured by users, I think it's better to
raise an error instead of using default value silently.

take Jane's example above:
1. For fixed-length data types, we should not accept another with option to
overwrite the length semantic in the schema
2. For variable-length data types, both two DDLs looks ok since STRING is
equal to VARCHAR(2147483647) and the user defined length is not beyond
definition,
but the following one is invalid:
CREATE TABLE t1 (
   f0 VARCHAR(128)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '256');

Another thing we may also take into considering(not a bug, but relevant),
is to support variable length semantics for varchar, since the length 128
in varchar(128) is just max length, we can extending datagen to generate
variable length values(maybe a new option to enable it, e.g.,
'fields.f0.var-len'='true'). Of course, this is a new feature that is not
part of this problem.

Best,
Lincoln Lee


Jane Chan <qingyue....@gmail.com> 于2023年11月21日周二 21:07写道:

> Hi Yubin,
>
> Thanks for driving this discussion. Perhaps a specific example can better
> illustrate the current issue.
>
> Considering the following DDL, f0 will always be generated with a default
> char length of 100, regardless of char(5), bcause the connector option
> 'fields.f0.length' is not specified [1].
>
>> CREATE TABLE foo (
>>    f0 CHAR(5)
>> ) WITH ('connector' = 'datagen');
>>
>
> Since it's often the case for a fixed-length type to specify length
> explictly in the DDL, the current design can be confusing for users to some
> extent.
>
> However, for the proposed changes, it would be preferable to provide
> specific details on how to handle the "not be user-defined" scenario. For
> example, should it be ignored or should an exception be thrown?
>
> To be more specific,
> 1. For fixed-length data types, what happens for the following two DDLs
>
>> CREATE TABLE foo (
>>    f0 CHAR(5)
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>
>> CREATE TABLE bar (
>>    f0 CHAR(5)
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1');
>>
>
> 2. For variable-length data types, what happens for the following two DDLs
>
>> CREATE TABLE meow (
>>    f0 VARCHAR(20)
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>
>> CREATE TABLE purr (
>>    f0 STRING
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>
>
> Best,
> Jane
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length
>
>
> On Mon, Nov 20, 2023 at 8:46 PM 李宇彬 <lixin58...@163.com> wrote:
>
>> Hi everyone,
>>
>>
>> Currently, the Datagen connector generates data that doesn't match the
>> schema definition
>> when dealing with fixed-length and variable-length fields. It defaults to
>> a unified length of 100
>> and requires manual configuration by the user. This violates the
>> correctness of schema constraints
>> and hampers ease of use.
>>
>>
>> Jane Chan and I have discussed offline and I will summarize our
>> discussion below.
>>
>>
>> To enhance the datagen connector to automatically generate data that
>> conforms to the schema
>> definition without additional manual configuration, we propose handling
>> the following data types
>> appropriately [1]:
>>       1. For fixed-length data types (char, binary), the length should be
>> defined by the schema definition
>>          and not be user-defined.
>>       2. For variable-length data types (varchar, varbinary), the length
>> should be defined by the schema
>>           definition, but allow for user-defined lengths that are smaller
>> than the schema definition.
>>
>>
>>
>> Looking forward to your feedback :)
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-32993
>>
>>
>> Best,
>> Yubin
>>
>>

Reply via email to