Re: [DISCUSS][FLINK-32993] Datagen connector handles length-constrained fields according to the schema definition by default

2023-11-21 Thread Lincoln Lee
Thanks Yubin and Jane for the discussion!

+1 to fix this bug, although it's usually used as a test source, it's
important to provide the correct behavior for users.

for the invalid field length configured by users, I think it's better to
raise an error instead of using default value silently.

take Jane's example above:
1. For fixed-length data types, we should not accept another with option to
overwrite the length semantic in the schema
2. For variable-length data types, both two DDLs looks ok since STRING is
equal to VARCHAR(2147483647) and the user defined length is not beyond
definition,
but the following one is invalid:
CREATE TABLE t1 (
   f0 VARCHAR(128)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '256');

Another thing we may also take into considering(not a bug, but relevant),
is to support variable length semantics for varchar, since the length 128
in varchar(128) is just max length, we can extending datagen to generate
variable length values(maybe a new option to enable it, e.g.,
'fields.f0.var-len'='true'). Of course, this is a new feature that is not
part of this problem.

Best,
Lincoln Lee


Jane Chan  于2023年11月21日周二 21:07写道:

> Hi Yubin,
>
> Thanks for driving this discussion. Perhaps a specific example can better
> illustrate the current issue.
>
> Considering the following DDL, f0 will always be generated with a default
> char length of 100, regardless of char(5), bcause the connector option
> 'fields.f0.length' is not specified [1].
>
>> CREATE TABLE foo (
>>f0 CHAR(5)
>> ) WITH ('connector' = 'datagen');
>>
>
> Since it's often the case for a fixed-length type to specify length
> explictly in the DDL, the current design can be confusing for users to some
> extent.
>
> However, for the proposed changes, it would be preferable to provide
> specific details on how to handle the "not be user-defined" scenario. For
> example, should it be ignored or should an exception be thrown?
>
> To be more specific,
> 1. For fixed-length data types, what happens for the following two DDLs
>
>> CREATE TABLE foo (
>>f0 CHAR(5)
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>
>> CREATE TABLE bar (
>>f0 CHAR(5)
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1');
>>
>
> 2. For variable-length data types, what happens for the following two DDLs
>
>> CREATE TABLE meow (
>>f0 VARCHAR(20)
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>
>> CREATE TABLE purr (
>>f0 STRING
>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>
>
> Best,
> Jane
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length
>
>
> On Mon, Nov 20, 2023 at 8:46 PM 李宇彬  wrote:
>
>> Hi everyone,
>>
>>
>> Currently, the Datagen connector generates data that doesn't match the
>> schema definition
>> when dealing with fixed-length and variable-length fields. It defaults to
>> a unified length of 100
>> and requires manual configuration by the user. This violates the
>> correctness of schema constraints
>> and hampers ease of use.
>>
>>
>> Jane Chan and I have discussed offline and I will summarize our
>> discussion below.
>>
>>
>> To enhance the datagen connector to automatically generate data that
>> conforms to the schema
>> definition without additional manual configuration, we propose handling
>> the following data types
>> appropriately [1]:
>>   1. For fixed-length data types (char, binary), the length should be
>> defined by the schema definition
>>  and not be user-defined.
>>   2. For variable-length data types (varchar, varbinary), the length
>> should be defined by the schema
>>   definition, but allow for user-defined lengths that are smaller
>> than the schema definition.
>>
>>
>>
>> Looking forward to your feedback :)
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-32993
>>
>>
>> Best,
>> Yubin
>>
>>


Re: [DISCUSS][FLINK-32993] Datagen connector handles length-constrained fields according to the schema definition by default

2023-11-21 Thread Jane Chan
Hi Yubin,

Thanks for driving this discussion. Perhaps a specific example can better
illustrate the current issue.

Considering the following DDL, f0 will always be generated with a default
char length of 100, regardless of char(5), bcause the connector option
'fields.f0.length' is not specified [1].

> CREATE TABLE foo (
>f0 CHAR(5)
> ) WITH ('connector' = 'datagen');
>

Since it's often the case for a fixed-length type to specify length
explictly in the DDL, the current design can be confusing for users to some
extent.

However, for the proposed changes, it would be preferable to provide
specific details on how to handle the "not be user-defined" scenario. For
example, should it be ignored or should an exception be thrown?

To be more specific,
1. For fixed-length data types, what happens for the following two DDLs

> CREATE TABLE foo (
>f0 CHAR(5)
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>
> CREATE TABLE bar (
>f0 CHAR(5)
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1');
>

2. For variable-length data types, what happens for the following two DDLs

> CREATE TABLE meow (
>f0 VARCHAR(20)
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>
> CREATE TABLE purr (
>f0 STRING
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>

Best,
Jane

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length


On Mon, Nov 20, 2023 at 8:46 PM 李宇彬  wrote:

> Hi everyone,
>
>
> Currently, the Datagen connector generates data that doesn't match the
> schema definition
> when dealing with fixed-length and variable-length fields. It defaults to
> a unified length of 100
> and requires manual configuration by the user. This violates the
> correctness of schema constraints
> and hampers ease of use.
>
>
> Jane Chan and I have discussed offline and I will summarize our discussion
> below.
>
>
> To enhance the datagen connector to automatically generate data that
> conforms to the schema
> definition without additional manual configuration, we propose handling
> the following data types
> appropriately [1]:
>   1. For fixed-length data types (char, binary), the length should be
> defined by the schema definition
>  and not be user-defined.
>   2. For variable-length data types (varchar, varbinary), the length
> should be defined by the schema
>   definition, but allow for user-defined lengths that are smaller
> than the schema definition.
>
>
>
> Looking forward to your feedback :)
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-32993
>
>
> Best,
> Yubin
>
>