Hi, Yubin and Jane.

Big +1 for this fix. 


I also agree with Lincoln's view about throwing an error instead of leave all 
the complexity to the framework
if it is obvious that the length in schema and in with options are in conflict. 


About the four examples Jane provides, I think only the one below can pass the 
validation and others maybe
need throw an exception clearly.


```
CREATE TABLE foo (
 f0 VARCHAR(20)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
```



--

    Best!
    Xuyang





At 2023-11-21 23:37:32, "Lincoln Lee" <lincoln.8...@gmail.com> wrote:
>Thanks Yubin and Jane for the discussion!
>
>+1 to fix this bug, although it's usually used as a test source, it's
>important to provide the correct behavior for users.
>
>for the invalid field length configured by users, I think it's better to
>raise an error instead of using default value silently.
>
>take Jane's example above:
>1. For fixed-length data types, we should not accept another with option to
>overwrite the length semantic in the schema
>2. For variable-length data types, both two DDLs looks ok since STRING is
>equal to VARCHAR(2147483647) and the user defined length is not beyond
>definition,
>but the following one is invalid:
>CREATE TABLE t1 (
>   f0 VARCHAR(128)
>) WITH ('connector' = 'datagen', 'fields.f0.length' = '256');
>
>Another thing we may also take into considering(not a bug, but relevant),
>is to support variable length semantics for varchar, since the length 128
>in varchar(128) is just max length, we can extending datagen to generate
>variable length values(maybe a new option to enable it, e.g.,
>'fields.f0.var-len'='true'). Of course, this is a new feature that is not
>part of this problem.
>
>Best,
>Lincoln Lee
>
>
>Jane Chan <qingyue....@gmail.com> 于2023年11月21日周二 21:07写道:
>
>> Hi Yubin,
>>
>> Thanks for driving this discussion. Perhaps a specific example can better
>> illustrate the current issue.
>>
>> Considering the following DDL, f0 will always be generated with a default
>> char length of 100, regardless of char(5), bcause the connector option
>> 'fields.f0.length' is not specified [1].
>>
>>> CREATE TABLE foo (
>>>    f0 CHAR(5)
>>> ) WITH ('connector' = 'datagen');
>>>
>>
>> Since it's often the case for a fixed-length type to specify length
>> explictly in the DDL, the current design can be confusing for users to some
>> extent.
>>
>> However, for the proposed changes, it would be preferable to provide
>> specific details on how to handle the "not be user-defined" scenario. For
>> example, should it be ignored or should an exception be thrown?
>>
>> To be more specific,
>> 1. For fixed-length data types, what happens for the following two DDLs
>>
>>> CREATE TABLE foo (
>>>    f0 CHAR(5)
>>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>>
>>> CREATE TABLE bar (
>>>    f0 CHAR(5)
>>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1');
>>>
>>
>> 2. For variable-length data types, what happens for the following two DDLs
>>
>>> CREATE TABLE meow (
>>>    f0 VARCHAR(20)
>>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>>
>>> CREATE TABLE purr (
>>>    f0 STRING
>>> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>>>
>>
>> Best,
>> Jane
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length
>>
>>
>> On Mon, Nov 20, 2023 at 8:46 PM 李宇彬 <lixin58...@163.com> wrote:
>>
>>> Hi everyone,
>>>
>>>
>>> Currently, the Datagen connector generates data that doesn't match the
>>> schema definition
>>> when dealing with fixed-length and variable-length fields. It defaults to
>>> a unified length of 100
>>> and requires manual configuration by the user. This violates the
>>> correctness of schema constraints
>>> and hampers ease of use.
>>>
>>>
>>> Jane Chan and I have discussed offline and I will summarize our
>>> discussion below.
>>>
>>>
>>> To enhance the datagen connector to automatically generate data that
>>> conforms to the schema
>>> definition without additional manual configuration, we propose handling
>>> the following data types
>>> appropriately [1]:
>>>       1. For fixed-length data types (char, binary), the length should be
>>> defined by the schema definition
>>>          and not be user-defined.
>>>       2. For variable-length data types (varchar, varbinary), the length
>>> should be defined by the schema
>>>           definition, but allow for user-defined lengths that are smaller
>>> than the schema definition.
>>>
>>>
>>>
>>> Looking forward to your feedback :)
>>>
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-32993
>>>
>>>
>>> Best,
>>> Yubin
>>>
>>>

Reply via email to