Thanks Yubin and Jane for the discussion! +1 to fix this bug, although it's usually used as a test source, it's important to provide the correct behavior for users.
for the invalid field length configured by users, I think it's better to raise an error instead of using default value silently. take Jane's example above: 1. For fixed-length data types, we should not accept another with option to overwrite the length semantic in the schema 2. For variable-length data types, both two DDLs looks ok since STRING is equal to VARCHAR(2147483647) and the user defined length is not beyond definition, but the following one is invalid: CREATE TABLE t1 ( f0 VARCHAR(128) ) WITH ('connector' = 'datagen', 'fields.f0.length' = '256'); Another thing we may also take into considering(not a bug, but relevant), is to support variable length semantics for varchar, since the length 128 in varchar(128) is just max length, we can extending datagen to generate variable length values(maybe a new option to enable it, e.g., 'fields.f0.var-len'='true'). Of course, this is a new feature that is not part of this problem. Best, Lincoln Lee Jane Chan <qingyue....@gmail.com> 于2023年11月21日周二 21:07写道: > Hi Yubin, > > Thanks for driving this discussion. Perhaps a specific example can better > illustrate the current issue. > > Considering the following DDL, f0 will always be generated with a default > char length of 100, regardless of char(5), bcause the connector option > 'fields.f0.length' is not specified [1]. > >> CREATE TABLE foo ( >> f0 CHAR(5) >> ) WITH ('connector' = 'datagen'); >> > > Since it's often the case for a fixed-length type to specify length > explictly in the DDL, the current design can be confusing for users to some > extent. > > However, for the proposed changes, it would be preferable to provide > specific details on how to handle the "not be user-defined" scenario. For > example, should it be ignored or should an exception be thrown? > > To be more specific, > 1. For fixed-length data types, what happens for the following two DDLs > >> CREATE TABLE foo ( >> f0 CHAR(5) >> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); >> >> CREATE TABLE bar ( >> f0 CHAR(5) >> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1'); >> > > 2. For variable-length data types, what happens for the following two DDLs > >> CREATE TABLE meow ( >> f0 VARCHAR(20) >> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); >> >> CREATE TABLE purr ( >> f0 STRING >> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); >> > > Best, > Jane > > [1] > https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length > > > On Mon, Nov 20, 2023 at 8:46 PM 李宇彬 <lixin58...@163.com> wrote: > >> Hi everyone, >> >> >> Currently, the Datagen connector generates data that doesn't match the >> schema definition >> when dealing with fixed-length and variable-length fields. It defaults to >> a unified length of 100 >> and requires manual configuration by the user. This violates the >> correctness of schema constraints >> and hampers ease of use. >> >> >> Jane Chan and I have discussed offline and I will summarize our >> discussion below. >> >> >> To enhance the datagen connector to automatically generate data that >> conforms to the schema >> definition without additional manual configuration, we propose handling >> the following data types >> appropriately [1]: >> 1. For fixed-length data types (char, binary), the length should be >> defined by the schema definition >> and not be user-defined. >> 2. For variable-length data types (varchar, varbinary), the length >> should be defined by the schema >> definition, but allow for user-defined lengths that are smaller >> than the schema definition. >> >> >> >> Looking forward to your feedback :) >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-32993 >> >> >> Best, >> Yubin >> >>