Thank Jane for providing examples to make discussions clearer. Thank Lincoln and Xuyang for your feedback,I agree with you wholeheartedly that it is better to throw an error instead of ignoring it directly.
Taking the example provided, For fixed-length data types, two DDLs which custom length should throw exception like 'User-defined length of the fixed-length field f0 is not supported.' For variable-length data types the first DDL can be executed legally, if illegal user-defined length configured, will throw exception like 'User-defined length of the VARCHAR field %s should be shorter than the schema definition.' the second DDL, since the length of VARCHAR and VARBINARY is very large (2^31 - 1), when users does not specify a smaller field length, Fields that occupy a huge amount of memory (estimated to be more than 2GB) will be generated by default, which can easily lead to "java.lang.OutOfMemoryError: Java heap space", so I recommend that the default length of these two fields is 100 just like before, but the length can be configured to less than 2^31-1. Looking forward to your suggestions, thanks! Best! Yubin ---- Replied Message ---- | From | Xuyang<xyzhong...@163.com> | | Date | 11/22/2023 12:02 | | To | <dev@flink.apache.org> | | Subject | Re:Re: [DISCUSS][FLINK-32993] Datagen connector handles length-constrained fields according to the schema definition by default | Hi, Yubin and Jane. Big +1 for this fix. I also agree with Lincoln's view about throwing an error instead of leave all the complexity to the framework if it is obvious that the length in schema and in with options are in conflict. About the four examples Jane provides, I think only the one below can pass the validation and others maybe need throw an exception clearly. ``` CREATE TABLE foo ( f0 VARCHAR(20) ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); ``` -- Best! Xuyang At 2023-11-21 23:37:32, "Lincoln Lee" <lincoln.8...@gmail.com> wrote: Thanks Yubin and Jane for the discussion! +1 to fix this bug, although it's usually used as a test source, it's important to provide the correct behavior for users. for the invalid field length configured by users, I think it's better to raise an error instead of using default value silently. take Jane's example above: 1. For fixed-length data types, we should not accept another with option to overwrite the length semantic in the schema 2. For variable-length data types, both two DDLs looks ok since STRING is equal to VARCHAR(2147483647) and the user defined length is not beyond definition, but the following one is invalid: CREATE TABLE t1 ( f0 VARCHAR(128) ) WITH ('connector' = 'datagen', 'fields.f0.length' = '256'); Another thing we may also take into considering(not a bug, but relevant), is to support variable length semantics for varchar, since the length 128 in varchar(128) is just max length, we can extending datagen to generate variable length values(maybe a new option to enable it, e.g., 'fields.f0.var-len'='true'). Of course, this is a new feature that is not part of this problem. Best, Lincoln Lee Jane Chan <qingyue....@gmail.com> 于2023年11月21日周二 21:07写道: Hi Yubin, Thanks for driving this discussion. Perhaps a specific example can better illustrate the current issue. Considering the following DDL, f0 will always be generated with a default char length of 100, regardless of char(5), bcause the connector option 'fields.f0.length' is not specified [1]. CREATE TABLE foo ( f0 CHAR(5) ) WITH ('connector' = 'datagen'); Since it's often the case for a fixed-length type to specify length explictly in the DDL, the current design can be confusing for users to some extent. However, for the proposed changes, it would be preferable to provide specific details on how to handle the "not be user-defined" scenario. For example, should it be ignored or should an exception be thrown? To be more specific, 1. For fixed-length data types, what happens for the following two DDLs CREATE TABLE foo ( f0 CHAR(5) ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); CREATE TABLE bar ( f0 CHAR(5) ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1'); 2. For variable-length data types, what happens for the following two DDLs CREATE TABLE meow ( f0 VARCHAR(20) ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); CREATE TABLE purr ( f0 STRING ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10'); Best, Jane [1] https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length On Mon, Nov 20, 2023 at 8:46 PM 李宇彬 <lixin58...@163.com> wrote: Hi everyone, Currently, the Datagen connector generates data that doesn't match the schema definition when dealing with fixed-length and variable-length fields. It defaults to a unified length of 100 and requires manual configuration by the user. This violates the correctness of schema constraints and hampers ease of use. Jane Chan and I have discussed offline and I will summarize our discussion below. To enhance the datagen connector to automatically generate data that conforms to the schema definition without additional manual configuration, we propose handling the following data types appropriately [1]: 1. For fixed-length data types (char, binary), the length should be defined by the schema definition and not be user-defined. 2. For variable-length data types (varchar, varbinary), the length should be defined by the schema definition, but allow for user-defined lengths that are smaller than the schema definition. Looking forward to your feedback :) [1] https://issues.apache.org/jira/browse/FLINK-32993 Best, Yubin