Thank Jane for providing examples to make discussions clearer.
Thank Lincoln and Xuyang for your feedback,I agree with you wholeheartedly that 
it is better to throw an error instead of ignoring it directly.


Taking the example provided,
For fixed-length data types, two DDLs which custom length should throw 
exception like 'User-defined length of the fixed-length field f0 is not 
supported.'
For variable-length data types
the first DDL can be executed legally, if illegal user-defined length 
configured, will throw exception like 'User-defined length of the VARCHAR field 
%s should be shorter than the schema definition.'
the second DDL, since the length of VARCHAR and VARBINARY is very large (2^31 - 
1), when users does not specify a smaller field length, Fields that occupy a 
huge amount of memory (estimated to be more than 2GB) will be generated by 
default, which can easily lead to "java.lang.OutOfMemoryError: Java heap 
space", so I recommend that the default length of these two fields is 100 just 
like before, but the length can be configured to less than 2^31-1.


Looking forward to your suggestions, thanks!


Best!
Yubin


---- Replied Message ----
| From | Xuyang<xyzhong...@163.com> |
| Date | 11/22/2023 12:02 |
| To | <dev@flink.apache.org> |
| Subject | Re:Re: [DISCUSS][FLINK-32993] Datagen connector handles 
length-constrained fields according to the schema definition by default |
Hi, Yubin and Jane.


Big +1 for this fix.


I also agree with Lincoln's view about throwing an error instead of leave all 
the complexity to the framework
if it is obvious that the length in schema and in with options are in conflict.


About the four examples Jane provides, I think only the one below can pass the 
validation and others maybe
need throw an exception clearly.


```
CREATE TABLE foo (
f0 VARCHAR(20)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
```



--

Best!
Xuyang





At 2023-11-21 23:37:32, "Lincoln Lee" <lincoln.8...@gmail.com> wrote:
Thanks Yubin and Jane for the discussion!

+1 to fix this bug, although it's usually used as a test source, it's
important to provide the correct behavior for users.

for the invalid field length configured by users, I think it's better to
raise an error instead of using default value silently.

take Jane's example above:
1. For fixed-length data types, we should not accept another with option to
overwrite the length semantic in the schema
2. For variable-length data types, both two DDLs looks ok since STRING is
equal to VARCHAR(2147483647) and the user defined length is not beyond
definition,
but the following one is invalid:
CREATE TABLE t1 (
f0 VARCHAR(128)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '256');

Another thing we may also take into considering(not a bug, but relevant),
is to support variable length semantics for varchar, since the length 128
in varchar(128) is just max length, we can extending datagen to generate
variable length values(maybe a new option to enable it, e.g.,
'fields.f0.var-len'='true'). Of course, this is a new feature that is not
part of this problem.

Best,
Lincoln Lee


Jane Chan <qingyue....@gmail.com> 于2023年11月21日周二 21:07写道:

Hi Yubin,

Thanks for driving this discussion. Perhaps a specific example can better
illustrate the current issue.

Considering the following DDL, f0 will always be generated with a default
char length of 100, regardless of char(5), bcause the connector option
'fields.f0.length' is not specified [1].

CREATE TABLE foo (
f0 CHAR(5)
) WITH ('connector' = 'datagen');


Since it's often the case for a fixed-length type to specify length
explictly in the DDL, the current design can be confusing for users to some
extent.

However, for the proposed changes, it would be preferable to provide
specific details on how to handle the "not be user-defined" scenario. For
example, should it be ignored or should an exception be thrown?

To be more specific,
1. For fixed-length data types, what happens for the following two DDLs

CREATE TABLE foo (
f0 CHAR(5)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');

CREATE TABLE bar (
f0 CHAR(5)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '1');


2. For variable-length data types, what happens for the following two DDLs

CREATE TABLE meow (
f0 VARCHAR(20)
) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');

CREATE TABLE purr (
f0 STRING
) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');


Best,
Jane

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length


On Mon, Nov 20, 2023 at 8:46 PM 李宇彬 <lixin58...@163.com> wrote:

Hi everyone,


Currently, the Datagen connector generates data that doesn't match the
schema definition
when dealing with fixed-length and variable-length fields. It defaults to
a unified length of 100
and requires manual configuration by the user. This violates the
correctness of schema constraints
and hampers ease of use.


Jane Chan and I have discussed offline and I will summarize our
discussion below.


To enhance the datagen connector to automatically generate data that
conforms to the schema
definition without additional manual configuration, we propose handling
the following data types
appropriately [1]:
1. For fixed-length data types (char, binary), the length should be
defined by the schema definition
and not be user-defined.
2. For variable-length data types (varchar, varbinary), the length
should be defined by the schema
definition, but allow for user-defined lengths that are smaller
than the schema definition.



Looking forward to your feedback :)


[1] https://issues.apache.org/jira/browse/FLINK-32993


Best,
Yubin


Reply via email to