Re: [DISCUSS] Hierarchies in ConfigOption

Timo Walther Wed, 06 May 2020 05:37:43 -0700

Cool, so let's go with:

format = json
json.fail-on-missing-field = true
json.ignore-parse-error = true


value.format = json
value.json.fail-on-missing-field = true
value.json.ignore-parse-error = true

Regards,
Timo


On 06.05.20 12:39, Jingsong Li wrote:

Hi,

+1 to:
format = parquet
parquet.compression = ...
parquet.block.size = ...
parquet.page.size = ...

For the formats like parquet and orc,
Not just Flink itself, but this way also let Flink keys compatible with the
property keys of Hadoop / Hive / Spark.

And like Jark said, this way works for Kafka key value too.

Best,
Jingsong Lee

On Wed, May 6, 2020 at 6:19 PM Jark Wu <[email protected]> wrote:

Hi,

I think Timo proposed a good idea to make both side happy. That is:

format = json
json.fail-on-missing-field = true
json.ignore-parse-error = true

value.format = json
value.json.fail-on-missing-field = true
value.json.ignore-parse-error = true

This is a valid hierarchies. Besides, it's more clear that the option
belongs to a specific component (i.e. json).
This will be more readable when we introducing more formats, e.g. parquet.

format = parquet
parquet.compression = ...
parquet.block.size = ...
parquet.page.size = ...

is more readable than current style:

format = parquet
format.compression = ...
format.block.size = ...
format.page.size = ...

To sum up, I'm +1 to use "format = json",  "json.fail-on-missing-field =
true".

Best,
Jark

On Wed, 6 May 2020 at 17:12, Danny Chan <[email protected]> wrote:

Hi, everyone ~

Allows me to share some thoughts here.

Personally i think for SQL, "format" is obviously better than "

format.name",

it is more concise and straight-forward, similar with Presto FORMAT[2]

and

KSQL VALUE_FORMAT[1]; i think we move from "connector.type" to

"connector"

for the same reason, the "type" or "name" suffix is implicit, SQL syntax
like the DDL is a top-level user API, so from my side keeping good
user-friendly syntax is more important.

@Timo I'm big +1 for the a good code style guide, but that does not mean
we should go for a json-style table options in the DDL, the DDL could

have

its own contract. Can we move "represent these config options in YAML" to
another topic ? Otherwise, how should we handle the "connector" key,

should

we prefix all the table options with "connector" ? The original inention

of

FLIP-122 is to remove some redundant prefix/suffix of the table options
because they are obviously implicit there, and the "connector." prefix

and

the ".type" or ".name" suffix are the ones we most want to delete.

@Dawid Although ".type" is just another 4 characters, but we force the

SQL

users to do the thing that is obvious reduadant, i know serialize catalog
table to YAML or use the options in DataStream has similar keys request,
but they are different use cases that i believe many SQL user would not
encounter, that means we force many users to obey rules for cases they
would never have.


[1] https://docs.ksqldb.io/en/latest/developer-guide/create-a-table/
[2] https://prestodb.io/docs/current/sql/create-table.html

Best,
Danny Chan
在 2020年5月4日 +0800 PM11:34，Till Rohrmann <[email protected]>，写道：

Hi everyone,

I like Timo's proposal to organize our configuration more hierarchical
since this is what the coding guide specifies. The benefit I see is

that

config options belonging to the same concept will be found in the same
nested object. Moreover, it will be possible to split the configuration
into unrelated parts which are fed to the respective components. That

way

one has a much better separation of concern since component A cannot

read

the configuration of component B.

Concerning Timo's last two proposals:

If fail-on-missing-field is a common configuration shared by all

formats,

then I would go with the first option:

format.kind: json
format.fail-on-missing-field: true

If fail-on-missing-field is specific for json, then one could go with

format: json
json.fail-on-missing-field: true

or

format.kind: json
format.json.fail-on-missing-field: true

Cheers,
Till


On Fri, May 1, 2020 at 11:55 AM Timo Walther <[email protected]>

wrote:

Hi Jark,

yes, in theory every connector can design options as they like. But

for

user experience and good coding style we should be consistent in

Flink

connectors and configuration. Because implementers of new connectors
will copy the design of existing ones.

Furthermore, I could image that people in the DataStream API would

also

like to configure their connector based on options in the near

future.

It might be the case that Flink DataStream API connectors will reuse

the

ConfigOptions from Table API for consistency.

I'm favoring either:

format.kind = json
format.fail-on-missing-field: true

Or:

format = json
json.fail-on-missing-field: true

Both are valid hierarchies.

Regards,
Timo


On 30.04.20 17:57, Jark Wu wrote:

Hi Dawid,

I just want to mention one of your response,

What you described with
'format' = 'csv',
'csv.allow-comments' = 'true',
'csv.ignore-parse-errors' = 'true'
would not work though as the `format` prefix is mandatory in the

sources

as only the properties with format

will be passed to the format factory in majority of cases. We

already

have some implicit contracts.

IIUC, in FLIP-95 and FLIP-122, the property key style are totally

decided

by connectors, not the framework.
So I custom connector can define above properties, and extract the

value

of

'format', i.e. 'csv', to find the format factory.
And extract the properties with `csv.` prefix and remove the

prefix,

and

pass the properties (e.g. 'allow-comments' = 'true')
into the format factory to create format.

So there is no a strict guarantee to have a "nested JSON style"

properties.

Users can still develop a custom connector with this
un-hierarchy properties and works well.

'format' = 'json',
'format.fail-on-missing-field' = 'false'

Best,
Jark


On Thu, 30 Apr 2020 at 14:29, Dawid Wysakowicz <

[email protected]>

wrote:

Hi all,

I'd like to start with a comment that I am ok with the current

state of

the FLIP-122 if there is a strong preference for it.

Nevertheless I

still

like the idea of adding `type` to the `format` to have it as

`format.type`

= `json`.

I wanted to clarify a few things though:

@Jingsong As far as I see it most of the users copy/paste the

properties

from the documentation to the SQL, so I don't think additional

four

characters are too cumbersome. Plus if you force the additional

suffix

onto

all the options of a format you introduce way more boilerplate

than if

we

added the `type/kind/name`

@Kurt I agree that we cannot force it, but I think it is more of

question to set standards/implicit contracts on the properties.

What you

described with
'format' = 'csv',
'csv.allow-comments' = 'true',
'csv.ignore-parse-errors' = 'true'

would not work though as the `format` prefix is mandatory in the

sources

as only the properties with format will be passed to the format

factory

in

majority of cases. We already have some implicit contracts.

@Forward I did not necessarily get the example. Aren't json and

bson two

separate formats? Do you mean you can have those two at the same

time?

Why

do you need to differentiate the options for each? The way I see

it is:


‘format(.name)' = 'json',
‘format.fail-on-missing-field' = 'false'

or

‘format(.name)' = 'bson',
‘format.fail-on-missing-field' = 'false'

@Benchao I'd be fine with any of name, kind, type(this we already

had in

the past)

Best,
Dawid

On 30/04/2020 04:17, Forward Xu wrote:

Here I have a little doubt. At present, our json only supports

the

conventional json format. If we need to implement json with bson,

json

with

avro, etc., how should we express it?
Do you need like the following:

‘format.name' = 'json',

‘format.json.fail-on-missing-field' = 'false'


‘format.name' = 'bson',

‘format.bson.fail-on-missing-field' = ‘false'


Best,

Forward

Benchao Li <[email protected]> <[email protected]>

于2020年4月30日周四

上午9:58写道：



Thanks Timo for staring the discussion.

Generally I like the idea to keep the config align with a

standard

like

json/yaml.

 From the user's perspective, I don't use table configs from a

config

file

like yaml or json for now,
And it's ok to change it to yaml like style. Actually we didn't

know

that

this could be a yaml like
configuration hierarchy. If it has a hierarchy, we maybe consider

that

in

the future to load the
config from a yaml/json file.

Regarding the name,
'format.kind' looks fine to me. However there is another name

from

the

top

of my head:
'format.name', WDYT?

Dawid Wysakowicz <[email protected]> <

[email protected]>

于2020年4月29日周三 下午11:56写道：



Hi all,

I also wanted to share my opinion.

When talking about a ConfigOption hierarchy we use for

configuring

Flink

cluster I would be a strong advocate for keeping a

yaml/hocon/json/...

compatible style. Those options are primarily read from a file

and

thus

should at least try to follow common practices for nested formats

if we

ever decide to switch to one.

Here the question is about the properties we use in SQL

statements. The

origin/destination of these usually will be external catalog,

usually in


a

flattened(key/value) representation so I agree it is not as

important as


in

the aforementioned case. Nevertheless having a yaml based catalog

or


being

able to have e.g. yaml based snapshots of a catalog in my opinion

is

appealing. At the same time cost of being able to have a nice
yaml/hocon/json representation is just adding a single suffix to

single(at most 2 key + value) property. The question is between

`format`


=

`json` vs `format.kind` = `json`. That said I'd be slighty in

favor of

doing it.

Just to have a full picture. Both cases can be represented in

yaml, but

the difference is significant:
format: 'json'
format.option: 'value'

vs
format:
kind: 'json'

option: 'value'

Best,
Dawid

On 29/04/2020 17:13, Flavio Pompermaier wrote:

Personally I don't have any preference here. Compliance wih

standard


YAML

parser is probably more important

On Wed, Apr 29, 2020 at 5:10 PM Jark Wu <[email protected]> <

[email protected]> wrote:



 From a user's perspective, I prefer the shorter one

"format=json",


because

it's more concise and straightforward. The "kind" is redundant

for


users.

Is there a real case requires to represent the configuration in

JSON

style?
As far as I can see, I don't see such requirement, and everything

works

fine by now.

So I'm in favor of "format=json". But if the community insist to

follow

code style on this, I'm also fine with the longer one.

Btw, I also CC user mailing list to listen more user's feedback.

Because I

think this is relative to usability.

Best,
Jark

On Wed, 29 Apr 2020 at 22:09, Chesnay Schepler <

[email protected]>

[email protected]>

wrote:

Therefore, should we advocate instead:

'format.kind' = 'json',
'format.fail-on-missing-field' = 'false'


Yes. That's pretty much it.

This is reasonable important to nail down as with such

violations I

believe we could not actually switch to a standard YAML parser.

On 29/04/2020 16:05, Timo Walther wrote:

Hi everyone,

discussions around ConfigOption seem to be very popular recently.

So I

would also like to get some opinions on a different topic.

How do we represent hierarchies in ConfigOption? In FLIP-122, we
agreed on the following DDL syntax:

CREATE TABLE fs_table (
...
) WITH (
'connector' = 'filesystem',
'path' = 'file:///path/to/whatever',
'format' = 'csv',
'format.allow-comments' = 'true',
'format.ignore-parse-errors' = 'true'
);

Of course this is slightly different from regular Flink core
configuration but a connector still needs to be configured based

on

these options.

However, I think this FLIP violates our code style guidelines

because

'format' = 'json',
'format.fail-on-missing-field' = 'false'

is an invalid hierarchy. `format` cannot be a string and a

top-level

object at the same time.

We have similar problems in our runtime configuration:

state.backend=
state.backend.incremental=
restart-strategy=
restart-strategy.fixed-delay.delay=
high-availability=
high-availability.cluster-id=

The code style guide states "Think of the configuration as nested
objects (JSON style)". So such hierarchies cannot be represented

in


a

nested JSON style.

Therefore, should we advocate instead:

'format.kind' = 'json',
'format.fail-on-missing-field' = 'false'

What do you think?

Thanks,
Timo

[1]

https://flink.apache.org/contributing/code-style-and-quality-components.html#configuration-changes


--

Benchao Li
School of Electronics Engineering and Computer Science, Peking

UniversityTel:+86-15650713730

Email: [email protected]; [email protected]

Re: [DISCUSS] Hierarchies in ConfigOption

Reply via email to