Hi, +1 to: format = parquet parquet.compression = ... parquet.block.size = ... parquet.page.size = ...
For the formats like parquet and orc, Not just Flink itself, but this way also let Flink keys compatible with the property keys of Hadoop / Hive / Spark. And like Jark said, this way works for Kafka key value too. Best, Jingsong Lee On Wed, May 6, 2020 at 6:19 PM Jark Wu <imj...@gmail.com> wrote: > Hi, > > I think Timo proposed a good idea to make both side happy. That is: > > format = json > json.fail-on-missing-field = true > json.ignore-parse-error = true > > value.format = json > value.json.fail-on-missing-field = true > value.json.ignore-parse-error = true > > This is a valid hierarchies. Besides, it's more clear that the option > belongs to a specific component (i.e. json). > This will be more readable when we introducing more formats, e.g. parquet. > > format = parquet > parquet.compression = ... > parquet.block.size = ... > parquet.page.size = ... > > is more readable than current style: > > format = parquet > format.compression = ... > format.block.size = ... > format.page.size = ... > > To sum up, I'm +1 to use "format = json", "json.fail-on-missing-field = > true". > > Best, > Jark > > On Wed, 6 May 2020 at 17:12, Danny Chan <yuzhao....@gmail.com> wrote: > > > Hi, everyone ~ > > > > Allows me to share some thoughts here. > > > > Personally i think for SQL, "format" is obviously better than " > format.name", > > it is more concise and straight-forward, similar with Presto FORMAT[2] > and > > KSQL VALUE_FORMAT[1]; i think we move from "connector.type" to > "connector" > > for the same reason, the "type" or "name" suffix is implicit, SQL syntax > > like the DDL is a top-level user API, so from my side keeping good > > user-friendly syntax is more important. > > > > @Timo I'm big +1 for the a good code style guide, but that does not mean > > we should go for a json-style table options in the DDL, the DDL could > have > > its own contract. Can we move "represent these config options in YAML" to > > another topic ? Otherwise, how should we handle the "connector" key, > should > > we prefix all the table options with "connector" ? The original inention > of > > FLIP-122 is to remove some redundant prefix/suffix of the table options > > because they are obviously implicit there, and the "connector." prefix > and > > the ".type" or ".name" suffix are the ones we most want to delete. > > > > @Dawid Although ".type" is just another 4 characters, but we force the > SQL > > users to do the thing that is obvious reduadant, i know serialize catalog > > table to YAML or use the options in DataStream has similar keys request, > > but they are different use cases that i believe many SQL user would not > > encounter, that means we force many users to obey rules for cases they > > would never have. > > > > > > [1] https://docs.ksqldb.io/en/latest/developer-guide/create-a-table/ > > [2] https://prestodb.io/docs/current/sql/create-table.html > > > > Best, > > Danny Chan > > 在 2020年5月4日 +0800 PM11:34,Till Rohrmann <trohrm...@apache.org>,写道: > > > Hi everyone, > > > > > > I like Timo's proposal to organize our configuration more hierarchical > > > since this is what the coding guide specifies. The benefit I see is > that > > > config options belonging to the same concept will be found in the same > > > nested object. Moreover, it will be possible to split the configuration > > > into unrelated parts which are fed to the respective components. That > way > > > one has a much better separation of concern since component A cannot > read > > > the configuration of component B. > > > > > > Concerning Timo's last two proposals: > > > > > > If fail-on-missing-field is a common configuration shared by all > formats, > > > then I would go with the first option: > > > > > > format.kind: json > > > format.fail-on-missing-field: true > > > > > > If fail-on-missing-field is specific for json, then one could go with > > > > > > format: json > > > json.fail-on-missing-field: true > > > > > > or > > > > > > format.kind: json > > > format.json.fail-on-missing-field: true > > > > > > Cheers, > > > Till > > > > > > > > > On Fri, May 1, 2020 at 11:55 AM Timo Walther <twal...@apache.org> > wrote: > > > > > > > Hi Jark, > > > > > > > > yes, in theory every connector can design options as they like. But > for > > > > user experience and good coding style we should be consistent in > Flink > > > > connectors and configuration. Because implementers of new connectors > > > > will copy the design of existing ones. > > > > > > > > Furthermore, I could image that people in the DataStream API would > also > > > > like to configure their connector based on options in the near > future. > > > > It might be the case that Flink DataStream API connectors will reuse > > the > > > > ConfigOptions from Table API for consistency. > > > > > > > > I'm favoring either: > > > > > > > > format.kind = json > > > > format.fail-on-missing-field: true > > > > > > > > Or: > > > > > > > > format = json > > > > json.fail-on-missing-field: true > > > > > > > > Both are valid hierarchies. > > > > > > > > Regards, > > > > Timo > > > > > > > > > > > > On 30.04.20 17:57, Jark Wu wrote: > > > > > Hi Dawid, > > > > > > > > > > I just want to mention one of your response, > > > > > > > > > > > What you described with > > > > > > 'format' = 'csv', > > > > > > 'csv.allow-comments' = 'true', > > > > > > 'csv.ignore-parse-errors' = 'true' > > > > > > would not work though as the `format` prefix is mandatory in the > > sources > > > > > as only the properties with format > > > > > > will be passed to the format factory in majority of cases. We > > already > > > > > have some implicit contracts. > > > > > > > > > > IIUC, in FLIP-95 and FLIP-122, the property key style are totally > > decided > > > > > by connectors, not the framework. > > > > > So I custom connector can define above properties, and extract the > > value > > > > of > > > > > 'format', i.e. 'csv', to find the format factory. > > > > > And extract the properties with `csv.` prefix and remove the > prefix, > > and > > > > > pass the properties (e.g. 'allow-comments' = 'true') > > > > > into the format factory to create format. > > > > > > > > > > So there is no a strict guarantee to have a "nested JSON style" > > > > properties. > > > > > Users can still develop a custom connector with this > > > > > un-hierarchy properties and works well. > > > > > > > > > > 'format' = 'json', > > > > > 'format.fail-on-missing-field' = 'false' > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > > > > > > On Thu, 30 Apr 2020 at 14:29, Dawid Wysakowicz < > > dwysakow...@apache.org> > > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > I'd like to start with a comment that I am ok with the current > > state of > > > > > > the FLIP-122 if there is a strong preference for it. > Nevertheless I > > > > still > > > > > > like the idea of adding `type` to the `format` to have it as > > > > `format.type` > > > > > > = `json`. > > > > > > > > > > > > I wanted to clarify a few things though: > > > > > > > > > > > > @Jingsong As far as I see it most of the users copy/paste the > > properties > > > > > > from the documentation to the SQL, so I don't think additional > four > > > > > > characters are too cumbersome. Plus if you force the additional > > suffix > > > > onto > > > > > > all the options of a format you introduce way more boilerplate > > than if > > > > we > > > > > > added the `type/kind/name` > > > > > > > > > > > > @Kurt I agree that we cannot force it, but I think it is more of > a > > > > > > question to set standards/implicit contracts on the properties. > > What you > > > > > > described with > > > > > > 'format' = 'csv', > > > > > > 'csv.allow-comments' = 'true', > > > > > > 'csv.ignore-parse-errors' = 'true' > > > > > > > > > > > > would not work though as the `format` prefix is mandatory in the > > sources > > > > > > as only the properties with format will be passed to the format > > factory > > > > in > > > > > > majority of cases. We already have some implicit contracts. > > > > > > > > > > > > @Forward I did not necessarily get the example. Aren't json and > > bson two > > > > > > separate formats? Do you mean you can have those two at the same > > time? > > > > Why > > > > > > do you need to differentiate the options for each? The way I see > > it is: > > > > > > > > > > > > ‘format(.name)' = 'json', > > > > > > ‘format.fail-on-missing-field' = 'false' > > > > > > > > > > > > or > > > > > > > > > > > > ‘format(.name)' = 'bson', > > > > > > ‘format.fail-on-missing-field' = 'false' > > > > > > > > > > > > @Benchao I'd be fine with any of name, kind, type(this we already > > had in > > > > > > the past) > > > > > > > > > > > > Best, > > > > > > Dawid > > > > > > > > > > > > On 30/04/2020 04:17, Forward Xu wrote: > > > > > > > > > > > > Here I have a little doubt. At present, our json only supports > the > > > > > > conventional json format. If we need to implement json with bson, > > json > > > > with > > > > > > avro, etc., how should we express it? > > > > > > Do you need like the following: > > > > > > > > > > > > ‘format.name' = 'json', > > > > > > > > > > > > ‘format.json.fail-on-missing-field' = 'false' > > > > > > > > > > > > > > > > > > ‘format.name' = 'bson', > > > > > > > > > > > > ‘format.bson.fail-on-missing-field' = ‘false' > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Forward > > > > > > > > > > > > Benchao Li <libenc...@gmail.com> <libenc...@gmail.com> > > 于2020年4月30日周四 > > > > 上午9:58写道: > > > > > > > > > > > > > > > > > > Thanks Timo for staring the discussion. > > > > > > > > > > > > Generally I like the idea to keep the config align with a > standard > > like > > > > > > json/yaml. > > > > > > > > > > > > From the user's perspective, I don't use table configs from a > > config > > > > file > > > > > > like yaml or json for now, > > > > > > And it's ok to change it to yaml like style. Actually we didn't > > know > > > > that > > > > > > this could be a yaml like > > > > > > configuration hierarchy. If it has a hierarchy, we maybe consider > > that > > > > in > > > > > > the future to load the > > > > > > config from a yaml/json file. > > > > > > > > > > > > Regarding the name, > > > > > > 'format.kind' looks fine to me. However there is another name > from > > the > > > > top > > > > > > of my head: > > > > > > 'format.name', WDYT? > > > > > > > > > > > > Dawid Wysakowicz <dwysakow...@apache.org> < > dwysakow...@apache.org> > > > > 于2020年4月29日周三 下午11:56写道: > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > I also wanted to share my opinion. > > > > > > > > > > > > When talking about a ConfigOption hierarchy we use for > configuring > > Flink > > > > > > cluster I would be a strong advocate for keeping a > > yaml/hocon/json/... > > > > > > compatible style. Those options are primarily read from a file > and > > thus > > > > > > should at least try to follow common practices for nested formats > > if we > > > > > > ever decide to switch to one. > > > > > > > > > > > > Here the question is about the properties we use in SQL > > statements. The > > > > > > origin/destination of these usually will be external catalog, > > usually in > > > > > > > > > > > > a > > > > > > > > > > > > flattened(key/value) representation so I agree it is not as > > important as > > > > > > > > > > > > in > > > > > > > > > > > > the aforementioned case. Nevertheless having a yaml based catalog > > or > > > > > > > > > > > > being > > > > > > > > > > > > able to have e.g. yaml based snapshots of a catalog in my opinion > > is > > > > > > appealing. At the same time cost of being able to have a nice > > > > > > yaml/hocon/json representation is just adding a single suffix to > a > > > > > > single(at most 2 key + value) property. The question is between > > `format` > > > > > > > > > > > > = > > > > > > > > > > > > `json` vs `format.kind` = `json`. That said I'd be slighty in > > favor of > > > > > > doing it. > > > > > > > > > > > > Just to have a full picture. Both cases can be represented in > > yaml, but > > > > > > the difference is significant: > > > > > > format: 'json' > > > > > > format.option: 'value' > > > > > > > > > > > > vs > > > > > > format: > > > > > > kind: 'json' > > > > > > > > > > > > option: 'value' > > > > > > > > > > > > Best, > > > > > > Dawid > > > > > > > > > > > > On 29/04/2020 17:13, Flavio Pompermaier wrote: > > > > > > > > > > > > Personally I don't have any preference here. Compliance wih > > standard > > > > > > > > > > > > YAML > > > > > > > > > > > > parser is probably more important > > > > > > > > > > > > On Wed, Apr 29, 2020 at 5:10 PM Jark Wu <imj...@gmail.com> < > > > > imj...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > From a user's perspective, I prefer the shorter one > "format=json", > > > > > > > > > > > > because > > > > > > > > > > > > it's more concise and straightforward. The "kind" is redundant > for > > > > > > > > > > > > users. > > > > > > > > > > > > Is there a real case requires to represent the configuration in > > JSON > > > > > > style? > > > > > > As far as I can see, I don't see such requirement, and everything > > works > > > > > > fine by now. > > > > > > > > > > > > So I'm in favor of "format=json". But if the community insist to > > follow > > > > > > code style on this, I'm also fine with the longer one. > > > > > > > > > > > > Btw, I also CC user mailing list to listen more user's feedback. > > > > > > > > > > > > Because I > > > > > > > > > > > > think this is relative to usability. > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > On Wed, 29 Apr 2020 at 22:09, Chesnay Schepler < > ches...@apache.org> > > < > > > > ches...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Therefore, should we advocate instead: > > > > > > > > > > > > > > 'format.kind' = 'json', > > > > > > > 'format.fail-on-missing-field' = 'false' > > > > > > > > > > > > Yes. That's pretty much it. > > > > > > > > > > > > This is reasonable important to nail down as with such > violations I > > > > > > believe we could not actually switch to a standard YAML parser. > > > > > > > > > > > > On 29/04/2020 16:05, Timo Walther wrote: > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > discussions around ConfigOption seem to be very popular recently. > > > > > > > > > > > > So I > > > > > > > > > > > > would also like to get some opinions on a different topic. > > > > > > > > > > > > How do we represent hierarchies in ConfigOption? In FLIP-122, we > > > > > > agreed on the following DDL syntax: > > > > > > > > > > > > CREATE TABLE fs_table ( > > > > > > ... > > > > > > ) WITH ( > > > > > > 'connector' = 'filesystem', > > > > > > 'path' = 'file:///path/to/whatever', > > > > > > 'format' = 'csv', > > > > > > 'format.allow-comments' = 'true', > > > > > > 'format.ignore-parse-errors' = 'true' > > > > > > ); > > > > > > > > > > > > Of course this is slightly different from regular Flink core > > > > > > configuration but a connector still needs to be configured based > on > > > > > > these options. > > > > > > > > > > > > However, I think this FLIP violates our code style guidelines > > > > > > > > > > > > because > > > > > > > > > > > > 'format' = 'json', > > > > > > 'format.fail-on-missing-field' = 'false' > > > > > > > > > > > > is an invalid hierarchy. `format` cannot be a string and a > > top-level > > > > > > object at the same time. > > > > > > > > > > > > We have similar problems in our runtime configuration: > > > > > > > > > > > > state.backend= > > > > > > state.backend.incremental= > > > > > > restart-strategy= > > > > > > restart-strategy.fixed-delay.delay= > > > > > > high-availability= > > > > > > high-availability.cluster-id= > > > > > > > > > > > > The code style guide states "Think of the configuration as nested > > > > > > objects (JSON style)". So such hierarchies cannot be represented > in > > > > > > > > > > > > a > > > > > > > > > > > > nested JSON style. > > > > > > > > > > > > Therefore, should we advocate instead: > > > > > > > > > > > > 'format.kind' = 'json', > > > > > > 'format.fail-on-missing-field' = 'false' > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Thanks, > > > > > > Timo > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-components.html#configuration-changes > > > > > > > > > > > > -- > > > > > > > > > > > > Benchao Li > > > > > > School of Electronics Engineering and Computer Science, Peking > > > > UniversityTel:+86-15650713730 > > > > > > Email: libenc...@gmail.com; libenc...@pku.edu.cn > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Best, Jingsong Lee