+1, We could refer the Flink configuration to add some other properties for ConfigOption like description, fallbackKeys (collect the deprecated keys) Thanks, Harvey Yue
On 01/10/2020 06:54,lamberken<lamber...@163.com> wrote: hi @Vinoth Chandar, Got it, thanks. best, lamber-ken At 2020-01-09 23:52:52, "Vinoth Chandar" <vin...@apache.org> wrote: Hi lamber-ken, A ConfigOption class would be good indeed. +1 on starting incrementally with DataSource first and then iterating.. Thanks Vinoth On Tue, Jan 7, 2020 at 6:58 PM lamberken <lamber...@163.com> wrote: Hi @Vinoth, It's time to pick up this topic. Based on the content we talked about, here are my thoughts 1, Initial proposal aims to rework configuration framework includes(DataSource and WriteClient level), for compatibility, we can introduce a ConfigOption class and rework it on DataSource level. 2, It's very right that the scoped down version does not need a RFC[1], so change state from 'Under Discussion' to 'Close' ? [1] https://cwiki.apache.org/confluence/display/HUDI/RFC+-+11+%3A+Refactor+of+the+configuration+framework+of+hudi+project Best, Lamber-Ken At 2019-12-19 11:05:16, "Vinoth Chandar" <vin...@apache.org> wrote: Sounds good.. This scoped down version per se, does not need a RFC. On Wed, Dec 18, 2019 at 3:09 PM lamberken <lamber...@163.com> wrote: Hi @Vinoth I understand what you mean, I will continue to work on this when I finish reworking the new UI. :) best, lamber-ken At 2019-12-18 11:39:30, "Vinoth Chandar" <vin...@apache.org> wrote: Expect most users to use inputDF.write() approach... Uber uses the lower level RDD apis, like the DeltaStreamer tool does.. If we don't rename configs and still support a builder, it should be fine. I think we can scope this down to introducing a ConfigOption class that ties, the key,value, default together.. That definitely seems like a better abstraction. On Fri, Dec 13, 2019 at 5:18 PM lamberken <lamber...@163.com> wrote: Hi, @vinoth Okay, I see. If we don't want existing users to do any upgrading or reconfigurations, then this refactor work will not make much sense. This issue can be closed, because ConfigOptions and these builders do the same things. From another side, if we finish this work before a stable release, we will benefit a lot from it. We need to make a choice. btw, I have a question that users will use HoodieWriteConfig / HoodieWriteClient in their program? /---------------------------------------------------------------------------------------- HoodieWriteConfig cfg = HoodieWriteConfig.newBuilder() .withPath(basePath) .forTable(tableName) .withSchema(schemaStr) .withProps(props) // pass raw k,v pairs from a property file. .withCompactionConfig(HoodieCompactionConfig.newBuilder().withXXX(...).build()) .withIndexConfig(HoodieIndexConfig.newBuilder().withXXX(...).build()) ... .build(); /---------------------------------------------------------------------------------------- OR /---------------------------------------------------------------------------------------- inputDF.write() .format("org.apache.hudi") .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); /---------------------------------------------------------------------------------------- best, lamber-ken At 2019-12-14 08:43:06, "Vinoth Chandar" <vin...@apache.org> wrote: Hi, Are you saying these classes needs to change? If so, understood. But are you planning on renaming configs or relocating them? We dont want existing users to do any upgrading or reconfigurations On Fri, Dec 13, 2019 at 10:28 AM lamberken <lamber...@163.com> wrote: Hi, They need to change due to this, because only HoodieWriteConfig and *Options will be kept. best, lamber-ken At 2019-12-14 01:23:35, "Vinoth Chandar" <vin...@apache.org> wrote: Hi, We are trying to understand if existing jobs (datasource, deltastreamer, anything else) needs to change due to this. On Wed, Dec 11, 2019 at 7:18 PM lamberken <lamber...@163.com> wrote: Hi, @vinoth 1, Hoodie*Config classes are only used to set default value when call their build method currently. They will be replaced by HoodieMemoryOptions, HoodieIndexOptions, HoodieHBaseIndexOptions, etc... 2, I don't understand the question "It is not clear to me whether there is any external facing changes which changes this model.". Best, lamber-ken At 2019-12-12 11:01:36, "Vinoth Chandar" <vin...@apache.org> wrote: I actually prefer the builder pattern for making the configs, because I can do `builder.` in the IDE and actually see all the options... That said, most developers program against the Spark datasource and so this may not be useful, unless we expose a builder for that.. I will concede that since its also subjective anyway. But, to clarify Siva's question, you do intend to keep the different component level config classes right - HoodieIndexConfig, HoodieCompactionConfig? Once again, can you please explicitly address the following question, so we can get on the same page? It is not clear to me whether there is any external facing changes which changes this model. This is still the most critical question from both me and balaji. On Wed, Dec 11, 2019 at 11:35 AM lamberken <lamber...@163.com wrote: hi, @Sivabalan Yes, thanks very much for help me explain my initial proposal. Answer your question, we can call HoodieWriteConfig as a SystemConfig, we need to pass it everywhere. Actually, it may just contains a few custom configurations( does not include default configurations) Because each component has its own ConfigOptions. The old version HoodieWriteConfig includes all keys(custom configurations, default configurations), it is a fat. Best, lamber-ken At 2019-12-12 03:14:11, "Sivabalan" <n.siv...@gmail.com> wrote: Let me summarize your initial proposal and then will get into details. - Introduce ConfigOptions for ease of handling of default values. - Remove all Hoodie*Config classes and just have HoodieWriteConfig. What this means is that, every other config file will be replaced by ConfigOptions. eg, HoodieIndexConfigOption, HoodieCompactionConfigOption, etc. - Config option will take care of returning defaults for any property, even if an entire Config(eg IndexConfig) is not explicitly set. Here are the positives I see. - By way of having component level ConfigOptions, we bucketize the configs and have defaults set(same as before) - User doesn't need to set each component's config(eg IndexConfig) explicitly with HoodieWriteConfig. But have one question: - I see Bucketizing only in write path. How does one get hold of IndexConfigOptions as a consumer? For eg, If some class is using just IndexConfig alone, how will it consume? From your eg, I see only HoodieWriteConfig. Do we pass in HoodieWriteConfig everywhere then? Wouldn't that contradicts your initial proposal to not have a fat config class? May be can you expand your example below to show how a consumer of IndexConfig look like. Your eg: /** * New version */ // set value overrite the default value HoodieWriteConfig config = new HoodieWriteConfig(); config.set(HoodieIndexConfigOptions.INDEX_TYPE, HoodieIndex.IndexType.HBASE.name < http://hoodieindex.indextype.hbase.name/> ()) On Wed, Dec 11, 2019 at 8:33 AM lamberken < lamber...@163.com> wrote: Hi, On 1,2. Yes, you are right, moving the getter to the component level Config class itself. On 3, HoodieWriteConfig can also set value through ConfigOption, small code snippets. From the bellow snippets, we can see that clients need to know each component's builders and also call their "with" methods to override the default value in old version. But, in new version, clients just need to know each component's public config options, just like constants. So, these builders are redundant. /---------------------------------------------------------------------------/ public class HoodieIndexConfigOptions { public static final ConfigOption<String> INDEX_TYPE = ConfigOption .key("hoodie.index.type") .defaultValue(HoodieIndex.IndexType.BLOOM.name ()); } public class HoodieWriteConfig { public void setString(ConfigOption<String> option, String value) { this.props.put(option.key(), value); } } /** * New version */ // set value overrite the default value HoodieWriteConfig config = new HoodieWriteConfig(); config.set(HoodieIndexConfigOptions.INDEX_TYPE, HoodieIndex.IndexType.HBASE.name()) /** * Old version */ HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder() builder.withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build()) /---------------------------------------------------------------------------/ Another, users use hudi like bellow, here're all keys. /---------------------------------------------------------------------------/ df.write.format("hudi"). option("hoodie.insert.shuffle.parallelism", "10"). option("hoodie.upsert.shuffle.parallelism", "10"). option("hoodie.delete.shuffle.parallelism", "10"). option("hoodie.bulkinsert.shuffle.parallelism", "10"). option("hoodie.datasource.write.recordkey.field", "name"). option("hoodie.datasource.write.partitionpath.field", "location"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.table.name", tableName). mode(Overwrite). save(basePath); /---------------------------------------------------------------------------/ Last, as I responsed to @vino, it's reasonable to handle fallbackkeys. I think we need to do this step by step, it's easy to integrate FallbackKey in the future, it is not what we need right now in my opinion. If some places are still not very clear, feel free to feedback. Best, lamber-ken At 2019-12-11 23:41:31, "Vinoth Chandar" < vin...@apache.org wrote: Hi Lamber-ken, I looked at the sample PR you put up as well. On 1,2 => Seems your intent is to replace these with moving the getter to the component level Config class itself? I am fine with that (although I think its not that big of a hurdle really to use atm). But, once we do that we could pass just the specific component config into parts of code versus passing in the entire HoodieWriteConfig object. I am fine with moving the classes to a ConfigOption class as you suggested as well. On 3, I still we feel we will need the builder pattern going forward. to build the HoodieWriteConfig object. Like below? Cannot understand why we would want to change this. Could you please clarify? HoodieWriteConfig.Builder builder = HoodieWriteConfig.newBuilder().withPath(cfg.targetBasePath).combineInput(cfg.filterDupes, true) .withCompactionConfig(HoodieCompactionConfig.newBuilder().withPayloadClass(cfg.payloadClassName) // Inline compaction is disabled for continuous mode. otherwise enabled for MOR .withInlineCompaction(cfg.isInlineCompactionEnabled()).build()) .forTable(cfg.targetTableName) .withIndexConfig(HoodieIndexConfig.newBuilder().withIndexType(HoodieIndex.IndexType.BLOOM).build()) .withAutoCommit(false).withProps(props); Typically, we write RFCs for large changes that breaks existing behavior or introduces significantly complex new features.. If you are just planning to do the refactoring into ConfigOption class, per se you don't need a RFC. But , if you plan to address the fallback keys (or) your changes are going to break/change existing jobs, we would need a RFC. It is not clear to me whether there is any external facing changes which changes this model. I am still unclear on this as well. can you please explicitly clarify? thanks vinoth On Tue, Dec 10, 2019 at 12:35 PM lamberken < lamber...@163.com wrote: Hi, @Balaji @Vinoth I'm sorry, some places are not very clear, 1, We can see that HoodieMetricsConfig, HoodieStorageConfig, etc.. already defined in project. But we get property value by methods which defined in HoodieWriteConfig, like HoodieWriteConfig#getParquetMaxFileSize, HoodieWriteConfig#getParquetBlockSize, etc. It's means that Hoodie*Config are redundant. 2, These Hoodie*Config classes are used to set default value when call their build method, nothing else. 3, For current plan is keep the Builder pattern when configuring, when we are familiar with the config framework, We will find that Hoodie*Config class are redundant and methods prefixed with "get" in HoodieWriteConfig are also redundant. In addition, I create a pr[1] for initializing with a demo. At this demo, I create MetricsGraphiteReporterOptions which contains HOST, PORT, PREFIX, and remove getGraphiteServerHost, getGraphiteServerPort, getGraphiteMetricPrefix in HoodieMetricsConfig. https://github.com/apache/incubator-hudi/pull/1094 Best, lamber-ken At 2019-12-11 02:35:30, "Balaji Varadarajan" <v.bal...@ymail.com.INVALID wrote: Hi Lamber-Ken, Thanks for the time writing the proposal and thinking about improving Hudi usability. My preference would be to keep the Builder pattern when configuring. It is something I find it natural when configuring. It is not clear to me whether there is any external facing changes which changes this model. Would you mind adding some more details on the RFC. It would save time to read it in one place as opposed to checking out github repo :) Thanks,Balaji.V On Tuesday, December 10, 2019, 07:55:01 AM PST, Vinoth Chandar < vin...@apache.org> wrote: Hi , Thanks for the proposal. Some parts I agree, some parts I don't and some parts are unclear Agree : - On introducing a class that binds key, default value, provided value, and also may be a doc along with it (?). - Designing the framework to have fallback keys is good IMO. It helps us do things like https://issues.apache.org/jira/browse/HUDI-89 Disagree : - Not all configuration values are in HoodieWriteConfig, its not accurate. Configs are already split by components into HoodieIndexConfig, HoodieCompactionConfig etc.. - There are helpers for all these conveniently located in HoodieWriteConfig. I think some of the claims of usability seem subjective to me, speaking from hands-on experience writing jobs. So, if you proposing a large shake up (e.g not have a single properties file load all components), I would love to understand this at more depth. From my experience, well namespaced configs in a single properties file keeps it simple and understandable. Unclear : - What is impact on existing jobs - using RDD/WriteClient API, DataSource, DeltaStreamer level? Do you intend to change namespacing of configs? Thanks Vinoth On Tue, Dec 10, 2019 at 6:44 AM lamberken < lamber...@163.com> wrote: Hi, vino Reasonable, we can refactor this step by step. The first step now is to introduce the config framework. When our community is familiar with the config framework mechanism, it's easy to integrate FallbackKey in the future. Best, lamber-ken At 2019-12-10 11:51:22, "vino yang" < yanghua1...@gmail.com> wrote: Hi Lamber, Thanks for the proposal. +1 from my side. When it comes to configuration, it will involve how we handle deprecated configuration items in the future. In my opinion, we need to take this into consideration when designing. There are already some successful practices for our reference. For example, Flink defines some deprecated configurations as FallbackKey[1]. Maybe we can learn from these designs. WDYT? [1]: https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/FallbackKey.java Best, Vino lamberken <lamber...@163.com> 于2019年12月9日周一 下午11:19写道: Hi, all Currently, many configuration items and their default values are dispersed in the config file like HoodieWriteConfig. It’s very confused for developers, and it's easy for developers to use them in a wrong place especially when there are more and more configuration items. If we can solve this, developers will benefit from it and the code structure will be more concise. I had create a JIRA[1] and a under discuss RFC[2] to explain how to solve the problem, if you are interested in this, you can visit jira and RFC for detail. Any comments and feedback are welcome, WDYT? Best, lamber-ken [1] https://issues.apache.org/jira/projects/HUDI/issues/HUDI-375 [2] https://cwiki.apache.org/confluence/display/HUDI/RFC-11+%3A+Refactor+of+the+configuration+framework+of+hudi+project -- Regards, -Sivabalan