Hi Xia, +1 on introducing dynamic parallelism inference for HiveSource.
Orthogonal to this discussion, curious, how commonly HiveSource is used these days in the industry given the popularity of table formats/sources like Iceberg, Hudi and Delta lake? Thanks Venkat On Wed, Apr 24, 2024, 7:41 PM Xia Sun <xingbe...@gmail.com> wrote: > Hi everyone, > > Thanks for all the feedback! > > If there are no more comments, I would like to start the vote thread, > thanks again! > > Best, > Xia > > Ahmed Hamdy <hamdy10...@gmail.com> 于2024年4月18日周四 21:31写道: > > > Hi Xia, > > I have read through the FLIP and discussion and the new version of the > FLIP > > looks better. > > +1 for the proposal. > > Best Regards > > Ahmed Hamdy > > > > > > On Thu, 18 Apr 2024 at 12:21, Ron Liu <ron9....@gmail.com> wrote: > > > > > Hi, Xia > > > > > > Thanks for updating, looks good to me. > > > > > > Best, > > > Ron > > > > > > Xia Sun <xingbe...@gmail.com> 于2024年4月18日周四 19:11写道: > > > > > > > Hi Ron, > > > > Yes, presenting it in a table might be more intuitive. I have already > > > added > > > > the table in the "Public Interfaces | New Config Option" chapter of > > FLIP. > > > > PTAL~ > > > > > > > > Ron Liu <ron9....@gmail.com> 于2024年4月18日周四 18:10写道: > > > > > > > > > Hi, Xia > > > > > > > > > > Thanks for your reply. > > > > > > > > > > > That means, in terms > > > > > of priority, `table.exec.hive.infer-source-parallelism` > > > > > > `table.exec.hive.infer-source-parallelism.mode`. > > > > > > > > > > I still have some confusion, if the > > > > > `table.exec.hive.infer-source-parallelism` > > > > > >`table.exec.hive.infer-source-parallelism.mode`, currently > > > > > `table.exec.hive.infer-source-parallelism` default value is true, > > that > > > > > means always static parallelism inference work? Or perhaps after > this > > > > FLIP, > > > > > we changed the default behavior of > > > > > `table.exec.hive.infer-source-parallelism` to indicate dynamic > > > > parallelism > > > > > inference when enabled. > > > > > I think you should list the various behaviors of these two options > > that > > > > > coexist in FLIP by a table, only then users can know how the > dynamic > > > and > > > > > static parallelism inference work. > > > > > > > > > > Best, > > > > > Ron > > > > > > > > > > Xia Sun <xingbe...@gmail.com> 于2024年4月18日周四 16:33写道: > > > > > > > > > > > Hi Ron and Lijie, > > > > > > Thanks for joining the discussion and sharing your suggestions. > > > > > > > > > > > > > the InferMode class should also be introduced in the Public > > > > Interfaces > > > > > > > section! > > > > > > > > > > > > > > > > > > Thanks for the reminder, I have now added the InferMode class to > > the > > > > > Public > > > > > > Interfaces section as well. > > > > > > > > > > > > > `table.exec.hive.infer-source-parallelism.max` is 1024, I > checked > > > > > through > > > > > > > the code that the default value is 1000? > > > > > > > > > > > > > > > > > > I have checked and the default value of > > > > > > `table.exec.hive.infer-source-parallelism.max` is indeed 1000. > This > > > has > > > > > > been corrected in the FLIP. > > > > > > > > > > > > > how are`table.exec.hive.infer-source-parallelism` and > > > > > > > `table.exec.hive.infer-source-parallelism.mode` compatible? > > > > > > > > > > > > > > > > > > This is indeed a critical point. The current plan is to deprecate > > > > > > `table.exec.hive.infer-source-parallelism` but still utilize it > as > > > the > > > > > main > > > > > > switch for enabling automatic parallelism inference. That means, > in > > > > terms > > > > > > of priority, `table.exec.hive.infer-source-parallelism` > > > > > > > `table.exec.hive.infer-source-parallelism.mode`. In future > > versions, > > > if > > > > > > `table.exec.hive.infer-source-parallelism` is removed, this logic > > > will > > > > > also > > > > > > need to be revised, leaving only > > > > > > `table.exec.hive.infer-source-parallelism.mode` as the basis for > > > > deciding > > > > > > whether to enable parallelism inference. I have also added this > > > > > description > > > > > > to the FLIP. > > > > > > > > > > > > > > > > > > > In FLIP-367 it is supported to be able to set the Source's > > > > parallelism > > > > > > > individually, if in the future HiveSource also supports this > > > feature, > > > > > > > however, the default value of > > > > > > > `table.exec.hive.infer-source-parallelism.mode` is > > > > `InferMode.DYNAMIC`, > > > > > > at > > > > > > > this point will the parallelism be dynamically derived or will > > the > > > > > > manually > > > > > > > set parallelism take effect, and who has the higher priority? > > > > > > > > > > > > > > > > > > From my understanding, 'manually set parallelism' has the higher > > > > > priority, > > > > > > just like one of the preconditions for the effectiveness of > dynamic > > > > > > parallelism inference in the AdaptiveBatchScheduler is that the > > > > vertex's > > > > > > parallelism isn't set. I believe whether it's static inference or > > > > dynamic > > > > > > inference, the manually set parallelism by the user should be > > > > respected. > > > > > > > > > > > > > The `InferMode.NONE` option. > > > > > > > > > > > > Currently, 'adding InferMode.NONE' seems to be the prevailing > > > opinion. > > > > I > > > > > > will add InferMode.NONE as one of the Enum options in InferMode > > > class. > > > > > > > > > > > > Best, > > > > > > Xia > > > > > > > > > > > > Lijie Wang <wangdachui9...@gmail.com> 于2024年4月18日周四 13:50写道: > > > > > > > > > > > > > Thanks for driving the discussion. > > > > > > > > > > > > > > +1 for the proposal and +1 for the `InferMode.NONE` option. > > > > > > > > > > > > > > Best, > > > > > > > Lijie > > > > > > > > > > > > > > Ron liu <ron9....@gmail.com> 于2024年4月18日周四 11:36写道: > > > > > > > > > > > > > > > Hi, Xia > > > > > > > > > > > > > > > > Thanks for driving this FLIP. > > > > > > > > > > > > > > > > This proposal looks good to me overall. However, I have the > > > > following > > > > > > > minor > > > > > > > > questions: > > > > > > > > > > > > > > > > 1. FLIP introduced > > > `table.exec.hive.infer-source-parallelism.mode` > > > > > as a > > > > > > > new > > > > > > > > parameter, and the value is the enum class `InferMode`, I > think > > > the > > > > > > > > InferMode class should also be introduced in the Public > > > Interfaces > > > > > > > section! > > > > > > > > 2. You mentioned in FLIP that the default value of > > > > > > > > `table.exec.hive.infer-source-parallelism.max` is 1024, I > > checked > > > > > > through > > > > > > > > the code that the default value is 1000? > > > > > > > > 3. I also agree with Muhammet's idea that there is no need to > > > > > introduce > > > > > > > the > > > > > > > > option `table.exec.hive.infer-source-parallelism.enabled`, > and > > > that > > > > > > > > expanding the InferMode values will fulfill the need. There > is > > > > > another > > > > > > > > issue to consider here though, how are > > > > > > > > `table.exec.hive.infer-source-parallelism` and > > > > > > > > `table.exec.hive.infer-source-parallelism.mode` compatible? > > > > > > > > 4. In FLIP-367 it is supported to be able to set the Source's > > > > > > parallelism > > > > > > > > individually, if in the future HiveSource also supports this > > > > feature, > > > > > > > > however, the default value of > > > > > > > > `table.exec.hive.infer-source-parallelism.mode` is > `InferMode. > > > > > > DYNAMIC`, > > > > > > > at > > > > > > > > this point will the parallelism be dynamically derived or > will > > > the > > > > > > > manually > > > > > > > > set parallelism take effect, and who has the higher priority? > > > > > > > > > > > > > > > > Best, > > > > > > > > Ron > > > > > > > > > > > > > > > > Xia Sun <xingbe...@gmail.com> 于2024年4月17日周三 12:08写道: > > > > > > > > > > > > > > > > > Hi Jeyhun, Muhammet, > > > > > > > > > Thanks for all the feedback! > > > > > > > > > > > > > > > > > > > Could you please mention the default values for the new > > > > > > > configurations > > > > > > > > > > (e.g., table.exec.hive.infer-source-parallelism.mode, > > > > > > > > > > table.exec.hive.infer-source-parallelism.enabled, > > > > > > > > > > etc) ? > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for your suggestion. I have supplemented the > > explanation > > > > > > > regarding > > > > > > > > > the default values. > > > > > > > > > > > > > > > > > > > Since we are introducing the mode as a configuration > > option, > > > > > > > > > > could it make sense to have `InferMode.NONE` option > > also? > > > > > > > > > > The `NONE` option would disable the inference. > > > > > > > > > > > > > > > > > > > > > > > > > > > This is a good idea. Looking ahead, it could eliminate the > > need > > > > for > > > > > > > > > introducing > > > > > > > > > a new configuration option. I haven't identified any > > potential > > > > > > > > > compatibility issues > > > > > > > > > as yet. If there are no further ideas from others, I'll go > > > ahead > > > > > and > > > > > > > > update > > > > > > > > > the FLIP to > > > > > > > > > introducing InferMode.NONE. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Xia > > > > > > > > > > > > > > > > > > Muhammet Orazov <mor+fl...@morazow.com.invalid> > > 于2024年4月17日周三 > > > > > > 10:31写道: > > > > > > > > > > > > > > > > > > > Hello Xia, > > > > > > > > > > > > > > > > > > > > Thanks for the FLIP! > > > > > > > > > > > > > > > > > > > > Since we are introducing the mode as a configuration > > option, > > > > > > > > > > could it make sense to have `InferMode.NONE` option also? > > > > > > > > > > The `NONE` option would disable the inference. > > > > > > > > > > > > > > > > > > > > This way we deprecate the > > > > > > `table.exec.hive.infer-source-parallelism` > > > > > > > > > > and no additional > > > > > > `table.exec.hive.infer-source-parallelism.enabled` > > > > > > > > > > option is required. > > > > > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Muhammet > > > > > > > > > > > > > > > > > > > > On 2024-04-16 07:07, Xia Sun wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > I would like to start a discussion on FLIP-445: Support > > > > dynamic > > > > > > > > > > > parallelism > > > > > > > > > > > inference for HiveSource[1]. > > > > > > > > > > > > > > > > > > > > > > FLIP-379[2] has introduced dynamic source parallelism > > > > inference > > > > > > for > > > > > > > > > > > batch > > > > > > > > > > > jobs, which can utilize runtime information to more > > > > accurately > > > > > > > decide > > > > > > > > > > > the > > > > > > > > > > > source parallelism. As a follow-up task, we plan to > > > implement > > > > > the > > > > > > > > > > > dynamic > > > > > > > > > > > parallelism inference interface for HiveSource, and > also > > > > switch > > > > > > the > > > > > > > > > > > default > > > > > > > > > > > static parallelism inference to dynamic parallelism > > > > inference. > > > > > > > > > > > > > > > > > > > > > > Looking forward to your feedback and suggestions, > thanks. > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-445*3A*Support*dynamic*parallelism*inference*for*HiveSource__;JSsrKysrKw!!IKRxdwAv5BmarQ!c0CAsY9V98vWfPtmYrdWLpqB6tt3Ytz5mrImBR6NVlu5b8312EwY77qg5mHIQ8ukltEsioqHtpyfTqptLCqBHw$ > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-379*3A*Dynamic*source*parallelism*inference*for*batch*jobs__;JSsrKysrKys!!IKRxdwAv5BmarQ!c0CAsY9V98vWfPtmYrdWLpqB6tt3Ytz5mrImBR6NVlu5b8312EwY77qg5mHIQ8ukltEsioqHtpyfTqo1mQ_sdw$ > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > Xia > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >