Adding those information is already a more prevailing style at this moment, and this is usual to follow prevailing side if there isn't a specific reason. If there is confusion about this, I will explicitly add it into the guide ( https://spark.apache.org/contributing.html). Let me know if this still confuses or disagree.
2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: > Yes, that's probably our final goal to revisit the configurations to make > it structured and deduplicated documentation cleanly. +1. > > One point I would like to add is though to add such information to the > documentation until we actually manage our final goal > since probably it's going to take a while to revisit/fix and make it fully > structured with the documentation. > If we go more conservatively, we can add such information to the new > configurations being added from now on at least, and keeping the existing > configurations as are. > > > On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <dongjoon.h...@gmail.com> wrote: > >> Thank you for raising the issue, Hyukjin. >> >> According to the current status of discussion, it seems that we are able >> to agree on updating the non-structured configurations and keeping the >> structured configuration AS-IS. >> >> I'm +1 for the revisiting the configurations if that is our direction. If >> there is some mismatch in structured configurations, we may fix them >> together. >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <dmat...@comcast.net> wrote: >> >>> All are valid and valuable observations to put into practice: >>> >>> * structured and meaningful config names >>> * explainable text or succinct description >>> * easily accessible or searchable >>> >>> While these are aspirational but gradually doable if we make it part of >>> the dev and review cycle. Often meaningful config names, like security, >>> come as after thought. >>> >>> At the AMA in Amsterdam Spark Summit last year, a few developers >>> lamented the lack of finding Spark configs—what they do; what they are used >>> for; when and why; and what are their default values. >>> >>> Though you one fetch them programmatically, one still has to know what >>> specific config one islooking for. >>> >>> Cheers >>> Jules >>> >>> Sent from my iPhone >>> Pardon the dumb thumb typos :) >>> >>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: >>> >>> >>> Yeah, that's one of my point why I dont want to document this in the >>> guide yet. >>> >>> I would like to make sure dev people are on the same page that >>> documenting is a better practice. I dont intend to force as a hard >>> requirement; however, if that's pointed out, it should better to address. >>> >>> >>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0...@gmail.com> wrote: >>> >>>> In general I think it's better to have more detailed documents, but we >>>> don't have to force everyone to do it if the config name is structured. I >>>> would +1 to document the relationship of we can't tell it from the config >>>> names, e.g. spark.shuffle.service.enabled >>>> and spark.dynamicAllocation.enabled. >>>> >>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls...@gmail.com> >>>> wrote: >>>> >>>>> Also, I would like to hear other people' thoughts on here. Could I ask >>>>> what you guys think about this in general? >>>>> >>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>>>> >>>>>> To do that, we should explicitly document such structured >>>>>> configuration and implicit effect, which is currently missing. >>>>>> I would be more than happy if we document such implied relationship, >>>>>> *and* if we are very sure all configurations are >>>>>> structured correctly coherently. >>>>>> Until that point, I think it might be more practical to simply >>>>>> document it for now. >>>>>> >>>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another >>>>>> issue on practice - whether to duplicate description between >>>>>> configuration >>>>>> code and doc. I have been asked to add description on configuration code >>>>>> regardlessly, and existing codebase doesn't. This configuration is >>>>>> widely-used one. >>>>>> This is actually something we should fix too. in SQL configuration, >>>>>> now we don't have such duplications as of >>>>>> https://github.com/apache/spark/pull/27459 as it generates. We >>>>>> should do it in other configurations. >>>>>> >>>>>> >>>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>>>> 작성: >>>>>> >>>>>>> I'm looking into the case of `spark.dynamicAllocation` and this >>>>>>> seems to be the thing to support my voice. >>>>>>> >>>>>>> >>>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation >>>>>>> >>>>>>> I don't disagree with adding "This requires >>>>>>> spark.shuffle.service.enabled to be set." in the description of >>>>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly, >>>>>>> hence it should be better to have it. >>>>>>> >>>>>>> Why I'm in favor of structured configuration & implicit effect over >>>>>>> describing everything explicitly is there. >>>>>>> >>>>>>> 1. There're 10 configurations (if the doc doesn't miss any other >>>>>>> configuration) except `spark.dynamicAllocation.enabled`, and only 4 >>>>>>> configurations are referred in the description of >>>>>>> `spark.dynamicAllocation.enabled` - majority of config keys are missing. >>>>>>> 2. I think it's intentional, but the table starts >>>>>>> with `spark.dynamicAllocation.enabled` which talks implicitly but >>>>>>> intuitively that if you disable this then everything on dynamic >>>>>>> allocation >>>>>>> won't work. Missing majority of references on config keys don't get it >>>>>>> hard >>>>>>> to understand. >>>>>>> 3. Even `spark.dynamicAllocation` has bad case - see >>>>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and >>>>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the >>>>>>> structure of configuration. I think this is worse than not explicitly >>>>>>> mentioning the description. Let's assume the name has >>>>>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it >>>>>>> intuitive >>>>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to >>>>>>> `false` >>>>>>> would effectively disable >>>>>>> `spark.dynamicAllocation.shuffleTracking.timeout`? >>>>>>> >>>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another >>>>>>> issue on practice - whether to duplicate description between >>>>>>> configuration >>>>>>> code and doc. I have been asked to add description on configuration code >>>>>>> regardlessly, and existing codebase doesn't. This configuration is >>>>>>> widely-used one. >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to >>>>>>>> do it although it might be "redundant" :-) since anyone can give >>>>>>>> feedback >>>>>>>> to any thread in Spark dev mailing list, and discuss. >>>>>>>> >>>>>>>> This is actually more prevailing given my rough reading of >>>>>>>> configuration files. I would like to see this missing relationship as >>>>>>>> a bad >>>>>>>> pattern, started from a personal preference. >>>>>>>> >>>>>>>> > Personally I'd rather not think someone won't understand setting >>>>>>>> `.enabled` to `false` means the functionality is disabled and >>>>>>>> effectively >>>>>>>> it disables all sub-configurations. >>>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the >>>>>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me >>>>>>>> this >>>>>>>> is pretty intuitive and the one of major >>>>>>>> > benefits of the structured configurations. >>>>>>>> >>>>>>>> I don't think this is a good idea we assume for users to know such >>>>>>>> contexts. One might think >>>>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can >>>>>>>> partially enable the feature. It is better to be explicit to >>>>>>>> document since some of configurations are even difficult for users to >>>>>>>> confirm if it is working or not. >>>>>>>> For instance, one might think setting >>>>>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. >>>>>>>> Then, >>>>>>>> they realise the log is not rolling later after the file >>>>>>>> size becomes bigger. >>>>>>>> >>>>>>>> >>>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim < >>>>>>>> kabhwan.opensou...@gmail.com>님이 작성: >>>>>>>> >>>>>>>>> I'm sorry if I miss something, but this is ideally better to be >>>>>>>>> started as [DISCUSS] as I haven't seen any reference to have >>>>>>>>> consensus on >>>>>>>>> this practice. >>>>>>>>> >>>>>>>>> For me it's just there're two different practices co-existing on >>>>>>>>> the codebase, meaning it's closer to the preference of individual >>>>>>>>> (with >>>>>>>>> implicitly agreeing that others have different preferences), or it >>>>>>>>> hasn't >>>>>>>>> been discussed thoughtfully. >>>>>>>>> >>>>>>>>> Personally I'd rather not think someone won't understand setting >>>>>>>>> `.enabled` to `false` means the functionality is disabled and >>>>>>>>> effectively >>>>>>>>> it disables all sub-configurations. E.g. when >>>>>>>>> `spark.sql.adaptive.enabled` >>>>>>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are >>>>>>>>> implicitly no-op. For me this is pretty intuitive and the one of major >>>>>>>>> benefits of the structured configurations. >>>>>>>>> >>>>>>>>> If we want to make it explicit, "all" sub-configurations should >>>>>>>>> have redundant part of the doc. More redundant if the condition is >>>>>>>>> nested. >>>>>>>>> I agree this is the good step of "be kind" but less pragmatic. >>>>>>>>> >>>>>>>>> I'd be happy to follow the consensus we would make in this thread. >>>>>>>>> Appreciate more voices. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> > I don't plan to document this officially yet >>>>>>>>>> Just to prevent confusion, I meant I don't yet plan to document >>>>>>>>>> the fact that we should write the relationships in configurations as >>>>>>>>>> a >>>>>>>>>> code/review guideline in >>>>>>>>>> https://spark.apache.org/contributing.html >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 >>>>>>>>>> 작성: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I happened to review some PRs and I noticed that some >>>>>>>>>>> configurations don't have some information >>>>>>>>>>> necessary. >>>>>>>>>>> >>>>>>>>>>> To be explicit, I would like to make sure we document the direct >>>>>>>>>>> relationship between other configurations >>>>>>>>>>> in the documentation. For example, >>>>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` >>>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is >>>>>>>>>>> enabled. That's clearly documented. >>>>>>>>>>> We're good in general given that we document them in general in >>>>>>>>>>> Apache Spark. >>>>>>>>>>> See 'spark.task.reaper.enabled', >>>>>>>>>>> 'spark.dynamicAllocation.enabled', >>>>>>>>>>> 'spark.sql.parquet.filterPushdown', etc. >>>>>>>>>>> >>>>>>>>>>> However, I noticed such a pattern that such information is >>>>>>>>>>> missing in some components in general, for example, >>>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and >>>>>>>>>>> `spark.history.ui.acls.* ` >>>>>>>>>>> >>>>>>>>>>> I hope we all start to document such information. Logically >>>>>>>>>>> users can't know the relationship and I myself >>>>>>>>>>> had to read the codes to confirm when I review. >>>>>>>>>>> I don't plan to document this officially yet because to me it >>>>>>>>>>> looks a pretty logical request to me; however, >>>>>>>>>>> let me know if you guys have some different opinions. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>