Re: Request to document the direct relationship between other configurations

Hyukjin Kwon Wed, 12 Feb 2020 18:10:44 -0800

Adding those information is already a more prevailing style at this moment,
and this is usual to follow prevailing side if there isn't a specific
reason.
If there is confusion about this, I will explicitly add it into the guide (
https://spark.apache.org/contributing.html). Let me know if this still
confuses or disagree.


2020년 2월 13일 (목) 오전 9:47, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:

> Yes, that's probably our final goal to revisit the configurations to make
> it structured and deduplicated documentation cleanly. +1.
>
> One point I would like to add is though to add such information to the
> documentation until we actually manage our final goal
> since probably it's going to take a while to revisit/fix and make it fully
> structured with the documentation.
> If we go more conservatively, we can add such information to the new
> configurations being added from now on at least, and keeping the existing
> configurations as are.
>
>
> On Thu, 13 Feb 2020, 06:12 Dongjoon Hyun, <dongjoon.h...@gmail.com> wrote:
>
>> Thank you for raising the issue, Hyukjin.
>>
>> According to the current status of discussion, it seems that we are able
>> to agree on updating the non-structured configurations and keeping the
>> structured configuration AS-IS.
>>
>> I'm +1 for the revisiting the configurations if that is our direction. If
>> there is some mismatch in structured configurations, we may fix them
>> together.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 12, 2020 at 8:08 AM Jules Damji <dmat...@comcast.net> wrote:
>>
>>> All are valid and valuable observations to put into practice:
>>>
>>> * structured and meaningful config names
>>> * explainable text or succinct description
>>> * easily accessible or searchable
>>>
>>> While these are aspirational but gradually doable if we make it part of
>>> the dev and review cycle. Often meaningful config names, like security,
>>> come as after thought.
>>>
>>> At the AMA in Amsterdam Spark Summit last year, a few developers
>>> lamented the lack of finding Spark configs—what they do; what they are used
>>> for; when and why; and what are their default values.
>>>
>>> Though you one fetch them programmatically, one still has to know what
>>> specific config one islooking for.
>>>
>>> Cheers
>>> Jules
>>>
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Feb 12, 2020, at 5:19 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>>
>>> 
>>> Yeah, that's one of my point why I dont want to document this in the
>>> guide yet.
>>>
>>> I would like to make sure dev people are on the same page that
>>> documenting is a better practice. I dont intend to force as a hard
>>> requirement; however, if that's pointed out, it should better to address.
>>>
>>>
>>> On Wed, 12 Feb 2020, 21:30 Wenchen Fan, <cloud0...@gmail.com> wrote:
>>>
>>>> In general I think it's better to have more detailed documents, but we
>>>> don't have to force everyone to do it if the config name is structured. I
>>>> would +1 to document the relationship of we can't tell it from the config
>>>> names, e.g. spark.shuffle.service.enabled
>>>> and spark.dynamicAllocation.enabled.
>>>>
>>>> On Wed, Feb 12, 2020 at 7:54 PM Hyukjin Kwon <gurwls...@gmail.com>
>>>> wrote:
>>>>
>>>>> Also, I would like to hear other people' thoughts on here. Could I ask
>>>>> what you guys think about this in general?
>>>>>
>>>>> 2020년 2월 12일 (수) 오후 12:02, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:
>>>>>
>>>>>> To do that, we should explicitly document such structured
>>>>>> configuration and implicit effect, which is currently missing.
>>>>>> I would be more than happy if we document such implied relationship,
>>>>>> *and* if we are very sure all configurations are
>>>>>> structured correctly coherently.
>>>>>> Until that point, I think it might be more practical to simply
>>>>>> document it for now.
>>>>>>
>>>>>> > Btw, maybe off-topic, `spark.dynamicAllocation` is having another
>>>>>> issue on practice - whether to duplicate description between 
>>>>>> configuration
>>>>>> code and doc. I have been asked to add description on configuration code
>>>>>> regardlessly, and existing codebase doesn't. This configuration is
>>>>>> widely-used one.
>>>>>> This is actually something we should fix too. in SQL configuration,
>>>>>> now we don't have such duplications as of
>>>>>> https://github.com/apache/spark/pull/27459 as it generates. We
>>>>>> should do it in other configurations.
>>>>>>
>>>>>>
>>>>>> 2020년 2월 12일 (수) 오전 11:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이
>>>>>> 작성:
>>>>>>
>>>>>>> I'm looking into the case of `spark.dynamicAllocation` and this
>>>>>>> seems to be the thing to support my voice.
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation
>>>>>>>
>>>>>>> I don't disagree with adding "This requires
>>>>>>> spark.shuffle.service.enabled to be set." in the description of
>>>>>>> `spark.dynamicAllocation.enabled`. This cannot be inferred implicitly,
>>>>>>> hence it should be better to have it.
>>>>>>>
>>>>>>> Why I'm in favor of structured configuration & implicit effect over
>>>>>>> describing everything explicitly is there.
>>>>>>>
>>>>>>> 1. There're 10 configurations (if the doc doesn't miss any other
>>>>>>> configuration) except `spark.dynamicAllocation.enabled`, and only 4
>>>>>>> configurations are referred in the description of
>>>>>>> `spark.dynamicAllocation.enabled` - majority of config keys are missing.
>>>>>>> 2. I think it's intentional, but the table starts
>>>>>>> with `spark.dynamicAllocation.enabled` which talks implicitly but
>>>>>>> intuitively that if you disable this then everything on dynamic 
>>>>>>> allocation
>>>>>>> won't work. Missing majority of references on config keys don't get it 
>>>>>>> hard
>>>>>>> to understand.
>>>>>>> 3. Even `spark.dynamicAllocation` has bad case - see
>>>>>>> `spark.dynamicAllocation.shuffleTracking.enabled` and
>>>>>>> `spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
>>>>>>> structure of configuration. I think this is worse than not explicitly
>>>>>>> mentioning the description. Let's assume the name has
>>>>>>> been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it 
>>>>>>> intuitive
>>>>>>> that setting `spark.dynamicAllocation.shuffleTracking.enabled` to 
>>>>>>> `false`
>>>>>>> would effectively disable 
>>>>>>> `spark.dynamicAllocation.shuffleTracking.timeout`?
>>>>>>>
>>>>>>> Btw, maybe off-topic, `spark.dynamicAllocation` is having another
>>>>>>> issue on practice - whether to duplicate description between 
>>>>>>> configuration
>>>>>>> code and doc. I have been asked to add description on configuration code
>>>>>>> regardlessly, and existing codebase doesn't. This configuration is
>>>>>>> widely-used one.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Sure, adding "[DISCUSS]" is a good practice to label it. I had to
>>>>>>>> do it although it might be "redundant" :-) since anyone can give 
>>>>>>>> feedback
>>>>>>>> to any thread in Spark dev mailing list, and discuss.
>>>>>>>>
>>>>>>>> This is actually more prevailing given my rough reading of
>>>>>>>> configuration files. I would like to see this missing relationship as 
>>>>>>>> a bad
>>>>>>>> pattern, started from a personal preference.
>>>>>>>>
>>>>>>>> > Personally I'd rather not think someone won't understand setting
>>>>>>>> `.enabled` to `false` means the functionality is disabled and 
>>>>>>>> effectively
>>>>>>>> it disables all sub-configurations.
>>>>>>>> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
>>>>>>>> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me 
>>>>>>>> this
>>>>>>>> is pretty intuitive and the one of major
>>>>>>>> > benefits of the structured configurations.
>>>>>>>>
>>>>>>>> I don't think this is a good idea we assume for users to know such
>>>>>>>> contexts. One might think
>>>>>>>> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
>>>>>>>> partially enable the feature. It is better to be explicit to
>>>>>>>> document since some of configurations are even difficult for users to
>>>>>>>> confirm if it is working or not.
>>>>>>>> For instance, one might think setting
>>>>>>>> 'spark.eventLog.rolling.maxFileSize' automatically enables rolling. 
>>>>>>>> Then,
>>>>>>>> they realise the log is not rolling later after the file
>>>>>>>> size becomes bigger.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <
>>>>>>>> kabhwan.opensou...@gmail.com>님이 작성:
>>>>>>>>
>>>>>>>>> I'm sorry if I miss something, but this is ideally better to be
>>>>>>>>> started as [DISCUSS] as I haven't seen any reference to have 
>>>>>>>>> consensus on
>>>>>>>>> this practice.
>>>>>>>>>
>>>>>>>>> For me it's just there're two different practices co-existing on
>>>>>>>>> the codebase, meaning it's closer to the preference of individual 
>>>>>>>>> (with
>>>>>>>>> implicitly agreeing that others have different preferences), or it 
>>>>>>>>> hasn't
>>>>>>>>> been discussed thoughtfully.
>>>>>>>>>
>>>>>>>>> Personally I'd rather not think someone won't understand setting
>>>>>>>>> `.enabled` to `false` means the functionality is disabled and 
>>>>>>>>> effectively
>>>>>>>>> it disables all sub-configurations. E.g. when 
>>>>>>>>> `spark.sql.adaptive.enabled`
>>>>>>>>> is `false`, all the configurations for `spark.sql.adaptive.*` are
>>>>>>>>> implicitly no-op. For me this is pretty intuitive and the one of major
>>>>>>>>> benefits of the structured configurations.
>>>>>>>>>
>>>>>>>>> If we want to make it explicit, "all" sub-configurations should
>>>>>>>>> have redundant part of the doc. More redundant if the condition is 
>>>>>>>>> nested.
>>>>>>>>> I agree this is the good step of "be kind" but less pragmatic.
>>>>>>>>>
>>>>>>>>> I'd be happy to follow the consensus we would make in this thread.
>>>>>>>>> Appreciate more voices.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> > I don't plan to document this officially yet
>>>>>>>>>> Just to prevent confusion, I meant I don't yet plan to document
>>>>>>>>>> the fact that we should write the relationships in configurations as 
>>>>>>>>>> a
>>>>>>>>>> code/review guideline in
>>>>>>>>>> https://spark.apache.org/contributing.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이
>>>>>>>>>> 작성:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I happened to review some PRs and I noticed that some
>>>>>>>>>>> configurations don't have some information
>>>>>>>>>>> necessary.
>>>>>>>>>>>
>>>>>>>>>>> To be explicit, I would like to make sure we document the direct
>>>>>>>>>>> relationship between other configurations
>>>>>>>>>>> in the documentation. For example,
>>>>>>>>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>>>>>>>>> can be only enabled when `spark.sql.adaptive.enabled` is
>>>>>>>>>>> enabled. That's clearly documented.
>>>>>>>>>>> We're good in general given that we document them in general in
>>>>>>>>>>> Apache Spark.
>>>>>>>>>>> See 'spark.task.reaper.enabled',
>>>>>>>>>>> 'spark.dynamicAllocation.enabled', 
>>>>>>>>>>> 'spark.sql.parquet.filterPushdown', etc.
>>>>>>>>>>>
>>>>>>>>>>> However, I noticed such a pattern that such information is
>>>>>>>>>>> missing in some components in general, for example,
>>>>>>>>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
>>>>>>>>>>> `spark.history.ui.acls.* `
>>>>>>>>>>>
>>>>>>>>>>> I hope we all start to document such information. Logically
>>>>>>>>>>> users can't know the relationship and I myself
>>>>>>>>>>> had to read the codes to confirm when I review.
>>>>>>>>>>> I don't plan to document this officially yet because to me it
>>>>>>>>>>> looks a pretty logical request to me; however,
>>>>>>>>>>> let me know if you guys have some different opinions.
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

Re: Request to document the direct relationship between other configurations

Reply via email to