Re: Request to document the direct relationship between other configurations

Jungtaek Lim Tue, 11 Feb 2020 18:47:27 -0800

I'm looking into the case of `spark.dynamicAllocation` and this seems to be
the thing to support my voice.


https://github.com/apache/spark/blob/master/docs/configuration.md#dynamic-allocation

I don't disagree with adding "This requires spark.shuffle.service.enabled
to be set." in the description of `spark.dynamicAllocation.enabled`. This
cannot be inferred implicitly, hence it should be better to have it.

Why I'm in favor of structured configuration & implicit effect over
describing everything explicitly is there.

1. There're 10 configurations (if the doc doesn't miss any other
configuration) except `spark.dynamicAllocation.enabled`, and only 4
configurations are referred in the description of
`spark.dynamicAllocation.enabled` - majority of config keys are missing.
2. I think it's intentional, but the table starts
with `spark.dynamicAllocation.enabled` which talks implicitly but
intuitively that if you disable this then everything on dynamic allocation
won't work. Missing majority of references on config keys don't get it hard
to understand.
3. Even `spark.dynamicAllocation` has bad case - see
`spark.dynamicAllocation.shuffleTracking.enabled` and
`spark.dynamicAllocation.shuffleTimeout`. It is not respecting the
structure of configuration. I think this is worse than not explicitly
mentioning the description. Let's assume the name has
been `spark.dynamicAllocation.shuffleTracking.timeout` - isn't it intuitive
that setting `spark.dynamicAllocation.shuffleTracking.enabled` to `false`
would effectively disable `spark.dynamicAllocation.shuffleTracking.timeout`?

Btw, maybe off-topic, `spark.dynamicAllocation` is having another issue on
practice - whether to duplicate description between configuration code and
doc. I have been asked to add description on configuration code
regardlessly, and existing codebase doesn't. This configuration is
widely-used one.


On Wed, Feb 12, 2020 at 11:22 AM Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Sure, adding "[DISCUSS]" is a good practice to label it. I had to do it
> although it might be "redundant" :-) since anyone can give feedback to any
> thread in Spark dev mailing list, and discuss.
>
> This is actually more prevailing given my rough reading of configuration
> files. I would like to see this missing relationship as a bad pattern,
> started from a personal preference.
>
> > Personally I'd rather not think someone won't understand setting
> `.enabled` to `false` means the functionality is disabled and effectively
> it disables all sub-configurations.
> > E.g. when `spark.sql.adaptive.enabled` is `false`, all the
> configurations for `spark.sql.adaptive.*` are implicitly no-op. For me this
> is pretty intuitive and the one of major
> > benefits of the structured configurations.
>
> I don't think this is a good idea we assume for users to know such
> contexts. One might think
> `spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled` can
> partially enable the feature. It is better to be explicit to document
> since some of configurations are even difficult for users to confirm if it
> is working or not.
> For instance, one might think setting 'spark.eventLog.rolling.maxFileSize'
> automatically enables rolling. Then, they realise the log is not rolling
> later after the file
> size becomes bigger.
>
>
> 2020년 2월 12일 (수) 오전 10:47, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이
> 작성:
>
>> I'm sorry if I miss something, but this is ideally better to be started
>> as [DISCUSS] as I haven't seen any reference to have consensus on this
>> practice.
>>
>> For me it's just there're two different practices co-existing on the
>> codebase, meaning it's closer to the preference of individual (with
>> implicitly agreeing that others have different preferences), or it hasn't
>> been discussed thoughtfully.
>>
>> Personally I'd rather not think someone won't understand setting
>> `.enabled` to `false` means the functionality is disabled and effectively
>> it disables all sub-configurations. E.g. when `spark.sql.adaptive.enabled`
>> is `false`, all the configurations for `spark.sql.adaptive.*` are
>> implicitly no-op. For me this is pretty intuitive and the one of major
>> benefits of the structured configurations.
>>
>> If we want to make it explicit, "all" sub-configurations should have
>> redundant part of the doc. More redundant if the condition is nested. I
>> agree this is the good step of "be kind" but less pragmatic.
>>
>> I'd be happy to follow the consensus we would make in this thread.
>> Appreciate more voices.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Wed, Feb 12, 2020 at 10:36 AM Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>>
>>> > I don't plan to document this officially yet
>>> Just to prevent confusion, I meant I don't yet plan to document the fact
>>> that we should write the relationships in configurations as a code/review
>>> guideline in https://spark.apache.org/contributing.html
>>>
>>>
>>> 2020년 2월 12일 (수) 오전 9:57, Hyukjin Kwon <gurwls...@gmail.com>님이 작성:
>>>
>>>> Hi all,
>>>>
>>>> I happened to review some PRs and I noticed that some configurations
>>>> don't have some information
>>>> necessary.
>>>>
>>>> To be explicit, I would like to make sure we document the direct
>>>> relationship between other configurations
>>>> in the documentation. For example,
>>>> `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled`
>>>> can be only enabled when `spark.sql.adaptive.enabled` is enabled.
>>>> That's clearly documented.
>>>> We're good in general given that we document them in general in Apache
>>>> Spark.
>>>> See 'spark.task.reaper.enabled', 'spark.dynamicAllocation.enabled',
>>>> 'spark.sql.parquet.filterPushdown', etc.
>>>>
>>>> However, I noticed such a pattern that such information is missing in
>>>> some components in general, for example,
>>>> `spark.history.fs.cleaner.*`, `spark.history.kerberos.*` and
>>>> `spark.history.ui.acls.* `
>>>>
>>>> I hope we all start to document such information. Logically users can't
>>>> know the relationship and I myself
>>>> had to read the codes to confirm when I review.
>>>> I don't plan to document this officially yet because to me it looks a
>>>> pretty logical request to me; however,
>>>> let me know if you guys have some different opinions.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>

Re: Request to document the direct relationship between other configurations

Reply via email to