Hi all,

I'd like to discuss the naming policy of Spark configs, as for now it
depends on personal preference which leads to inconsistent namings.

In general, the config name should be a noun that describes its meaning
clearly.
Good examples:
spark.sql.session.timeZone
spark.sql.streaming.continuous.executorQueueSize
spark.sql.statistics.histogram.numBins
Bad examples:
spark.sql.defaultSizeInBytes (default size for what?)

Also note that, config name has many parts, joined by dots. Each part is a
namespace. Don't create namespace unnecessarily.
Good example:
spark.sql.execution.rangeExchange.sampleSizePerPartition
spark.sql.execution.arrow.maxRecordsPerBatch
Bad examples:
spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a useful
namespace, better to be .buffer.inMemoryThreshold)

For a big feature, usually we need to create an umbrella config to turn it
on/off, and other configs for fine-grained controls. These configs should
share the same namespace, and the umbrella config should be named like
featureName.enabled. For example:
spark.sql.cbo.enabled
spark.sql.cbo.starSchemaDetection
spark.sql.cbo.starJoinFTRatio
spark.sql.cbo.joinReorder.enabled
spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good namespace)
spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good namespace)

For boolean configs, in general it should end with a verb, e.g.
spark.sql.join.preferSortMergeJoin. If the config is for a feature and you
can't find a good verb for the feature, featureName.enabled is also good.

I'll update https://spark.apache.org/contributing.html after we reach a
consensus here. Any comments are welcome!

Thanks,
Wenchen

Reply via email to