Hey devs,

I think we can improve the way we handle our Spark configuration right now. 
Specifically, I see a few issues.

Our SQL configs are scattered across a number of classes.

For example, we have some SQL configs in SparkUtil, IcebergSource, 
SparkWriteBuilder, Spark3Util. I think having a separate class for such 
constants would make sense, given that we already have SparkReadOptions and 
SparkWriteOptions.

We repeat the same code to parse the read/write options, session conf, table 
metadata in multiple places.

For example, we duplicate the code for getting the write file format, split 
related configs, whether to check nullability and order of incoming columns, 
etc. This happens partially because we have Spark 2 and Spark 3 integrations 
but also because we have multiple scans/writers in Spark 3. With upcoming 
support for merge-on-read, there will be even more classes where we would need 
to parse the same arguments.

The config resolution logic complicates classes where it is defined.

Our writer and scan builders are already complicated so removing the config 
resolution logic and dedicating a separate class for that seems like a 
promising idea.

We have inconsistent precedence order for Spark configs.

Historically, we interpreted whether a read/write option should take precedence 
over the corresponding session config inconsistently. At some point, we had a 
discussion [1] and reached consensus that the most common way to think about it 
is read option -> session conf -> table metadata. There are still places where 
the session config overrides options. I think we should finally fix that and be 
consistent even though it may be a behavior change.

I’ve put together a PR to cover these points and I’d appreciate your feedback. 
I mention this on the dev list because the last point is a behavior change and 
I want to make sure everyone is OK with that.

Anton

[1] - https://github.com/apache/iceberg/pull/2248#discussion_r580693954

Reply via email to