Hey devs, I think we can improve the way we handle our Spark configuration right now. Specifically, I see a few issues.
Our SQL configs are scattered across a number of classes. For example, we have some SQL configs in SparkUtil, IcebergSource, SparkWriteBuilder, Spark3Util. I think having a separate class for such constants would make sense, given that we already have SparkReadOptions and SparkWriteOptions. We repeat the same code to parse the read/write options, session conf, table metadata in multiple places. For example, we duplicate the code for getting the write file format, split related configs, whether to check nullability and order of incoming columns, etc. This happens partially because we have Spark 2 and Spark 3 integrations but also because we have multiple scans/writers in Spark 3. With upcoming support for merge-on-read, there will be even more classes where we would need to parse the same arguments. The config resolution logic complicates classes where it is defined. Our writer and scan builders are already complicated so removing the config resolution logic and dedicating a separate class for that seems like a promising idea. We have inconsistent precedence order for Spark configs. Historically, we interpreted whether a read/write option should take precedence over the corresponding session config inconsistently. At some point, we had a discussion [1] and reached consensus that the most common way to think about it is read option -> session conf -> table metadata. There are still places where the session config overrides options. I think we should finally fix that and be consistent even though it may be a behavior change. I’ve put together a PR to cover these points and I’d appreciate your feedback. I mention this on the dev list because the last point is a behavior change and I want to make sure everyone is OK with that. Anton [1] - https://github.com/apache/iceberg/pull/2248#discussion_r580693954