Thanks for raising the Iceberg convention, Ajantha! I think it's a good idea to investigate it before extracting a common module for multiple Spark versions.
Yufei On Mon, Jun 1, 2026 at 6:07 AM Robert Stupp <[email protected]> wrote: > Hi all, > > I also prefer the approach to not have duplicated code. > > Looking at the `spark/src` and `integration/src` directories, I see 23 > byte-identical files, and 4 more files that have slight Spark-version > specific differences that can be "deduplicated" with base classes plus > version specific adapters. > > This appears to match Polaris's use of Spark, which is different from > projects that deeply integrate with Spark or Flink planning and execution > internals. > > Robert > > > On Mon, Jun 1, 2026 at 7:48 AM Jean-Baptiste Onofré <[email protected]> > wrote: > > > Hi Dmitri, > > > > While I don't have a major concern with duplicating code in principle, > > the main issue is the quantity of duplication. If the amount of > > redundant code is large, it becomes significantly harder to maintain. > > > > For this reason, I prefer the second option of factoring out common code. > > > > Regards, > > JB > > > > On Thu, May 28, 2026 at 11:21 PM Dmitri Bourlatchkov <[email protected]> > > wrote: > > > > > > Hi All, > > > > > > This is another discussion stemming from today's Community Sync call > and > > PR > > > [4535]. > > > > > > Adding support for Spark 4 apparently produced a substantial amount of > > > "copied" code in [4535]. > > > > > > Points in favour of copy: > > > > > > * Adjusting to differences between Spark versions is easier > > > > > > * Dropping support for old Spark versions is easy (when they expire). > > > > > > Points in favour of extracting common modules: > > > > > > * Nice code organization. Common code is unit-tested once. > > > > > > * Bug fixes in shared logic only need to be done in one place. > > > > > > * Polaris does not appear to depend on deep Spark API (no query > planning, > > > etc.) so differences between Spark versions can probably be handled by > > > allowing a small number of customization points in the common code. > > > > > > I tend to prefer the second approach, that is factoring out common code > > and > > > sharing it between Spark 3.x and 4.x modules with the expectation that > > the > > > size of the common code is much larger than the size of the > > > version-specific code. > > > > > > Thoughts? > > > > > > [4535] https://github.com/apache/polaris/pull/4535 > > > > > > Thanks, > > > Dmitri. > > >
