Re: [DISCUSS] Code organization for Spark 3.x and 4.x

Ajantha Bhat Thu, 28 May 2026 23:17:22 -0700

We had a similar situation at Iceberg community long back and we decided to
keep duplicate code instead of keeping the common module.
https://github.com/apache/iceberg/pull/3313


Also, a similar approach has been taken for Flink modules too.

I cannot recall the decision points, I think it is because we can have
independent modules and faster development. More effort was needed for
maintaining common code.
Maybe someone who remembers the conversation from the Iceberg discussion
can share more on this.

- Ajantha








On Fri, May 29, 2026 at 9:00 AM Yong Zheng <[email protected]> wrote:

> Hello Dmitri,
>
> Thanks for bring this one up. Personally, I like the second option as well
> as our code base around spark is really minimal. One thing I am not sure
> about is how do we decided which ones move to common module? Lets use
> couple examples:
> 1. For the ones with 100% identical and less likely to get change: for
> sure we can move those
> 2. For the ones with 100% identical for now but may change for the next
> version: how do we decide this? Do we move them back from common to spark
> version specific module? Or convert to adapter?
> 3. For the ones that already have difference, do we keep 2 files that are
> 80% identical and one per spark version or convert to adapter?
>
> Thanks,
> Yong
>
> On 2026/05/28 21:20:57 Dmitri Bourlatchkov wrote:
> > Hi All,
> >
> > This is another discussion stemming from today's Community Sync call and
> PR
> > [4535].
> >
> > Adding support for Spark 4 apparently produced a substantial amount of
> > "copied" code in [4535].
> >
> > Points in favour of copy:
> >
> > * Adjusting to differences between Spark versions is easier
> >
> > * Dropping support for old Spark versions is easy (when they expire).
> >
> > Points in favour of extracting common modules:
> >
> > * Nice code organization. Common code is unit-tested once.
> >
> > * Bug fixes in shared logic only need to be done in one place.
> >
> > * Polaris does not appear to depend on deep Spark API (no query planning,
> > etc.) so differences between Spark versions can probably be handled by
> > allowing a small number of customization points in the common code.
> >
> > I tend to prefer the second approach, that is factoring out common code
> and
> > sharing it between Spark 3.x and 4.x modules with the expectation that
> the
> > size of the common code is much larger than the size of the
> > version-specific code.
> >
> > Thoughts?
> >
> > [4535] https://github.com/apache/polaris/pull/4535
> >
> > Thanks,
> > Dmitri.
> >
>

Re: [DISCUSS] Code organization for Spark 3.x and 4.x

Reply via email to