Hi Yong,

Thanks for the info!

DeltaHelper looks like a perfect example for the common/specific code
refactoring. It is a class owned by Polaris so we can define it as an
interface in the common code with different implementations specific to
Spark 3.x , 4.x etc.

I haven't looked too deeply into this code, but it seems doable... WDYT?

Cheers,
Dmitri.

On Sat, May 30, 2026 at 1:30 PM Yong Zheng <[email protected]> wrote:

> Hello Dmitri,
>
> Yes. It may be not exact 20% diff but maybe we can use SparkCatalog.java
> as an example, there there are diff introduced by spark 4 API. The one that
> will actually have 20% diff will be DeltaHelper.java where there are some
> major changes. Detail can be found in
> https://github.com/apache/polaris/pull/4535#issuecomment-4551464567.
>
> Thanks,
> Yong Zheng
>
> On 2026/05/29 17:01:54 Dmitri Bourlatchkov wrote:
> > Hi Yong,
> >
> > Do you have an example of a file that is 80% identical between Spark 3
> and
> > 4? (sorry, I'm not very familiar with that codebase myself :)
> >
> > Thanks,
> > Dmitri.
> >
> > On Thu, May 28, 2026 at 11:30 PM Yong Zheng <[email protected]> wrote:
> >
> > > Hello Dmitri,
> > >
> > > Thanks for bring this one up. Personally, I like the second option as
> well
> > > as our code base around spark is really minimal. One thing I am not
> sure
> > > about is how do we decided which ones move to common module? Lets use
> > > couple examples:
> > > 1. For the ones with 100% identical and less likely to get change: for
> > > sure we can move those
> > > 2. For the ones with 100% identical for now but may change for the next
> > > version: how do we decide this? Do we move them back from common to
> spark
> > > version specific module? Or convert to adapter?
> > > 3. For the ones that already have difference, do we keep 2 files that
> are
> > > 80% identical and one per spark version or convert to adapter?
> > >
> > > Thanks,
> > > Yong
> > >
> > > On 2026/05/28 21:20:57 Dmitri Bourlatchkov wrote:
> > > > Hi All,
> > > >
> > > > This is another discussion stemming from today's Community Sync call
> and
> > > PR
> > > > [4535].
> > > >
> > > > Adding support for Spark 4 apparently produced a substantial amount
> of
> > > > "copied" code in [4535].
> > > >
> > > > Points in favour of copy:
> > > >
> > > > * Adjusting to differences between Spark versions is easier
> > > >
> > > > * Dropping support for old Spark versions is easy (when they expire).
> > > >
> > > > Points in favour of extracting common modules:
> > > >
> > > > * Nice code organization. Common code is unit-tested once.
> > > >
> > > > * Bug fixes in shared logic only need to be done in one place.
> > > >
> > > > * Polaris does not appear to depend on deep Spark API (no query
> planning,
> > > > etc.) so differences between Spark versions can probably be handled
> by
> > > > allowing a small number of customization points in the common code.
> > > >
> > > > I tend to prefer the second approach, that is factoring out common
> code
> > > and
> > > > sharing it between Spark 3.x and 4.x modules with the expectation
> that
> > > the
> > > > size of the common code is much larger than the size of the
> > > > version-specific code.
> > > >
> > > > Thoughts?
> > > >
> > > > [4535] https://github.com/apache/polaris/pull/4535
> > > >
> > > > Thanks,
> > > > Dmitri.
> > > >
> > >
> >
>

Reply via email to