Thanks Alexey.

This was actually the case for a while now, I think. From what I can see,
our quickstart for spark still suggests passing spark-avro in via
--packages, but utilities bundle related examples are relying on the fact
that this is pre-bundled.

I do acknowledge that with recent Spark 3.x versions, breakages have become
much more frequent, amplifying this pain. However, to prevent jobs from
failing upon upgrade (i.e forcing everyone to redeploy streaming + batch
job with the --packages flag), I would prefer if we actually kept the same
bundling behavior with the following simplifications.

1. We have three spark profiles now - spark2, spark3.1.x, and spark3
(3.2.1). We continue to bundle spark-avro and support the latest spark
minor version
2. We retain and make the docs clearer about how users can "optionally"
unbundle and deploy for other versions.

Given other large features going out, turned on by default this release,
not sure if its a good idea to introduce a breaking change like this.

Thanks
Vinoth

On Tue, Mar 8, 2022 at 1:32 PM Alexey Kudinkin <[email protected]> wrote:

> Hello, everyone!
>
> While working on HUDI-3549 <
> https://issues.apache.org/jira/browse/HUDI-3549>,
> we've surprisingly discovered that Hudi actually bundles "spark-avro"
> dependency *by default*.
>
> This is problematic b/c "spark-avro" is tightly coupled with some of the
> other Spark components making up its core distribution (ie being packaged
> in Spark itself, not an external packages, one example of that is
> "spark-sql")
>
> In regards to HUDI-3549
> <https://issues.apache.org/jira/browse/HUDI-3549> itself,
> the problem in there unfolded like following:
>
>    1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1 bundled
>    along with it
>    2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
>    3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
>    3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
>    and renaming Internal API methods DataSourceUtils)
>
>
> To avoid this problems going forward, our proposal is to
>
>    1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
>    this means that Hudi users would need to now specify spark-avro via
>    `--packages` flag, since it's not part of Spark's core distribution)
>    2. (Optional) If community still sees value in bundling (and shading)
>    "spark-avro" in some cases, we can add Maven profile that would allow
> to do
>    that *ad hoc*.
>
> We've put a PR#4955 <https://github.com/apache/hudi/pull/4955> with the
> proposed changes.
>
> Looking forward to your feedback.
>

Reply via email to