I think the issue is whether a distribution of Spark is so materially
different from OSS that it causes problems for the larger community of
users. There's a legitimate question of whether such a thing can be called
"Apache Spark + changes", as describing it that way becomes meaningfully
inaccurate. And if it's inaccurate, then it's a trademark usage issue, and
a matter for the PMC to act on. I certainly recall this type of problem
from the early days of Hadoop - the project itself had 2 or 3 live branches
in development (was it 0.20.x vs 0.23.x vs 1.x? YARN vs no YARN?) picked up
by different vendors and it was unclear what "Apache Hadoop" meant in a
vendor distro. Or frankly, upstream.

In comparison, variation in Scala maintenance release seems trivial. I'm
not clear from the thread what actual issue this causes to users. Is there
more to it - does this go hand in hand with JDK version and Ammonite, or
are those separate? What's an example of the practical user issue. Like, I
compile vs Spark 3.4.0 and because of Scala version differences it doesn't
run on some vendor distro? That's not great, but seems like a vendor
problem. Unless you tell me we are getting tons of bug reports to OSS Spark
as a result or something.

Is the implication that something in OSS Spark is being blocked to prefer
some set of vendor choices? because the changes you're pointing to seem to
be going into Apache Spark, actually. It'd be more useful to be specific
and name names at this point, seems fine.

The rest of this is just a discussion about Databricks choices. (If it's
not clear, I'm at Databricks but do not work on the Spark distro). We can
discuss but it seems off-topic _if_ it can't be connected to a problem for
OSS Spark. Anyway:

If it helps, _some_ important patches are described at
; I don't think this is exactly hidden.

Out of curiosity, how would you describe this software in the UI instead?
"3.4.0" is shorthand, because this is a little dropdown menu; the terminal
output is likewise not a place to list all patches. You would propose
requesting calling this "3.4.0 + patches"? That's the best I can think of,
but I don't think it addresses what you're getting at anyway. I think you'd
just prefer Databricks make a different choice, which is legitimate, but,
an issue to take up with Databricks, not here.

On Mon, Jun 5, 2023 at 6:58 PM Dongjoon Hyun <dongjoon.h...@gmail.com>

> Hi, Sean.
> "+ patches" or "powered by Apache Spark 3.4.0" is not a problem as you
> mentioned. For the record, I also didn't bring up any old story here.
> > "Apache Spark 3.4.0 + patches"
> However, "including Apache Spark 3.4.0" still causes confusion even in a
> different way because of those missing patches, SPARK-40436 (Upgrade Scala
> to 2.12.17) and SPARK-39414 (Upgrade Scala to 2.12.16). Technically,
> Databricks Runtime doesn't include Apache Spark 3.4.0 while it claims it to
> the users.
> [image: image.png]
> It's a sad story from the Apache Spark Scala perspective because the users
> cannot even try to use the correct Scala 2.12.17 version in the runtime.
> All items I've shared are connected via a single theme, hurting Apache
> Spark Scala users.
> From (1) building Spark, (2) creating a fragmented Scala Spark runtime
> environment and (3) hidden user-facing documentation.
> Of course, I don't think those are designed in an organized way
> intentionally. It just happens at the same time.
> Based on your comments, let me ask you two questions. (1) When Databricks
> builds its internal Spark from its private code repository, is it a company
> policy to always expose "Apache 3.4.0" to the users like the following by
> ignoring all changes (whatever they are). And, (2) Do you insist that it is
> normative and clear to the users and the community?
> > - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark
> version 3.4.0"
> > - UI shows Apache Spark logo and `3.4.0`.

Reply via email to