I think the issue is whether a distribution of Spark is so materially different from OSS that it causes problems for the larger community of users. There's a legitimate question of whether such a thing can be called "Apache Spark + changes", as describing it that way becomes meaningfully inaccurate. And if it's inaccurate, then it's a trademark usage issue, and a matter for the PMC to act on. I certainly recall this type of problem from the early days of Hadoop - the project itself had 2 or 3 live branches in development (was it 0.20.x vs 0.23.x vs 1.x? YARN vs no YARN?) picked up by different vendors and it was unclear what "Apache Hadoop" meant in a vendor distro. Or frankly, upstream.
In comparison, variation in Scala maintenance release seems trivial. I'm not clear from the thread what actual issue this causes to users. Is there more to it - does this go hand in hand with JDK version and Ammonite, or are those separate? What's an example of the practical user issue. Like, I compile vs Spark 3.4.0 and because of Scala version differences it doesn't run on some vendor distro? That's not great, but seems like a vendor problem. Unless you tell me we are getting tons of bug reports to OSS Spark as a result or something. Is the implication that something in OSS Spark is being blocked to prefer some set of vendor choices? because the changes you're pointing to seem to be going into Apache Spark, actually. It'd be more useful to be specific and name names at this point, seems fine. The rest of this is just a discussion about Databricks choices. (If it's not clear, I'm at Databricks but do not work on the Spark distro). We can discuss but it seems off-topic _if_ it can't be connected to a problem for OSS Spark. Anyway: If it helps, _some_ important patches are described at https://docs.databricks.com/release-notes/runtime/maintenance-updates.html ; I don't think this is exactly hidden. Out of curiosity, how would you describe this software in the UI instead? "3.4.0" is shorthand, because this is a little dropdown menu; the terminal output is likewise not a place to list all patches. You would propose requesting calling this "3.4.0 + patches"? That's the best I can think of, but I don't think it addresses what you're getting at anyway. I think you'd just prefer Databricks make a different choice, which is legitimate, but, an issue to take up with Databricks, not here. On Mon, Jun 5, 2023 at 6:58 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Hi, Sean. > > "+ patches" or "powered by Apache Spark 3.4.0" is not a problem as you > mentioned. For the record, I also didn't bring up any old story here. > > > "Apache Spark 3.4.0 + patches" > > However, "including Apache Spark 3.4.0" still causes confusion even in a > different way because of those missing patches, SPARK-40436 (Upgrade Scala > to 2.12.17) and SPARK-39414 (Upgrade Scala to 2.12.16). Technically, > Databricks Runtime doesn't include Apache Spark 3.4.0 while it claims it to > the users. > > [image: image.png] > > It's a sad story from the Apache Spark Scala perspective because the users > cannot even try to use the correct Scala 2.12.17 version in the runtime. > > All items I've shared are connected via a single theme, hurting Apache > Spark Scala users. > From (1) building Spark, (2) creating a fragmented Scala Spark runtime > environment and (3) hidden user-facing documentation. > > Of course, I don't think those are designed in an organized way > intentionally. It just happens at the same time. > > Based on your comments, let me ask you two questions. (1) When Databricks > builds its internal Spark from its private code repository, is it a company > policy to always expose "Apache 3.4.0" to the users like the following by > ignoring all changes (whatever they are). And, (2) Do you insist that it is > normative and clear to the users and the community? > > > - The runtime logs "23/06/05 04:23:27 INFO SparkContext: Running Spark > version 3.4.0" > > - UI shows Apache Spark logo and `3.4.0`. > >> >>