Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Steve Loughran Mon, 29 Jul 2024 10:44:25 -0700

I'm going to join in from an ASF community perspective.

Nobody should be making fundamental changes to an ASF code base with a PR
up and then merged two hours later because of the needs of a single vendor
of a downstream product. This doesn't even give people in different time
zones the chance to review it. It goes completely against the concept of
"community" and replaces it with private problems, not shared with anyone,
and large pieces of development work to address them without any
opportunity for others to improve. Pieces of work which presumably must
have been ongoing for some days.

I know doing stuff in public is time-consuming as you have to spend a lot
of time chasing reviews, but collaboration is essential as it ensures that
changes meet the needs of a broader community than one single vendor.
Avoiding that is exclusively and unhealthy for a project.

If the databricks products have some problem resolving user:key secrets in
paths in the virtual file system, that will be good to know, especially the
what and the why -as others may encounter it too. At the very least: others
should know what to do so as to avoid getting into the same situation.

If you want more nimble development, well, closed source gives you that.
Switching to commit-then-review on specific ASF repos is also allowed,
despite the inherent risks. We use it for some of her hadoop release
packaging/testing for a rapid iteration of release process automation and
validation code.

Anyway, the patch has been reverted and discussions are now ongoing, as
they should have been from the outset.

Steve

On Wed, 24 Jul 2024 at 01:29, Hyukjin Kwon <[email protected]> wrote:

> There is always a running session. I replied in the PR.
>
> On Tue, 23 Jul 2024 at 23:32, Dongjoon Hyun <[email protected]> wrote:
>
>> I'm bumping up this thread because the overhead bites us back already.
>> Here is a commit merged 3 hours ago.
>>
>> https://github.com/apache/spark/pull/47453
>> [SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in
>> spark ML reader/writer
>>
>> In short, unlike the original PRs' claims, this commit starts to create
>> `SparkSession` in this layer. Although I understand the reason why Hyukjin
>> and Martin claims that `SparkSession` will be there in any way, this is an
>> architectural change which we need to decide explicitly, not implicitly.
>>
>> > On 2024/07/13 05:33:32 Hyukjin Kwon wrote:
>> > We actually get the active Spark session so it doesn't cause overhead.
>> Also
>> > even we create, it will create once which should be pretty trivial
>> overhead.
>>
>> If this architectural change is required inevitably and needs to happen
>> in Apache Spark 4.0.0. Can we have a dev-document about this? If there is
>> no proper place, we can add it to the ML migration guide simply.
>>
>> Dongjoon.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to