Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Hyukjin Kwon Sat, 13 Jul 2024 01:37:38 -0700

Reverted, and opened a new one https://github.com/apache/spark/pull/47341.


On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <[email protected]> wrote:

> Yeah that's fine. I'll revert and open a fresh PR including my own
> followup when I get back home later today.
>
> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <[email protected]>
> wrote:
>
>> Even if the change is reasonable (and I can see arguments both ways),
>> it's important that we follow the process we agreed on. Merging a PR
>> without discussion* in ~ 2 hours from the initial proposal is not enough
>> time to reach a lazy consensus. If it was a small bug-fix I could
>> understand but this was a non-trivial change.
>>
>>
>> * It was approved by another committer but without any discussion, and
>> the approver & code author work for the same employer mentioned as the
>> justification for the change.
>>
>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <[email protected]>
>> wrote:
>>
>>> I think we should have not mentioned a specific vendor there. The change
>>> also shouldn't repartition. We should create a partition 1.
>>>
>>> But in general leveraging Catalyst optimizer and SQL engine there is a
>>> good idea as we can leverage all optimization there. For example, it will
>>> use UTF8 encoding instead of a plan string ser/de. We made similar changes
>>> in JSON and CSV schema inference (it was an RDD before)
>>>
>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <[email protected]>
>>> wrote:
>>>
>>>> My bad I meant to say I believe the provided justification is
>>>> inappropriate.
>>>>
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>>
>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> So looking at the PR it does not appear to be removing any RDD APIs
>>>>> but the justification provided for changing the ML backend to use the
>>>>> DataFrame APIs is indeed concerning.
>>>>>
>>>>> This PR appears to have been merged without proper review (or
>>>>> providing an opportunity for review).
>>>>>
>>>>> I’d like to remind people of the expectations we decided on together —
>>>>> https://spark.apache.org/committers.html
>>>>>
>>>>> I believe the provided justification for the change and would ask that
>>>>> we revert this PR so that a proper discussion can take place.
>>>>>
>>>>> “
>>>>> In databricks runtime, RDD read / write API has some issue for certain
>>>>> storage types that requires the account key, but Dataframe read / write 
>>>>> API
>>>>> works.
>>>>> “
>>>>>
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>>>
>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> I took a quick look at the PR and would like to understand your
>>>>>> concern better about:
>>>>>>
>>>>>> >  SparkSession is heavier than SparkContext
>>>>>>
>>>>>> It looks like the PR is using the active SparkSession, not creating a
>>>>>> new one etc. I would highly appreciate it if you could help me understand
>>>>>> this situation better.
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Apache Spark's RDD API plays an essential and invaluable role from
>>>>>>> the beginning and it will be even if it's not supported by Spark 
>>>>>>> Connect.
>>>>>>>
>>>>>>> I have a concern about a recent activity which replaces RDD with
>>>>>>> SparkSession blindly.
>>>>>>>
>>>>>>> For instance,
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/47328
>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>>>>> Dataframe read / write API
>>>>>>>
>>>>>>> This PR doesn't look proper to me in two ways.
>>>>>>> - SparkSession is heavier than SparkContext
>>>>>>> - According to the following PR description, the background is also
>>>>>>> hidden in the community.
>>>>>>>
>>>>>>>   > # Why are the changes needed?
>>>>>>>   > In databricks runtime, RDD read / write API has some issue for
>>>>>>> certain storage types
>>>>>>>   > that requires the account key, but Dataframe read / write API
>>>>>>> works.
>>>>>>>
>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>>>>> storage's issue or not because it's not testable in the community test
>>>>>>> coverage.
>>>>>>>
>>>>>>> I'm wondering if the Apache Spark community aims to move away from
>>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because
>>>>>>> `Spark Connect` is not even GA in the community?
>>>>>>>
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to