Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Mich Talebzadeh Sun, 14 Jul 2024 15:26:11 -0700

I was looking at this email trail and the original one raised by Martin
Grund. I too agree that mistakes can and do happen.


On my part, kudos to Martin for raising the issue and . @Hyukjin Kwon
<gurwls...@apache.org>  for quick action that helped avoid potential
delays. Thanks both.

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Sat, 13 Jul 2024 at 23:14, Hyukjin Kwon <gurwls...@apache.org> wrote:

> 👍
>
> On Sun, Jul 14, 2024 at 1:07 AM Holden Karau <holden.ka...@gmail.com>
> wrote:
>
>> Thank you :)
>>
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon <gurwls...@apache.org>
>> wrote:
>>
>>> Reverted, and opened a new one
>>> https://github.com/apache/spark/pull/47341.
>>>
>>> On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <gurwls...@apache.org> wrote:
>>>
>>>> Yeah that's fine. I'll revert and open a fresh PR including my own
>>>> followup when I get back home later today.
>>>>
>>>> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Even if the change is reasonable (and I can see arguments both ways),
>>>>> it's important that we follow the process we agreed on. Merging a PR
>>>>> without discussion* in ~ 2 hours from the initial proposal is not enough
>>>>> time to reach a lazy consensus. If it was a small bug-fix I could
>>>>> understand but this was a non-trivial change.
>>>>>
>>>>>
>>>>> * It was approved by another committer but without any discussion, and
>>>>> the approver & code author work for the same employer mentioned as the
>>>>> justification for the change.
>>>>>
>>>>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think we should have not mentioned a specific vendor there. The
>>>>>> change also shouldn't repartition. We should create a partition 1.
>>>>>>
>>>>>> But in general leveraging Catalyst optimizer and SQL engine there is
>>>>>> a good idea as we can leverage all optimization there. For example, it 
>>>>>> will
>>>>>> use UTF8 encoding instead of a plan string ser/de. We made similar 
>>>>>> changes
>>>>>> in JSON and CSV schema inference (it was an RDD before)
>>>>>>
>>>>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> My bad I meant to say I believe the provided justification is
>>>>>>> inappropriate.
>>>>>>>
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So looking at the PR it does not appear to be removing any RDD APIs
>>>>>>>> but the justification provided for changing the ML backend to use the
>>>>>>>> DataFrame APIs is indeed concerning.
>>>>>>>>
>>>>>>>> This PR appears to have been merged without proper review (or
>>>>>>>> providing an opportunity for review).
>>>>>>>>
>>>>>>>> I’d like to remind people of the expectations we decided on
>>>>>>>> together —
>>>>>>>> https://spark.apache.org/committers.html
>>>>>>>>
>>>>>>>> I believe the provided justification for the change and would ask
>>>>>>>> that we revert this PR so that a proper discussion can take place.
>>>>>>>>
>>>>>>>> “
>>>>>>>> In databricks runtime, RDD read / write API has some issue for
>>>>>>>> certain storage types that requires the account key, but Dataframe 
>>>>>>>> read /
>>>>>>>> write API works.
>>>>>>>> “
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund
>>>>>>>> <mar...@databricks.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> I took a quick look at the PR and would like to understand your
>>>>>>>>> concern better about:
>>>>>>>>>
>>>>>>>>> >  SparkSession is heavier than SparkContext
>>>>>>>>>
>>>>>>>>> It looks like the PR is using the active SparkSession, not
>>>>>>>>> creating a new one etc. I would highly appreciate it if you could 
>>>>>>>>> help me
>>>>>>>>> understand this situation better.
>>>>>>>>>
>>>>>>>>> Thanks a lot!
>>>>>>>>>
>>>>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun <
>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, All.
>>>>>>>>>>
>>>>>>>>>> Apache Spark's RDD API plays an essential and invaluable role
>>>>>>>>>> from the beginning and it will be even if it's not supported by Spark
>>>>>>>>>> Connect.
>>>>>>>>>>
>>>>>>>>>> I have a concern about a recent activity which replaces RDD with
>>>>>>>>>> SparkSession blindly.
>>>>>>>>>>
>>>>>>>>>> For instance,
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/pull/47328
>>>>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with
>>>>>>>>>> Dataframe read / write API
>>>>>>>>>>
>>>>>>>>>> This PR doesn't look proper to me in two ways.
>>>>>>>>>> - SparkSession is heavier than SparkContext
>>>>>>>>>> - According to the following PR description, the background is
>>>>>>>>>> also hidden in the community.
>>>>>>>>>>
>>>>>>>>>>   > # Why are the changes needed?
>>>>>>>>>>   > In databricks runtime, RDD read / write API has some issue
>>>>>>>>>> for certain storage types
>>>>>>>>>>   > that requires the account key, but Dataframe read / write API
>>>>>>>>>> works.
>>>>>>>>>>
>>>>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown
>>>>>>>>>> storage's issue or not because it's not testable in the community 
>>>>>>>>>> test
>>>>>>>>>> coverage.
>>>>>>>>>>
>>>>>>>>>> I'm wondering if the Apache Spark community aims to move away
>>>>>>>>>> from the RDD usage in favor of `Spark Connect`. Isn't it too early 
>>>>>>>>>> because
>>>>>>>>>> `Spark Connect` is not even GA in the community?
>>>>>>>>>>
>>>>>>>>>> Dongjoon.
>>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>

Re: [DISCUSS] Why do we remove RDD usage and RDD-backed code?

Reply via email to