I was looking at this email trail and the original one raised by Martin Grund. I too agree that mistakes can and do happen.
On my part, kudos to Martin for raising the issue and . @Hyukjin Kwon <gurwls...@apache.org> for quick action that helped avoid potential delays. Thanks both. Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Sat, 13 Jul 2024 at 23:14, Hyukjin Kwon <gurwls...@apache.org> wrote: > 👍 > > On Sun, Jul 14, 2024 at 1:07 AM Holden Karau <holden.ka...@gmail.com> > wrote: > >> Thank you :) >> >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> >> On Sat, Jul 13, 2024 at 1:37 AM Hyukjin Kwon <gurwls...@apache.org> >> wrote: >> >>> Reverted, and opened a new one >>> https://github.com/apache/spark/pull/47341. >>> >>> On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <gurwls...@apache.org> wrote: >>> >>>> Yeah that's fine. I'll revert and open a fresh PR including my own >>>> followup when I get back home later today. >>>> >>>> On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com> >>>> wrote: >>>> >>>>> Even if the change is reasonable (and I can see arguments both ways), >>>>> it's important that we follow the process we agreed on. Merging a PR >>>>> without discussion* in ~ 2 hours from the initial proposal is not enough >>>>> time to reach a lazy consensus. If it was a small bug-fix I could >>>>> understand but this was a non-trivial change. >>>>> >>>>> >>>>> * It was approved by another committer but without any discussion, and >>>>> the approver & code author work for the same employer mentioned as the >>>>> justification for the change. >>>>> >>>>> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org> >>>>> wrote: >>>>> >>>>>> I think we should have not mentioned a specific vendor there. The >>>>>> change also shouldn't repartition. We should create a partition 1. >>>>>> >>>>>> But in general leveraging Catalyst optimizer and SQL engine there is >>>>>> a good idea as we can leverage all optimization there. For example, it >>>>>> will >>>>>> use UTF8 encoding instead of a plan string ser/de. We made similar >>>>>> changes >>>>>> in JSON and CSV schema inference (it was an RDD before) >>>>>> >>>>>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> My bad I meant to say I believe the provided justification is >>>>>>> inappropriate. >>>>>>> >>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>> >>>>>>> >>>>>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> So looking at the PR it does not appear to be removing any RDD APIs >>>>>>>> but the justification provided for changing the ML backend to use the >>>>>>>> DataFrame APIs is indeed concerning. >>>>>>>> >>>>>>>> This PR appears to have been merged without proper review (or >>>>>>>> providing an opportunity for review). >>>>>>>> >>>>>>>> I’d like to remind people of the expectations we decided on >>>>>>>> together — >>>>>>>> https://spark.apache.org/committers.html >>>>>>>> >>>>>>>> I believe the provided justification for the change and would ask >>>>>>>> that we revert this PR so that a proper discussion can take place. >>>>>>>> >>>>>>>> “ >>>>>>>> In databricks runtime, RDD read / write API has some issue for >>>>>>>> certain storage types that requires the account key, but Dataframe >>>>>>>> read / >>>>>>>> write API works. >>>>>>>> “ >>>>>>>> >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >>>>>>>> <mar...@databricks.com.invalid> wrote: >>>>>>>> >>>>>>>>> I took a quick look at the PR and would like to understand your >>>>>>>>> concern better about: >>>>>>>>> >>>>>>>>> > SparkSession is heavier than SparkContext >>>>>>>>> >>>>>>>>> It looks like the PR is using the active SparkSession, not >>>>>>>>> creating a new one etc. I would highly appreciate it if you could >>>>>>>>> help me >>>>>>>>> understand this situation better. >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> >>>>>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun < >>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, All. >>>>>>>>>> >>>>>>>>>> Apache Spark's RDD API plays an essential and invaluable role >>>>>>>>>> from the beginning and it will be even if it's not supported by Spark >>>>>>>>>> Connect. >>>>>>>>>> >>>>>>>>>> I have a concern about a recent activity which replaces RDD with >>>>>>>>>> SparkSession blindly. >>>>>>>>>> >>>>>>>>>> For instance, >>>>>>>>>> >>>>>>>>>> https://github.com/apache/spark/pull/47328 >>>>>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>>>>>>>> Dataframe read / write API >>>>>>>>>> >>>>>>>>>> This PR doesn't look proper to me in two ways. >>>>>>>>>> - SparkSession is heavier than SparkContext >>>>>>>>>> - According to the following PR description, the background is >>>>>>>>>> also hidden in the community. >>>>>>>>>> >>>>>>>>>> > # Why are the changes needed? >>>>>>>>>> > In databricks runtime, RDD read / write API has some issue >>>>>>>>>> for certain storage types >>>>>>>>>> > that requires the account key, but Dataframe read / write API >>>>>>>>>> works. >>>>>>>>>> >>>>>>>>>> In addition, we don't know if this PR fixes the mentioned unknown >>>>>>>>>> storage's issue or not because it's not testable in the community >>>>>>>>>> test >>>>>>>>>> coverage. >>>>>>>>>> >>>>>>>>>> I'm wondering if the Apache Spark community aims to move away >>>>>>>>>> from the RDD usage in favor of `Spark Connect`. Isn't it too early >>>>>>>>>> because >>>>>>>>>> `Spark Connect` is not even GA in the community? >>>>>>>>>> >>>>>>>>>> Dongjoon. >>>>>>>>>> >>>>>>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>>