Reverted, and opened a new one https://github.com/apache/spark/pull/47341.
On Sat, 13 Jul 2024 at 15:40, Hyukjin Kwon <gurwls...@apache.org> wrote: > Yeah that's fine. I'll revert and open a fresh PR including my own > followup when I get back home later today. > > On Sat, Jul 13, 2024 at 3:08 PM Holden Karau <holden.ka...@gmail.com> > wrote: > >> Even if the change is reasonable (and I can see arguments both ways), >> it's important that we follow the process we agreed on. Merging a PR >> without discussion* in ~ 2 hours from the initial proposal is not enough >> time to reach a lazy consensus. If it was a small bug-fix I could >> understand but this was a non-trivial change. >> >> >> * It was approved by another committer but without any discussion, and >> the approver & code author work for the same employer mentioned as the >> justification for the change. >> >> On Fri, Jul 12, 2024 at 6:42 PM Hyukjin Kwon <gurwls...@apache.org> >> wrote: >> >>> I think we should have not mentioned a specific vendor there. The change >>> also shouldn't repartition. We should create a partition 1. >>> >>> But in general leveraging Catalyst optimizer and SQL engine there is a >>> good idea as we can leverage all optimization there. For example, it will >>> use UTF8 encoding instead of a plan string ser/de. We made similar changes >>> in JSON and CSV schema inference (it was an RDD before) >>> >>> On Sat, Jul 13, 2024 at 10:33 AM Holden Karau <holden.ka...@gmail.com> >>> wrote: >>> >>>> My bad I meant to say I believe the provided justification is >>>> inappropriate. >>>> >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>>> >>>> On Fri, Jul 12, 2024 at 5:14 PM Holden Karau <holden.ka...@gmail.com> >>>> wrote: >>>> >>>>> So looking at the PR it does not appear to be removing any RDD APIs >>>>> but the justification provided for changing the ML backend to use the >>>>> DataFrame APIs is indeed concerning. >>>>> >>>>> This PR appears to have been merged without proper review (or >>>>> providing an opportunity for review). >>>>> >>>>> I’d like to remind people of the expectations we decided on together — >>>>> https://spark.apache.org/committers.html >>>>> >>>>> I believe the provided justification for the change and would ask that >>>>> we revert this PR so that a proper discussion can take place. >>>>> >>>>> “ >>>>> In databricks runtime, RDD read / write API has some issue for certain >>>>> storage types that requires the account key, but Dataframe read / write >>>>> API >>>>> works. >>>>> “ >>>>> >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>> >>>>> >>>>> On Fri, Jul 12, 2024 at 1:02 PM Martin Grund >>>>> <mar...@databricks.com.invalid> wrote: >>>>> >>>>>> I took a quick look at the PR and would like to understand your >>>>>> concern better about: >>>>>> >>>>>> > SparkSession is heavier than SparkContext >>>>>> >>>>>> It looks like the PR is using the active SparkSession, not creating a >>>>>> new one etc. I would highly appreciate it if you could help me understand >>>>>> this situation better. >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>> On Fri, Jul 12, 2024 at 8:52 PM Dongjoon Hyun < >>>>>> dongjoon.h...@gmail.com> wrote: >>>>>> >>>>>>> Hi, All. >>>>>>> >>>>>>> Apache Spark's RDD API plays an essential and invaluable role from >>>>>>> the beginning and it will be even if it's not supported by Spark >>>>>>> Connect. >>>>>>> >>>>>>> I have a concern about a recent activity which replaces RDD with >>>>>>> SparkSession blindly. >>>>>>> >>>>>>> For instance, >>>>>>> >>>>>>> https://github.com/apache/spark/pull/47328 >>>>>>> [SPARK-48883][ML][R] Replace RDD read / write API invocation with >>>>>>> Dataframe read / write API >>>>>>> >>>>>>> This PR doesn't look proper to me in two ways. >>>>>>> - SparkSession is heavier than SparkContext >>>>>>> - According to the following PR description, the background is also >>>>>>> hidden in the community. >>>>>>> >>>>>>> > # Why are the changes needed? >>>>>>> > In databricks runtime, RDD read / write API has some issue for >>>>>>> certain storage types >>>>>>> > that requires the account key, but Dataframe read / write API >>>>>>> works. >>>>>>> >>>>>>> In addition, we don't know if this PR fixes the mentioned unknown >>>>>>> storage's issue or not because it's not testable in the community test >>>>>>> coverage. >>>>>>> >>>>>>> I'm wondering if the Apache Spark community aims to move away from >>>>>>> the RDD usage in favor of `Spark Connect`. Isn't it too early because >>>>>>> `Spark Connect` is not even GA in the community? >>>>>>> >>>>>>> Dongjoon. >>>>>>> >>>>>> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >