I want to thank you *Ruifeng Zheng* publicly for his work that lists all the signature differences of Core, SQL and Hive we made in this upcoming release. For details, please read the files attached in SPARK-30982 <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these files and submitted the following PRs to add back the SparkSQL APIs whose maintenance costs are low based on my own experiences in SparkSQL development:
- https://github.com/apache/spark/pull/27821 - functions.toDegrees/toRadians - functions.approxCountDistinct - functions.monotonicallyIncreasingId - Column.!== - Dataset.explode - Dataset.registerTempTable - SQLContext.getOrCreate, setActive, clearActive, constructors - https://github.com/apache/spark/pull/27815 - HiveContext - createExternalTable APIs - - https://github.com/apache/spark/pull/27839 - SQLContext.applySchema - SQLContext.parquetFile - SQLContext.jsonFile - SQLContext.jsonRDD - SQLContext.load - SQLContext.jdbc If you think these APIs should not be added back, let me know and we can discuss the items further. In general, I think we should provide more evidences and discuss them publicly when we dropping these APIs at the beginning. +1 on Jungtaek's comments. When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the *bi-weekly digest *to the dev list. If you are willing to join this working group and help build these digests, feel free to send me a note [lix...@databricks.com]. Cheers, Xiao Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2020年3月7日周六 下午4:50写道: > +1 for Sean as well. > > Moreover, as I added a voice on previous thread, if we want to be strict > with retaining public API, what we really need to do along with this is > having similar level or stricter of policy for adding public API. If we > don't apply the policy symmetrically, problems would go worse as it's still > not that hard to add public API (only require normal review) but once the > API is added and released it's going to be really hard to remove it. > > If we consider adding public API and deprecating/removing public API as > "critical" one for the project, IMHO, it would give better visibility and > open discussion if we make it going through dev@ mailing list instead of > directly filing a PR. As there're so many PRs being submitted it's nearly > impossible to look into all of PRs - it may require us to "watch" the repo > and have tons of mails. Compared to the popularity on Github PRs, dev@ > mailing list is not that crowded so less chance of missing the critical > changes, and not quickly decided by only a couple of committers. > > These suggestions would slow down the developments - that would make us > realize we may want to "classify/mark" user facing public APIs and others > (just exposed as public) and only apply all the policies to former. For > latter we don't need to guarantee anything. > > > On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> +1 for Sean's concerns and questions. >> >> Bests, >> Dongjoon. >> >> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sro...@gmail.com> wrote: >> >>> This thread established some good general principles, illustrated by a >>> few good examples. It didn't draw specific conclusions about what to add >>> back, which is why it wasn't at all controversial. What it means in >>> specific cases is where there may be disagreement, and that harder question >>> hasn't been addressed. >>> >>> The reverts I have seen so far seemed like the obvious one, but yes, >>> there are several more going on now, some pretty broad. I am not even sure >>> what all of them are. In addition to below, >>> https://github.com/apache/spark/pull/27839. Would it be too much >>> overhead to post to this thread any changes that one believes are endorsed >>> by these principles and perhaps a more strict interpretation of them now? >>> It's important enough we should get any data points or input, and now. >>> (We're obviously not going to debate each one.) A draft PR, or several, >>> actually sounds like a good vehicle for that -- as long as people know >>> about them! >>> >>> Also, is there any usage data available to share? many arguments turn >>> around 'commonly used' but can we know that more concretely? >>> >>> Otherwise I think we'll back into implementing personal interpretations >>> of general principles, which is arguably the issue in the first place, even >>> when everyone believes in good faith in the same principles. >>> >>> >>> >>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>> wrote: >>> >>>> Hi, All. >>>> >>>> Recently, reverting PRs seems to start to spread like the *well-known* >>>> virus. >>>> Can we finalize this first before doing unofficial personal decisions? >>>> Technically, this thread was not a vote and our website doesn't have a >>>> clear policy yet. >>>> >>>> https://github.com/apache/spark/pull/27821 >>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs >>>> ==> This technically revert most of the SPARK-25908. >>>> >>>> https://github.com/apache/spark/pull/27835 >>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the >>>> operands" >>>> >>>> https://github.com/apache/spark/pull/27834 >>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default >>>> >>>> Bests, >>>> Dongjoon. >>>> >>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Hi, All. >>>>> >>>>> There is a on-going Xiao's PR referencing this email. >>>>> >>>>> https://github.com/apache/spark/pull/27821 >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sro...@gmail.com> wrote: >>>>> >>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <hol...@pigscanfly.ca> >>>>>> wrote: >>>>>> >> 1. Could you estimate how many revert commits are required in >>>>>> `branch-3.0` for new rubric? >>>>>> >>>>>> Fair question about what actual change this implies for 3.0? so far it >>>>>> seems like some targeted, quite reasonable reverts. I don't think >>>>>> anyone's suggesting reverting loads of changes. >>>>>> >>>>>> >>>>>> >> 2. Are you going to revert all removed test cases for the >>>>>> deprecated ones? >>>>>> > This is a good point, making sure we keep the tests as well is >>>>>> important (worse than removing a deprecated API is shipping it broken),. >>>>>> >>>>>> (I'd say, yes of course! which seems consistent with what is >>>>>> happening now) >>>>>> >>>>>> >>>>>> >> 3. Does it make any delay for Apache Spark 3.0.0 release? >>>>>> >> (I believe it was previously scheduled on June before >>>>>> Spark Summit 2020) >>>>>> > >>>>>> > I think if we need to delay to make a better release this is ok, >>>>>> especially given our current preview releases being available to gather >>>>>> community feedback. >>>>>> >>>>>> Of course these things block 3.0 -- all the more reason to keep it >>>>>> specific and targeted -- but nothing so far seems inconsistent with >>>>>> finishing in a month or two. >>>>>> >>>>>> >>>>>> >> Although there was a discussion already, I want to make the >>>>>> following tough parts sure. >>>>>> >> 4. We are not going to add Scala 2.11 API, right? >>>>>> > I hope not. >>>>>> >> >>>>>> >> 5. We are not going to support Python 2.x in Apache Spark >>>>>> 3.1+, right? >>>>>> > I think doing that would be bad, it's already end of lifed >>>>>> elsewhere. >>>>>> >>>>>> Yeah this is an important subtext -- the valuable principles here >>>>>> could be interpreted in many different ways depending on how much you >>>>>> weight value of keeping APIs for compatibility vs value in simplifying >>>>>> Spark and pushing users to newer APIs more forcibly. They're all >>>>>> judgment calls, based on necessarily limited data about the universe >>>>>> of users. We can only go on rare direct user feedback, on feedback >>>>>> perhaps from vendors as proxies for a subset of users, and the general >>>>>> good faith judgment of committers who have lived Spark for years. >>>>>> >>>>>> My specific interpretation is that the standard is (correctly) >>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do >>>>>> not think anyone is advocating for the logical extreme of, for >>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think >>>>>> that falls out readily from the rubric here: maintaining 2.11 >>>>>> compatibility is really quite painful if you ever support 2.13 too, >>>>>> for example. >>>>>> >>>>>