Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding new APIs looks nice.
> When we making the API changes (e.g., adding the new APIs or changing the existing APIs), we should regularly publish them in the dev list. I am willing to lead this effort, work with my colleagues to summarize all the merged commits [especially the API changes], and then send the *bi-weekly digest *to the dev list This digest looks very helpful for the community, thanks, Xiao! Bests, Takeshi On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <gatorsm...@gmail.com> wrote: > I want to thank you *Ruifeng Zheng* publicly for his work that lists all > the signature differences of Core, SQL and Hive we made in this upcoming > release. For details, please read the files attached in SPARK-30982 > <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these > files and submitted the following PRs to add back the SparkSQL APIs whose > maintenance costs are low based on my own experiences in SparkSQL > development: > > - https://github.com/apache/spark/pull/27821 > - functions.toDegrees/toRadians > - functions.approxCountDistinct > - functions.monotonicallyIncreasingId > - Column.!== > - Dataset.explode > - Dataset.registerTempTable > - SQLContext.getOrCreate, setActive, clearActive, constructors > - https://github.com/apache/spark/pull/27815 > - HiveContext > - createExternalTable APIs > - > - https://github.com/apache/spark/pull/27839 > - SQLContext.applySchema > - SQLContext.parquetFile > - SQLContext.jsonFile > - SQLContext.jsonRDD > - SQLContext.load > - SQLContext.jdbc > > If you think these APIs should not be added back, let me know and we can > discuss the items further. In general, I think we should provide more > evidences and discuss them publicly when we dropping these APIs at the > beginning. > > +1 on Jungtaek's comments. When we making the API changes (e.g., adding > the new APIs or changing the existing APIs), we should regularly publish > them in the dev list. I am willing to lead this effort, work with my > colleagues to summarize all the merged commits [especially the API > changes], and then send the *bi-weekly digest *to the dev list. If you > are willing to join this working group and help build these digests, feel > free to send me a note [lix...@databricks.com]. > > Cheers, > > Xiao > > > > > Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2020年3月7日周六 下午4:50写道: > >> +1 for Sean as well. >> >> Moreover, as I added a voice on previous thread, if we want to be strict >> with retaining public API, what we really need to do along with this is >> having similar level or stricter of policy for adding public API. If we >> don't apply the policy symmetrically, problems would go worse as it's still >> not that hard to add public API (only require normal review) but once the >> API is added and released it's going to be really hard to remove it. >> >> If we consider adding public API and deprecating/removing public API as >> "critical" one for the project, IMHO, it would give better visibility and >> open discussion if we make it going through dev@ mailing list instead of >> directly filing a PR. As there're so many PRs being submitted it's nearly >> impossible to look into all of PRs - it may require us to "watch" the repo >> and have tons of mails. Compared to the popularity on Github PRs, dev@ >> mailing list is not that crowded so less chance of missing the critical >> changes, and not quickly decided by only a couple of committers. >> >> These suggestions would slow down the developments - that would make us >> realize we may want to "classify/mark" user facing public APIs and others >> (just exposed as public) and only apply all the policies to former. For >> latter we don't need to guarantee anything. >> >> >> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com> >> wrote: >> >>> +1 for Sean's concerns and questions. >>> >>> Bests, >>> Dongjoon. >>> >>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> This thread established some good general principles, illustrated by a >>>> few good examples. It didn't draw specific conclusions about what to add >>>> back, which is why it wasn't at all controversial. What it means in >>>> specific cases is where there may be disagreement, and that harder question >>>> hasn't been addressed. >>>> >>>> The reverts I have seen so far seemed like the obvious one, but yes, >>>> there are several more going on now, some pretty broad. I am not even sure >>>> what all of them are. In addition to below, >>>> https://github.com/apache/spark/pull/27839. Would it be too much >>>> overhead to post to this thread any changes that one believes are endorsed >>>> by these principles and perhaps a more strict interpretation of them now? >>>> It's important enough we should get any data points or input, and now. >>>> (We're obviously not going to debate each one.) A draft PR, or several, >>>> actually sounds like a good vehicle for that -- as long as people know >>>> about them! >>>> >>>> Also, is there any usage data available to share? many arguments turn >>>> around 'commonly used' but can we know that more concretely? >>>> >>>> Otherwise I think we'll back into implementing personal interpretations >>>> of general principles, which is arguably the issue in the first place, even >>>> when everyone believes in good faith in the same principles. >>>> >>>> >>>> >>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>> wrote: >>>> >>>>> Hi, All. >>>>> >>>>> Recently, reverting PRs seems to start to spread like the *well-known* >>>>> virus. >>>>> Can we finalize this first before doing unofficial personal decisions? >>>>> Technically, this thread was not a vote and our website doesn't have a >>>>> clear policy yet. >>>>> >>>>> https://github.com/apache/spark/pull/27821 >>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs >>>>> ==> This technically revert most of the SPARK-25908. >>>>> >>>>> https://github.com/apache/spark/pull/27835 >>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the >>>>> operands" >>>>> >>>>> https://github.com/apache/spark/pull/27834 >>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default >>>>> >>>>> Bests, >>>>> Dongjoon. >>>>> >>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, All. >>>>>> >>>>>> There is a on-going Xiao's PR referencing this email. >>>>>> >>>>>> https://github.com/apache/spark/pull/27821 >>>>>> >>>>>> Bests, >>>>>> Dongjoon. >>>>>> >>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <hol...@pigscanfly.ca> >>>>>>> wrote: >>>>>>> >> 1. Could you estimate how many revert commits are required in >>>>>>> `branch-3.0` for new rubric? >>>>>>> >>>>>>> Fair question about what actual change this implies for 3.0? so far >>>>>>> it >>>>>>> seems like some targeted, quite reasonable reverts. I don't think >>>>>>> anyone's suggesting reverting loads of changes. >>>>>>> >>>>>>> >>>>>>> >> 2. Are you going to revert all removed test cases for the >>>>>>> deprecated ones? >>>>>>> > This is a good point, making sure we keep the tests as well is >>>>>>> important (worse than removing a deprecated API is shipping it broken),. >>>>>>> >>>>>>> (I'd say, yes of course! which seems consistent with what is >>>>>>> happening now) >>>>>>> >>>>>>> >>>>>>> >> 3. Does it make any delay for Apache Spark 3.0.0 release? >>>>>>> >> (I believe it was previously scheduled on June before >>>>>>> Spark Summit 2020) >>>>>>> > >>>>>>> > I think if we need to delay to make a better release this is ok, >>>>>>> especially given our current preview releases being available to gather >>>>>>> community feedback. >>>>>>> >>>>>>> Of course these things block 3.0 -- all the more reason to keep it >>>>>>> specific and targeted -- but nothing so far seems inconsistent with >>>>>>> finishing in a month or two. >>>>>>> >>>>>>> >>>>>>> >> Although there was a discussion already, I want to make the >>>>>>> following tough parts sure. >>>>>>> >> 4. We are not going to add Scala 2.11 API, right? >>>>>>> > I hope not. >>>>>>> >> >>>>>>> >> 5. We are not going to support Python 2.x in Apache Spark >>>>>>> 3.1+, right? >>>>>>> > I think doing that would be bad, it's already end of lifed >>>>>>> elsewhere. >>>>>>> >>>>>>> Yeah this is an important subtext -- the valuable principles here >>>>>>> could be interpreted in many different ways depending on how much you >>>>>>> weight value of keeping APIs for compatibility vs value in >>>>>>> simplifying >>>>>>> Spark and pushing users to newer APIs more forcibly. They're all >>>>>>> judgment calls, based on necessarily limited data about the universe >>>>>>> of users. We can only go on rare direct user feedback, on feedback >>>>>>> perhaps from vendors as proxies for a subset of users, and the >>>>>>> general >>>>>>> good faith judgment of committers who have lived Spark for years. >>>>>>> >>>>>>> My specific interpretation is that the standard is (correctly) >>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do >>>>>>> not think anyone is advocating for the logical extreme of, for >>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think >>>>>>> that falls out readily from the rubric here: maintaining 2.11 >>>>>>> compatibility is really quite painful if you ever support 2.13 too, >>>>>>> for example. >>>>>>> >>>>>> -- --- Takeshi Yamamuro