Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Takeshi Yamamuro Sat, 07 Mar 2020 23:20:12 -0800

Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding
new APIs looks nice.


> When we making the API changes (e.g., adding the new APIs or changing the
existing APIs), we should regularly publish them in the dev list. I am
willing to lead this effort, work with my colleagues to summarize all the
merged commits [especially the API changes], and then send the *bi-weekly
digest *to the dev list

This digest looks very helpful for the community, thanks, Xiao!

Bests,
Takeshi

On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <gatorsm...@gmail.com> wrote:

> I want to thank you *Ruifeng Zheng* publicly for his work that lists all
> the signature differences of Core, SQL and Hive we made in this upcoming
> release. For details, please read the files attached in SPARK-30982
> <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
> files and submitted the following PRs to add back the SparkSQL APIs whose
> maintenance costs are low based on my own experiences in SparkSQL
> development:
>
>    - https://github.com/apache/spark/pull/27821
>    - functions.toDegrees/toRadians
>       - functions.approxCountDistinct
>       - functions.monotonicallyIncreasingId
>       - Column.!==
>       - Dataset.explode
>       - Dataset.registerTempTable
>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>    - https://github.com/apache/spark/pull/27815
>       - HiveContext
>       - createExternalTable APIs
>    -
>    - https://github.com/apache/spark/pull/27839
>       - SQLContext.applySchema
>       - SQLContext.parquetFile
>       - SQLContext.jsonFile
>       - SQLContext.jsonRDD
>       - SQLContext.load
>       - SQLContext.jdbc
>
> If you think these APIs should not be added back, let me know and we can
> discuss the items further. In general, I think we should provide more
> evidences and discuss them publicly when we dropping these APIs at the
> beginning.
>
> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
> the new APIs or changing the existing APIs), we should regularly publish
> them in the dev list. I am willing to lead this effort, work with my
> colleagues to summarize all the merged commits [especially the API
> changes], and then send the *bi-weekly digest *to the dev list. If you
> are willing to join this working group and help build these digests, feel
> free to send me a note [lix...@databricks.com].
>
> Cheers,
>
> Xiao
>
>
>
>
> Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2020年3月7日周六 下午4:50写道：
>
>> +1 for Sean as well.
>>
>> Moreover, as I added a voice on previous thread, if we want to be strict
>> with retaining public API, what we really need to do along with this is
>> having similar level or stricter of policy for adding public API. If we
>> don't apply the policy symmetrically, problems would go worse as it's still
>> not that hard to add public API (only require normal review) but once the
>> API is added and released it's going to be really hard to remove it.
>>
>> If we consider adding public API and deprecating/removing public API as
>> "critical" one for the project, IMHO, it would give better visibility and
>> open discussion if we make it going through dev@ mailing list instead of
>> directly filing a PR. As there're so many PRs being submitted it's nearly
>> impossible to look into all of PRs - it may require us to "watch" the repo
>> and have tons of mails. Compared to the popularity on Github PRs, dev@
>> mailing list is not that crowded so less chance of missing the critical
>> changes, and not quickly decided by only a couple of committers.
>>
>> These suggestions would slow down the developments - that would make us
>> realize we may want to "classify/mark" user facing public APIs and others
>> (just exposed as public) and only apply all the policies to former. For
>> latter we don't need to guarantee anything.
>>
>>
>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>> wrote:
>>
>>> +1 for Sean's concerns and questions.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> This thread established some good general principles, illustrated by a
>>>> few good examples. It didn't draw specific conclusions about what to add
>>>> back, which is why it wasn't at all controversial. What it means in
>>>> specific cases is where there may be disagreement, and that harder question
>>>> hasn't been addressed.
>>>>
>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>> there are several more going on now, some pretty broad. I am not even sure
>>>> what all of them are. In addition to below,
>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>> overhead to post to this thread any changes that one believes are endorsed
>>>> by these principles and perhaps a more strict interpretation of them now?
>>>> It's important enough we should get any data points or input, and now.
>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>> actually sounds like a good vehicle for that -- as long as people know
>>>> about them!
>>>>
>>>> Also, is there any usage data available to share? many arguments turn
>>>> around 'commonly used' but can we know that more concretely?
>>>>
>>>> Otherwise I think we'll back into implementing personal interpretations
>>>> of general principles, which is arguably the issue in the first place, even
>>>> when everyone believes in good faith in the same principles.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>>> virus.
>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>> Technically, this thread was not a vote and our website doesn't have a
>>>>> clear policy yet.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>
>>>>> https://github.com/apache/spark/pull/27835
>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>> operands"
>>>>>
>>>>> https://github.com/apache/spark/pull/27834
>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>>>> `branch-3.0` for new rubric?
>>>>>>>
>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>> it
>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>
>>>>>>>
>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>> deprecated ones?
>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>
>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>> happening now)
>>>>>>>
>>>>>>>
>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>> Spark Summit 2020)
>>>>>>> >
>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>> especially given our current preview releases being available to gather
>>>>>>> community feedback.
>>>>>>>
>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>> finishing in a month or two.
>>>>>>>
>>>>>>>
>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>> following tough parts sure.
>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>> > I hope not.
>>>>>>> >>
>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>> 3.1+, right?
>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>> elsewhere.
>>>>>>>
>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>> could be interpreted in many different ways depending on how much you
>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>> simplifying
>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>> general
>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>
>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>> for example.
>>>>>>>
>>>>>>

-- 
---
Takeshi Yamamuro

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to