Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Dongjoon Hyun Sun, 08 Mar 2020 22:40:11 -0700

Thank you all. Especially, the Audit efforts.

Until now, the whole community has been working together in the same
direction with the existing policy. It is always good.


Since it seems that we are considering to have a new direction, I created
an umbrella JIRA to track all activities.

      https://issues.apache.org/jira/browse/SPARK-31085
      Amend Spark's Semantic Versioning Policy

As we know, the community-wide directional change always has a huge impact
on daily PR reviews and regular releases. So, we had better consider all
the reverting PRs as a normal independent PR instead of the follow-ups.
Specifically, I believe we need the following.

    1. Have new JIRA IDs instead of considering a simple revert or
follow-up.
        It's because we are not adding everything back blindly. For example,
            https://issues.apache.org/jira/browse/SPARK-31089
            "Add back ImageSchema.readImages in Spark 3.0"
        is created and closed as 'Won't Do' with consideration between the
trade-off.
        We need to have a JIRA-issue-level history for this kind of request
and the decision.

    2. Sometime, as described by Michael, reverting is insufficient.
        We need to provide a more fine-grained deprecation for users'
safety case by case.

    3. Given the timeline, newly added API should have a test coverage in
the same PR from the beginning.
        This is required because the whole reverting efforts aim to give a
working API back.

I believe that we have a good discussion in this thread.
We are making a big change in Apache Spark history.
Please be part of the history by taking actions like replying, voting, and
reviewing.

Thanks,
Dongjoon.


On Sat, Mar 7, 2020 at 11:20 PM Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Yea, +1 on Jungtaek's suggestion; having the same strict policy for adding
> new APIs looks nice.
>
> > When we making the API changes (e.g., adding the new APIs or changing
> the existing APIs), we should regularly publish them in the dev list. I am
> willing to lead this effort, work with my colleagues to summarize all the
> merged commits [especially the API changes], and then send the *bi-weekly
> digest *to the dev list
>
> This digest looks very helpful for the community, thanks, Xiao!
>
> Bests,
> Takeshi
>
> On Sun, Mar 8, 2020 at 12:05 PM Xiao Li <gatorsm...@gmail.com> wrote:
>
>> I want to thank you *Ruifeng Zheng* publicly for his work that lists all
>> the signature differences of Core, SQL and Hive we made in this upcoming
>> release. For details, please read the files attached in SPARK-30982
>> <https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
>> files and submitted the following PRs to add back the SparkSQL APIs whose
>> maintenance costs are low based on my own experiences in SparkSQL
>> development:
>>
>>    - https://github.com/apache/spark/pull/27821
>>    - functions.toDegrees/toRadians
>>       - functions.approxCountDistinct
>>       - functions.monotonicallyIncreasingId
>>       - Column.!==
>>       - Dataset.explode
>>       - Dataset.registerTempTable
>>       - SQLContext.getOrCreate, setActive, clearActive, constructors
>>    - https://github.com/apache/spark/pull/27815
>>       - HiveContext
>>       - createExternalTable APIs
>>    -
>>    - https://github.com/apache/spark/pull/27839
>>       - SQLContext.applySchema
>>       - SQLContext.parquetFile
>>       - SQLContext.jsonFile
>>       - SQLContext.jsonRDD
>>       - SQLContext.load
>>       - SQLContext.jdbc
>>
>> If you think these APIs should not be added back, let me know and we can
>> discuss the items further. In general, I think we should provide more
>> evidences and discuss them publicly when we dropping these APIs at the
>> beginning.
>>
>> +1 on Jungtaek's comments. When we making the API changes (e.g., adding
>> the new APIs or changing the existing APIs), we should regularly publish
>> them in the dev list. I am willing to lead this effort, work with my
>> colleagues to summarize all the merged commits [especially the API
>> changes], and then send the *bi-weekly digest *to the dev list. If you
>> are willing to join this working group and help build these digests, feel
>> free to send me a note [lix...@databricks.com].
>>
>> Cheers,
>>
>> Xiao
>>
>>
>>
>>
>> Jungtaek Lim <kabhwan.opensou...@gmail.com> 于2020年3月7日周六 下午4:50写道：
>>
>>> +1 for Sean as well.
>>>
>>> Moreover, as I added a voice on previous thread, if we want to be strict
>>> with retaining public API, what we really need to do along with this is
>>> having similar level or stricter of policy for adding public API. If we
>>> don't apply the policy symmetrically, problems would go worse as it's still
>>> not that hard to add public API (only require normal review) but once the
>>> API is added and released it's going to be really hard to remove it.
>>>
>>> If we consider adding public API and deprecating/removing public API as
>>> "critical" one for the project, IMHO, it would give better visibility and
>>> open discussion if we make it going through dev@ mailing list instead
>>> of directly filing a PR. As there're so many PRs being submitted it's
>>> nearly impossible to look into all of PRs - it may require us to "watch"
>>> the repo and have tons of mails. Compared to the popularity on Github PRs,
>>> dev@ mailing list is not that crowded so less chance of missing the
>>> critical changes, and not quickly decided by only a couple of committers.
>>>
>>> These suggestions would slow down the developments - that would make us
>>> realize we may want to "classify/mark" user facing public APIs and others
>>> (just exposed as public) and only apply all the policies to former. For
>>> latter we don't need to guarantee anything.
>>>
>>>
>>> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>> wrote:
>>>
>>>> +1 for Sean's concerns and questions.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> This thread established some good general principles, illustrated by a
>>>>> few good examples. It didn't draw specific conclusions about what to add
>>>>> back, which is why it wasn't at all controversial. What it means in
>>>>> specific cases is where there may be disagreement, and that harder 
>>>>> question
>>>>> hasn't been addressed.
>>>>>
>>>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>>>> there are several more going on now, some pretty broad. I am not even sure
>>>>> what all of them are. In addition to below,
>>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>>> overhead to post to this thread any changes that one believes are endorsed
>>>>> by these principles and perhaps a more strict interpretation of them now?
>>>>> It's important enough we should get any data points or input, and now.
>>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>>> actually sounds like a good vehicle for that -- as long as people know
>>>>> about them!
>>>>>
>>>>> Also, is there any usage data available to share? many arguments turn
>>>>> around 'commonly used' but can we know that more concretely?
>>>>>
>>>>> Otherwise I think we'll back into implementing personal
>>>>> interpretations of general principles, which is arguably the issue in the
>>>>> first place, even when everyone believes in good faith in the same
>>>>> principles.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> Recently, reverting PRs seems to start to spread like the
>>>>>> *well-known* virus.
>>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>>> Technically, this thread was not a vote and our website doesn't have
>>>>>> a clear policy yet.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>>>     ==> This technically revert most of the SPARK-25908.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27835
>>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>>> operands"
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27834
>>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <sro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>> >>     1. Could you estimate how many revert commits are required
>>>>>>>> in `branch-3.0` for new rubric?
>>>>>>>>
>>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>>> it
>>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>>
>>>>>>>>
>>>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>>>> deprecated ones?
>>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>>> important (worse than removing a deprecated API is shipping it 
>>>>>>>> broken),.
>>>>>>>>
>>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>>> happening now)
>>>>>>>>
>>>>>>>>
>>>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>>>> Spark Summit 2020)
>>>>>>>> >
>>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>>> especially given our current preview releases being available to gather
>>>>>>>> community feedback.
>>>>>>>>
>>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>>> finishing in a month or two.
>>>>>>>>
>>>>>>>>
>>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>>> following tough parts sure.
>>>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>>>> > I hope not.
>>>>>>>> >>
>>>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>>>> 3.1+, right?
>>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>>> elsewhere.
>>>>>>>>
>>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>>> could be interpreted in many different ways depending on how much
>>>>>>>> you
>>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>>> simplifying
>>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>>> general
>>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>>
>>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>>> for example.
>>>>>>>>
>>>>>>>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to