Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Xiao Li Sat, 07 Mar 2020 19:06:11 -0800

 I want to thank you *Ruifeng Zheng* publicly for his work that lists all
the signature differences of Core, SQL and Hive we made in this upcoming
release. For details, please read the files attached in SPARK-30982
<https://issues.apache.org/jira/browse/SPARK-30982>. I went over these
files and submitted the following PRs to add back the SparkSQL APIs whose
maintenance costs are low based on my own experiences in SparkSQL
development:


   - https://github.com/apache/spark/pull/27821
   - functions.toDegrees/toRadians
      - functions.approxCountDistinct
      - functions.monotonicallyIncreasingId
      - Column.!==
      - Dataset.explode
      - Dataset.registerTempTable
      - SQLContext.getOrCreate, setActive, clearActive, constructors
   - https://github.com/apache/spark/pull/27815
      - HiveContext
      - createExternalTable APIs
   -
   - https://github.com/apache/spark/pull/27839
      - SQLContext.applySchema
      - SQLContext.parquetFile
      - SQLContext.jsonFile
      - SQLContext.jsonRDD
      - SQLContext.load
      - SQLContext.jdbc

If you think these APIs should not be added back, let me know and we can
discuss the items further. In general, I think we should provide more
evidences and discuss them publicly when we dropping these APIs at the
beginning.

+1 on Jungtaek's comments. When we making the API changes (e.g., adding the
new APIs or changing the existing APIs), we should regularly publish them
in the dev list. I am willing to lead this effort, work with my colleagues
to summarize all the merged commits [especially the API changes], and then
send the *bi-weekly digest *to the dev list. If you are willing to join
this working group and help build these digests, feel free to send me a
note [[email protected]].

Cheers,

Xiao




Jungtaek Lim <[email protected]> 于2020年3月7日周六 下午4:50写道：

> +1 for Sean as well.
>
> Moreover, as I added a voice on previous thread, if we want to be strict
> with retaining public API, what we really need to do along with this is
> having similar level or stricter of policy for adding public API. If we
> don't apply the policy symmetrically, problems would go worse as it's still
> not that hard to add public API (only require normal review) but once the
> API is added and released it's going to be really hard to remove it.
>
> If we consider adding public API and deprecating/removing public API as
> "critical" one for the project, IMHO, it would give better visibility and
> open discussion if we make it going through dev@ mailing list instead of
> directly filing a PR. As there're so many PRs being submitted it's nearly
> impossible to look into all of PRs - it may require us to "watch" the repo
> and have tons of mails. Compared to the popularity on Github PRs, dev@
> mailing list is not that crowded so less chance of missing the critical
> changes, and not quickly decided by only a couple of committers.
>
> These suggestions would slow down the developments - that would make us
> realize we may want to "classify/mark" user facing public APIs and others
> (just exposed as public) and only apply all the policies to former. For
> latter we don't need to guarantee anything.
>
>
> On Sun, Mar 8, 2020 at 4:31 AM Dongjoon Hyun <[email protected]>
> wrote:
>
>> +1 for Sean's concerns and questions.
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Mar 6, 2020 at 3:14 PM Sean Owen <[email protected]> wrote:
>>
>>> This thread established some good general principles, illustrated by a
>>> few good examples. It didn't draw specific conclusions about what to add
>>> back, which is why it wasn't at all controversial. What it means in
>>> specific cases is where there may be disagreement, and that harder question
>>> hasn't been addressed.
>>>
>>> The reverts I have seen so far seemed like the obvious one, but yes,
>>> there are several more going on now, some pretty broad. I am not even sure
>>> what all of them are. In addition to below,
>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>> overhead to post to this thread any changes that one believes are endorsed
>>> by these principles and perhaps a more strict interpretation of them now?
>>> It's important enough we should get any data points or input, and now.
>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>> actually sounds like a good vehicle for that -- as long as people know
>>> about them!
>>>
>>> Also, is there any usage data available to share? many arguments turn
>>> around 'commonly used' but can we know that more concretely?
>>>
>>> Otherwise I think we'll back into implementing personal interpretations
>>> of general principles, which is arguably the issue in the first place, even
>>> when everyone believes in good faith in the same principles.
>>>
>>>
>>>
>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun <[email protected]>
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>> virus.
>>>> Can we finalize this first before doing unofficial personal decisions?
>>>> Technically, this thread was not a vote and our website doesn't have a
>>>> clear policy yet.
>>>>
>>>> https://github.com/apache/spark/pull/27821
>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>     ==> This technically revert most of the SPARK-25908.
>>>>
>>>> https://github.com/apache/spark/pull/27835
>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>> operands"
>>>>
>>>> https://github.com/apache/spark/pull/27834
>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen <[email protected]> wrote:
>>>>>
>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau <[email protected]>
>>>>>> wrote:
>>>>>> >>     1. Could you estimate how many revert commits are required in
>>>>>> `branch-3.0` for new rubric?
>>>>>>
>>>>>> Fair question about what actual change this implies for 3.0? so far it
>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>
>>>>>>
>>>>>> >>     2. Are you going to revert all removed test cases for the
>>>>>> deprecated ones?
>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>
>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>> happening now)
>>>>>>
>>>>>>
>>>>>> >>     3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>> >>         (I believe it was previously scheduled on June before
>>>>>> Spark Summit 2020)
>>>>>> >
>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>> especially given our current preview releases being available to gather
>>>>>> community feedback.
>>>>>>
>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>> finishing in a month or two.
>>>>>>
>>>>>>
>>>>>> >> Although there was a discussion already, I want to make the
>>>>>> following tough parts sure.
>>>>>> >>     4. We are not going to add Scala 2.11 API, right?
>>>>>> > I hope not.
>>>>>> >>
>>>>>> >>     5. We are not going to support Python 2.x in Apache Spark
>>>>>> 3.1+, right?
>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>> elsewhere.
>>>>>>
>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>> could be interpreted in many different ways depending on how much you
>>>>>> weight value of keeping APIs for compatibility vs value in simplifying
>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>> perhaps from vendors as proxies for a subset of users, and the general
>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>
>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>> for example.
>>>>>>
>>>>>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to