Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Xiao Li Tue, 25 Feb 2020 21:09:52 -0800

+1

Xiao


Michael Armbrust <mich...@databricks.com> 于2020年2月24日周一 下午3:03写道：

> Hello Everyone,
>
> As more users have started upgrading to Spark 3.0 preview (including
> myself), there have been many discussions around APIs that have been broken
> compared with Spark 2.x. In many of these discussions, one of the
> rationales for breaking an API seems to be "Spark follows semantic
> versioning <https://spark.apache.org/versioning-policy.html>, so this
> major release is our chance to get it right [by breaking APIs]". Similarly,
> in many cases the response to questions about why an API was completely
> removed has been, "this API has been deprecated since x.x, so we have to
> remove it".
>
> As a long time contributor to and user of Spark this interpretation of the
> policy is concerning to me. This reasoning misses the intention of the
> original policy, and I am worried that it will hurt the long-term success
> of the project.
>
> I definitely understand that these are hard decisions, and I'm not
> proposing that we never remove anything from Spark. However, I would like
> to give some additional context and also propose a different rubric for
> thinking about API breakage moving forward.
>
> Spark adopted semantic versioning back in 2014 during the preparations for
> the 1.0 release. As this was the first major release -- and as, up until
> fairly recently, Spark had only been an academic project -- no real
> promises had been made about API stability ever.
>
> During the discussion, some committers suggested that this was an
> opportunity to clean up cruft and give the Spark APIs a once-over, making
> cosmetic changes to improve consistency. However, in the end, it was
> decided that in many cases it was not in the best interests of the Spark
> community to break things just because we could. Matei actually said it
> pretty forcefully
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
> :
>
> I know that some names are suboptimal, but I absolutely detest breaking
> APIs, config names, etc. I’ve seen it happen way too often in other
> projects (even things we depend on that are officially post-1.0, like Akka
> or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
> cutting-edge users are okay with libraries occasionally changing, but many
> others will consider it a show-stopper. Given this, I think that any
> cosmetic change now, even though it might improve clarity slightly, is not
> worth the tradeoff in terms of creating an update barrier for existing
> users.
>
> In the end, while some changes were made, most APIs remained the same and
> users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
> this served the project very well, as compatibility means users are able to
> upgrade and we keep as many people on the latest versions of Spark (though
> maybe not the latest APIs of Spark) as possible.
>
> As Spark grows, I think compatibility actually becomes more important and
> we should be more conservative rather than less. Today, there are very
> likely more Spark programs running than there were at any other time in the
> past. Spark is no longer a tool only used by advanced hackers, it is now
> also running "traditional enterprise workloads.'' In many cases these jobs
> are powering important processes long after the original author leaves.
>
> Broken APIs can also affect libraries that extend Spark. This dependency
> can be even harder for users, as if the library has not been upgraded to
> use new APIs and they need that library, they are stuck.
>
> Given all of this, I'd like to propose the following rubric as an addition
> to our semantic versioning policy. After discussion and if people agree
> this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
> the official policy.
>
> Considerations When Breaking APIs
>
> The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions. While this is not always possible, the
> balance of the following factors should be considered before choosing to
> break an API.
>
> Cost of Breaking an API
>
> Breaking an API almost always has a non-trivial cost to the users of
> Spark. A broken API means that Spark programs need to be rewritten before
> they can be upgraded. However, there are a few considerations when thinking
> about what the cost will be:
>
>    -
>
>    Usage - an API that is actively used in many different places, is
>    always very costly to break. While it is hard to know usage for sure, there
>    are a bunch of ways that we can estimate:
>    -
>
>       How long has the API been in Spark?
>       -
>
>       Is the API common even for basic programs?
>       -
>
>       How often do we see recent questions in JIRA or mailing lists?
>       -
>
>       How often does it appear in StackOverflow or blogs?
>       -
>
>    Behavior after the break - How will a program that works today, work
>    after the break? The following are listed roughly in order of increasing
>    severity:
>    -
>
>       Will there be a compiler or linker error?
>       -
>
>       Will there be a runtime exception?
>       -
>
>       Will that exception happen after significant processing has been
>       done?
>       -
>
>       Will we silently return different answers? (very hard to debug,
>       might not even notice!)
>
>
> Cost of Maintaining an API
>
> Of course, the above does not mean that we will never break any APIs. We
> must also consider the cost both to the project and to our users of keeping
> the API in question.
>
>    -
>
>    Project Costs - Every API we have needs to be tested and needs to keep
>    working as other parts of the project changes. These costs are
>    significantly exacerbated when external dependencies change (the JVM,
>    Scala, etc). In some cases, while not completely technically infeasible,
>    the cost of maintaining a particular API can become too high.
>    -
>
>    User Costs - APIs also have a cognitive cost to users learning Spark
>    or trying to understand Spark programs. This cost becomes even higher when
>    the API in question has confusing or undefined semantics.
>
>
> Alternatives to Breaking an API
>
> In cases where there is a "Bad API", but where the cost of removal is also
> high, there are alternatives that should be considered that do not hurt
> existing users but do address some of the maintenance costs.
>
>
>    -
>
>    Avoid Bad APIs - While this is a bit obvious, it is an important
>    point. Anytime we are adding a new interface to Spark we should consider
>    that we might be stuck with this API forever. Think deeply about how
>    new APIs relate to existing ones, as well as how you expect them to evolve
>    over time.
>    -
>
>    Deprecation Warnings - All deprecation warnings should point to a
>    clear alternative and should never just say that an API is deprecated.
>    -
>
>    Updated Docs - Documentation should point to the "best" recommended
>    way of performing a given task. In the cases where we maintain legacy
>    documentation, we should clearly point to newer APIs and suggest to users
>    the "right" way.
>    -
>
>    Community Work - Many people learn Spark by reading blogs and other
>    sites such as StackOverflow. However, many of these resources are out of
>    date. Update them, to reduce the cost of eventually removing deprecated
>    APIs.
>
>
> Examples
>
> Here are some examples of how I think the policy above could be applied to
> different issues that have been discussed recently. These are only to
> illustrate how to apply the above rubric, but are not intended to be part
> of the official policy.
>
> [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
> multiple creation of SparkContexts #23311
> <https://github.com/apache/spark/pull/23311>
>
>
>    -
>
>    Cost to Break - Multiple Contexts in a single JVM never worked
>    properly. When users tried it they would nearly always report that Spark
>    was broken (SPARK-2243
>    <https://issues.apache.org/jira/browse/SPARK-2243>), due to the
>    confusing set of logs messages. Given this, I think it is very unlikely
>    that there are many real world use cases active today. Even those cases
>    likely suffer from undiagnosed issues as there are many areas of Spark that
>    assume a single context per JVM.
>    -
>
>    Cost to Maintain - We have recently had users ask on the mailing list
>    if this was supported, as the conf led them to believe it was, and the
>    existence of this configuration as "supported" makes it harder to reason
>    about certain global state in SparkContext.
>
>
> Decision: Remove this configuration and related code.
>
> [SPARK-25908] Remove registerTempTable #22921
> <https://github.com/apache/spark/pull/22921/> (only looking at one API of
> this PR)
>
>
>    -
>
>    Cost to Break - This is a wildly popular API of Spark SQL that has
>    been there since the first release. There are tons of blog posts and
>    examples that use this syntax if you google "dataframe
>    registerTempTable
>    
> <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
>    (even more than the "correct" API "dataframe createOrReplaceView
>    
> <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
>    All of these will be invalid for users of Spark 3.0
>    -
>
>    Cost to Maintain - This is just an alias, so there is not a lot of
>    extra machinery required to keep the API. Users have two ways to do the
>    same thing, but we can note that this is just an alias in the docs.
>
>
> Decision: Do not remove this API, I would even consider un-deprecating
> it. I anecdotally asked several users and this is the API they prefer over
> the "correct" one.
>
> [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
> <https://github.com/apache/spark/pull/24195>
>
>    -
>
>    Cost to Break - I think that this case actually exemplifies several
>    anti-patterns in breaking APIs. In some languages, the deprecation warning
>    gives you no help, other than what version the function was removed in. In
>    R, it points users to a really deep conversation on the semantics of time
>    in Spark SQL. None of the messages tell you how you should correctly be
>    parsing a timestamp that is given to you in a format other than UTC. My
>    guess is all users will blindly flip the flag to true (to keep using this
>    function), so you've only succeeded in annoying them.
>    -
>
>    Cost to Maintain - These are two relatively isolated expressions,
>    there should be little cost to keeping them. Users can be confused by their
>    semantics, so we probably should update the docs to point them to a best
>    practice (I learned only by complaining on the PR, that a good practice is
>    to parse timestamps including the timezone in the format expression, which
>    naturally shifts them to UTC).
>
>
> Decision: Do not deprecate these two functions. We should update the docs
> to talk about best practices for parsing timestamps, including how to
> correctly shift them to UTC for storage.
>
> [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
> <https://github.com/apache/spark/pull/24902>
>
>
>    -
>
>    Cost to Break - The TRIM function takes two string parameters. If we
>    switch the parameter order, queries that use the TRIM function would
>    silently get different results on different versions of Spark. Users may
>    not notice it for a long time and wrong query results may cause serious
>    problems to users.
>    -
>
>    Cost to Maintain - We will have some inconsistency inside Spark, as
>    the TRIM function in Scala API and in SQL have different parameter order.
>
>
> Decision: Do not switch the parameter order. Promote the TRIM(trimStr
> FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate
> (with a warning, not by removing) the SQL TRIM function and move users to
> the SQL standard TRIM syntax.
>
> Thanks for taking the time to read this! Happy to discuss the specifics
> and amend this policy as the community sees fit.
>
> Michael
>
>

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

Reply via email to