Hello Everyone,

As more users have started upgrading to Spark 3.0 preview (including
myself), there have been many discussions around APIs that have been broken
compared with Spark 2.x. In many of these discussions, one of the
rationales for breaking an API seems to be "Spark follows semantic
versioning <https://spark.apache.org/versioning-policy.html>, so this major
release is our chance to get it right [by breaking APIs]". Similarly, in
many cases the response to questions about why an API was completely
removed has been, "this API has been deprecated since x.x, so we have to
remove it".

As a long time contributor to and user of Spark this interpretation of the
policy is concerning to me. This reasoning misses the intention of the
original policy, and I am worried that it will hurt the long-term success
of the project.

I definitely understand that these are hard decisions, and I'm not
proposing that we never remove anything from Spark. However, I would like
to give some additional context and also propose a different rubric for
thinking about API breakage moving forward.

Spark adopted semantic versioning back in 2014 during the preparations for
the 1.0 release. As this was the first major release -- and as, up until
fairly recently, Spark had only been an academic project -- no real
promises had been made about API stability ever.

During the discussion, some committers suggested that this was an
opportunity to clean up cruft and give the Spark APIs a once-over, making
cosmetic changes to improve consistency. However, in the end, it was
decided that in many cases it was not in the best interests of the Spark
community to break things just because we could. Matei actually said it
pretty forcefully
<http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503>
:

I know that some names are suboptimal, but I absolutely detest breaking
APIs, config names, etc. I’ve seen it happen way too often in other
projects (even things we depend on that are officially post-1.0, like Akka
or Protobuf or Hadoop), and it’s very painful. I think that we as fairly
cutting-edge users are okay with libraries occasionally changing, but many
others will consider it a show-stopper. Given this, I think that any
cosmetic change now, even though it might improve clarity slightly, is not
worth the tradeoff in terms of creating an update barrier for existing
users.

In the end, while some changes were made, most APIs remained the same and
users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think
this served the project very well, as compatibility means users are able to
upgrade and we keep as many people on the latest versions of Spark (though
maybe not the latest APIs of Spark) as possible.

As Spark grows, I think compatibility actually becomes more important and
we should be more conservative rather than less. Today, there are very
likely more Spark programs running than there were at any other time in the
past. Spark is no longer a tool only used by advanced hackers, it is now
also running "traditional enterprise workloads.'' In many cases these jobs
are powering important processes long after the original author leaves.

Broken APIs can also affect libraries that extend Spark. This dependency
can be even harder for users, as if the library has not been upgraded to
use new APIs and they need that library, they are stuck.

Given all of this, I'd like to propose the following rubric as an addition
to our semantic versioning policy. After discussion and if people agree
this is a good idea, I'll call a vote of the PMC to ratify its inclusion in
the official policy.

Considerations When Breaking APIs

The Spark project strives to avoid breaking APIs or silently changing
behavior, even at major versions. While this is not always possible, the
balance of the following factors should be considered before choosing to
break an API.

Cost of Breaking an API

Breaking an API almost always has a non-trivial cost to the users of Spark.
A broken API means that Spark programs need to be rewritten before they can
be upgraded. However, there are a few considerations when thinking about
what the cost will be:

   -

   Usage - an API that is actively used in many different places, is always
   very costly to break. While it is hard to know usage for sure, there are a
   bunch of ways that we can estimate:
   -

      How long has the API been in Spark?
      -

      Is the API common even for basic programs?
      -

      How often do we see recent questions in JIRA or mailing lists?
      -

      How often does it appear in StackOverflow or blogs?
      -

   Behavior after the break - How will a program that works today, work
   after the break? The following are listed roughly in order of increasing
   severity:
   -

      Will there be a compiler or linker error?
      -

      Will there be a runtime exception?
      -

      Will that exception happen after significant processing has been done?
      -

      Will we silently return different answers? (very hard to debug, might
      not even notice!)


Cost of Maintaining an API

Of course, the above does not mean that we will never break any APIs. We
must also consider the cost both to the project and to our users of keeping
the API in question.

   -

   Project Costs - Every API we have needs to be tested and needs to keep
   working as other parts of the project changes. These costs are
   significantly exacerbated when external dependencies change (the JVM,
   Scala, etc). In some cases, while not completely technically infeasible,
   the cost of maintaining a particular API can become too high.
   -

   User Costs - APIs also have a cognitive cost to users learning Spark or
   trying to understand Spark programs. This cost becomes even higher when the
   API in question has confusing or undefined semantics.


Alternatives to Breaking an API

In cases where there is a "Bad API", but where the cost of removal is also
high, there are alternatives that should be considered that do not hurt
existing users but do address some of the maintenance costs.


   -

   Avoid Bad APIs - While this is a bit obvious, it is an important point.
   Anytime we are adding a new interface to Spark we should consider that we
   might be stuck with this API forever. Think deeply about how new APIs
   relate to existing ones, as well as how you expect them to evolve over time.
   -

   Deprecation Warnings - All deprecation warnings should point to a clear
   alternative and should never just say that an API is deprecated.
   -

   Updated Docs - Documentation should point to the "best" recommended way
   of performing a given task. In the cases where we maintain legacy
   documentation, we should clearly point to newer APIs and suggest to users
   the "right" way.
   -

   Community Work - Many people learn Spark by reading blogs and other
   sites such as StackOverflow. However, many of these resources are out of
   date. Update them, to reduce the cost of eventually removing deprecated
   APIs.


Examples

Here are some examples of how I think the policy above could be applied to
different issues that have been discussed recently. These are only to
illustrate how to apply the above rubric, but are not intended to be part
of the official policy.

[SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow
multiple creation of SparkContexts #23311
<https://github.com/apache/spark/pull/23311>


   -

   Cost to Break - Multiple Contexts in a single JVM never worked properly.
   When users tried it they would nearly always report that Spark was broken (
   SPARK-2243 <https://issues.apache.org/jira/browse/SPARK-2243>), due to
   the confusing set of logs messages. Given this, I think it is very unlikely
   that there are many real world use cases active today. Even those cases
   likely suffer from undiagnosed issues as there are many areas of Spark that
   assume a single context per JVM.
   -

   Cost to Maintain - We have recently had users ask on the mailing list if
   this was supported, as the conf led them to believe it was, and the
   existence of this configuration as "supported" makes it harder to reason
   about certain global state in SparkContext.


Decision: Remove this configuration and related code.

[SPARK-25908] Remove registerTempTable #22921
<https://github.com/apache/spark/pull/22921/> (only looking at one API of
this PR)


   -

   Cost to Break - This is a wildly popular API of Spark SQL that has been
   there since the first release. There are tons of blog posts and examples
   that use this syntax if you google "dataframe registerTempTable
   
<https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>"
   (even more than the "correct" API "dataframe createOrReplaceView
   
<https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>").
   All of these will be invalid for users of Spark 3.0
   -

   Cost to Maintain - This is just an alias, so there is not a lot of extra
   machinery required to keep the API. Users have two ways to do the same
   thing, but we can note that this is just an alias in the docs.


Decision: Do not remove this API, I would even consider un-deprecating it.
I anecdotally asked several users and this is the API they prefer over the
"correct" one.

[SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195
<https://github.com/apache/spark/pull/24195>

   -

   Cost to Break - I think that this case actually exemplifies several
   anti-patterns in breaking APIs. In some languages, the deprecation warning
   gives you no help, other than what version the function was removed in. In
   R, it points users to a really deep conversation on the semantics of time
   in Spark SQL. None of the messages tell you how you should correctly be
   parsing a timestamp that is given to you in a format other than UTC. My
   guess is all users will blindly flip the flag to true (to keep using this
   function), so you've only succeeded in annoying them.
   -

   Cost to Maintain - These are two relatively isolated expressions, there
   should be little cost to keeping them. Users can be confused by their
   semantics, so we probably should update the docs to point them to a best
   practice (I learned only by complaining on the PR, that a good practice is
   to parse timestamps including the timezone in the format expression, which
   naturally shifts them to UTC).


Decision: Do not deprecate these two functions. We should update the docs
to talk about best practices for parsing timestamps, including how to
correctly shift them to UTC for storage.

[SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902
<https://github.com/apache/spark/pull/24902>


   -

   Cost to Break - The TRIM function takes two string parameters. If we
   switch the parameter order, queries that use the TRIM function would
   silently get different results on different versions of Spark. Users may
   not notice it for a long time and wrong query results may cause serious
   problems to users.
   -

   Cost to Maintain - We will have some inconsistency inside Spark, as the
   TRIM function in Scala API and in SQL have different parameter order.


Decision: Do not switch the parameter order. Promote the TRIM(trimStr FROM
srcStr) syntax our SQL docs as it's the SQL standard. Deprecate (with a
warning, not by removing) the SQL TRIM function and move users to the SQL
standard TRIM syntax.

Thanks for taking the time to read this! Happy to discuss the specifics and
amend this policy as the community sees fit.

Michael

Reply via email to