+1 Xiao
Michael Armbrust <mich...@databricks.com> 于2020年2月24日周一 下午3:03写道: > Hello Everyone, > > As more users have started upgrading to Spark 3.0 preview (including > myself), there have been many discussions around APIs that have been broken > compared with Spark 2.x. In many of these discussions, one of the > rationales for breaking an API seems to be "Spark follows semantic > versioning <https://spark.apache.org/versioning-policy.html>, so this > major release is our chance to get it right [by breaking APIs]". Similarly, > in many cases the response to questions about why an API was completely > removed has been, "this API has been deprecated since x.x, so we have to > remove it". > > As a long time contributor to and user of Spark this interpretation of the > policy is concerning to me. This reasoning misses the intention of the > original policy, and I am worried that it will hurt the long-term success > of the project. > > I definitely understand that these are hard decisions, and I'm not > proposing that we never remove anything from Spark. However, I would like > to give some additional context and also propose a different rubric for > thinking about API breakage moving forward. > > Spark adopted semantic versioning back in 2014 during the preparations for > the 1.0 release. As this was the first major release -- and as, up until > fairly recently, Spark had only been an academic project -- no real > promises had been made about API stability ever. > > During the discussion, some committers suggested that this was an > opportunity to clean up cruft and give the Spark APIs a once-over, making > cosmetic changes to improve consistency. However, in the end, it was > decided that in many cases it was not in the best interests of the Spark > community to break things just because we could. Matei actually said it > pretty forcefully > <http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-for-Spark-Release-Strategy-td464i20.html#a503> > : > > I know that some names are suboptimal, but I absolutely detest breaking > APIs, config names, etc. I’ve seen it happen way too often in other > projects (even things we depend on that are officially post-1.0, like Akka > or Protobuf or Hadoop), and it’s very painful. I think that we as fairly > cutting-edge users are okay with libraries occasionally changing, but many > others will consider it a show-stopper. Given this, I think that any > cosmetic change now, even though it might improve clarity slightly, is not > worth the tradeoff in terms of creating an update barrier for existing > users. > > In the end, while some changes were made, most APIs remained the same and > users of Spark <= 0.9 were pretty easily able to upgrade to 1.0. I think > this served the project very well, as compatibility means users are able to > upgrade and we keep as many people on the latest versions of Spark (though > maybe not the latest APIs of Spark) as possible. > > As Spark grows, I think compatibility actually becomes more important and > we should be more conservative rather than less. Today, there are very > likely more Spark programs running than there were at any other time in the > past. Spark is no longer a tool only used by advanced hackers, it is now > also running "traditional enterprise workloads.'' In many cases these jobs > are powering important processes long after the original author leaves. > > Broken APIs can also affect libraries that extend Spark. This dependency > can be even harder for users, as if the library has not been upgraded to > use new APIs and they need that library, they are stuck. > > Given all of this, I'd like to propose the following rubric as an addition > to our semantic versioning policy. After discussion and if people agree > this is a good idea, I'll call a vote of the PMC to ratify its inclusion in > the official policy. > > Considerations When Breaking APIs > > The Spark project strives to avoid breaking APIs or silently changing > behavior, even at major versions. While this is not always possible, the > balance of the following factors should be considered before choosing to > break an API. > > Cost of Breaking an API > > Breaking an API almost always has a non-trivial cost to the users of > Spark. A broken API means that Spark programs need to be rewritten before > they can be upgraded. However, there are a few considerations when thinking > about what the cost will be: > > - > > Usage - an API that is actively used in many different places, is > always very costly to break. While it is hard to know usage for sure, there > are a bunch of ways that we can estimate: > - > > How long has the API been in Spark? > - > > Is the API common even for basic programs? > - > > How often do we see recent questions in JIRA or mailing lists? > - > > How often does it appear in StackOverflow or blogs? > - > > Behavior after the break - How will a program that works today, work > after the break? The following are listed roughly in order of increasing > severity: > - > > Will there be a compiler or linker error? > - > > Will there be a runtime exception? > - > > Will that exception happen after significant processing has been > done? > - > > Will we silently return different answers? (very hard to debug, > might not even notice!) > > > Cost of Maintaining an API > > Of course, the above does not mean that we will never break any APIs. We > must also consider the cost both to the project and to our users of keeping > the API in question. > > - > > Project Costs - Every API we have needs to be tested and needs to keep > working as other parts of the project changes. These costs are > significantly exacerbated when external dependencies change (the JVM, > Scala, etc). In some cases, while not completely technically infeasible, > the cost of maintaining a particular API can become too high. > - > > User Costs - APIs also have a cognitive cost to users learning Spark > or trying to understand Spark programs. This cost becomes even higher when > the API in question has confusing or undefined semantics. > > > Alternatives to Breaking an API > > In cases where there is a "Bad API", but where the cost of removal is also > high, there are alternatives that should be considered that do not hurt > existing users but do address some of the maintenance costs. > > > - > > Avoid Bad APIs - While this is a bit obvious, it is an important > point. Anytime we are adding a new interface to Spark we should consider > that we might be stuck with this API forever. Think deeply about how > new APIs relate to existing ones, as well as how you expect them to evolve > over time. > - > > Deprecation Warnings - All deprecation warnings should point to a > clear alternative and should never just say that an API is deprecated. > - > > Updated Docs - Documentation should point to the "best" recommended > way of performing a given task. In the cases where we maintain legacy > documentation, we should clearly point to newer APIs and suggest to users > the "right" way. > - > > Community Work - Many people learn Spark by reading blogs and other > sites such as StackOverflow. However, many of these resources are out of > date. Update them, to reduce the cost of eventually removing deprecated > APIs. > > > Examples > > Here are some examples of how I think the policy above could be applied to > different issues that have been discussed recently. These are only to > illustrate how to apply the above rubric, but are not intended to be part > of the official policy. > > [SPARK-26362] Remove 'spark.driver.allowMultipleContexts' to disallow > multiple creation of SparkContexts #23311 > <https://github.com/apache/spark/pull/23311> > > > - > > Cost to Break - Multiple Contexts in a single JVM never worked > properly. When users tried it they would nearly always report that Spark > was broken (SPARK-2243 > <https://issues.apache.org/jira/browse/SPARK-2243>), due to the > confusing set of logs messages. Given this, I think it is very unlikely > that there are many real world use cases active today. Even those cases > likely suffer from undiagnosed issues as there are many areas of Spark that > assume a single context per JVM. > - > > Cost to Maintain - We have recently had users ask on the mailing list > if this was supported, as the conf led them to believe it was, and the > existence of this configuration as "supported" makes it harder to reason > about certain global state in SparkContext. > > > Decision: Remove this configuration and related code. > > [SPARK-25908] Remove registerTempTable #22921 > <https://github.com/apache/spark/pull/22921/> (only looking at one API of > this PR) > > > - > > Cost to Break - This is a wildly popular API of Spark SQL that has > been there since the first release. There are tons of blog posts and > examples that use this syntax if you google "dataframe > registerTempTable > > <https://www.google.com/search?q=dataframe+registertemptable&rlz=1C5CHFA_enUS746US746&oq=dataframe+registertemptable&aqs=chrome.0.0l8.3040j1j7&sourceid=chrome&ie=UTF-8>" > (even more than the "correct" API "dataframe createOrReplaceView > > <https://www.google.com/search?rlz=1C5CHFA_enUS746US746&ei=TkZMXrj1ObzA0PEPpLKR2A4&q=dataframe+createorreplacetempview&oq=dataframe+createor&gs_l=psy-ab.3.0.0j0i22i30l7.663.1303..2750...0.3..1.212.782.7j0j1......0....1..gws-wiz.......0i71j0i131.zP34wH1novM>"). > All of these will be invalid for users of Spark 3.0 > - > > Cost to Maintain - This is just an alias, so there is not a lot of > extra machinery required to keep the API. Users have two ways to do the > same thing, but we can note that this is just an alias in the docs. > > > Decision: Do not remove this API, I would even consider un-deprecating > it. I anecdotally asked several users and this is the API they prefer over > the "correct" one. > > [SPARK-25496] Deprecate from_utc_timestamp and to_utc_timestamp #24195 > <https://github.com/apache/spark/pull/24195> > > - > > Cost to Break - I think that this case actually exemplifies several > anti-patterns in breaking APIs. In some languages, the deprecation warning > gives you no help, other than what version the function was removed in. In > R, it points users to a really deep conversation on the semantics of time > in Spark SQL. None of the messages tell you how you should correctly be > parsing a timestamp that is given to you in a format other than UTC. My > guess is all users will blindly flip the flag to true (to keep using this > function), so you've only succeeded in annoying them. > - > > Cost to Maintain - These are two relatively isolated expressions, > there should be little cost to keeping them. Users can be confused by their > semantics, so we probably should update the docs to point them to a best > practice (I learned only by complaining on the PR, that a good practice is > to parse timestamps including the timezone in the format expression, which > naturally shifts them to UTC). > > > Decision: Do not deprecate these two functions. We should update the docs > to talk about best practices for parsing timestamps, including how to > correctly shift them to UTC for storage. > > [SPARK-28093] Fix TRIM/LTRIM/RTRIM function parameter order issue #24902 > <https://github.com/apache/spark/pull/24902> > > > - > > Cost to Break - The TRIM function takes two string parameters. If we > switch the parameter order, queries that use the TRIM function would > silently get different results on different versions of Spark. Users may > not notice it for a long time and wrong query results may cause serious > problems to users. > - > > Cost to Maintain - We will have some inconsistency inside Spark, as > the TRIM function in Scala API and in SQL have different parameter order. > > > Decision: Do not switch the parameter order. Promote the TRIM(trimStr > FROM srcStr) syntax our SQL docs as it's the SQL standard. Deprecate > (with a warning, not by removing) the SQL TRIM function and move users to > the SQL standard TRIM syntax. > > Thanks for taking the time to read this! Happy to discuss the specifics > and amend this policy as the community sees fit. > > Michael > >