Thank you, Josh and Xiao. That sounds great.

Do you think we can have some parts of that improvement in `2.4.4` document
first since that is the very next release?

Bests,
Dongjoon.

On Sun, Jul 14, 2019 at 4:25 PM Xiao Li <lix...@databricks.com> wrote:

> Yeah, Josh! All these ideas sound good to me. All the top commercial
> database products have very detailed guide/document about the version
> upgrading. You can easily find them.
>
> Currently, only SQL and ML modules have the migration or upgrade guides.
> Since Spark 2.3 release, we strictly require the PR authors to document all
> the behavior changes in the SQL component. I would suggest to do the same
> things in the other modules. For example, Spark Core and Structured
> Streaming. Any objection?
>
> Cheers,
>
> Xiao
>
>
>
> On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <rosenvi...@gmail.com> wrote:
>
>> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
>> documentation: these are valuable resources and I think we could increase
>> that value by making these docs easier to discover and by adding a bit more
>> structure to the existing content.
>>
>> For folks who aren't familiar with these docs: the Spark docs have a "SQL
>> Migration Guide" which lists the deprecations and changes of behavior in
>> each release:
>>
>>    - Latest published version:
>>    https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>>    - Master branch version (will become 3.0):
>>    
>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>>
>> A lot of community work went into crafting this doc and I really
>> appreciate those efforts.
>>
>> This doc is a little hard to find, though, because it's not consistently
>> linked from release notes pages: the 2.4.0 page links it under "Changes of
>> Behavior" (
>> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
>> but subsequent maintenance releases do not link to it (
>> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
>> not very cross-linked from the rest of the Spark docs (e.g. the Overview
>> doc, doc drop-down menus, etc).
>>
>> I'm also concerned that the doc may be overwhelming to end users (as
>> opposed to Spark developers):
>>
>>    - *Entries aren't grouped by component*, so users need to read the
>>    entire document to spot changes relevant to their use of Spark (for
>>    example, PySpark changes are not grouped together).
>>    - *Entries aren't ordered by size / risk of change,* e.g. performance
>>    impact vs. loud behavior change (stopping with an explicit exception) vs.
>>    silent behavior changes (e.g. changing default rounding behavior). If we
>>    assume limited reader attention then it may be important to prioritize the
>>    order in which we list entries, putting the highest-expected-impact /
>>    lowest-organic-discoverability changes first.
>>    - *We don't link JIRAs*, forcing users to do their own archaeology to
>>    learn more about a specific change.
>>
>> The existing ML migration guide addresses some of these issues, so maybe
>> we can emulate it in the SQL guide:
>> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>>
>> I think that documentation clarity is especially important with Spark 3.0
>> around the corner: many folks will seek out this information when they
>> upgrade, so improving this guide can be a high-leverage, high-impact
>> activity.
>>
>> What do folks think? Does anyone have examples from other projects which
>> do a notably good job of crafting release notes / migration guides? I'd be
>> glad to help with pre-release editing after we decide on a structure and
>> style.
>>
>> Cheers,
>> Josh
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>

Reply via email to