Re: Spark SQL upgrade / migration guide: discoverability and content organization

Xiao Li Sun, 14 Jul 2019 16:25:46 -0700

Yeah, Josh! All these ideas sound good to me. All the top commercial
database products have very detailed guide/document about the version
upgrading. You can easily find them.


Currently, only SQL and ML modules have the migration or upgrade guides.
Since Spark 2.3 release, we strictly require the PR authors to document all
the behavior changes in the SQL component. I would suggest to do the same
things in the other modules. For example, Spark Core and Structured
Streaming. Any objection?

Cheers,

Xiao



On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen <[email protected]> wrote:

> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
> documentation: these are valuable resources and I think we could increase
> that value by making these docs easier to discover and by adding a bit more
> structure to the existing content.
>
> For folks who aren't familiar with these docs: the Spark docs have a "SQL
> Migration Guide" which lists the deprecations and changes of behavior in
> each release:
>
>    - Latest published version:
>    https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>    - Master branch version (will become 3.0):
>    
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>
> A lot of community work went into crafting this doc and I really
> appreciate those efforts.
>
> This doc is a little hard to find, though, because it's not consistently
> linked from release notes pages: the 2.4.0 page links it under "Changes of
> Behavior" (
> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
> but subsequent maintenance releases do not link to it (
> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
> not very cross-linked from the rest of the Spark docs (e.g. the Overview
> doc, doc drop-down menus, etc).
>
> I'm also concerned that the doc may be overwhelming to end users (as
> opposed to Spark developers):
>
>    - *Entries aren't grouped by component*, so users need to read the
>    entire document to spot changes relevant to their use of Spark (for
>    example, PySpark changes are not grouped together).
>    - *Entries aren't ordered by size / risk of change,* e.g. performance
>    impact vs. loud behavior change (stopping with an explicit exception) vs.
>    silent behavior changes (e.g. changing default rounding behavior). If we
>    assume limited reader attention then it may be important to prioritize the
>    order in which we list entries, putting the highest-expected-impact /
>    lowest-organic-discoverability changes first.
>    - *We don't link JIRAs*, forcing users to do their own archaeology to
>    learn more about a specific change.
>
> The existing ML migration guide addresses some of these issues, so maybe
> we can emulate it in the SQL guide:
> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>
> I think that documentation clarity is especially important with Spark 3.0
> around the corner: many folks will seek out this information when they
> upgrade, so improving this guide can be a high-leverage, high-impact
> activity.
>
> What do folks think? Does anyone have examples from other projects which
> do a notably good job of crafting release notes / migration guides? I'd be
> glad to help with pre-release editing after we decide on a structure and
> style.
>
> Cheers,
> Josh
>


-- 
[image: Databricks Summit - Watch the talks]
<https://databricks.com/sparkaisummit/north-america>

Re: Spark SQL upgrade / migration guide: discoverability and content organization

Reply via email to