Re: Spark SQL upgrade / migration guide: discoverability and content organization

2019-07-14 Thread Jungtaek Lim
As one of contributors in Structured Streaming, I would vote on having
migration guide doc for structured streaming as well, once we decide
standard format of migration guide.

In Spark 3.0.0 there're some breaking change on even SS area - one example
is SPARK-28199 which Sean took care of leaving release note for this, but
migration guide would be better to help for some users from 2.4.x to 3.0.x
since release note would be bound to only 3.0.0.

-Jungtaek Lim (HeartSaVioR)

On Mon, Jul 15, 2019 at 8:25 AM Xiao Li  wrote:

> Yeah, Josh! All these ideas sound good to me. All the top commercial
> database products have very detailed guide/document about the version
> upgrading. You can easily find them.
>
> Currently, only SQL and ML modules have the migration or upgrade guides.
> Since Spark 2.3 release, we strictly require the PR authors to document all
> the behavior changes in the SQL component. I would suggest to do the same
> things in the other modules. For example, Spark Core and Structured
> Streaming. Any objection?
>
> Cheers,
>
> Xiao
>
>
>
> On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen  wrote:
>
>> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
>> documentation: these are valuable resources and I think we could increase
>> that value by making these docs easier to discover and by adding a bit more
>> structure to the existing content.
>>
>> For folks who aren't familiar with these docs: the Spark docs have a "SQL
>> Migration Guide" which lists the deprecations and changes of behavior in
>> each release:
>>
>>- Latest published version:
>>https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>>- Master branch version (will become 3.0):
>>
>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>>
>> A lot of community work went into crafting this doc and I really
>> appreciate those efforts.
>>
>> This doc is a little hard to find, though, because it's not consistently
>> linked from release notes pages: the 2.4.0 page links it under "Changes of
>> Behavior" (
>> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
>> but subsequent maintenance releases do not link to it (
>> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
>> not very cross-linked from the rest of the Spark docs (e.g. the Overview
>> doc, doc drop-down menus, etc).
>>
>> I'm also concerned that the doc may be overwhelming to end users (as
>> opposed to Spark developers):
>>
>>- *Entries aren't grouped by component*, so users need to read the
>>entire document to spot changes relevant to their use of Spark (for
>>example, PySpark changes are not grouped together).
>>- *Entries aren't ordered by size / risk of change,* e.g. performance
>>impact vs. loud behavior change (stopping with an explicit exception) vs.
>>silent behavior changes (e.g. changing default rounding behavior). If we
>>assume limited reader attention then it may be important to prioritize the
>>order in which we list entries, putting the highest-expected-impact /
>>lowest-organic-discoverability changes first.
>>- *We don't link JIRAs*, forcing users to do their own archaeology to
>>learn more about a specific change.
>>
>> The existing ML migration guide addresses some of these issues, so maybe
>> we can emulate it in the SQL guide:
>> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>>
>> I think that documentation clarity is especially important with Spark 3.0
>> around the corner: many folks will seek out this information when they
>> upgrade, so improving this guide can be a high-leverage, high-impact
>> activity.
>>
>> What do folks think? Does anyone have examples from other projects which
>> do a notably good job of crafting release notes / migration guides? I'd be
>> glad to help with pre-release editing after we decide on a structure and
>> style.
>>
>> Cheers,
>> Josh
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior


Re: Spark SQL upgrade / migration guide: discoverability and content organization

2019-07-14 Thread Dongjoon Hyun
Thank you, Josh and Xiao. That sounds great.

Do you think we can have some parts of that improvement in `2.4.4` document
first since that is the very next release?

Bests,
Dongjoon.

On Sun, Jul 14, 2019 at 4:25 PM Xiao Li  wrote:

> Yeah, Josh! All these ideas sound good to me. All the top commercial
> database products have very detailed guide/document about the version
> upgrading. You can easily find them.
>
> Currently, only SQL and ML modules have the migration or upgrade guides.
> Since Spark 2.3 release, we strictly require the PR authors to document all
> the behavior changes in the SQL component. I would suggest to do the same
> things in the other modules. For example, Spark Core and Structured
> Streaming. Any objection?
>
> Cheers,
>
> Xiao
>
>
>
> On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen  wrote:
>
>> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
>> documentation: these are valuable resources and I think we could increase
>> that value by making these docs easier to discover and by adding a bit more
>> structure to the existing content.
>>
>> For folks who aren't familiar with these docs: the Spark docs have a "SQL
>> Migration Guide" which lists the deprecations and changes of behavior in
>> each release:
>>
>>- Latest published version:
>>https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>>- Master branch version (will become 3.0):
>>
>> https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>>
>> A lot of community work went into crafting this doc and I really
>> appreciate those efforts.
>>
>> This doc is a little hard to find, though, because it's not consistently
>> linked from release notes pages: the 2.4.0 page links it under "Changes of
>> Behavior" (
>> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
>> but subsequent maintenance releases do not link to it (
>> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
>> not very cross-linked from the rest of the Spark docs (e.g. the Overview
>> doc, doc drop-down menus, etc).
>>
>> I'm also concerned that the doc may be overwhelming to end users (as
>> opposed to Spark developers):
>>
>>- *Entries aren't grouped by component*, so users need to read the
>>entire document to spot changes relevant to their use of Spark (for
>>example, PySpark changes are not grouped together).
>>- *Entries aren't ordered by size / risk of change,* e.g. performance
>>impact vs. loud behavior change (stopping with an explicit exception) vs.
>>silent behavior changes (e.g. changing default rounding behavior). If we
>>assume limited reader attention then it may be important to prioritize the
>>order in which we list entries, putting the highest-expected-impact /
>>lowest-organic-discoverability changes first.
>>- *We don't link JIRAs*, forcing users to do their own archaeology to
>>learn more about a specific change.
>>
>> The existing ML migration guide addresses some of these issues, so maybe
>> we can emulate it in the SQL guide:
>> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>>
>> I think that documentation clarity is especially important with Spark 3.0
>> around the corner: many folks will seek out this information when they
>> upgrade, so improving this guide can be a high-leverage, high-impact
>> activity.
>>
>> What do folks think? Does anyone have examples from other projects which
>> do a notably good job of crafting release notes / migration guides? I'd be
>> glad to help with pre-release editing after we decide on a structure and
>> style.
>>
>> Cheers,
>> Josh
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> 
>


Re: JDBC connector for DataSourceV2

2019-07-14 Thread Xianyin Xin
There’s another pr https://github.com/apache/spark/pull/21861 but which is 
based the old V2 APIs.

 

We’d better link the JIRAs, SPARK-24907, SPARK-25547, and SPARK-28380 and 
finalize a plan.

 

Xianyin

 

From: Shiv Prashant Sood 
Date: Sunday, July 14, 2019 at 2:59 AM
To: Gabor Somogyi 
Cc: Xianyin Xin , Ryan Blue , 
, Spark Dev List 
Subject: Re: JDBC connector for DataSourceV2

 

To me this looks like refactoring of DS1 JDBC to enable user provided 
connection factories. In itself a good change, but IMO not DSV2 related. 

 

I created a JIRA and added some goals. Please comments/add as relevant.

 

https://issues.apache.org/jira/browse/SPARK-28380

 

JIRA for DataSourceV2 API based JDBC connector.

Goals :
Generic connector based on JDBC that supports all databases (min bar is support 
for all V1 data bases).
Reference implementation and Interface for any specialized JDBC connectors.
 

Regards,

Shiv

 

On Sat, Jul 13, 2019 at 2:17 AM Gabor Somogyi  wrote:

Hi Guys,

 

Don't know what's the intention exactly here but there is such a PR: 
https://github.com/apache/spark/pull/22560

If that's what we need maybe we can resurrect it. BTW, I'm also interested in...

 

BR,

G

 

 

On Sat, Jul 13, 2019 at 4:09 AM Shiv Prashant Sood  
wrote:

Thanks all. I can also contribute toward this effort.

 

Regards,

Shiv

Sent from my iPhone


On Jul 12, 2019, at 6:51 PM, Xianyin Xin  wrote:

If there’s nobody working on that, I’d like to contribute. 

 

Loop in @Gengliang Wang.

 

Xianyin

 

From: Ryan Blue 
Reply-To: 
Date: Saturday, July 13, 2019 at 6:54 AM
To: Shiv Prashant Sood 
Cc: Spark Dev List 
Subject: Re: JDBC connector for DataSourceV2

 

I'm not aware of a JDBC connector effort. It would be great to have someone 
build one!

 

On Fri, Jul 12, 2019 at 3:33 PM Shiv Prashant Sood  
wrote:

Can someone please help understand the current Status of DataSource V2 based 
JDBC connector? I see connectors for various file formats in Master, but can't 
find a JDBC implementation or related JIRA. 

 

DatasourceV2 APIs to me look in good shape to attempt a JDBC connector for 
READ/WRITE path.

Thanks & Regards,

Shiv


 

-- 

Ryan Blue

Software Engineer

Netflix



Re: Spark SQL upgrade / migration guide: discoverability and content organization

2019-07-14 Thread Xiao Li
Yeah, Josh! All these ideas sound good to me. All the top commercial
database products have very detailed guide/document about the version
upgrading. You can easily find them.

Currently, only SQL and ML modules have the migration or upgrade guides.
Since Spark 2.3 release, we strictly require the PR authors to document all
the behavior changes in the SQL component. I would suggest to do the same
things in the other modules. For example, Spark Core and Structured
Streaming. Any objection?

Cheers,

Xiao



On Sun, Jul 14, 2019 at 2:05 PM Josh Rosen  wrote:

> I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
> documentation: these are valuable resources and I think we could increase
> that value by making these docs easier to discover and by adding a bit more
> structure to the existing content.
>
> For folks who aren't familiar with these docs: the Spark docs have a "SQL
> Migration Guide" which lists the deprecations and changes of behavior in
> each release:
>
>- Latest published version:
>https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
>- Master branch version (will become 3.0):
>
> https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md
>
> A lot of community work went into crafting this doc and I really
> appreciate those efforts.
>
> This doc is a little hard to find, though, because it's not consistently
> linked from release notes pages: the 2.4.0 page links it under "Changes of
> Behavior" (
> https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
> but subsequent maintenance releases do not link to it (
> https://spark.apache.org/releases/spark-release-2-4-1.html). It's also
> not very cross-linked from the rest of the Spark docs (e.g. the Overview
> doc, doc drop-down menus, etc).
>
> I'm also concerned that the doc may be overwhelming to end users (as
> opposed to Spark developers):
>
>- *Entries aren't grouped by component*, so users need to read the
>entire document to spot changes relevant to their use of Spark (for
>example, PySpark changes are not grouped together).
>- *Entries aren't ordered by size / risk of change,* e.g. performance
>impact vs. loud behavior change (stopping with an explicit exception) vs.
>silent behavior changes (e.g. changing default rounding behavior). If we
>assume limited reader attention then it may be important to prioritize the
>order in which we list entries, putting the highest-expected-impact /
>lowest-organic-discoverability changes first.
>- *We don't link JIRAs*, forcing users to do their own archaeology to
>learn more about a specific change.
>
> The existing ML migration guide addresses some of these issues, so maybe
> we can emulate it in the SQL guide:
> https://spark.apache.org/docs/latest/ml-guide.html#migration-guide
>
> I think that documentation clarity is especially important with Spark 3.0
> around the corner: many folks will seek out this information when they
> upgrade, so improving this guide can be a high-leverage, high-impact
> activity.
>
> What do folks think? Does anyone have examples from other projects which
> do a notably good job of crafting release notes / migration guides? I'd be
> glad to help with pre-release editing after we decide on a structure and
> style.
>
> Cheers,
> Josh
>


-- 
[image: Databricks Summit - Watch the talks]



Spark SQL upgrade / migration guide: discoverability and content organization

2019-07-14 Thread Josh Rosen
I'd like to discuss the Spark SQL migration / upgrade guides in the Spark
documentation: these are valuable resources and I think we could increase
that value by making these docs easier to discover and by adding a bit more
structure to the existing content.

For folks who aren't familiar with these docs: the Spark docs have a "SQL
Migration Guide" which lists the deprecations and changes of behavior in
each release:

   - Latest published version:
   https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html
   - Master branch version (will become 3.0):
   
https://github.com/apache/spark/blob/master/docs/sql-migration-guide-upgrade.md

A lot of community work went into crafting this doc and I really appreciate
those efforts.

This doc is a little hard to find, though, because it's not consistently
linked from release notes pages: the 2.4.0 page links it under "Changes of
Behavior" (
https://spark.apache.org/releases/spark-release-2-4-0.html#changes-of-behavior)
but subsequent maintenance releases do not link to it (
https://spark.apache.org/releases/spark-release-2-4-1.html). It's also not
very cross-linked from the rest of the Spark docs (e.g. the Overview doc,
doc drop-down menus, etc).

I'm also concerned that the doc may be overwhelming to end users (as
opposed to Spark developers):

   - *Entries aren't grouped by component*, so users need to read the
   entire document to spot changes relevant to their use of Spark (for
   example, PySpark changes are not grouped together).
   - *Entries aren't ordered by size / risk of change,* e.g. performance
   impact vs. loud behavior change (stopping with an explicit exception) vs.
   silent behavior changes (e.g. changing default rounding behavior). If we
   assume limited reader attention then it may be important to prioritize the
   order in which we list entries, putting the highest-expected-impact /
   lowest-organic-discoverability changes first.
   - *We don't link JIRAs*, forcing users to do their own archaeology to
   learn more about a specific change.

The existing ML migration guide addresses some of these issues, so maybe we
can emulate it in the SQL guide:
https://spark.apache.org/docs/latest/ml-guide.html#migration-guide

I think that documentation clarity is especially important with Spark 3.0
around the corner: many folks will seek out this information when they
upgrade, so improving this guide can be a high-leverage, high-impact
activity.

What do folks think? Does anyone have examples from other projects which do
a notably good job of crafting release notes / migration guides? I'd be
glad to help with pre-release editing after we decide on a structure and
style.

Cheers,
Josh