[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-20 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875137#comment-15875137
 ] 

Nick Dimiduk commented on SPARK-19615:
--

Thanks for taking a look [~hyukjin.kwon]. These three bugs are indeed issues -- 
in all cases, it seems spark was not being careful to map column names to the 
appropriate column from each site of the union. My experience with 1.6.3 and 
2.1.0 with unions has been much better. Actually, I still see echos of 
SPARK-9874 / SPARK-9813 when I extend one side or the other with null columns. 
I can file that as a separate issue if that's of interest to you.

As for what RDBMS may or may not do, I'm not very aware or concerned. I'm 
thinking more about ease of use for a user. This is why I suggest perhaps a 
different union method that would encapsulate this behavior. Parsed spark sql 
can exhibit whatever semantics the community deems appropriate, while still 
giving users of the API access to this convenient functionality. I've 
implemented this logic in my application and it's quite complex. It would be 
very good for Spark to provide this for its users.

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873189#comment-15873189
 ] 

Hyukjin Kwon commented on SPARK-19615:
--

Let me leave loosely related JIRAs - SPARK-9813 , SPARK-9874 and SPARK-15918

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873188#comment-15873188
 ] 

Hyukjin Kwon commented on SPARK-19615:
--

I remember I checked UNION operation in other DBMS and current behaviour is 
current and compliant. Could you maybe check and leave other references or 
DBMSes please?

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19615) Provide Dataset union convenience for divergent schema

2017-02-16 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870513#comment-15870513
 ] 

Nick Dimiduk commented on SPARK-19615:
--

IMHO, a union operation should be as generous as possible. This facilitates 
common ETL and data cleansing operations where the sources are sparse-schema 
structures (JSON, HBase, Elastic Search, ). A couple examples of what I mean.

Given dataframes of type
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: string (nullable = true)
{noformat}
and
{noformat}
root
 |-- a: string (nullable = false)
 |-- c: string (nullable = true)
{noformat}
I would expect the union operation to infer the nullable columns from both 
sides to produce a dataframe of type
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: string (nullable = true)
 |-- c: string (nullable = true)
{noformat}

This should work on an arbitrarily deep nesting of structs, so

{noformat}
root
 |-- a: string (nullable = false)
 |-- b: struct (nullable = false)
 ||-- b1: string (nullable = true)
 ||-- b2: string (nullable = true)
{noformat}
unioned with
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: struct (nullable = false)
 ||-- b3: string (nullable = true)
 ||-- b4: string (nullable = true)
{noformat}
would result in
{noformat}
root
 |-- a: string (nullable = false)
 |-- b: struct (nullable = false)
 ||-- b1: string (nullable = true)
 ||-- b2: string (nullable = true)
 ||-- b3: string (nullable = true)
 ||-- b4: string (nullable = true)
{noformat}

> Provide Dataset union convenience for divergent schema
> --
>
> Key: SPARK-19615
> URL: https://issues.apache.org/jira/browse/SPARK-19615
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>Priority: Minor
>
> Creating a union DataFrame over two sources that have different schema 
> definitions is surprisingly complex. Provide a version of the union method 
> that will create a infer a target schema as the result of merging the 
> sources. Automatically add extend either side with {{null}} columns for any 
> missing columns that are nullable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org