[jira] [Commented] (SPARK-18006) When union, spark SQL didn't complain about schema mismatch

Shawn Zhang (JIRA) Wed, 19 Oct 2016 02:15:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-18006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588139#comment-15588139
 ]


Shawn Zhang commented on SPARK-18006:
-------------------------------------

ds1.printSchema() shows 
root
 |-- dateline: long (nullable = false)
 |-- uid: long (nullable = false)

ds2.printSchema() shows
root
 |-- uid: long (nullable = false)
 |-- dateline: long (nullable = false)

ds1 contains
+-------------+---+
|     dateline|uid|
+-------------+---+
|1476866468186|  1|
|1476866468186|  2|
+-------------+---+

while the result of union contains
+-------------+--------+
|          uid|dateline|
+-------------+--------+
|1476866468186|       1|
|1476866468186|       2|
+-------------+--------+

You see, uid has become dateline.

> When union, spark SQL didn't complain about schema mismatch
> -----------------------------------------------------------
>
>                 Key: SPARK-18006
>                 URL: https://issues.apache.org/jira/browse/SPARK-18006
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.0.1
>            Reporter: Shawn Zhang
>            Priority: Minor
>
> When union two Dataset<Row>, spark will check they have same number of 
> columns. But if the order of column is different, strange result will be 
> generated.
> The output of the following code shows that column have being switched by 
> Spark.
> ================= Code =============
> package test;
> import java.util.ArrayList;
> import java.util.List;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import audit_spark.SparkConfig;
> public class SchemaBug {
>       public static class User {
>               
>               public User(long uid, long dateline) {
>                       this.uid = uid;
>                       this.dateline = dateline;
>               }
>               long uid;
>               long dateline;
>               public long getUid() {
>                       return uid;
>               }
>               public void setUid(long uid) {
>                       this.uid = uid;
>               }
>               public long getDateline() {
>                       return dateline;
>               }
>               public void setDateline(long dateline) {
>                       this.dateline = dateline;
>               }
>               
>       }
>       public static void main(String[] args) {
>       
>               SparkSession sparkSession = SparkSession
>                           .builder()
>                           .appName("test")
>                               .config("spark.sql.warehouse.dir", "file:///")
>                           .getOrCreate();
>               
>               
>               StructType userSchema2 = new StructType(new StructField[]{
>                               new StructField("uid", DataTypes.LongType, 
> false, Metadata.empty()),
>                               new StructField("dateline", DataTypes.LongType, 
> false, Metadata.empty()),
>                               
>                               });
>               
>               List userList = new ArrayList();
>               userList.add(new User(1, System.currentTimeMillis()));
>               userList.add(new User(2, System.currentTimeMillis()));
>               Dataset<Row> ds1 = 
> SparkConfig.sparkSession.createDataFrame(userList, User.class);
>               Dataset<Row> ds2 = SparkConfig.sparkSession.createDataFrame(new 
> ArrayList(), userSchema2);
>               ds2.union(ds1).show();
>       }
> }
> =========== Program Output ===============
> |          uid|dateline|
> |1476867071496|       1|
> |1476867071496|       2|
> =========== Expected Output ===============
> |       dateline   |uid|
> |1476867071496|       1|
> |1476867071496|       2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18006) When union, spark SQL didn't complain about schema mismatch

Reply via email to