[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

Paul Praet (JIRA) Thu, 10 Aug 2017 02:04:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121297#comment-16121297
 ]


Paul Praet edited comment on SPARK-21402 at 8/10/17 9:03 AM:
-------------------------------------------------------------

I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- id: string (nullable = false)
 |-- type: string (nullable = false)
 |-- ssid: string (nullable = false)
+------------+------+--------+--------+
|writeKey    |id    |type    |ssid    |
+------------+------+--------+--------+
|someWriteKey|someId|someType|someSSID|
+------------+------+--------+--------+
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 |    |-- id: string (nullable = false)
 |    |-- type: string (nullable = false)
 |    |-- ssid: string (nullable = false)

+------------+--------------------------+
|writeKey    |nodes                     |
+------------+--------------------------+
|someWriteKey|[someId,someType,someSSID]|
+------------+--------------------------+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = false)
 |    |    |-- type: string (nullable = false)
 |    |    |-- ssid: string (nullable = false)

+------------+----------------------------+
|writeKey    |nodes                       |
+------------+----------------------------+
|someWriteKey|[[someId,someType,someSSID]]|
+------------+----------------------------+
{noformat}

When I convert  this to Java...

{code:java}
        Dataset<Row> dfArray = dfStruct.groupBy("writeKey")
                .agg(functions.collect_set("nodes").alias("nodes"));
  Encoder<Topology> topologyEncoder = Encoders.bean(Topology.class);
        Dataset<Topology> datasetMultiple = dfArray.as(topologyEncoder);
        System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
        private String writeKey;
        private List<Node> nodes;

        public Topology() {
        }

        public String getWriteKey() {
            return writeKey;
        }

        public void setWriteKey(String writeKey) {
            this.writeKey = writeKey;
        }

        public List<Node> getNodes() {
            return nodes;
        }

        public void setNodes(List<Node> nodes) {
            this.nodes = nodes;
        }

        @Override
        public String toString() {
            return "Topology{" +
                    "writeKey='" + writeKey + '\'' +
                    ", nodes=" + nodes +
                    '}';
        }
    }

    public static class Node {
        private String id;
        private String type;
        private String ssid;

        public Node() {
        }

        public String getId() {
            return id;
        }

        public void setId(String id) {
            this.id = id;
        }

        public String getType() {
            return type;
        }

        public void setType(String type) {
            this.type = type;
        }

        public String getSsid() {
            return ssid;
        }

        public void setSsid(String ssid) {
            this.ssid = ssid;
        }


        @Override
        public String toString() {
            return "Node{" +
                    "id='" + id + '\'' +
                    ", type='" + type + '\'' +
                    ", ssid='" + ssid + '\'' +
                    '}';
        }
    }
{code}




was (Author: praetp):
I can confirm this problem persists in Spark 2.2.0: fields get all swapped when 
you use the bean encoder on a dataset with an array of structs. A plain struct 
works, an array of structs does not. Pretty big issue if you ask me.
I have a datamodel like this (all Strings)

{noformat}
+------------+------+--------+--------+
|writeKey    |id    |type    |ssid    |
+------------+------+--------+--------+
|someWriteKey|someId|someType|someSSID|
+------------+------+--------+--------+
{noformat}

When I convert into a struct, everything is still fine:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: struct (nullable = false)
 |    |-- id: string (nullable = false)
 |    |-- type: string (nullable = false)
 |    |-- ssid: string (nullable = false)

+------------+--------------------------+
|writeKey    |nodes                     |
+------------+--------------------------+
|someWriteKey|[someId,someType,someSSID]|
+------------+--------------------------+
{noformat}

When I do a groupBy on writeKey and a collect_set() on the nodes, we get:

{noformat}
root
 |-- writeKey: string (nullable = false)
 |-- nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = false)
 |    |    |-- type: string (nullable = false)
 |    |    |-- ssid: string (nullable = false)

+------------+----------------------------+
|writeKey    |nodes                       |
+------------+----------------------------+
|someWriteKey|[[someId,someType,someSSID]]|
+------------+----------------------------+
{noformat}

When I convert  this to Java...

{code:java}
        Dataset<Row> dfArray = dfStruct.groupBy("writeKey")
                .agg(functions.collect_set("nodes").alias("nodes"));
  Encoder<Topology> topologyEncoder = Encoders.bean(Topology.class);
        Dataset<Topology> datasetMultiple = dfArray.as(topologyEncoder);
        System.out.println(datasetMultiple.first());
{code}
This prints:

{noformat}
Topology{writeKey='someWriteKey', nodes=[Node{id='someId', type='someSSID', 
ssid='someType'}]}
{noformat}
You can clearly see the type and ssid fields were swapped.

POJO classes:
{code:java}
 public static class Topology {
        private String writeKey;
        private List<Node> nodes;

        public Topology() {
        }

        public String getWriteKey() {
            return writeKey;
        }

        public void setWriteKey(String writeKey) {
            this.writeKey = writeKey;
        }

        public List<Node> getNodes() {
            return nodes;
        }

        public void setNodes(List<Node> nodes) {
            this.nodes = nodes;
        }

        @Override
        public String toString() {
            return "Topology{" +
                    "writeKey='" + writeKey + '\'' +
                    ", nodes=" + nodes +
                    '}';
        }
    }

    public static class Node {
        private String id;
        private String type;
        private String ssid;

        public Node() {
        }

        public String getId() {
            return id;
        }

        public void setId(String id) {
            this.id = id;
        }

        public String getType() {
            return type;
        }

        public void setType(String type) {
            this.type = type;
        }

        public String getSsid() {
            return ssid;
        }

        public void setSsid(String ssid) {
            this.ssid = ssid;
        }


        @Override
        public String toString() {
            return "Node{" +
                    "id='" + id + '\'' +
                    ", type='" + type + '\'' +
                    ", ssid='" + ssid + '\'' +
                    '}';
        }
    }
{code}



> Java encoders - switch fields on collectAsList
> ----------------------------------------------
>
>                 Key: SPARK-21402
>                 URL: https://issues.apache.org/jira/browse/SPARK-21402
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.1
>         Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>            Reporter: Tom
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  |    |-- key: string
>  |    |-- value: struct (valueContainsNull = true)
>  |    |    |-- startTime: long (nullable = true)
>  |    |    |-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
>     private String userId;
>     private Map<String, MyDTO> data;
>     private Long offset;
>  }
> public class MyDTO {
>     private long startTime;
>     private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
>         Encoder<MyClass> myClassEncoder = Encoders.bean(MyClass.class);
>         Dataset<MyClass> results = raw_df.as(myClassEncoder);
>         List<MyClass> lst = results.collectAsList();
> {code}
>         
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-----------------------------+--------------+
> |1498854000                |1498870800              |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21402) Java encoders - switch fields on collectAsList

Reply via email to