Ofer Mendelevitch created SPARK-10892:
-----------------------------------------

             Summary: Join with Data Frame returns wrong results
                 Key: SPARK-10892
                 URL: https://issues.apache.org/jira/browse/SPARK-10892
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.0, 1.4.1
            Reporter: Ofer Mendelevitch
            Priority: Critical


I'm attaching a simplified reproducible example of the problem:
1. Loading a JSON file from HDFS as a Data Frame
2. Creating 3 data frames: PRCP, TMIN, TMAX
3. Joining the data frames together. Each of those has a column "value" with 
the same name, so renaming them after the join.
4. The output seems incorrect; the first column has the correct values, but the 
two other columns seem to have a copy of the values from the first column.

Here's the sample code:

import org.apache.spark.sql._
val sqlc = new SQLContext(sc)

val weather = sqlc.read.format("json").load("data.json")
val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()

prcp.filter("year=2012 and month=10").show()
tmin.filter("year=2012 and month=10").show()
tmax.filter("year=2012 and month=10").show()

val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
          .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
            prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
            tmax("value").alias("TMAX")) )
out.filter("year=2012 and month=10").show()

The output is:

+--------+---+------+-----+-----------+-----+----+
|date_str|day|metric|month|    station|value|year|
+--------+---+------+-----+-----------+-----+----+
|20121001|  1|  PRCP|   10|USW00023272|    0|2012|
|20121002|  2|  PRCP|   10|USW00023272|    0|2012|
|20121003|  3|  PRCP|   10|USW00023272|    0|2012|
|20121004|  4|  PRCP|   10|USW00023272|    0|2012|
|20121005|  5|  PRCP|   10|USW00023272|    0|2012|
|20121006|  6|  PRCP|   10|USW00023272|    0|2012|
|20121007|  7|  PRCP|   10|USW00023272|    0|2012|
|20121008|  8|  PRCP|   10|USW00023272|    0|2012|
|20121009|  9|  PRCP|   10|USW00023272|    0|2012|
|20121010| 10|  PRCP|   10|USW00023272|    0|2012|
|20121011| 11|  PRCP|   10|USW00023272|    3|2012|
|20121012| 12|  PRCP|   10|USW00023272|    0|2012|
|20121013| 13|  PRCP|   10|USW00023272|    0|2012|
|20121014| 14|  PRCP|   10|USW00023272|    0|2012|
|20121015| 15|  PRCP|   10|USW00023272|    0|2012|
|20121016| 16|  PRCP|   10|USW00023272|    0|2012|
|20121017| 17|  PRCP|   10|USW00023272|    0|2012|
|20121018| 18|  PRCP|   10|USW00023272|    0|2012|
|20121019| 19|  PRCP|   10|USW00023272|    0|2012|
|20121020| 20|  PRCP|   10|USW00023272|    0|2012|
+--------+---+------+-----+-----------+-----+——+

+--------+---+------+-----+-----------+-----+----+
|date_str|day|metric|month|    station|value|year|
+--------+---+------+-----+-----------+-----+----+
|20121001|  1|  TMIN|   10|USW00023272|  139|2012|
|20121002|  2|  TMIN|   10|USW00023272|  178|2012|
|20121003|  3|  TMIN|   10|USW00023272|  144|2012|
|20121004|  4|  TMIN|   10|USW00023272|  144|2012|
|20121005|  5|  TMIN|   10|USW00023272|  139|2012|
|20121006|  6|  TMIN|   10|USW00023272|  128|2012|
|20121007|  7|  TMIN|   10|USW00023272|  122|2012|
|20121008|  8|  TMIN|   10|USW00023272|  122|2012|
|20121009|  9|  TMIN|   10|USW00023272|  139|2012|
|20121010| 10|  TMIN|   10|USW00023272|  128|2012|
|20121011| 11|  TMIN|   10|USW00023272|  122|2012|
|20121012| 12|  TMIN|   10|USW00023272|  117|2012|
|20121013| 13|  TMIN|   10|USW00023272|  122|2012|
|20121014| 14|  TMIN|   10|USW00023272|  128|2012|
|20121015| 15|  TMIN|   10|USW00023272|  128|2012|
|20121016| 16|  TMIN|   10|USW00023272|  156|2012|
|20121017| 17|  TMIN|   10|USW00023272|  139|2012|
|20121018| 18|  TMIN|   10|USW00023272|  161|2012|
|20121019| 19|  TMIN|   10|USW00023272|  133|2012|
|20121020| 20|  TMIN|   10|USW00023272|  122|2012|
+--------+---+------+-----+-----------+-----+——+


+--------+---+------+-----+-----------+-----+----+
|date_str|day|metric|month|    station|value|year|
+--------+---+------+-----+-----------+-----+----+
|20121001|  1|  TMAX|   10|USW00023272|  322|2012|
|20121002|  2|  TMAX|   10|USW00023272|  344|2012|
|20121003|  3|  TMAX|   10|USW00023272|  222|2012|
|20121004|  4|  TMAX|   10|USW00023272|  189|2012|
|20121005|  5|  TMAX|   10|USW00023272|  194|2012|
|20121006|  6|  TMAX|   10|USW00023272|  200|2012|
|20121007|  7|  TMAX|   10|USW00023272|  167|2012|
|20121008|  8|  TMAX|   10|USW00023272|  183|2012|
|20121009|  9|  TMAX|   10|USW00023272|  194|2012|
|20121010| 10|  TMAX|   10|USW00023272|  183|2012|
|20121011| 11|  TMAX|   10|USW00023272|  139|2012|
|20121012| 12|  TMAX|   10|USW00023272|  161|2012|
|20121013| 13|  TMAX|   10|USW00023272|  211|2012|
|20121014| 14|  TMAX|   10|USW00023272|  189|2012|
|20121015| 15|  TMAX|   10|USW00023272|  233|2012|
|20121016| 16|  TMAX|   10|USW00023272|  211|2012|
|20121017| 17|  TMAX|   10|USW00023272|  278|2012|
|20121018| 18|  TMAX|   10|USW00023272|  294|2012|
|20121019| 19|  TMAX|   10|USW00023272|  194|2012|
|20121020| 20|  TMAX|   10|USW00023272|  183|2012|
+--------+---+------+-----+-----------+-----+——+

And the join output is:

+----+-----+---+--------+----+----+----+
|year|month|day|date_str|PRCP|TMIN|TMAX|
+----+-----+---+--------+----+----+----+
|2012|   10|  1|20121001|   0|   0|   0|
|2012|   10|  2|20121002|   0|   0|   0|
|2012|   10|  3|20121003|   0|   0|   0|
|2012|   10|  4|20121004|   0|   0|   0|
|2012|   10|  5|20121005|   0|   0|   0|
|2012|   10|  6|20121006|   0|   0|   0|
|2012|   10|  7|20121007|   0|   0|   0|
|2012|   10|  8|20121008|   0|   0|   0|
|2012|   10|  9|20121009|   0|   0|   0|
|2012|   10| 10|20121010|   0|   0|   0|
|2012|   10| 11|20121011|   3|   3|   3|
|2012|   10| 12|20121012|   0|   0|   0|
|2012|   10| 13|20121013|   0|   0|   0|
|2012|   10| 14|20121014|   0|   0|   0|
|2012|   10| 15|20121015|   0|   0|   0|
|2012|   10| 16|20121016|   0|   0|   0|
|2012|   10| 17|20121017|   0|   0|   0|
|2012|   10| 18|20121018|   0|   0|   0|
|2012|   10| 19|20121019|   0|   0|   0|
|2012|   10| 20|20121020|   0|   0|   0|
+----+-----+---+--------+----+----+——+

Attachment:
- data.json file that is read from HDFS





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to