[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17003649#comment-17003649
 ] 

Takeshi Yamamuro commented on SPARK-10892:
------------------------------------------

Since I checked that the current master threw an exception for the query above, 
I'll close as resolved. If any problem, please reopen this. Thanks.
{code:java}
scala> val out = prcp.join(tmin, "date_str").join(tmax, 
"date_str").select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"), 
prcp("value").alias("PRCP"), tmin("value").alias("TMIN"), 
tmax("value").alias("TMAX"))
org.apache.spark.sql.AnalysisException: Column value#12L, value#12L are 
ambiguous. It's probably because you joined several Datasets together, and some 
of these Datasets are the same. This column points to one of the Datasets but 
Spark is unable to figure out which one. Please alias the Datasets with 
different names via `Dataset.as` before joining them, and specify the column 
using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. 
You can also set spark.sql.analyzer.failAmbiguousSelfJoin.enabled to false to 
disable this check.;
{code}

> Join with Data Frame returns wrong results
> ------------------------------------------
>
>                 Key: SPARK-10892
>                 URL: https://issues.apache.org/jira/browse/SPARK-10892
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.5.0, 2.4.0
>            Reporter: Ofer Mendelevitch
>            Priority: Critical
>              Labels: correctness
>         Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>           .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
>             prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
>             tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> +--------+---+------+-----+-----------+-----+----+
> |date_str|day|metric|month|    station|value|year|
> +--------+---+------+-----+-----------+-----+----+
> |20121001|  1|  PRCP|   10|USW00023272|    0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|    0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|    0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|    0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|    0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|    0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|    0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|    0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|    0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|    0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|    3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|    0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|    0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|    0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|    0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|    0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|    0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|    0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|    0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|    0|2012|
> +--------+---+------+-----+-----------+-----+——+
> +--------+---+------+-----+-----------+-----+----+
> |date_str|day|metric|month|    station|value|year|
> +--------+---+------+-----+-----------+-----+----+
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> +--------+---+------+-----+-----------+-----+——+
> +--------+---+------+-----+-----------+-----+----+
> |date_str|day|metric|month|    station|value|year|
> +--------+---+------+-----+-----------+-----+----+
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |20121006|  6|  TMAX|   10|USW00023272|  200|2012|
> |20121007|  7|  TMAX|   10|USW00023272|  167|2012|
> |20121008|  8|  TMAX|   10|USW00023272|  183|2012|
> |20121009|  9|  TMAX|   10|USW00023272|  194|2012|
> |20121010| 10|  TMAX|   10|USW00023272|  183|2012|
> |20121011| 11|  TMAX|   10|USW00023272|  139|2012|
> |20121012| 12|  TMAX|   10|USW00023272|  161|2012|
> |20121013| 13|  TMAX|   10|USW00023272|  211|2012|
> |20121014| 14|  TMAX|   10|USW00023272|  189|2012|
> |20121015| 15|  TMAX|   10|USW00023272|  233|2012|
> |20121016| 16|  TMAX|   10|USW00023272|  211|2012|
> |20121017| 17|  TMAX|   10|USW00023272|  278|2012|
> |20121018| 18|  TMAX|   10|USW00023272|  294|2012|
> |20121019| 19|  TMAX|   10|USW00023272|  194|2012|
> |20121020| 20|  TMAX|   10|USW00023272|  183|2012|
> +--------+---+------+-----+-----------+-----+——+
> {code}
> And the join output is:
> {code}
> +----+-----+---+--------+----+----+----+
> |year|month|day|date_str|PRCP|TMIN|TMAX|
> +----+-----+---+--------+----+----+----+
> |2012|   10|  1|20121001|   0|   0|   0|
> |2012|   10|  2|20121002|   0|   0|   0|
> |2012|   10|  3|20121003|   0|   0|   0|
> |2012|   10|  4|20121004|   0|   0|   0|
> |2012|   10|  5|20121005|   0|   0|   0|
> |2012|   10|  6|20121006|   0|   0|   0|
> |2012|   10|  7|20121007|   0|   0|   0|
> |2012|   10|  8|20121008|   0|   0|   0|
> |2012|   10|  9|20121009|   0|   0|   0|
> |2012|   10| 10|20121010|   0|   0|   0|
> |2012|   10| 11|20121011|   3|   3|   3|
> |2012|   10| 12|20121012|   0|   0|   0|
> |2012|   10| 13|20121013|   0|   0|   0|
> |2012|   10| 14|20121014|   0|   0|   0|
> |2012|   10| 15|20121015|   0|   0|   0|
> |2012|   10| 16|20121016|   0|   0|   0|
> |2012|   10| 17|20121017|   0|   0|   0|
> |2012|   10| 18|20121018|   0|   0|   0|
> |2012|   10| 19|20121019|   0|   0|   0|
> |2012|   10| 20|20121020|   0|   0|   0|
> +----+-----+---+--------+----+----+——+
> {code}
> Attachment:
> - data.json file that is read from HDFS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to