[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246084#comment-14246084 ]
Stephen Boesch commented on SPARK-4775: --------------------------------------- Thanks v much Michael. You hit the nail on the head. I will update our internal code here to remove that antipattern. Issue is being closed. > Possible problem in a simple join? Getting duplicate rows and missing rows > --------------------------------------------------------------------------- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic > Reporter: Stephen Boesch > Assignee: Michael Armbrust > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > ------------------------------------ > Consider the following schema that includes two columns: > {code} > case class JoinTable2Cols(intcol: Int, strcol: String) > {code} > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > {code} > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > {code} > Here is the data in both tables: > {code} > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > {code} > Now let us join the tables on the first column: > {code} > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > {code} > What results do we get: > came back with 4 results > {code} > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > {code} > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > {code} > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > {code} > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org