[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236577#comment-14236577 ]
Stephen Boesch commented on SPARK-4775: --------------------------------------- Here is the same logic for mysql: note the output in mysql matches the expected output from the test program: create table tab1 (intcol int, strcol varchar(100)); create table tab2 (intcol int, strcol varchar(100)); insert into tab1 values (1,'valA1'); insert into tab1 values (2,'valA2'); insert into tab2 values (1,'valB1'); insert into tab2 values (1,'valB2'); insert into tab2 values (2,'valB3'); insert into tab2 values (2,'valB4'); select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, t2.strcol t2strcol from tab1 t1 JOIN tab2 t2 on t1.intcol = t2.intcol; mysql> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, -> t2.strcol t2strcol from tab1 t1 JOIN -> tab2 t2 on t1.intcol = t2.intcol; +----------+----------+----------+----------+ | t1intcol | t2intcol | t1strcol | t2strcol | +----------+----------+----------+----------+ | 1 | 1 | valA1 | valB1 | | 1 | 1 | valA1 | valB2 | | 2 | 2 | valA2 | valB3 | | 2 | 2 | valA2 | valB4 | +----------+----------+----------+----------+ 4 rows in set (0.01 sec) > Possible problem in a simple join? Getting duplicate rows and missing rows > --------------------------------------------------------------------------- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic > Reporter: Stephen Boesch > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > ------------------------------------ > Consider the following schema that includes two columns: > case class JoinTable2Cols(intcol: Int, strcol: String) > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > Here is the data in both tables: > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > Now let us join the tables on the first column: > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > What results do we get: > came back with 4 results > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org