Stephen Boesch created SPARK-4775:
-------------------------------------

             Summary: Possible problem in a simple join?  Getting duplicate 
rows and missing rows
                 Key: SPARK-4775
                 URL: https://issues.apache.org/jira/browse/SPARK-4775
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.3.0
         Environment: Run on Mac but should be agnostic
            Reporter: Stephen Boesch


I am working on testing of HBase joins. As part of this work some simple 
vanilla SparkSQL tests were created.  Some of the results are surprising: here 
are the details:
------------------------------------
Consider the following schema that includes two columns:

case class JoinTable2Cols(intcol: Int, strcol: String)

Let us register two temp tables using this schema and insert 2 rows and 4 rows 
respectively:

    val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, 
s"valA$ix")})
    rdd1.registerTempTable("SparkJoinTable1")

    val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
    val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, 
s"valB$is")})
    val table2 = rdd2.registerTempTable("SparkJoinTable2")

Here is the data in both tables:

Table1 Contents:
[1,valA1]
[2,valA2]
Table2 Contents:
[1,valB1]
[1,valB2]
[2,valB3]
[2,valB4]

Now let us join the tables on the first column:

select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
                t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
                    SparkJoinTable2 t2 on t1.intcol = t2.intcol

What results do we get:


 came back with 4 results
Results
[1,1,valA1,valB2]
[1,1,valA1,valB2]
[2,2,valA2,valB4]
[2,2,valA2,valB4]


Huh??

Where did valB1 and valB3 go? Why do we have duplicate rows?

Note: the expected results were:

      Seq(1, 1, "valA1", "valB1"),
      Seq(1, 1, "valA1", "valB2"),
      Seq(2, 2, "valA2", "valB3"),
      Seq(2, 2, "valA2", "valB4"))


A standalone testing program is attached  SparkSQLJoinSuite. An abridged 
version of the actual output is also attached.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to