[ 
https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246029#comment-14246029
 ] 

Michael Armbrust commented on SPARK-4775:
-----------------------------------------

I only scanned over the code quickly, but I think likely the problem is that 
you are calling "toRDD".  This function is an developer API not intended for 
users and is documented "Internal version of the RDD. Avoids copies and has no 
schema".  If you use it directly without defensively copying you'll see weird 
repeated rows.  Instead just use the SchemaRDD as an RDD and we'll do the 
copying for you.

> Possible problem in a simple join?  Getting duplicate rows and missing rows
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-4775
>                 URL: https://issues.apache.org/jira/browse/SPARK-4775
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment: Run on Mac but should be agnostic
>            Reporter: Stephen Boesch
>            Assignee: Michael Armbrust
>
> I am working on testing of HBase joins. As part of this work some simple 
> vanilla SparkSQL tests were created.  Some of the results are surprising: 
> here are the details:
> ------------------------------------
> Consider the following schema that includes two columns:
> {code}
> case class JoinTable2Cols(intcol: Int, strcol: String)
> {code}
> Let us register two temp tables using this schema and insert 2 rows and 4 
> rows respectively:
> {code}
>     val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, 
> s"valA$ix")})
>     rdd1.registerTempTable("SparkJoinTable1")
>     val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
>     val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, 
> s"valB$is")})
>     val table2 = rdd2.registerTempTable("SparkJoinTable2")
> {code}
> Here is the data in both tables:
> {code}
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> {code}
> Now let us join the tables on the first column:
> {code}
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
>                 t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
>                     SparkJoinTable2 t2 on t1.intcol = t2.intcol
> {code}
> What results do we get:
>  came back with 4 results
> {code}
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> {code}
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
> {code}
>       Seq(1, 1, "valA1", "valB1"),
>       Seq(1, 1, "valA1", "valB2"),
>       Seq(2, 2, "valA2", "valB3"),
>       Seq(2, 2, "valA2", "valB4"))
> {code}
> A standalone testing program is attached  SparkSQLJoinSuite. An abridged 
> version of the actual output is also attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to