[ https://issues.apache.org/jira/browse/SPARK-15441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15297881#comment-15297881 ]
Zhan Zhang commented on SPARK-15441: ------------------------------------ Currently new GenericInternalRow(right.output.length) is used as nullRow, but actually it cannot be used to identify the difference of row itself is null or all columns are null. Probably we can add a special row nullRow to represent that the InternalRow itself is null, so that Encoder can identify whether the object itself is null or not. > dataset outer join seems to return incorrect result > --------------------------------------------------- > > Key: SPARK-15441 > URL: https://issues.apache.org/jira/browse/SPARK-15441 > Project: Spark > Issue Type: Bug > Components: sq; > Reporter: Reynold Xin > Assignee: Wenchen Fan > Priority: Critical > > See notebook > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2836020637783173/5382278320999420/latest.html > {code} > import org.apache.spark.sql.functions > val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS() > val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS() > // The last row _1 should be null, rather than (null, -1) > left.toDF("k", "v").as[(String, Int)].alias("left") > .joinWith(right.toDF("k", "u").as[(String, String)].alias("right"), > functions.col("left.k") === functions.col("right.k"), "right_outer") > .show() > {code} > The returned result currently is > {code} > +---------+-----+ > | _1| _2| > +---------+-----+ > | (a,2)|(a,x)| > | (a,1)|(a,x)| > | (b,3)|(b,y)| > |(null,-1)|(d,z)| > +---------+-----+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org