[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15801624#comment-15801624 ]
Nattavut Sutyanyong commented on SPARK-19017: --------------------------------------------- One way to interpret the behaviour of the comparison of tuples is it operates strictly on 2-value logic of the comparison of each element of the tuples. That is, Case 1: (a1, a2) = (b1, b2) is interpreted as IF (a1 = b1) is true THEN true ELSE false AND IF (a2 = b2) is true THEN true ELSE false Case 2: (a1, a2) <> (b1, b2) is interpreted as a negation of the equality operator on the 2-value logic of each element. IF (a1 = b1) is true THEN false ELSE true OR IF (a2 = b2) is true THEN false ELSE true The NOT IN semantics is different that it preserves the 3-value logic in the comparison of each element. I don't know that we should align the two usages to the same semantics or not. > NOT IN subquery with more than one column may return incorrect results > ---------------------------------------------------------------------- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 > Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org