[ https://issues.apache.org/jira/browse/SPARK-19017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798758#comment-15798758 ]
Herman van Hovell commented on SPARK-19017: ------------------------------------------- Ok, my bad. Lets try this again. If I follow the NAAJ explanation in the section 6.1 of the Enhanced Subquery Optimizations in Oracle paper (http://www.vldb.org/pvldb/2/vldb09-423.pdf). Then I can rewrite the following query {{select * from tbl_a where a not in (select b from tbl_b)}} into the following (given that tbl_b has n elements): {noformat} SELECT * FROM TBL_a WHERE a <> b1 AND a <> b2 AND ... AND a <> bn {noformat} My basic thought here is that if we are comparing a complete tuple, for instance (2, 1), to an incomplete tuple, for instance (2, null), then the result should be unknown (null). Following this it would be obvious that a conjunctive predicate containing such an incomplete comparison should also evaluate to unknown. It now boils down to how we should compare the tuples; i.e. how is {{a <> b}} evaluated. Note that I do not have an answer here. The following two things could apply: * {{not(a.x1 = b.x1 and a.x2 = b.x2 and ... and a.xn = b.xn)}} - I would lean towards this option. * {{a.x1 <> b.x1 or a.x2 <> b.x2 or ... or a.xn <> b.xn)}} - Spark currently implements this option if you compare structs. > NOT IN subquery with more than one column may return incorrect results > ---------------------------------------------------------------------- > > Key: SPARK-19017 > URL: https://issues.apache.org/jira/browse/SPARK-19017 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 > Reporter: Nattavut Sutyanyong > > When putting more than one column in the NOT IN, the query may not return > correctly if there is a null data. We can demonstrate the problem with the > following data set and query: > {code} > Seq((2,1)).toDF("a1","b1").createOrReplaceTempView("t1") > Seq[(java.lang.Integer,java.lang.Integer)]((1,null)).toDF("a2","b2").createOrReplaceTempView("t2") > sql("select * from t1 where (a1,b1) not in (select a2,b2 from t2)").show > +---+---+ > | a1| b1| > +---+---+ > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org