[ 
https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745467#comment-14745467
 ] 

Cheng Hao commented on SPARK-4226:
----------------------------------

[~marmbrus] [~yhuai] After investigating a little bit, I think using anti-join 
is much more efficient than rewriting the NOT IN / NOT EXISTS with left outer 
join followed by null filtering. As the anti-join will return negative once 
it's found the first matched from the second relation, however the left outer 
join will go thru every match pairs and then do filtering.

Besides, for the NOT EXISTS clause, without the anti-join, seems more 
complicated in implementation. For example:
{code}
mysql> select * from d1;
+------+------+
| a    | b    |
+------+------+
|    2 |    2 |
|    8 |   10 |
+------+------+
2 rows in set (0.00 sec)

mysql> select * from d2;
+------+------+
| a    | b    |
+------+------+
|    1 |    1 |
|    8 | NULL |
|    0 |    0 |
+------+------+
3 rows in set (0.00 sec)

mysql> select * from d1 where not exists (select b from d2 where d1.a=d2.a);
+------+------+
| a    | b    |
+------+------+
|    2 |    2 |
+------+------+
1 row in set (0.00 sec)

// If we rewrite the above query in left outer join, the filter condition 
cannot simply be the subquery project list.
mysql> select d1.a, d1.b from d1 left join d2 on d1.a=d2.a where d2.b is null;
+------+------+
| a    | b    |
+------+------+
|    8 |   10 |
|    2 |    2 |
+------+------+
2 rows in set (0.00 sec)
// get difference result with NOT EXISTS.
{code}

If you feel that make sense, I can reopen my PR and do the rebasing.

> SparkSQL - Add support for subqueries in predicates
> ---------------------------------------------------
>
>                 Key: SPARK-4226
>                 URL: https://issues.apache.org/jira/browse/SPARK-4226
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0
>         Environment: Spark 1.2 snapshot
>            Reporter: Terry Siu
>
> I have a test table defined in Hive as follows:
> {code:sql}
> CREATE TABLE sparkbug (
>   id INT,
>   event STRING
> ) STORED AS PARQUET;
> {code}
> and insert some sample data with ids 1, 2, 3.
> In a Spark shell, I then create a HiveContext and then execute the following 
> HQL to test out subquery predicates:
> {code}
> val hc = HiveContext(hc)
> hc.hql("select customerid from sparkbug where customerid in (select 
> customerid from sparkbug where customerid in (2,3))")
> {code}
> I get the following error:
> {noformat}
> java.lang.RuntimeException: Unsupported language features in query: select 
> customerid from sparkbug where customerid in (select customerid from sparkbug 
> where customerid in (2,3))
> TOK_QUERY
>   TOK_FROM
>     TOK_TABREF
>       TOK_TABNAME
>         sparkbug
>   TOK_INSERT
>     TOK_DESTINATION
>       TOK_DIR
>         TOK_TMP_FILE
>     TOK_SELECT
>       TOK_SELEXPR
>         TOK_TABLE_OR_COL
>           customerid
>     TOK_WHERE
>       TOK_SUBQUERY_EXPR
>         TOK_SUBQUERY_OP
>           in
>         TOK_QUERY
>           TOK_FROM
>             TOK_TABREF
>               TOK_TABNAME
>                 sparkbug
>           TOK_INSERT
>             TOK_DESTINATION
>               TOK_DIR
>                 TOK_TMP_FILE
>             TOK_SELECT
>               TOK_SELEXPR
>                 TOK_TABLE_OR_COL
>                   customerid
>             TOK_WHERE
>               TOK_FUNCTION
>                 in
>                 TOK_TABLE_OR_COL
>                   customerid
>                 2
>                 3
>         TOK_TABLE_OR_COL
>           customerid
> scala.NotImplementedError: No parse rules for ASTNode type: 817, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR
>   TOK_SUBQUERY_OP
>     in
>   TOK_QUERY
>     TOK_FROM
>       TOK_TABREF
>         TOK_TABNAME
>           sparkbug
>     TOK_INSERT
>       TOK_DESTINATION
>         TOK_DIR
>           TOK_TMP_FILE
>       TOK_SELECT
>         TOK_SELEXPR
>           TOK_TABLE_OR_COL
>             customerid
>       TOK_WHERE
>         TOK_FUNCTION
>           in
>           TOK_TABLE_OR_COL
>             customerid
>           2
>           3
>   TOK_TABLE_OR_COL
>     customerid
> " +
>          
> org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098)
>         
>         at scala.sys.package$.error(package.scala:27)
>         at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252)
>         at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
>         at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
>         at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> {noformat}
> [This 
> thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html]
>  also brings up lack of subquery support in SparkSQL. It would be nice to 
> have subquery predicate support in a near, future release (1.3, maybe?).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to