Nicholas Chammas created SPARK-23945:
----------------------------------------

             Summary: Column.isin() should accept a single-column DataFrame as 
input
                 Key: SPARK-23945
                 URL: https://issues.apache.org/jira/browse/SPARK-23945
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Nicholas Chammas


In SQL you can filter rows based on the result of a subquery:

 
{code:java}
SELECT *
FROM table1
WHERE name NOT IN (
    SELECT name
    FROM table2
);{code}
In the Spark DataFrame API, the equivalent would probably look like this:
{code:java}
(table1
    .where(
        ~col('name').isin(
            table2.select('name')
        )
    )
){code}
However, .isin() currently [only accepts a local list of 
values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].

I imagine making this enhancement would happen as part of a larger effort to 
support correlated subqueries in the DataFrame API.

Or perhaps there is no plan to support this style of query in the DataFrame 
API, and queries like this should instead be written in a different way? How 
would we write a query like the one I have above in the DataFrame API, without 
needing to collect values locally for the NOT IN filter?

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to