[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2019-09-12 Thread Jean-Marc Spaggiari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928938#comment-16928938
 ] 

Jean-Marc Spaggiari commented on SPARK-23945:
-

It is what I ended up doing (with the DataFrame syntax) but might be nice to 
have the other syntax option?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input

2019-09-12 Thread Jean-Marc Spaggiari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928799#comment-16928799
 ] 

Jean-Marc Spaggiari commented on SPARK-23945:
-

What's about a query like:
{code:java}
SELECT *
FROM table1
WHERE (a, b) IN (
SELECT a,b
FROM table2
);
{code}
Any easy way to translate that into the Dataframe API?

> Column.isin() should accept a single-column DataFrame as input
> --
>
> Key: SPARK-23945
> URL: https://issues.apache.org/jira/browse/SPARK-23945
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> In SQL you can filter rows based on the result of a subquery:
> {code:java}
> SELECT *
> FROM table1
> WHERE name NOT IN (
> SELECT name
> FROM table2
> );{code}
> In the Spark DataFrame API, the equivalent would probably look like this:
> {code:java}
> (table1
> .where(
> ~col('name').isin(
> table2.select('name')
> )
> )
> ){code}
> However, .isin() currently [only accepts a local list of 
> values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin].
> I imagine making this enhancement would happen as part of a larger effort to 
> support correlated subqueries in the DataFrame API.
> Or perhaps there is no plan to support this style of query in the DataFrame 
> API, and queries like this should instead be written in a different way? How 
> would we write a query like the one I have above in the DataFrame API, 
> without needing to collect values locally for the NOT IN filter?
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org