[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928938#comment-16928938 ] Jean-Marc Spaggiari commented on SPARK-23945: - It is what I ended up doing (with the DataFrame syntax) but might be nice to have the other syntax option? > Column.isin() should accept a single-column DataFrame as input > -- > > Key: SPARK-23945 > URL: https://issues.apache.org/jira/browse/SPARK-23945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Nicholas Chammas >Priority: Minor > > In SQL you can filter rows based on the result of a subquery: > {code:java} > SELECT * > FROM table1 > WHERE name NOT IN ( > SELECT name > FROM table2 > );{code} > In the Spark DataFrame API, the equivalent would probably look like this: > {code:java} > (table1 > .where( > ~col('name').isin( > table2.select('name') > ) > ) > ){code} > However, .isin() currently [only accepts a local list of > values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin]. > I imagine making this enhancement would happen as part of a larger effort to > support correlated subqueries in the DataFrame API. > Or perhaps there is no plan to support this style of query in the DataFrame > API, and queries like this should instead be written in a different way? How > would we write a query like the one I have above in the DataFrame API, > without needing to collect values locally for the NOT IN filter? > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23945) Column.isin() should accept a single-column DataFrame as input
[ https://issues.apache.org/jira/browse/SPARK-23945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928799#comment-16928799 ] Jean-Marc Spaggiari commented on SPARK-23945: - What's about a query like: {code:java} SELECT * FROM table1 WHERE (a, b) IN ( SELECT a,b FROM table2 ); {code} Any easy way to translate that into the Dataframe API? > Column.isin() should accept a single-column DataFrame as input > -- > > Key: SPARK-23945 > URL: https://issues.apache.org/jira/browse/SPARK-23945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Nicholas Chammas >Priority: Minor > > In SQL you can filter rows based on the result of a subquery: > {code:java} > SELECT * > FROM table1 > WHERE name NOT IN ( > SELECT name > FROM table2 > );{code} > In the Spark DataFrame API, the equivalent would probably look like this: > {code:java} > (table1 > .where( > ~col('name').isin( > table2.select('name') > ) > ) > ){code} > However, .isin() currently [only accepts a local list of > values|http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.isin]. > I imagine making this enhancement would happen as part of a larger effort to > support correlated subqueries in the DataFrame API. > Or perhaps there is no plan to support this style of query in the DataFrame > API, and queries like this should instead be written in a different way? How > would we write a query like the one I have above in the DataFrame API, > without needing to collect values locally for the NOT IN filter? > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org