Nick, thanks for raising this. It looks useful to have something in the DF API that behaves like sub-queries, but I’m not sure that passing a DF works. Making every method accept a DF that may contain matching data seems like it puts a lot of work on the API — which now has to accept a DF all over the place.
What about exposing transforms that make it easy to coerce data to what the method needs? Instead of passing a dataframe, you’d pass df.toSet to isin: val subQ = spark.sql("select distinct filter_col from source") val df = table.filter($"col".isin(subQ.toSet)) That also distinguishes between a sub-query and a correlated sub-query that uses values from the outer query. We would still need to come up with syntax for the correlated case, unless there’s a proposal already. rb On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > I just submitted SPARK-23945 > <https://issues.apache.org/jira/browse/SPARK-23945> but wanted to double > check here to make sure I didn't miss something fundamental. > > Correlated subqueries are tracked at a high level in SPARK-18455 > <https://issues.apache.org/jira/browse/SPARK-18455>, but it's not clear > to me whether they are "design-appropriate" for the DataFrame API. > > Are correlated subqueries a thing we can expect to have in the DataFrame > API? > > Nick > > -- Ryan Blue Software Engineer Netflix