Perhaps we can just have a function that turns a DataFrame into a Column? That'd work for both correlated and uncorrelated case, although in the correlated case we'd need to turn off eager analysis (otherwise there is no way to construct a valid DataFrame).
On Thu, Apr 19, 2018 at 4:08 PM, Ryan Blue <rb...@netflix.com.invalid> wrote: > Nick, thanks for raising this. > > It looks useful to have something in the DF API that behaves like > sub-queries, but I’m not sure that passing a DF works. Making every method > accept a DF that may contain matching data seems like it puts a lot of work > on the API — which now has to accept a DF all over the place. > > What about exposing transforms that make it easy to coerce data to what > the method needs? Instead of passing a dataframe, you’d pass df.toSet to > isin: > > val subQ = spark.sql("select distinct filter_col from source") > val df = table.filter($"col".isin(subQ.toSet)) > > That also distinguishes between a sub-query and a correlated sub-query > that uses values from the outer query. We would still need to come up with > syntax for the correlated case, unless there’s a proposal already. > > rb > > > On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I just submitted SPARK-23945 >> <https://issues.apache.org/jira/browse/SPARK-23945> but wanted to double >> check here to make sure I didn't miss something fundamental. >> >> Correlated subqueries are tracked at a high level in SPARK-18455 >> <https://issues.apache.org/jira/browse/SPARK-18455>, but it's not clear >> to me whether they are "design-appropriate" for the DataFrame API. >> >> Are correlated subqueries a thing we can expect to have in the DataFrame >> API? >> >> Nick >> >> > > > -- > Ryan Blue > Software Engineer > Netflix >