Re: Correlated subqueries in the DataFrame API

Ryan Blue Thu, 19 Apr 2018 16:10:02 -0700

Nick, thanks for raising this.

It looks useful to have something in the DF API that behaves like
sub-queries, but I’m not sure that passing a DF works. Making every method
accept a DF that may contain matching data seems like it puts a lot of work
on the API — which now has to accept a DF all over the place.

What about exposing transforms that make it easy to coerce data to what the
method needs? Instead of passing a dataframe, you’d pass df.toSet to isin:

val subQ = spark.sql("select distinct filter_col from source")
val df = table.filter($"col".isin(subQ.toSet))

That also distinguishes between a sub-query and a correlated sub-query that
uses values from the outer query. We would still need to come up with
syntax for the correlated case, unless there’s a proposal already.

rb

On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> I just submitted SPARK-23945
> <https://issues.apache.org/jira/browse/SPARK-23945> but wanted to double
> check here to make sure I didn't miss something fundamental.
>
> Correlated subqueries are tracked at a high level in SPARK-18455
> <https://issues.apache.org/jira/browse/SPARK-18455>, but it's not clear
> to me whether they are "design-appropriate" for the DataFrame API.
>
> Are correlated subqueries a thing we can expect to have in the DataFrame
> API?
>
> Nick
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Correlated subqueries in the DataFrame API

Reply via email to