Re: Correlated subqueries in the DataFrame API

Reynold Xin Thu, 19 Apr 2018 16:14:25 -0700

Perhaps we can just have a function that turns a DataFrame into a Column?
That'd work for both correlated and uncorrelated case, although in the
correlated case we'd need to turn off eager analysis (otherwise there is no
way to construct a valid DataFrame).



On Thu, Apr 19, 2018 at 4:08 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> Nick, thanks for raising this.
>
> It looks useful to have something in the DF API that behaves like
> sub-queries, but I’m not sure that passing a DF works. Making every method
> accept a DF that may contain matching data seems like it puts a lot of work
> on the API — which now has to accept a DF all over the place.
>
> What about exposing transforms that make it easy to coerce data to what
> the method needs? Instead of passing a dataframe, you’d pass df.toSet to
> isin:
>
> val subQ = spark.sql("select distinct filter_col from source")
> val df = table.filter($"col".isin(subQ.toSet))
>
> That also distinguishes between a sub-query and a correlated sub-query
> that uses values from the outer query. We would still need to come up with
> syntax for the correlated case, unless there’s a proposal already.
>
> rb
> 
>
> On Mon, Apr 9, 2018 at 3:56 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I just submitted SPARK-23945
>> <https://issues.apache.org/jira/browse/SPARK-23945> but wanted to double
>> check here to make sure I didn't miss something fundamental.
>>
>> Correlated subqueries are tracked at a high level in SPARK-18455
>> <https://issues.apache.org/jira/browse/SPARK-18455>, but it's not clear
>> to me whether they are "design-appropriate" for the DataFrame API.
>>
>> Are correlated subqueries a thing we can expect to have in the DataFrame
>> API?
>>
>> Nick
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Correlated subqueries in the DataFrame API

Reply via email to