[ 
https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Florentino Sainz updated SPARK-31417:
-------------------------------------
    Description: 
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/61111172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having only Seq[Expression] in the case class, but 
also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
made a test-run with a predicate which receives and Broadcast variable and uses 
the value inside, and it works much better.

  was:
*I would like to for someone to explore if the following feature makes sense. 
Allow users to use "isin" (or similar) predicates with Lists which are 
broadcasted.*

As of now (AFAIK), users can only provide a list/sequence of elements which 
will be sent as part of the "In" predicate, which is part of the task, to the 
executors. So when this list is huge, this causes tasks to be "huge" (specially 
when the task number is big).

I'm coming from 
https://stackoverflow.com/questions/61111172/scala-spark-isin-broadcast-list 
(I'm the creator of the post, and also the 2nd answer). 

I know this is not common, but can be useful for people reading from "servers" 
which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete 
case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 
minute "calculating plan" to send the BIG tasks to the executors) versus 30 
minutes doing a full-scan (even if semi-filtered) on my table.

More or less I have a clear view (explained in stackoverflow) on what to modify 
If I wanted to create my own "BroadcastIn" in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
 however with my limited "idea", anyone who implemented In pushdown would have 
to re-adapt it to the new "BroadcastIn"

I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
for a corporation which uses that version).

Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
current case class "In" while keeping the compatibility with PushDown 
implementations, but not having the List as a val in the case class?* This is 
what makes the tasks huge, I made a test-run with a predicate which receives 
and Broadcast variable and uses the value inside, and it works much better.


> Allow broadcast variables when using isin code
> ----------------------------------------------
>
>                 Key: SPARK-31417
>                 URL: https://issues.apache.org/jira/browse/SPARK-31417
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Florentino Sainz
>            Priority: Minor
>
> *I would like to for someone to explore if the following feature makes sense. 
> Allow users to use "isin" (or similar) predicates with Lists which are 
> broadcasted.*
> As of now (AFAIK), users can only provide a list/sequence of elements which 
> will be sent as part of the "In" predicate, which is part of the task, to the 
> executors. So when this list is huge, this causes tasks to be "huge" 
> (specially when the task number is big).
> I'm coming from 
> https://stackoverflow.com/questions/61111172/scala-spark-isin-broadcast-list 
> (I'm the creator of the post, and also the 2nd answer). 
> I know this is not common, but can be useful for people reading from 
> "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my 
> concrete case, querying Kudu with 200.000 elements in the isin takes 
> ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the 
> executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my 
> table.
> More or less I have a clear view (explained in stackoverflow) on what to 
> modify If I wanted to create my own "BroadcastIn" in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala,
>  however with my limited "idea", anyone who implemented In pushdown would 
> have to re-adapt it to the new "BroadcastIn"
> I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 
> 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working 
> for a corporation which uses that version).
> Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an 
> expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend 
> current case class "In" while keeping the compatibility with PushDown 
> implementations, but not having only Seq[Expression] in the case class, but 
> also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I 
> made a test-run with a predicate which receives and Broadcast variable and 
> uses the value inside, and it works much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to