[ 
https://issues.apache.org/jira/browse/SPARK-37536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rodrigo Boavida updated SPARK-37536:
------------------------------------
    Description: 
We have been using Spark on local mode, as a small embedded, in-memory SQL DB 
for our microservice.

Spark's powerful SQL features, and flexibility enables developers to build 
efficient data querying solutions. Due to the nature of our solution dealing 
with small datasets, which required to be queried through SQL, on very low 
latencies, we found the embedded approach a very good model.

We found through experimentation, that Spark on local mode, would gain 
significant performance improvements (on average between 20-30%) by disabling 
the shuffling on aggregation operations. This is done by expanding the query 
execution plan with ShuffleExchangeExec or the BroadcastExchangeExec.

I will be raising a PR, to propose introducing a new configuration variable 

*spark.sql.localMode.shuffle.enabled*

This variable will default to true, and will be checked on the QueryExecution 
EnsureRequirements creation time, in conjunction with checking if Spark is 
running on local mode, will keep the execution plan unchanged if the value is 
false.

Looking forward any comments and feedback.

  was:
We have been using Spark on local mode, as a small embedded, in-memory SQL DB 
for our microservice.

Spark's powerful SQL features, and flexibility enables developers to build 
efficient data querying solutions. Due to the nature of our solution dealing 
with small datasets, which required to be queried through SQL, on very low 
latencies, we found the embedded approach a very good model.

We found through experimentation, that Spark on local mode, would gain 
significant performance improvements (on average between 20-30%) by disabling 
the shuffling on aggregation operations. This is done by expanding the query 
execution plan with ShuffleExchangeExec or the BroadcastExchangeExec.

I will be raising a PR, to propose introducing a new configuration variable 

*spark.shuffle.local.enabled*

This variable will default to true, and will be checked on the QueryExecution 
EnsureRequirements creation time, in conjunction with checking if Spark is 
running on local mode, will keep the execution plan unchanged if the value is 
false.

Looking forward any comments and feedback.


> Allow for API user to disable Shuffle Operations while running locally
> ----------------------------------------------------------------------
>
>                 Key: SPARK-37536
>                 URL: https://issues.apache.org/jira/browse/SPARK-37536
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.3.0
>         Environment: Spark running in local mode
>            Reporter: Rodrigo Boavida
>            Priority: Major
>
> We have been using Spark on local mode, as a small embedded, in-memory SQL DB 
> for our microservice.
> Spark's powerful SQL features, and flexibility enables developers to build 
> efficient data querying solutions. Due to the nature of our solution dealing 
> with small datasets, which required to be queried through SQL, on very low 
> latencies, we found the embedded approach a very good model.
> We found through experimentation, that Spark on local mode, would gain 
> significant performance improvements (on average between 20-30%) by disabling 
> the shuffling on aggregation operations. This is done by expanding the query 
> execution plan with ShuffleExchangeExec or the BroadcastExchangeExec.
> I will be raising a PR, to propose introducing a new configuration variable 
> *spark.sql.localMode.shuffle.enabled*
> This variable will default to true, and will be checked on the QueryExecution 
> EnsureRequirements creation time, in conjunction with checking if Spark is 
> running on local mode, will keep the execution plan unchanged if the value is 
> false.
> Looking forward any comments and feedback.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to