Rodrigo Boavida created SPARK-37536:
---------------------------------------

             Summary: Allow for API user to disable Shuffle Operations while 
running locally
                 Key: SPARK-37536
                 URL: https://issues.apache.org/jira/browse/SPARK-37536
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.3.0
         Environment: Spark running in local mode
            Reporter: Rodrigo Boavida


We have been using Spark on local mode, as a small embedded, in-memory SQL DB 
for our microservice.

Spark's powerful SQL features, and flexibility enables developers to build 
efficient data querying solutions. Due to the nature of our solution dealing 
with small datasets, which required to be queried through SQL, on very low 
latencies, we found the embedded approach a very good model.

We found through experimentation, that Spark on local mode, would gain 
significant performance improvements (on average between 20-30%) by disabling 
the shuffling on aggregation operations. This is done by expanding the query 
execution plan with ShuffleExchangeExec or the BroadcastExchangeExec.

I will be raising a PR, to propose introducing a new configuration variable 

*spark.shuffle.local.enabled*

This variable will default to true, and will be checked on the QueryExecution 
EnsureRequirements creation time, in conjunction with checking if Spark is 
running on local mode, will keep the execution plan unchanged if the value is 
false.

Looking forward any comments and feedback.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to