[ https://issues.apache.org/jira/browse/SPARK-37536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rodrigo Boavida updated SPARK-37536: ------------------------------------ Description: We have been using Spark on local mode, as a small embedded, in-memory SQL DB for our microservice. Spark's powerful SQL features, and flexibility enables developers to build efficient data querying solutions. Due to the nature of our solution dealing with small datasets, which required to be queried through SQL, on very low latencies, we found the embedded approach a very good model. We found through experimentation, that Spark on local mode, would gain significant performance improvements (on average between 20-30%) by disabling the shuffling on aggregation operations. This is done by expanding the query execution plan with ShuffleExchangeExec or the BroadcastExchangeExec. I will be raising a PR, to propose introducing a new configuration variable *spark.sql.localMode.shuffle.enabled* This variable will default to true, and will be checked on the QueryExecution EnsureRequirements creation time, in conjunction with checking if Spark is running on local mode, will keep the execution plan unchanged if the value is false. Looking forward any comments and feedback. was: We have been using Spark on local mode, as a small embedded, in-memory SQL DB for our microservice. Spark's powerful SQL features, and flexibility enables developers to build efficient data querying solutions. Due to the nature of our solution dealing with small datasets, which required to be queried through SQL, on very low latencies, we found the embedded approach a very good model. We found through experimentation, that Spark on local mode, would gain significant performance improvements (on average between 20-30%) by disabling the shuffling on aggregation operations. This is done by expanding the query execution plan with ShuffleExchangeExec or the BroadcastExchangeExec. I will be raising a PR, to propose introducing a new configuration variable *spark.shuffle.local.enabled* This variable will default to true, and will be checked on the QueryExecution EnsureRequirements creation time, in conjunction with checking if Spark is running on local mode, will keep the execution plan unchanged if the value is false. Looking forward any comments and feedback. > Allow for API user to disable Shuffle Operations while running locally > ---------------------------------------------------------------------- > > Key: SPARK-37536 > URL: https://issues.apache.org/jira/browse/SPARK-37536 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 3.3.0 > Environment: Spark running in local mode > Reporter: Rodrigo Boavida > Priority: Major > > We have been using Spark on local mode, as a small embedded, in-memory SQL DB > for our microservice. > Spark's powerful SQL features, and flexibility enables developers to build > efficient data querying solutions. Due to the nature of our solution dealing > with small datasets, which required to be queried through SQL, on very low > latencies, we found the embedded approach a very good model. > We found through experimentation, that Spark on local mode, would gain > significant performance improvements (on average between 20-30%) by disabling > the shuffling on aggregation operations. This is done by expanding the query > execution plan with ShuffleExchangeExec or the BroadcastExchangeExec. > I will be raising a PR, to propose introducing a new configuration variable > *spark.sql.localMode.shuffle.enabled* > This variable will default to true, and will be checked on the QueryExecution > EnsureRequirements creation time, in conjunction with checking if Spark is > running on local mode, will keep the execution plan unchanged if the value is > false. > Looking forward any comments and feedback. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org