[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-41053:
-----------------------------------
    Summary: Better Spark UI scalability and Driver stability for large 
applications  (was: Support disk-based KV store in Spark live UI)

> Better Spark UI scalability and Driver stability for large applications
> -----------------------------------------------------------------------
>
>                 Key: SPARK-41053
>                 URL: https://issues.apache.org/jira/browse/SPARK-41053
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core, Web UI
>    Affects Versions: 3.4.0
>            Reporter: Gengliang Wang
>            Priority: Major
>
> The current architecture of Spark live UI and Spark history server(SHS) is 
> too simple to serve large clusters and heavy workloads:
>  * Spark stores all the live UI date in memory. The size can be a few GBs and 
> affects the driver's stability (OOM). 
>  * There is a limitation of storing 1000 queries only. Note that we can’t 
> simply increase the limitation under the current Architecture. I did a memory 
> profiling. Storing one query execution detail can take 800KB while storing 
> one task requires 0.3KB. So for 1000 SQL queries with 1000* 2000 tasks, the 
> memory usage for query execution and task data will be 1.4GB. Spark UI stores 
> UI data for jobs/stages/executors as well.  So to store 10k queries, it may 
> take more than 14GB.
>  * SHS has to parse JSON format event log for the initial start.  The 
> uncompressed event logs can be as big as a few GBs, and the parse can be 
> quite slow. Some users reported they had to wait for more than half an hour.
>  
> The proposal is to:
>  # Store all the live UI data in local RocksDB with protobuf serialization.
>  # The RocksDB files of live UI can be used on SHS directly.
>  # If the RocksDB file is unavailable for SHS, event logs can be written with 
> protobuf for faster replay.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to