darion yaphet created SPARK-56734:
-------------------------------------
Summary: Optimize RocksDBPersistenceEngine by segregating data
into distinct Column Families
Key: SPARK-56734
URL: https://issues.apache.org/jira/browse/SPARK-56734
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 4.3.0
Reporter: darion yaphet
*Motivation*
Currently, {{RocksDBPersistenceEngine}} in the Spark Master stores all metadata
(Applications, Workers, Drivers) in a single default Column Family, using key
prefixes to distinguish them. This causes significant performance issues during
recovery: * *Inefficient Scanning:* Reading a specific type (e.g.,
Applications) requires scanning the entire database and performing expensive
string prefix matching, leading to *O(N_total)* complexity.
* *High Overhead:* The current approach wastes CPU on string operations and
causes cache contention between different data types.
*Proposed Solution*
Refactor {{RocksDBPersistenceEngine}} to use native *Column Families* for data
isolation (e.g., separate CFs for Apps, Workers, and Drivers). * Eliminate key
prefixing logic and route data directly to the corresponding
{{{}ColumnFamilyHandle{}}}.
* Allow the engine to scan only the relevant Column Family during recovery.
*Benefits* * *Faster Recovery:* Optimizes read complexity from *O(N_total)* to
{*}O(N_type){*}, drastically reducing Master startup time.
* *Better Performance:* Removes string matching overhead and improves Block
Cache hit rates.
* *Granular Control:* Enables independent configuration (e.g., compression,
TTL) for different metadata types.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]