[DISCUSS] Replace Imap by rocksdb Hi all, I’ve drafted a high-level design for a RocksDB-based state backend for SeaTunnel. This is the first phase and I’d like to collect feedback from the community.
Below is the draft: Seatunnel RocksDB-based State Backend: High-Level Design Draft Motivation Hazelcast IMap shows performance limitations under high-load environments. See Issue #9851 for more details.( https://github.com/apache/seatunnel/issues/9851) *Goal:* Replace Hazelcast IMap with a custom state backend based on local RocksDB, using Hazelcast PartitionService only for partition-to-node mapping and cluster membership. ------------------------------ Design Principles - *Local State Storage:* Each node stores its assigned partition state in a local RocksDB instance. - *Partition Routing:* Hazelcast PartitionService is used to determine the owner node for a given key (partitioning/routing only). - *No Hazelcast Data Structure:* Hazelcast IMap will not be used for business data/state storage. - *Distributed Access:* State access (get/put/delete) must be routed to the partition owner. If the current node is not the owner, it must forward requests to the correct node. - *Cluster-wide Operations:* For operations that require aggregating data from all partitions/nodes (e.g., metrics), a coordinator node collects results from all nodes and aggregates them. - *Future-proofing:* Partition migration/rebalancing, state replication, and checkpoint integration will be addressed in subsequent phases. ------------------------------ High-Level Architecture 1. *Partition Ownership* - Use PartitionService to determine which node owns each partition/key. - Each node only reads/writes state for the partitions it owns. 2. *Local State Backend* - Use RocksDB as the local, persistent store for all state data belonging to owned partitions. 3. *State Access Protocol* - For any key-based operation: - If the current node is the owner, access RocksDB directly. - If not, forward the request to the owner node (via Hazelcast IExecutorService). 4. *Cluster-wide Operations* - For global queries (e.g., collect all metrics), the coordinator node sends requests to all nodes, gathers their local results, and aggregates the data. ------------------------------ Example State Access Flow Partition partition = hazelcast.getPartitionService().getPartition(key); Member owner = partition.getOwner(); if (owner.equals(hazelcast.getCluster().getLocalMember())) { // Access local RocksDB value = localRocksdbGet(key); } else { // Forward request to owner node value = remoteGet(owner.getAddress(), key); } ------------------------------ Out of Scope / Future Work - *Partition Rebalancing & Data Migration:* If cluster topology changes (node join/leave), partition ownership may change. Current phase does not handle automatic migration of partition data between nodes. This can result in temporary data unavailability until migration logic is implemented in a future phase. - *Replication & HA:* There is no built-in replication in this phase, so a single-node failure may result in data loss. In future phases, we will address replication and recovery—most likely by integrating with external storage, which is generally required for HA in distributed systems. - *External Checkpoint/Restore Integration:* Not covered in this phase, but planned for later together with replication and HA. ------------------------------ Next Steps 1. Collect feedback on this high-level design from the community. 2. Upon agreement, proceed to detailed design (class diagrams, API specs, migration flow, error handling, etc.). 3. Develop a prototype for local RocksDB. ------------------------------ *Detailed class diagrams and API specifications will be provided in a follow-up document after initial feedback and community agreement.* Looking forward to your feedback. Thanks!
