+1. Thanks Doyeon!
Doyeon Kim <[email protected]> 于2025年9月24日周三 12:59写道: > > [DISCUSS] Replace Imap by rocksdb > Hi all, > > I’ve drafted a high-level design for a RocksDB-based state backend for > SeaTunnel. > This is the first phase and I’d like to collect feedback from the community. > > Below is the draft: > Seatunnel RocksDB-based State Backend: High-Level Design Draft Motivation > > Hazelcast IMap shows performance limitations under high-load environments. > > See Issue #9851 for more details.( > https://github.com/apache/seatunnel/issues/9851) > > *Goal:* > > Replace Hazelcast IMap with a custom state backend based on local RocksDB, > using Hazelcast PartitionService only for partition-to-node mapping and > cluster membership. > ------------------------------ > Design Principles > > - > > *Local State Storage:* > > Each node stores its assigned partition state in a local RocksDB > instance. > - > > *Partition Routing:* > > Hazelcast PartitionService is used to determine the owner node for a > given key (partitioning/routing only). > - > > *No Hazelcast Data Structure:* > > Hazelcast IMap will not be used for business data/state storage. > - > > *Distributed Access:* > > State access (get/put/delete) must be routed to the partition owner. If > the current node is not the owner, it must forward requests to the correct > node. > - > > *Cluster-wide Operations:* > > For operations that require aggregating data from all partitions/nodes > (e.g., metrics), a coordinator node collects results from all nodes and > aggregates them. > - > > *Future-proofing:* > > Partition migration/rebalancing, state replication, and checkpoint > integration will be addressed in subsequent phases. > > ------------------------------ > High-Level Architecture > > 1. *Partition Ownership* > - Use PartitionService to determine which node owns each > partition/key. > - Each node only reads/writes state for the partitions it owns. > 2. *Local State Backend* > - Use RocksDB as the local, persistent store for all state data > belonging to owned partitions. > 3. *State Access Protocol* > - For any key-based operation: > - If the current node is the owner, access RocksDB directly. > - If not, forward the request to the owner node (via Hazelcast > IExecutorService). > 4. *Cluster-wide Operations* > - For global queries (e.g., collect all metrics), the coordinator > node sends requests to all nodes, gathers their local results, and > aggregates the data. > > ------------------------------ > Example State Access Flow > > Partition partition = hazelcast.getPartitionService().getPartition(key); > Member owner = partition.getOwner(); > > if (owner.equals(hazelcast.getCluster().getLocalMember())) { > // Access local RocksDB > value = localRocksdbGet(key); > } else { > // Forward request to owner node > value = remoteGet(owner.getAddress(), key); > } > > > ------------------------------ > Out of Scope / Future Work > > - > > *Partition Rebalancing & Data Migration:* > > If cluster topology changes (node join/leave), partition ownership may > change. > > Current phase does not handle automatic migration of partition data > between nodes. > > This can result in temporary data unavailability until migration logic > is implemented in a future phase. > - > > *Replication & HA:* > > There is no built-in replication in this phase, so a single-node failure > may result in data loss. > > In future phases, we will address replication and recovery—most likely > by integrating with external storage, which is generally required for HA in > distributed systems. > - > > *External Checkpoint/Restore Integration:* > > Not covered in this phase, but planned for later together with > replication and HA. > > ------------------------------ > Next Steps > > 1. Collect feedback on this high-level design from the community. > 2. Upon agreement, proceed to detailed design (class diagrams, API > specs, migration flow, error handling, etc.). > 3. Develop a prototype for local RocksDB. > > ------------------------------ > > *Detailed class diagrams and API specifications will be provided in a > follow-up document after initial feedback and community agreement.* > > > Looking forward to your feedback. Thanks!
