+1. Thanks Doyeon!

Doyeon Kim <[email protected]> 于2025年9月24日周三 12:59写道:
>
> [DISCUSS] Replace Imap by rocksdb
> Hi all,
>
> I’ve drafted a high-level design for a RocksDB-based state backend for
> SeaTunnel.
> This is the first phase and I’d like to collect feedback from the community.
>
> Below is the draft:
> Seatunnel RocksDB-based State Backend: High-Level Design Draft Motivation
>
> Hazelcast IMap shows performance limitations under high-load environments.
>
> See Issue #9851 for more details.(
> https://github.com/apache/seatunnel/issues/9851)
>
> *Goal:*
>
> Replace Hazelcast IMap with a custom state backend based on local RocksDB,
> using Hazelcast PartitionService only for partition-to-node mapping and
> cluster membership.
> ------------------------------
> Design Principles
>
>    -
>
>    *Local State Storage:*
>
>    Each node stores its assigned partition state in a local RocksDB
>    instance.
>    -
>
>    *Partition Routing:*
>
>    Hazelcast PartitionService is used to determine the owner node for a
>    given key (partitioning/routing only).
>    -
>
>    *No Hazelcast Data Structure:*
>
>    Hazelcast IMap will not be used for business data/state storage.
>    -
>
>    *Distributed Access:*
>
>    State access (get/put/delete) must be routed to the partition owner. If
>    the current node is not the owner, it must forward requests to the correct
>    node.
>    -
>
>    *Cluster-wide Operations:*
>
>    For operations that require aggregating data from all partitions/nodes
>    (e.g., metrics), a coordinator node collects results from all nodes and
>    aggregates them.
>    -
>
>    *Future-proofing:*
>
>    Partition migration/rebalancing, state replication, and checkpoint
>    integration will be addressed in subsequent phases.
>
> ------------------------------
> High-Level Architecture
>
>    1. *Partition Ownership*
>       - Use PartitionService to determine which node owns each
>       partition/key.
>       - Each node only reads/writes state for the partitions it owns.
>    2. *Local State Backend*
>       - Use RocksDB as the local, persistent store for all state data
>       belonging to owned partitions.
>    3. *State Access Protocol*
>       - For any key-based operation:
>          - If the current node is the owner, access RocksDB directly.
>          - If not, forward the request to the owner node (via Hazelcast
>          IExecutorService).
>       4. *Cluster-wide Operations*
>       - For global queries (e.g., collect all metrics), the coordinator
>       node sends requests to all nodes, gathers their local results, and
>       aggregates the data.
>
> ------------------------------
> Example State Access Flow
>
> Partition partition = hazelcast.getPartitionService().getPartition(key);
> Member owner = partition.getOwner();
>
> if (owner.equals(hazelcast.getCluster().getLocalMember())) {
>     // Access local RocksDB
>     value = localRocksdbGet(key);
> } else {
>     // Forward request to owner node
>     value = remoteGet(owner.getAddress(), key);
> }
>
>
> ------------------------------
> Out of Scope / Future Work
>
>    -
>
>    *Partition Rebalancing & Data Migration:*
>
>    If cluster topology changes (node join/leave), partition ownership may
>    change.
>
>    Current phase does not handle automatic migration of partition data
>    between nodes.
>
>    This can result in temporary data unavailability until migration logic
>    is implemented in a future phase.
>    -
>
>    *Replication & HA:*
>
>    There is no built-in replication in this phase, so a single-node failure
>    may result in data loss.
>
>    In future phases, we will address replication and recovery—most likely
>    by integrating with external storage, which is generally required for HA in
>    distributed systems.
>    -
>
>    *External Checkpoint/Restore Integration:*
>
>    Not covered in this phase, but planned for later together with
>    replication and HA.
>
> ------------------------------
> Next Steps
>
>    1. Collect feedback on this high-level design from the community.
>    2. Upon agreement, proceed to detailed design (class diagrams, API
>    specs, migration flow, error handling, etc.).
>    3. Develop a prototype for local RocksDB.
>
> ------------------------------
>
> *Detailed class diagrams and API specifications will be provided in a
> follow-up document after initial feedback and community agreement.*
>
>
> Looking forward to your feedback. Thanks!

Reply via email to