[DISCUSS] Replace Imap by rocksdb
Hi all,

I’ve drafted a high-level design for a RocksDB-based state backend for
SeaTunnel.
This is the first phase and I’d like to collect feedback from the community.

Below is the draft:
Seatunnel RocksDB-based State Backend: High-Level Design Draft Motivation

Hazelcast IMap shows performance limitations under high-load environments.

See Issue #9851 for more details.(
https://github.com/apache/seatunnel/issues/9851)

*Goal:*

Replace Hazelcast IMap with a custom state backend based on local RocksDB,
using Hazelcast PartitionService only for partition-to-node mapping and
cluster membership.
------------------------------
Design Principles

   -

   *Local State Storage:*

   Each node stores its assigned partition state in a local RocksDB
   instance.
   -

   *Partition Routing:*

   Hazelcast PartitionService is used to determine the owner node for a
   given key (partitioning/routing only).
   -

   *No Hazelcast Data Structure:*

   Hazelcast IMap will not be used for business data/state storage.
   -

   *Distributed Access:*

   State access (get/put/delete) must be routed to the partition owner. If
   the current node is not the owner, it must forward requests to the correct
   node.
   -

   *Cluster-wide Operations:*

   For operations that require aggregating data from all partitions/nodes
   (e.g., metrics), a coordinator node collects results from all nodes and
   aggregates them.
   -

   *Future-proofing:*

   Partition migration/rebalancing, state replication, and checkpoint
   integration will be addressed in subsequent phases.

------------------------------
High-Level Architecture

   1. *Partition Ownership*
      - Use PartitionService to determine which node owns each
      partition/key.
      - Each node only reads/writes state for the partitions it owns.
   2. *Local State Backend*
      - Use RocksDB as the local, persistent store for all state data
      belonging to owned partitions.
   3. *State Access Protocol*
      - For any key-based operation:
         - If the current node is the owner, access RocksDB directly.
         - If not, forward the request to the owner node (via Hazelcast
         IExecutorService).
      4. *Cluster-wide Operations*
      - For global queries (e.g., collect all metrics), the coordinator
      node sends requests to all nodes, gathers their local results, and
      aggregates the data.

------------------------------
Example State Access Flow

Partition partition = hazelcast.getPartitionService().getPartition(key);
Member owner = partition.getOwner();

if (owner.equals(hazelcast.getCluster().getLocalMember())) {
    // Access local RocksDB
    value = localRocksdbGet(key);
} else {
    // Forward request to owner node
    value = remoteGet(owner.getAddress(), key);
}


------------------------------
Out of Scope / Future Work

   -

   *Partition Rebalancing & Data Migration:*

   If cluster topology changes (node join/leave), partition ownership may
   change.

   Current phase does not handle automatic migration of partition data
   between nodes.

   This can result in temporary data unavailability until migration logic
   is implemented in a future phase.
   -

   *Replication & HA:*

   There is no built-in replication in this phase, so a single-node failure
   may result in data loss.

   In future phases, we will address replication and recovery—most likely
   by integrating with external storage, which is generally required for HA in
   distributed systems.
   -

   *External Checkpoint/Restore Integration:*

   Not covered in this phase, but planned for later together with
   replication and HA.

------------------------------
Next Steps

   1. Collect feedback on this high-level design from the community.
   2. Upon agreement, proceed to detailed design (class diagrams, API
   specs, migration flow, error handling, etc.).
   3. Develop a prototype for local RocksDB.

------------------------------

*Detailed class diagrams and API specifications will be provided in a
follow-up document after initial feedback and community agreement.*


Looking forward to your feedback. Thanks!

Reply via email to