Seatunnel RocksDB-based State Backend V2 Design Doc
------------------------------
1. Partition Rebalancing & Data Migration Motivation and Problem Definition

   - When cluster topology changes (e.g., node join/leave/failure),
   partition (Shard/KeyGroup) ownership must be reassigned.
   - Partition data stored on the previous owner node must be automatically
   migrated to the new owner node.
   - If not implemented, there can be data inaccessibility or temporary
   data loss during partition migration.

Design Approach (Reflected in Architecture/Diagrams)

   - *External Storage-Based Partition Migration*
      - Each partition’s RocksDB instance periodically uploads a snapshot
      to external storage (HDFS, Local FS etc).
      - When partition ownership transfer is decided (by PartitionService),
      the current owner node creates and uploads the latest snapshot.
      - The new owner node downloads and restores the partition snapshot
      from external storage.
      - Data movement is “file-based” for simplicity and to avoid
      lock/network concurrency issues.
   - *PartitionService Event Trigger*
      - PartitionService detects and broadcasts partition ownership change
      events (leveraging Hazelcast's MembershipListener, MigrationListener,
      PartitionLostListener).
      - Write/read operations for the affected partition are restricted
      until ownership change and restore are complete (to ensure consistency).

Open Issues & Discussion Points

   - One open point is how to ensure partition data consistency during
   migration (e.g., blocking or synchronizing writes before snapshot). Do you
   think there could be a better alternative?
   -

------------------------------
2. Replication & High Availability (HA) Motivation and Problem Definition

   - Data loss must be prevented if a single node fails.
   - “Always recoverable” data is required for production environments.

Design Approach (Reflected in Diagrams)

   - *Periodic External Storage Snapshots for HA*
      - Each partition’s RocksDB instance saves a snapshot to external
      storage (HDFS, Local, etc) at a configured interval.
      - On node failure, the affected partition is reassigned to a new
      node, which downloads and restores the latest snapshot.
      - This minimizes service downtime and data loss (within the snapshot
      interval).
   - *Real-Time Replication (Primary/Replica)*
      - For large-scale RocksDB state or ETL pipelines, real-time
      replication brings significant complexity and performance
overhead, and is
      often less effective operationally.
      - The default is periodic snapshots; real-time replication may be
      considered only if explicitly required.

Open Issues & Discussion Points

   - Assessing the need for real-time replication/Primary-Replica (periodic
   snapshots are the default)
   - Balancing snapshot interval, performance, and storage cost:
      - Shorter intervals → less potential data loss, higher cost
      - Longer intervals → more potential data loss, lower cost

------------------------------
Conclusion & Next Steps

   - This design adopts the *periodic external storage snapshot + recovery*
   pattern as the foundation for partition migration, HA, and restore.
   - Real-time replication/Primary-Replica will only be considered if
   operational requirements clearly justify it.
   - Integration with external storage will leverage Seatunnel’s built-in
   abstractions and logic wherever possible to minimize implementation
   complexity and operational overhead.
   - Further enhancements and optimizations will be addressed incrementally
   as real-world operational needs arise.

------------------------------
[Reference] Main Components in Current Architecture Diagram

   - *ValueState* (Key-Value state storage, IMAP replacement)
   - *RocksDBStateBackend* (Manages ValueState and other states, instance
   creation/lookup)
   - *RocksDBPartitionService* (Controls partition ownership, distributed
   lookup, migration)
   - *CheckpointStorage/Factory* (Handles external storage integration,
   snapshot management)
   - *CheckpointCoordinator* (Orchestrates snapshot)

→ The architecture is designed to be extensible and compatible with future
State types (e.g., MapState), new storage backends, or distributed engines.

https://github.com/apache/seatunnel/issues/9851 (Diagram is attached in the
issue comment.)

   - *Note:* The actual implementation may differ in some details as
   development proceeds; the design will be refined iteratively based on
   practical feedback and technical findings.


Best regards,

Doyeon

(Github ID : dybyte)

Reply via email to