Seatunnel RocksDB-based State Backend V2 Design Doc
------------------------------
1. Partition Rebalancing & Data Migration Motivation and Problem Definition
- When cluster topology changes (e.g., node join/leave/failure),
partition (Shard/KeyGroup) ownership must be reassigned.
- Partition data stored on the previous owner node must be automatically
migrated to the new owner node.
- If not implemented, there can be data inaccessibility or temporary
data loss during partition migration.
Design Approach (Reflected in Architecture/Diagrams)
- *External Storage-Based Partition Migration*
- Each partition’s RocksDB instance periodically uploads a snapshot
to external storage (HDFS, Local FS etc).
- When partition ownership transfer is decided (by PartitionService),
the current owner node creates and uploads the latest snapshot.
- The new owner node downloads and restores the partition snapshot
from external storage.
- Data movement is “file-based” for simplicity and to avoid
lock/network concurrency issues.
- *PartitionService Event Trigger*
- PartitionService detects and broadcasts partition ownership change
events (leveraging Hazelcast's MembershipListener, MigrationListener,
PartitionLostListener).
- Write/read operations for the affected partition are restricted
until ownership change and restore are complete (to ensure consistency).
Open Issues & Discussion Points
- One open point is how to ensure partition data consistency during
migration (e.g., blocking or synchronizing writes before snapshot). Do you
think there could be a better alternative?
-
------------------------------
2. Replication & High Availability (HA) Motivation and Problem Definition
- Data loss must be prevented if a single node fails.
- “Always recoverable” data is required for production environments.
Design Approach (Reflected in Diagrams)
- *Periodic External Storage Snapshots for HA*
- Each partition’s RocksDB instance saves a snapshot to external
storage (HDFS, Local, etc) at a configured interval.
- On node failure, the affected partition is reassigned to a new
node, which downloads and restores the latest snapshot.
- This minimizes service downtime and data loss (within the snapshot
interval).
- *Real-Time Replication (Primary/Replica)*
- For large-scale RocksDB state or ETL pipelines, real-time
replication brings significant complexity and performance
overhead, and is
often less effective operationally.
- The default is periodic snapshots; real-time replication may be
considered only if explicitly required.
Open Issues & Discussion Points
- Assessing the need for real-time replication/Primary-Replica (periodic
snapshots are the default)
- Balancing snapshot interval, performance, and storage cost:
- Shorter intervals → less potential data loss, higher cost
- Longer intervals → more potential data loss, lower cost
------------------------------
Conclusion & Next Steps
- This design adopts the *periodic external storage snapshot + recovery*
pattern as the foundation for partition migration, HA, and restore.
- Real-time replication/Primary-Replica will only be considered if
operational requirements clearly justify it.
- Integration with external storage will leverage Seatunnel’s built-in
abstractions and logic wherever possible to minimize implementation
complexity and operational overhead.
- Further enhancements and optimizations will be addressed incrementally
as real-world operational needs arise.
------------------------------
[Reference] Main Components in Current Architecture Diagram
- *ValueState* (Key-Value state storage, IMAP replacement)
- *RocksDBStateBackend* (Manages ValueState and other states, instance
creation/lookup)
- *RocksDBPartitionService* (Controls partition ownership, distributed
lookup, migration)
- *CheckpointStorage/Factory* (Handles external storage integration,
snapshot management)
- *CheckpointCoordinator* (Orchestrates snapshot)
→ The architecture is designed to be extensible and compatible with future
State types (e.g., MapState), new storage backends, or distributed engines.
https://github.com/apache/seatunnel/issues/9851 (Diagram is attached in the
issue comment.)
- *Note:* The actual implementation may differ in some details as
development proceeds; the design will be refined iteratively based on
practical feedback and technical findings.
Best regards,
Doyeon
(Github ID : dybyte)