Re: [I] [Discussions] Support Post-Synchronization Data Validation (Data Validation) [seatunnel]

via GitHub Wed, 17 Dec 2025 05:45:55 -0800


yzeng1618 commented on issue #10204:
URL: https://github.com/apache/seatunnel/issues/10204#issuecomment-3665428222


   I think “data validation” can be clearly split into two categories: Data 
Quality (DQ) and Data Consistency / Reconciliation.
   
   DQ fits naturally into SeaTunnel’s transform stage, where we can do 
row-level checks (null/type/range/enum/cross-field constraints), plus tagging + 
quarantine/side output + metrics.
   This helps prevent dirty/invalid records from reaching the sink and causing 
downstream “inconsistency-like” issues (e.g., parse failures, invalid values 
causing partial writes, null PK breaking upsert semantics).
   So shifting DQ left (before sink) can significantly reduce the likelihood of 
data inconsistency, and also improves observability (rule hit counts, fail 
rates, etc.).
   
   However, post-sync source↔sink reconciliation (row count / checksum / PK 
diff) is usually IO/compute intensive and better handled by external 
engines/schedulers (Spark/Trino/Flink SQL) after the window is closed (T+Δ). 
SeaTunnel can still provide standardized outputs (window_id, read/write counts, 
DQ metrics, run_id) to make external reconciliation easier and more reliable.
   
   
   Overall: SeaTunnel should focus on “in-pipeline quality assurance & 
observability”, while heavy post-sync reconciliation remains an external 
responsibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Discussions] Support Post-Synchronization Data Validation (Data Validation) [seatunnel]

Reply via email to