yzeng1618 commented on issue #10204: URL: https://github.com/apache/seatunnel/issues/10204#issuecomment-3665428222
I think “data validation” can be clearly split into two categories: Data Quality (DQ) and Data Consistency / Reconciliation. DQ fits naturally into SeaTunnel’s transform stage, where we can do row-level checks (null/type/range/enum/cross-field constraints), plus tagging + quarantine/side output + metrics. This helps prevent dirty/invalid records from reaching the sink and causing downstream “inconsistency-like” issues (e.g., parse failures, invalid values causing partial writes, null PK breaking upsert semantics). So shifting DQ left (before sink) can significantly reduce the likelihood of data inconsistency, and also improves observability (rule hit counts, fail rates, etc.). However, post-sync source↔sink reconciliation (row count / checksum / PK diff) is usually IO/compute intensive and better handled by external engines/schedulers (Spark/Trino/Flink SQL) after the window is closed (T+Δ). SeaTunnel can still provide standardized outputs (window_id, read/write counts, DQ metrics, run_id) to make external reconciliation easier and more reliable. Overall: SeaTunnel should focus on “in-pipeline quality assurance & observability”, while heavy post-sync reconciliation remains an external responsibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
