prashantwason opened a new issue, #18976:
URL: https://github.com/apache/hudi/issues/18976

   ## Problem statement
   
   Every Hudi write produces commit metadata that records per-file and 
per-partition write statistics — `numInserts`, `numUpdates`, `numWrites`, 
`numDeletes`, and related counters. These stats are the primary source of truth 
that operators, pipelines, and reconciliation tooling use to answer the 
question: *"How many records did my write actually produce?"*
   
   However, when **deduplication** (`hoodie.combine.before.insert`) or 
**precombine** (during upsert) is enabled, multiple input records that share 
the same record key are collapsed into a single output record before anything 
is written. The commit metadata reports only the **final written count** — it 
does not report how many input records were collapsed along the way, or *why* 
the count shrank.
   
   This creates an **observability gap**: a discrepancy between input record 
count and written record count cannot be attributed to a cause.
   
   ### Concrete example
   
   Suppose an input RDD/Dataset contains 5 records that all share the same 
record key:
   
   ```
   key=A, ts=1
   key=A, ts=2
   key=A, ts=3
   key=A, ts=4
   key=A, ts=5
   ```
   
   With dedup/precombine enabled, Hudi keeps one record (say `ts=5`) and writes 
it. The commit metadata reports:
   
   ```
   numInserts = 1
   ```
   
   From this number alone, an operator **cannot tell the difference** between 
two very different scenarios:
   
   1. **Expected behavior:** 4 records were legitimate duplicates, correctly 
collapsed by precombine. Data is fully intact. :white_check_mark:
   2. **A bug / data loss:** records were silently dropped somewhere in the 
pipeline (a partitioning bug, a faulty merge, an index issue, etc.), and the "4 
missing" records were *not* actually duplicates. :x:
   
   Both scenarios look identical in commit metadata: `5 in -> 1 out`. There is 
no field that says "4 of these were dropped as duplicates."
   
   ### Why this matters
   
   - **Data integrity / auditing:** Pipelines that reconcile source-vs-sink 
counts hit a dead end. A drop from 5 to 1 is unexplained, so it cannot be 
safely signed off as correct nor flagged as a real loss.
   - **Debugging:** When a genuine data-loss bug occurs, there is no metadata 
signal distinguishing it from normal dedup behavior, making root-cause analysis 
much harder.
   - **Trust:** Without dedup attribution, every count discrepancy requires 
manual, expensive investigation.
   
   ### Scope
   
   This applies to **both** write paths:
   
   - **Insert dedup** — duplicates dropped before insert when 
combine-before-insert is on.
   - **Upsert precombine** — multiple incoming records for the same key 
combined down to one (and combined against the existing record on disk).
   
   ## Proposed solution
   
   Extend Hudi commit metadata (`HoodieWriteStat` and the aggregated 
commit-level stats) with additional counters that make dedup/precombine 
explicit, for example:
   
   - `numDuplicates` / `numRecordsDeduplicated` — input records dropped because 
they shared a key with another input record.
   - `numPrecombined` — records eliminated by the precombine step specifically.
   
   With these stats, the invariant becomes verifiable:
   
   ```
   numInputRecords == numWrites + numDeletes + numDuplicates (+ numErrors)
   ```
   
   When this equation balances, a count drop is provably explained by 
deduplication. When it does **not** balance, the gap points at a real bug — 
turning a silent ambiguity into an actionable signal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to