eugenegujing opened a new pull request, #5101:
URL: https://github.com/apache/texera/pull/5101

   # DataGuard — permission-gated data cleaning for Texera
   
   ## Introduction
   
   DataGuard brings the Claude Code permission UX to **data** instead of code. 
Drop a `CSVFileScan`, `CSVOldFileScan`, or `JSONLFileScan` onto the canvas and 
a floating checklist slides in: each detected issue gets one row, one risk-tier 
badge, one concrete proposed fix, and one Allow/Skip control. Nothing mutates 
the dataset until you click **Fix N & run**. Approved fixes are written back as 
a new dataset version through LakeFS, the operator is repointed at the cleaned 
data, the workflow auto-runs, and DataGuard immediately re-scans so you can 
chase the next round.
   
   Four detectors run on every scan: missing values, placeholder sentinels, 
duplicate IDs, inconsistent label spellings. 
   
   ## The problem
   
   Data cleaning fails today in two opposite, equally bad ways:
   
   - **Manual pandas in a notebook** — opaque, unauditable, no provenance, 
doesn't survive the person who wrote it.
   - **One-click auto-clean tools** — black-box decisions, no explanation, no 
human in the loop for the high-impact moves (drop rows, resolve conflicting 
IDs, clamp a value that might be a real rare case).
   
   DataGuard's bet is that the **interaction model** is the missing piece, not 
the algorithms. Treat each cleaning decision the way Claude Code treats a file 
edit: explain the evidence, propose the action, ask permission, log the answer.
   
   ## What ships in this PR
   
   ### Four detectors, four enabled by default
   
   | Detector | Default | Notes |
   |---|---|---|
   | Missing values | on | nulls, empties, configurable tokens (`na`, `n/a`, 
`null`, `none`, `nan`, case-insensitive) |
   | Placeholder values | on | numeric sentinels (`999`, `-1`), string 
sentinels (`unknown`) |
   | Duplicate IDs | on | honors an explicit ID column; otherwise infers from 
name patterns (`id`, `*_id`, `*Id`, dotted paths like `user.id`) |
   | Inconsistent labels | on | low-cardinality string columns where `trim + 
lowercase` keys collide (`Male` / `male` / `MALE`) |
   
   ### Every fix asks before it lands
   
   Every proposal carries a risk tier (`low` / `medium` / `high` / `warning`) 
that drives both the UI affordance and the permission gate. `low` is 
pre-checked; `medium` is pending; **`high` and `warning` can never be 
auto-approved**, even when an "always allow" rule exists for the issue type — 
some decisions should never be automated away. The legacy "modify" verdict 
(free-text override) is rejected at both HTTP and WebSocket boundaries; it 
executed the original action while pretending to honor the user's edit, so we 
cut it rather than ship a lie.
   
   ### Auto-trigger is the onboarding
   
   The user does nothing special — drop a dataset operator and the checklist 
appears. The scan runs the deterministic profiler and the LLM-backed proposal 
step server-side, bypassing the agent's ReAct loop, so the LLM can't decide to 
call a destructive tool and vaporize the workflow.
   
   ### Iterative cleanup loop
   
   Click **Fix N & run** and the loop is `v1 → Apply → v2 → Scan → Apply → v3 → 
…`, with the panel auto-rescanning the cleaned data after each round. A toolbar 
shield toggles DataGuard per workflow; when the panel is closed but the shield 
is on, a small floating icon hangs on the canvas as a one-click rescan.
   
   ### Locate-cycle for multi-row issues
   
   Each issue row has a pin. Click it and the Result Panel scrolls to the 
affected row and flashes the cell. If the issue affects multiple rows — a 
duplicate-ID cluster, say, with four offending rows — **the pin cycles**: first 
click 1 of 4, second 2 of 4, then 3 of 4, 4 of 4, and the fifth wraps back to 1 
of 4. The cursor is per-issue, so two different issues don't fight over the 
focus, and the tooltip previews the next position before you click.
   
   ## What got fixed on this branch (the unglamorous moments)
   
   - **Null-cell fingerprint normalization.** Texera's JSONL scan parses 
through Jackson, and `JsonNullNode.asText()` returns the literal *string* 
`"null"` — not Java null. Before this fix, locating a row with any null cell 
silently fell through to a byte-order index path and flashed whatever shuffled 
display row happened to sit at that position. Both sides now canonicalize every 
missing form (`null`, `undefined`, `NaN`, `""`, the literal string `"null"`, 
and `na` / `n/a` / `nan` / `none` case-insensitive) to a single fingerprint 
token before comparison.
   - **Silent fingerprint fallback → "row not found" toast.** When the 
fingerprint walk legitimately exhausts (drift, post-Apply schema change), the 
result panel used to land on the wrong row without saying anything. A silent 
miss-into-wrong-row is strictly worse than no flash, so it now surfaces a toast 
instead.
   - **No-op write guard for `replace_value` and `standardize`.** When the LLM 
re-proposes a mapping whose target equals the cell that's already there (e.g., 
`{south: "South"}` after round 2 already standardized it), the applier now 
skips the no-op write. Without this, LakeFS aborts the version commit with "No 
changes detected in dataset" mid-iteration and the Apply loop dies on the 
second pass.
   
   ## What's next
   - Broader operator coverage — Allow for Arrow, generic `FileScan`, and 
`TextInput`.
   
   
   ## How was this PR tested?
   
   **Automated**
   
   - `cd agent-service && bun test` → **231 pass / 0 fail / 500 expect calls** 
across 20 files (profiler, applier, permission gate with-approval, decision 
log, apply-batch end-to-end, JSONL parser, fingerprint contract, dataguard tool 
surface).
   - `cd agent-service && bun run typecheck` → exit 0.
   - `cd frontend && npx tsc --noEmit` → exit 0.
   - `cd frontend && npx ng test --watch=false` scoped to DataGuard specs → 
**60+ tests pass** across `dataguard-checklist`, `data-guard-row-navigator`, 
and `data-guard-results.service`. Two unrelated specs in this directory 
(`data-guard-jsonl.spec.ts`, `data-guard-auto-trigger.service.spec.ts`) trip on 
a pre-existing vitest / jsdom `Blob.text()` infrastructure gap; this is 
environment plumbing, not a regression introduced by this PR (the underlying 
code paths are covered by the agent-service side).
   - `sbt "scalafixAll --check"` and `sbt scalafmtCheckAll` → exit 0 (this PR 
touches no Scala).
   - Prettier check on all DataGuard frontend files → clean.
   
   **Manual end-to-end**
   
   1. Drop `CSVFileScan` on a dirty dataset with known missing values, an `age 
= 999` placeholder, duplicate IDs, and mixed-case labels → checklist 
auto-opens; all four enabled detectors fire with the right risk tiers.
   2. Drop `JSONLFileScan` on a file with nested objects and explicit `null` 
cells in a numeric column → flatten policy produces `user.id`-style columns; 
clicking the pin on a null-cell row lands on the correct row (not the wrong 
worker-shuffled row).
   3. Click **Fix N & run** with a mix of approved and skipped rows → new 
dataset version is created, operator is repointed, workflow re-runs, panel 
auto-rescans showing only residuals.
   4. Click the pin on a duplicate-ID cluster repeatedly → highlight cycles 1 → 
2 → 3 → 4 → 1.
   5. Toggle the toolbar shield off → auto-trigger goes silent. Toggle on, 
click the floating icon → fresh scan starts even mid-flight (queued, never 
concurrent).
   6. Trigger a no-op standardize / replace_value path (already-canonical 
column) → apply succeeds without a LakeFS "no changes detected" error.
   7. Supply a `validRanges` payload via the scan options → outlier detector 
activates and flags out-of-range values as WARNING.
   8. Confirm HIGH / WARNING risk tiers never auto-approve, even with "always 
allow" toggled on the issue type.
   
   ## Was this PR authored or co-authored using generative AI tooling?
   
   Yes. Generated-by: Anthropic Claude Opus 4.7 via Claude Code CLI. AI was 
used throughout the development cycle — design exploration, implementation, 
test authoring, code review. Team Feature is implemented. Every decision was 
reviewed and approved by a human author before being committed; the AI did not 
autonomously merge or push.
   
   
https://github.com/user-attachments/assets/eb405c63-dcaf-4512-9ca9-4c5b2f418da9
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to