[PR] [Hackathon] feat: add DataGuard for automatic Data Cleaning [texera]

via GitHub Sat, 16 May 2026 06:43:08 -0700


eugenegujing opened a new pull request, #5101:
URL: https://github.com/apache/texera/pull/5101

# DataGuard — permission-gated data cleaning for Texera

## Introduction

DataGuard brings the Claude Code permission UX to **data** instead of code.
Drop a `CSVFileScan`, `CSVOldFileScan`, or `JSONLFileScan` onto the canvas and
a floating checklist slides in: each detected issue gets one row, one risk-tier
badge, one concrete proposed fix, and one Allow/Skip control. Nothing mutates
the dataset until you click **Fix N & run**. Approved fixes are written back as
a new dataset version through LakeFS, the operator is repointed at the cleaned
data, the workflow auto-runs, and DataGuard immediately re-scans so you can
chase the next round.

Four detectors run on every scan: missing values, placeholder sentinels,
duplicate IDs, inconsistent label spellings.

## The problem

Data cleaning fails today in two opposite, equally bad ways:

- **Manual pandas in a notebook** — opaque, unauditable, no provenance,
doesn't survive the person who wrote it.
- **One-click auto-clean tools** — black-box decisions, no explanation, no
human in the loop for the high-impact moves (drop rows, resolve conflicting
IDs, clamp a value that might be a real rare case).

DataGuard's bet is that the **interaction model** is the missing piece, not
the algorithms. Treat each cleaning decision the way Claude Code treats a file
edit: explain the evidence, propose the action, ask permission, log the answer.

## What ships in this PR

### Four detectors, four enabled by default

| Detector | Default | Notes |
|---|---|---|
| Missing values | on | nulls, empties, configurable tokens (`na`, `n/a`,
`null`, `none`, `nan`, case-insensitive) |
| Placeholder values | on | numeric sentinels (`999`, `-1`), string
sentinels (`unknown`) |
| Duplicate IDs | on | honors an explicit ID column; otherwise infers from
name patterns (`id`, `*_id`, `*Id`, dotted paths like `user.id`) |
| Inconsistent labels | on | low-cardinality string columns where `trim +
lowercase` keys collide (`Male` / `male` / `MALE`) |

### Every fix asks before it lands

Every proposal carries a risk tier (`low` / `medium` / `high` / `warning`)
that drives both the UI affordance and the permission gate. `low` is
pre-checked; `medium` is pending; **`high` and `warning` can never be
auto-approved**, even when an "always allow" rule exists for the issue type —
some decisions should never be automated away. The legacy "modify" verdict
(free-text override) is rejected at both HTTP and WebSocket boundaries; it
executed the original action while pretending to honor the user's edit, so we
cut it rather than ship a lie.

### Auto-trigger is the onboarding

The user does nothing special — drop a dataset operator and the checklist
appears. The scan runs the deterministic profiler and the LLM-backed proposal
step server-side, bypassing the agent's ReAct loop, so the LLM can't decide to
call a destructive tool and vaporize the workflow.

### Iterative cleanup loop

Click **Fix N & run** and the loop is `v1 → Apply → v2 → Scan → Apply → v3 →
…`, with the panel auto-rescanning the cleaned data after each round. A toolbar
shield toggles DataGuard per workflow; when the panel is closed but the shield
is on, a small floating icon hangs on the canvas as a one-click rescan.

### Locate-cycle for multi-row issues

Each issue row has a pin. Click it and the Result Panel scrolls to the
affected row and flashes the cell. If the issue affects multiple rows — a
duplicate-ID cluster, say, with four offending rows — **the pin cycles**: first
click 1 of 4, second 2 of 4, then 3 of 4, 4 of 4, and the fifth wraps back to 1
of 4. The cursor is per-issue, so two different issues don't fight over the
focus, and the tooltip previews the next position before you click.

## What got fixed on this branch (the unglamorous moments)

- **Null-cell fingerprint normalization.** Texera's JSONL scan parses
through Jackson, and `JsonNullNode.asText()` returns the literal *string*
`"null"` — not Java null. Before this fix, locating a row with any null cell
silently fell through to a byte-order index path and flashed whatever shuffled
display row happened to sit at that position. Both sides now canonicalize every
missing form (`null`, `undefined`, `NaN`, `""`, the literal string `"null"`,
and `na` / `n/a` / `nan` / `none` case-insensitive) to a single fingerprint
token before comparison.
- **Silent fingerprint fallback → "row not found" toast.** When the
fingerprint walk legitimately exhausts (drift, post-Apply schema change), the
result panel used to land on the wrong row without saying anything. A silent
miss-into-wrong-row is strictly worse than no flash, so it now surfaces a toast
instead.
- **No-op write guard for `replace_value` and `standardize`.** When the LLM
re-proposes a mapping whose target equals the cell that's already there (e.g.,
`{south: "South"}` after round 2 already standardized it), the applier now
skips the no-op write. Without this, LakeFS aborts the version commit with "No
changes detected in dataset" mid-iteration and the Apply loop dies on the
second pass.

## What's next
- Broader operator coverage — Allow for Arrow, generic `FileScan`, and
`TextInput`.

## How was this PR tested?

**Automated**

- `cd agent-service && bun test` → **231 pass / 0 fail / 500 expect calls**
across 20 files (profiler, applier, permission gate with-approval, decision
log, apply-batch end-to-end, JSONL parser, fingerprint contract, dataguard tool
surface).
- `cd agent-service && bun run typecheck` → exit 0.
- `cd frontend && npx tsc --noEmit` → exit 0.
- `cd frontend && npx ng test --watch=false` scoped to DataGuard specs →
**60+ tests pass** across `dataguard-checklist`, `data-guard-row-navigator`,
and `data-guard-results.service`. Two unrelated specs in this directory
(`data-guard-jsonl.spec.ts`, `data-guard-auto-trigger.service.spec.ts`) trip on
a pre-existing vitest / jsdom `Blob.text()` infrastructure gap; this is
environment plumbing, not a regression introduced by this PR (the underlying
code paths are covered by the agent-service side).
- `sbt "scalafixAll --check"` and `sbt scalafmtCheckAll` → exit 0 (this PR
touches no Scala).
- Prettier check on all DataGuard frontend files → clean.

**Manual end-to-end**

1. Drop `CSVFileScan` on a dirty dataset with known missing values, an `age
= 999` placeholder, duplicate IDs, and mixed-case labels → checklist
auto-opens; all four enabled detectors fire with the right risk tiers.
2. Drop `JSONLFileScan` on a file with nested objects and explicit `null`
cells in a numeric column → flatten policy produces `user.id`-style columns;
clicking the pin on a null-cell row lands on the correct row (not the wrong
worker-shuffled row).
3. Click **Fix N & run** with a mix of approved and skipped rows → new
dataset version is created, operator is repointed, workflow re-runs, panel
auto-rescans showing only residuals.
4. Click the pin on a duplicate-ID cluster repeatedly → highlight cycles 1 →
2 → 3 → 4 → 1.
5. Toggle the toolbar shield off → auto-trigger goes silent. Toggle on,
click the floating icon → fresh scan starts even mid-flight (queued, never
concurrent).
6. Trigger a no-op standardize / replace_value path (already-canonical
column) → apply succeeds without a LakeFS "no changes detected" error.
7. Supply a `validRanges` payload via the scan options → outlier detector
activates and flags out-of-range values as WARNING.
8. Confirm HIGH / WARNING risk tiers never auto-approve, even with "always
allow" toggled on the issue type.

## Was this PR authored or co-authored using generative AI tooling?

Yes. Generated-by: Anthropic Claude Opus 4.7 via Claude Code CLI. AI was
used throughout the development cycle — design exploration, implementation,
test authoring, code review. Team Feature is implemented. Every decision was
reviewed and approved by a human author before being committed; the AI did not
autonomously merge or push.

https://github.com/user-attachments/assets/eb405c63-dcaf-4512-9ca9-4c5b2f418da9

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Hackathon] feat: add DataGuard for automatic Data Cleaning [texera]

Reply via email to