tanishqgandhi1908 opened a new pull request, #5099: URL: https://github.com/apache/texera/pull/5099
## Motivation Texera's result pane has historically been a static, page-by-page table viewer with a default page size of five rows. Users could glance at operator outputs and search by column **name**, but they could not interact with the data the way they would in a modern spreadsheet tool — no row-level filtering, no sorting, no full-data search, and no way to see, at a glance, *what an operator actually did to its input*. That meant every debugging or exploration session looked roughly the same: 1. Click an operator. 2. Click through paginated pages. 3. Switch to the upstream operator. 4. Click through *its* pages. 5. Mentally diff the two in your head. It worked, but it was slow, fiddly, and easy to get wrong on wide or large tables. This PR rethinks the result pane around two ideas: 1. **Treat the result pane like a spreadsheet, not a static table.** Sort, filter, search, reorder columns, hide columns, pin columns — all without leaving the operator. Make it work at scale by pushing the heavy lifting down to Iceberg. 2. **Make every operator self-explanatory.** Right above the data, show the user what changed compared to the upstream operator: row delta, column delta, schema diff. So instead of mentally diffing two tables, you *see* the diff inline. ## What changed (story version) ### Phase 1 — From `nz-table` to ag-grid Community The old `nz-table` view rendered every column to the DOM and capped at five rows per page. That worked for toy data but felt cramped, didn't sort or filter, and couldn't survive a 200-column table. We swapped it for **ag-grid Community** (MIT-licensed, Apache-compatible) using the **Infinite Row Model** wired into Texera's existing WebSocket pagination protocol via a custom \`IDatasource\`. Out of the box, the user now gets: - Sort + per-column filter menus - Column reorder via drag - Column hide/show via a toggle dropdown - Column pin (left/right) via header context menu - DOM column virtualization so 200-column tables render smoothly - Pagination with **auto-fit page size** — resize the dock and the page size adjusts to the visible space The grid is themed against Texera's existing Ant Design palette (no garish ag-grid defaults), and the per-column stats (Min / Max / Non-Null / category %) that lived in the old header are restored via a custom header component — same data, better layout. ### Phase 2 — Backend pushdown Spreadsheet UX is only useful if it scales. Texera stores operator results as **Iceberg / Parquet**, which can prune entire data files by partition + min/max stats and push predicates into the Parquet reader. We extended the protocol and the storage layer to take advantage of that: - \`ResultPaginationRequest\` now carries optional \`filters\`, \`sorts\`, and \`rowSearch\` fields. - \`VirtualDocument\` gains \`getRangeWithQuery\` + \`countWithQuery\` (defaulted to safe fallbacks so non-Iceberg documents keep working). - A new \`IcebergPredicateBuilder\` translates the wire-format \`ColumnFilter\` into Iceberg \`Expressions\` with **type-aware value parsing** per column type (no silent string-coercion bugs). - \`IcebergDocument\` implements both methods: predicate pushdown for ops Iceberg supports natively, residual evaluation in memory for \`contains\` / \`endsWith\` / \`rowSearch\`, and an in-memory sort capped at \`storage.result.sort.max-rows\` (default 100k). When sort is requested but the matched count exceeds the cap, the backend returns rows in scan order with a \`sortSkipped\` flag, and the UI shows a friendly banner explaining how to narrow the filter. (Iceberg cannot push ORDER BY into the Parquet reader — sort is the one place we have to spend JVM heap.) ### Phase 3 — Full-data row search A debounced \`Search rows...\` input above the grid sends a \`rowSearch\` string down to the backend, which compiles it into a multi-column \`contains\` predicate over all string columns. This is the **first** real \"search inside the data\" experience in the result pane — the existing column-name search continues to work alongside it. ### Phase 4 — The transformation diff This is the most ambitious idea: every operator, at a glance, tells you what it did. A compact strip above the grid renders: - **Left pill**: upstream operator name with its row count and column count (taken from the frontend's per-operator cache — no extra backend calls). - **Middle**: row delta (e.g. \`↓ -149 rows (-99.3%)\`, color-coded green/red/neutral) and column delta (e.g. \`+2 -1 ⇄1 cols\` or \`5 cols · unchanged\`). - **Right pill**: current operator. Click the strip and it expands inline (no popup) into a detail drawer with: - A two-row Before / After bar visualisation of row counts (scaled relative to the larger side, with the actual numbers right-aligned for clarity). - Coloured tag lists for **Removed**, **Added**, **Type-changed**, and **Kept** columns. For source operators with no input, the strip shows a friendly \`▶ Source operator\` chip. For multi-input operators (joins, unions), it collapses to \`⛙ Combined from N inputs\` and defers the pairwise diff for a future iteration. All of this is computed from the data the frontend already maintains in \`WorkflowResultService\` — **zero new backend round trips**. ### Layout — bottom dock instead of floating modal The result panel itself was a draggable floating popup. We turned it into a **fixed bottom dock**: full viewport width, top-edge resize handle for height, no drag-to-move, no \"return to corner\" widget. Clicking a row no longer opens a modal — instead an inline row inspector slides in below the grid with a JSON tree view, prev/next/close, and visual selection on the corresponding row in the grid. ## Architectural notes - **Frontend memory is bounded** regardless of dataset size — ag-grid's row + column virtualization keeps DOM at ~20–30 row nodes; the page cache evicts LRU at ~2 000 rows. - **The frontend page cache is populated on response** for the unfiltered fast path, so paging back and forth costs zero WS round-trips after the first visit. - **Wire format stays backward-compatible**: \`columnOffset\` / \`columnLimit\` / \`columnSearch\` are kept on \`ResultPaginationRequest\` with their defaults for the Python SDK and any external callers. New frontend simply stops setting them; the bare-minimum payload also avoids a Jackson edge case where JS \`Number.MAX_SAFE_INTEGER\` overflows Scala's \`Int\`. - **Filter / sort / rowSearch fields are elided** from the wire when empty, so the no-query path is byte-identical to the pre-PR shape. ## Risks and mitigations (also covered in the plan doc) - **Sort beyond 100 k rows** — returned unsorted with a banner; user narrows the filter to get sort back. Spill-to-disk sort is a follow-up. - **Filter value typing** — centralized in \`IcebergPredicateBuilder\` with per-Iceberg-type parsers; ag-grid picks the right filter component per column type so bad input is rare at the UI level. - **Streaming results** — the existing \`dirtyPageIndices\` hook maps to \`gridApi.purgeInfiniteCache()\` so scroll position stays put while new rows land. - **Bundle size** — ag-grid adds ~300 KB gzipped. We register only the Community modules we use; the result pane is a good candidate for future lazy-loading. - **License** — ag-grid Community is MIT, which is [Category A under Apache policy](https://www.apache.org/legal/resolved.html#category-a). No commercial license is used or required. ## Future ideas this PR enables The same data plane (per-operator schema + row count cache + WebSocket pagination) makes these reasonable follow-ups: - Sort spill-to-disk via a temp Iceberg sort transform — eliminates the 100 k cap. - Filtered-count caching keyed by \`hash(filters, rowSearch)\` so count doesn't recompute per page. - Cross-operator comparison (\"diff this op's output against the same op from a previous run\") — reuses the schema-diff machinery. - Bloom filters or inverted indices for fast row-search on huge string columns. ## Test plan - [ ] Frontend builds clean (\`yarn build\`) and lints (\`yarn lint\`). - [ ] Backend Scala compiles (\`sbt WorkflowExecutionService/Compile/compile\`). - [ ] Run the **Iris CSV** sample workflow: - [ ] Sort any column → rows reorder across pages. - [ ] Filter \`SepalLengthCm > 5\` via the column header menu → grid + row count update; banner stays hidden. - [ ] Type into the \"Search rows...\" box → debounced backend round trip; matching rows appear. - [ ] Click a row → bottom inspector slides in; prev/next walks rows; × closes. - [ ] Resize the dock from the top edge → page size auto-adjusts; data unchanged. - [ ] Click the transformation strip → drawer expands showing schema diff with column tags. - [ ] On a multi-million-row table, apply a narrowing filter on a partitioned column → confirm the backend logs show Iceberg pruning data files. - [ ] Force a sort over more than 100 k matched rows → confirm the yellow \"Too many rows to sort\" banner appears and the grid shows scan order. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
