GitHub user tanishqgandhi1908 created a discussion: Design: interactive grid for the operator result pane
Design conversation for #5394. ## Making sort, filter, and row search work on the full dataset The frontend only ever holds a small slice of the data, whatever pages the user has scrolled through. If sort and filter were evaluated only on the rows currently in browser memory, the user would silently get wrong results on any non-trivial dataset. To make them meaningful, the filter / sort / row-search criteria need to be evaluated on the backend, where the full dataset lives. Operator results are already stored as Iceberg / Parquet files. Iceberg has two relevant capabilities for this: - It can **skip entire data files** during a scan by comparing the filter against per-file min/max statistics it stores alongside the data. - It can **push remaining row-level predicates into the Parquet reader**, so only matching rows are decoded. The proposal is to surface these capabilities by extending the existing WebSocket pagination protocol with optional filter / sort / row-search fields, and adding methods to the storage abstraction that execute them through Iceberg: - `ResultPaginationRequest` gains optional `filters`, `sorts`, and `rowSearch` fields. Requests without these fields take the same code path as today. - `VirtualDocument` gains `getRangeWithQuery` and `countWithQuery` methods, defaulted to safe fallbacks so non-Iceberg document types continue to work unchanged. - A new `IcebergPredicateBuilder` translates the wire-format `ColumnFilter` objects into Iceberg `Expressions`, with type-aware value parsing per column type so we don't silently mis-coerce strings into numbers. - `IcebergDocument` implements both new methods. Operators Iceberg supports natively (`eq`, `ne`, `lt`, `le`, `gt`, `ge`, `startsWith`, `isNull`, `isNotNull`, `in`) are pushed down. `contains` and `endsWith` aren't pushdown-capable, so they're evaluated in memory over the iterator returned by the scan. `rowSearch` compiles to a multi-column `contains` and runs as a residual. **Sort is the one exception.** Iceberg has no `ORDER BY` pushdown, so a sort is necessarily executed in JVM memory over the filtered iterator. To prevent that from OOM-ing the backend on large filtered sets, sort is capped at a configurable row threshold (`storage.result.sort.max-rows`, default 100k). When the matched count exceeds the cap, rows are returned in scan order with a `sortSkipped` flag in the response, and the frontend shows a banner explaining how to narrow the filter to enable sorting. ## Architectural notes - Frontend memory stays bounded — ag-grid virtualization keeps DOM at ~20–30 row nodes regardless of dataset size. - The existing pagination cache in `OperatorPaginationResultService` is populated on response, so revisiting a page is a zero-WS round-trip. - Wire format stays backward-compatible. `columnOffset` / `columnLimit` / `columnSearch` are kept on `ResultPaginationRequest` with their defaults; the new frontend simply stops setting them because column virtualization makes the column pager obsolete. New fields are skipped when empty so the no-query path is byte-identical to today's payload. ## Reference implementation The hackathon prototype — [#5099](https://github.com/apache/texera/pull/5099) — has all of this working end-to-end. It's there for reference. Happy to discuss more on this!! GitHub link: https://github.com/apache/texera/discussions/5395 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
