GitHub user tanishqgandhi1908 created a discussion: Design: interactive grid 
for the operator result pane

Design conversation for #5394. 

## Making sort, filter, and row search work on the full dataset

The frontend only ever holds a small slice of the data, whatever pages the user 
has scrolled through. If sort and filter were evaluated only on the rows 
currently in browser memory, the user would silently get wrong results on any 
non-trivial dataset. To make them meaningful, the filter / sort / row-search 
criteria need to be evaluated on the backend, where the full dataset lives.

Operator results are already stored as Iceberg / Parquet files. Iceberg has two 
relevant capabilities for this:

- It can **skip entire data files** during a scan by comparing the filter 
against per-file min/max statistics it stores alongside the data.
- It can **push remaining row-level predicates into the Parquet reader**, so 
only matching rows are decoded.

The proposal is to surface these capabilities by extending the existing 
WebSocket pagination protocol with optional filter / sort / row-search fields, 
and adding methods to the storage abstraction that execute them through Iceberg:

- `ResultPaginationRequest` gains optional `filters`, `sorts`, and `rowSearch` 
fields. Requests without these fields take the same code path as today.
- `VirtualDocument` gains `getRangeWithQuery` and `countWithQuery` methods, 
defaulted to safe fallbacks so non-Iceberg document types continue to work 
unchanged.
- A new `IcebergPredicateBuilder` translates the wire-format `ColumnFilter` 
objects into Iceberg `Expressions`, with type-aware value parsing per column 
type so we don't silently mis-coerce strings into numbers.
- `IcebergDocument` implements both new methods. Operators Iceberg supports 
natively (`eq`, `ne`, `lt`, `le`, `gt`, `ge`, `startsWith`, `isNull`, 
`isNotNull`, `in`) are pushed down. `contains` and `endsWith` aren't 
pushdown-capable, so they're evaluated in memory over the iterator returned by 
the scan. `rowSearch` compiles to a multi-column `contains` and runs as a 
residual.

**Sort is the one exception.** Iceberg has no `ORDER BY` pushdown, so a sort is 
necessarily executed in JVM memory over the filtered iterator. To prevent that 
from OOM-ing the backend on large filtered sets, sort is capped at a 
configurable row threshold (`storage.result.sort.max-rows`, default 100k). When 
the matched count exceeds the cap, rows are returned in scan order with a 
`sortSkipped` flag in the response, and the frontend shows a banner explaining 
how to narrow the filter to enable sorting.

## Architectural notes

- Frontend memory stays bounded — ag-grid virtualization keeps DOM at ~20–30 
row nodes regardless of dataset size.
- The existing pagination cache in `OperatorPaginationResultService` is 
populated on response, so revisiting a page is a zero-WS round-trip.
- Wire format stays backward-compatible. `columnOffset` / `columnLimit` / 
`columnSearch` are kept on `ResultPaginationRequest` with their defaults; the 
new frontend simply stops setting them because column virtualization makes the 
column pager obsolete. New fields are skipped when empty so the no-query path 
is byte-identical to today's payload.


## Reference implementation

The hackathon prototype — [#5099](https://github.com/apache/texera/pull/5099) — 
has all of this working end-to-end. It's there for reference.


Happy to discuss more on this!!

GitHub link: https://github.com/apache/texera/discussions/5395

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to