[DISCUSS] Limiting query results by size (CASSANDRA-11745)

Jacek Lewandowski Mon, 12 Jun 2023 00:08:05 -0700

Hi,

I was working on limiting query results by their size expressed in bytes,
and some questions arose that I'd like to bring to the mailing list.


The semantics of queries (without aggregation) - data limits are applied on
the raw data returned from replicas - while it works fine for the row
number limits as the number of rows is not likely to change after
post-processing, it is not that accurate for size based limits as the cell
sizes may be different after post-processing (for example due to applying
some transformation function, projection, or whatever).

We can truncate the results after post-processing to stay within the
user-provided limit in bytes, but if the result is smaller than the limit -
we will not fetch more. In that case, the meaning of "limit" being an
actual limit is valid though it would be misleading for the page size
because we will not fetch the maximum amount of data that does not exceed
the page size.

Such a problem is much more visible for "group by" queries with
aggregation. The paging and limiting mechanism is applied to the rows
rather than groups, as it has no information about how much memory a single
group uses. For now, I've approximated a group size as the size of the
largest participating row.

The problem concerns the allowed interpretation of the size limit expressed
in bytes. Whether we want to use this mechanism to let the users precisely
control the size of the resultset, or we instead want to use this mechanism
to limit the amount of memory used internally for the data and prevent
problems (assuming restricting size and rows number can be used
simultaneously in a way that we stop when we reach any of the specified
limits).

https://issues.apache.org/jira/browse/CASSANDRA-11745

thanks,
- - -- --- ----- -------- -------------
Jacek Lewandowski

[DISCUSS] Limiting query results by size (CASSANDRA-11745)

Reply via email to