Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Benedict Mon, 12 Jun 2023 02:20:34 -0700

I agree that this is more suitable as a paging option, and not as a CQL LIMIT option.

If it were to be a CQL LIMIT option though, then it should be accurate regarding result set IMO; there shouldn’t be any further results that could have been returned within the LIMIT.

On 12 Jun 2023, at 10:16, Benjamin Lerer <ble...@apache.org> wrote:

Thanks Jacek for raising that discussion.

I do not have in mind a scenario where it could be useful to specify a LIMIT in bytes. The LIMIT clause is usually used when you know how many rows you wish to display or use. Unless somebody has a useful scenario in mind I do not think that there is a need for that feature.

Paging in bytes makes sense to me as the paging mechanism is transparent for the user in most drivers. It is simply a way to optimize your memory usage from end to end.

I do not like the approach of using both of them simultaneously because if you request a page with a certain amount of rows and do not get it then is is really confusing and can be a problem for some usecases. We have users keeping their session open and the page information to display page of data.

Le lun. 12 juin 2023 à 09:08, Jacek Lewandowski <lewandowski.ja...@gmail.com> a écrit :
Hi,

I was working on limiting query results by their size expressed in bytes, and some questions arose that I'd like to bring to the mailing list.

The semantics of queries (without aggregation) - data limits are applied on the raw data returned from replicas - while it works fine for the row number limits as the number of rows is not likely to change after post-processing, it is not that accurate for size based limits as the cell sizes may be different after post-processing (for example due to applying some transformation function, projection, or whatever).

We can truncate the results after post-processing to stay within the user-provided limit in bytes, but if the result is smaller than the limit - we will not fetch more. In that case, the meaning of "limit" being an actual limit is valid though it would be misleading for the page size because we will not fetch the maximum amount of data that does not exceed the page size.

Such a problem is much more visible for "group by" queries with aggregation. The paging and limiting mechanism is applied to the rows rather than groups, as it has no information about how much memory a single group uses. For now, I've approximated a group size as the size of the largest participating row.

The problem concerns the allowed interpretation of the size limit expressed in bytes. Whether we want to use this mechanism to let the users precisely control the size of the resultset, or we instead want to use this mechanism to limit the amount of memory used internally for the data and prevent problems (assuming restricting size and rows number can be used simultaneously in a way that we stop when we reach any of the specified limits).

https://issues.apache.org/jira/browse/CASSANDRA-11745

thanks,
- - -- --- ----- -------- -------------
Jacek Lewandowski

Re: [DISCUSS] Limiting query results by size (CASSANDRA-11745)

Reply via email to