I think it definitely makes sense to discount timeAllowed for the shard
requests.  I was thinking about this and related matters recently at my
last job. I recall thinking that a possible solution is to use the NOW
passed along from coordinator to shard as the epoch to base the time
allowance on.  It's captured early by the coordinator. On the other hand,
it could be considered a hack / abuse of the intent of NOW.  Maybe a
separate parameter (or HTTP header) should be used to pass this along.

I don't think it's worth bothering doing this for memory or CPU limits.
But I'm not against it.

On Mon, Sep 22, 2025 at 10:11 AM Andrzej Białecki <[email protected]> wrote:

> Hi devs,
>
> While working on SOLR-17869 two things occurred to me:
>
> 1. Discounting the `timeAllowed`
>
> When `timeAllowed` is > 0 `TopGroupsShardRequestFactory` discounts the
> time already spent at the query coordinator when creating shard requests.
> This happens ONLY for the grouping queries, for all other queries
> `timeAllowed` is passed as-is to shard requests.
>
> The net effect is that for other query types the same value of
> `timeAllowed` is applied to the processing at the coordinator as well as to
> the per-shard local processing. Since `timeAllowed` is a wall-clock time
> relative to the time of the local request, not discounting the time already
> spent increases the likelihood of the query timing out at the coordinator
> (because even if shard requests all fit within the `timeAllowed` *as
> started at each shard*, it still takes some time to return the results and
> post-process them at the coordinator where timeAllowed was started much
> earlier).
>
> From this POV it seems that `timeAllowed` should always be discounted
> before sending shard requests, for all types of queries, which is not the
> case today.
>
> Another question comes to mind, too: should we discount other types of
> query limits, too? For other limit types the same “budget” is applied both
> at the coordinator and locally per-shard - however, CPU and memory limits
> are not tied to an absolute point in time so maybe it doesn’t make sense to
> discount them by the amount spent at the coordinator, unless we want to
> limit the “end-to-end” impact of a query across both local and distrib
> sub-requests. WDYT?
>
>
> 2. QueryCommand.timeAllowed
>
> During the upgrade to Lucene 10 a few blocks of code were commented out in
> SolrIndexSearcher, and one of them was this (SolrIndexSearcher:309):
>
> /* final long timeAllowed = cmd.getTimeAllowed();
> if (timeAllowed > 0) {
>   setTimeout(new QueryTimeoutImpl(timeAllowed));
> }*/
>
> I was surprised that none of the tests failed with this change and started
> digging. QueryCommand.timeAllowed is initialized from request params. But
> it looks like this property is no longer needed because now for each
> request we initialize the QueryLimits from the same request params, which
> serve the same purpose.
>
> What’s more, I think the commented-out section was a bug because it would
> call `IndexSearcher.setTimeout` on a searcher instance that was used for
> multiple queries, and that setting would persist even for other subsequent
> requests.
>
> I propose to remove `QueryCommand.timeAllowed` altogether.
>
> —
>
> Andrzej Białecki
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to