[
https://issues.apache.org/jira/browse/SOLR-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891263#comment-17891263
]
Gus Heck commented on SOLR-17158:
---------------------------------
Also adding metadata about what fraction of shards completed seems like a
reasonable follow on feature, and full info about which in the debug case...
but one of the things I think is difficult about failover type behavior here is
that there are several types of failures:
# The Limit was just too small, even a healthy server can't answer in the
allotted time/space (this is a 4xx type of case if an error is to be thrown)
# The query is unreasonable, and even a healthy server can't answer it in the
allotted time/space (this is a 4xx type of case if an error is to be thrown)
# The query and Limit are reasonable, but the (these are 5xx like cases if an
error is to be thrown)
## The cluster is under extreme load and thus all shards are going to be
unable to answer
## This individual node is under extreme load and an alternative node might
answer.
In every case except 3.2 repeating the request is harmful. The code already
detects and retries if the http communication fails, but adding this
timeAllowed parameter means that we can effectively hide from that retry code.
If 3.1 is the usual problem that would be a good thing, and if 3.2 is most
common that's not so good. In the case where zero shards responded 3.1 seems
much more likely. So after pondering all this for a long time, I've come to the
thought that throwing exceptions or otherwise using the response to gauge
server health is a poor substitute for system monitoring. So certainly the
metadata you suggest might be nice to see for troubleshooting, but I'm leery of
the notion that it might be used for automated fail-over / fall-back.
Also as you can see if we start throwing errors, we have no way to decide what
error to throw... 4xx says "user, you must change your request" and 5xx says
"come back later with that request, we've got problems"... So this is another
part of why I settled on 200 OK, YOU ASKED FOR IT ;)
> Terminate distributed processing quickly when query limit is reached
> --------------------------------------------------------------------
>
> Key: SOLR-17158
> URL: https://issues.apache.org/jira/browse/SOLR-17158
> Project: Solr
> Issue Type: Sub-task
> Components: Query Limits
> Reporter: Andrzej Bialecki
> Assignee: Gus Heck
> Priority: Major
> Labels: pull-request-available
> Fix For: main (10.0), 9.8
>
> Time Spent: 8h 50m
> Remaining Estimate: 0h
>
> Solr should make sure that when query limits are reached and partial results
> are not needed (and not wanted) then both the processing in shards and in the
> query coordinator should be terminated as quickly as possible, and Solr
> should minimize wasted resources spent on eg. returning data from the
> remaining shards, merging responses in the coordinator, or returning any data
> back to the user.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]