mcvsubbu commented on issue #4484: Pinot query timeout due to the broker waiting for a single non-responsive server URL: https://github.com/apache/incubator-pinot/issues/4484#issuecomment-528440340 As currently designed, the broker (after pruning segments) comes up with a set of segments that need to be covered in order to respond to the query. The broker has pre-constructed routing entries that determines which servers need to be reached in order to cover these segments. The broker then forwards the request and waits for responses, until a timeout. If all servers have not responded, then the response is flagged as being partial in the metadata. There are multiple metrics on servers and brokers (system, jvm and pinot level) that adminstrators can set alerts on. Monitoring systems may also auto-restart any of these entities on specific alerts. We expect that on a large scale site-facing system (that cannot tolerate more than a small number of such partial/failed responses) has such monitoring and automated repair systems. That being said, there is scope for improvement here. One simple improvement can be that the request should specify the timeout that it can tolerate. The broker waits for a maximum of that timeout, and returns partial response. Alerts may be set on the partial response flag to indicate that some administrator intervention is needed. The broker may also support some (fancy) algorithms to back off routing to specific servers that have responded late (or not at all), and then include them slowly over time, only to back off again if the repair has not been done yet. Think of this as a score attached to a server in the routing entries, and the score keeps improving as the servers responds faster, but decreases when they timeout, or miss a response. Broker tends to favor servers with higher score over those with lower ones (perhaps gives smaller number of segments to servers with lower score). Neither of these (or any other than I can think of) auto-fixes a permanent problem on the server without some external intervention (restarts/resets/hardware replacement/whatever). Therefore, for any system that has stringent requirements, it should be that appropriate alerts be set that warrant these operations and, if possible, some auto-remediation be applied in case of most common causes.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org