[ https://issues.apache.org/jira/browse/SOLR-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836662#comment-13836662 ]
Timothy Potter commented on SOLR-5474: -------------------------------------- Thanks for the info about changes to ZkStateReader for SOLR-5473. I'm trying to think about how to differentiate between downed nodes and slow queries using this approach. Let's consider the scenario where there are two nodes serving a shard (A & B) and LazyCloudSolrServer sends a query request to node A. Imagine that node A is down, but the client application doesn't know that yet because its cached state is stale. The request will timeout after some configurable duration. After the timeout, LazyCloudSolrServer refreshes the cached state and realizes node A is down so it sends the request to node B and the query succeeds. However, if node A is actually healthy and the cause of the timeout is a slow query, then the client should have waited longer. After refreshing the state from ZooKeeper (in response to the timeout), the client can realize that since A was healthy, the cause of the timeout was likely a slow query. So does client re-send the slow query? That seems like it could end up in a loop of timeout / resends. Does LazyCloudSolrServer keep track of how many attempts it's made for a given query ... just brainstorming here ... I know Solr supports the timeAllowed parameter for a query but that's optional. I suppose this scenario is still possible even with the current approach of having watcher on the state znode on the client side. Although, I have to think that under the current approach, the probability of sending a request to a downed node goes down since state is refreshed in real-time. The zk version doesn't help here because if node A is down, the only thing the client can do is wait for the request to timeout. > Have a new mode for SolrJ to not watch any ZKNode > ------------------------------------------------- > > Key: SOLR-5474 > URL: https://issues.apache.org/jira/browse/SOLR-5474 > Project: Solr > Issue Type: Sub-task > Components: SolrCloud > Reporter: Noble Paul > > In this mode SolrJ would not watch any ZK node > It fetches the state on demand and cache the most recently used n > collections in memory. > SolrJ would not listen to any ZK node. When a request comes for a collection > ‘xcoll’ > it would first check if such a collection exists > If yes it first looks up the details in the local cache for that collection > If not found in cache , it fetches the node /collections/xcoll/state.json and > caches the information > Any query/update will be sent with extra query param specifying the > collection name , shard name, Role (Leader/Replica), and range (example > \_target_=xcoll:shard1:L:80000000-b332ffff) . A node would throw an error > (INVALID_NODE) if it does not the serve the collection/shard/Role/range combo. > If SolrJ gets INVALID_NODE error it would invalidate the cache and fetch > fresh state information for that collection (and caches it again) > If there is a connection timeout, SolrJ assumes the node is down and re-fetch > the state for the collection and try again -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org