[ 
https://issues.apache.org/jira/browse/SOLR-5474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836662#comment-13836662
 ] 

Timothy Potter commented on SOLR-5474:
--------------------------------------

Thanks for the info about changes to ZkStateReader for SOLR-5473. I'm trying to 
think about how to differentiate between downed nodes and slow queries using 
this approach.

Let's consider the scenario where there are two nodes serving a shard (A & B) 
and LazyCloudSolrServer sends a query request to node A. Imagine that node A is 
down, but the client application doesn't know that yet because its cached state 
is stale. The request will timeout after some configurable duration. After the 
timeout, LazyCloudSolrServer refreshes the cached state and realizes node A is 
down so it sends the request to node B and the query succeeds.

However, if node A is actually healthy and the cause of the timeout is a slow 
query, then the client should have waited longer. After refreshing the state 
from ZooKeeper (in response to the timeout), the client can realize that since 
A was healthy, the cause of the timeout was likely a slow query. So does client 
re-send the slow query? That seems like it could end up in a loop of timeout / 
resends. Does LazyCloudSolrServer keep track of how many attempts it's made for 
a given query ... just brainstorming here ... I know Solr supports the 
timeAllowed parameter for a query but that's optional.

I suppose this scenario is still possible even with the current approach of 
having watcher on the state znode on the client side. Although, I have to think 
that under the current approach, the probability of sending a request to a 
downed node goes down since state is refreshed in real-time. The zk version 
doesn't help here because if node A is down, the only thing the client can do 
is wait for the request to timeout.

> Have a new mode for SolrJ to not watch any ZKNode
> -------------------------------------------------
>
>                 Key: SOLR-5474
>                 URL: https://issues.apache.org/jira/browse/SOLR-5474
>             Project: Solr
>          Issue Type: Sub-task
>          Components: SolrCloud
>            Reporter: Noble Paul
>
> In this mode SolrJ would not watch any ZK node
> It fetches the state  on demand and cache the most recently used n 
> collections in memory.
> SolrJ would not listen to any ZK node. When a request comes for a collection 
> ‘xcoll’
> it would first check if such a collection exists
> If yes it first looks up the details in the local cache for that collection
> If not found in cache , it fetches the node /collections/xcoll/state.json and 
> caches the information
> Any query/update will be sent with extra query param specifying the 
> collection name , shard name, Role (Leader/Replica), and range (example 
> \_target_=xcoll:shard1:L:80000000-b332ffff) . A node would throw an error 
> (INVALID_NODE) if it does not the serve the collection/shard/Role/range combo.
> If SolrJ gets INVALID_NODE error it would invalidate the cache and fetch 
> fresh state information for that collection (and caches it again)
> If there is a connection timeout, SolrJ assumes the node is down and re-fetch 
> the state for the collection and try again



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to