[ 
https://issues.apache.org/jira/browse/SOLR-12415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493880#comment-16493880
 ] 

Shawn Heisey commented on SOLR-12415:
-------------------------------------

I see these as possible solutions in situations where the base URLs do not 
include a core/collection name:

 * Change the zombie server check query to a benign action on one of the admin 
handlers that responds quickly.
 ** This would only be guaranteed to work on versions where the chosen handler 
is implicitly added, which might mean possible compatibility issues with older 
server versions.
 ** A different admin handler might be needed for checks on URLs that *do* 
include the core/collection name.
 ** The Javadoc for the client would need to declare which handlers are used 
for zombie checks, what server version added those handlers implicitly, and 
indicate that there must be explicit config in older versions.
 * Move 'setDefaultCollection' from CloudSolrClient to SolrClient, fix any 
problems that causes, and require setDefaultCollection for LBHttpSolrClient to 
work properly.
 * Have LBHttpSolrClient make a CoreAdmin call to get a list of valid cores and 
choose one for the zombie server check - but only if setDefaultCollection was 
not used.

If CloudSolrClient relies on its internal LBHttpSolrClient to re-enable zombie 
servers, this might affect CloudSolrClient too.  I suspect that the cloud 
client relies more on info from zookeeper.

I was a little surprised to learn that LBHttpSolrClient assumes all servers are 
good until a request fails.  I would have expected alive checks to happen 
before then.


> Solr Loadbalancer client LBHttpSolrClient not working as expected, if a Solr 
> node goes down, it is unable to detect when it become live again due to 404 
> error
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-12415
>                 URL: https://issues.apache.org/jira/browse/SOLR-12415
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 7.2.1, 7.3.1, 7.4
>         Environment: Solr 7.2.1
> 2 servers - master and slave.
>            Reporter: Grzegorz Lebek
>            Priority: Critical
>
> *Context*
>  When LBHttpSolrClient has been constructed using *base root urls*, and when 
> a slave goes down, and then back again, the client is unable to mark it as 
> alive again due to 404 error.
> Logs  below:
> {code:java}
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 >> "GET 
> /solr/select?q=%3A&rows=0&sort=docid+asc&distrib=false&wt=javabin&version=2 
> HTTP/1.1[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 >> 
> "User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 
> 1.0[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 >> "Host: 
> localhost:8984[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 >> 
> "Connection: Keep-Alive[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 >> "[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "HTTP/1.1 
> 404 Not Found[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "Cache-Control: must-revalidate,no-cache,no-store[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "Content-Type: text/html;charset=iso-8859-1[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "Content-Length: 243[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "[\r][\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "<html>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "<head>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "<meta 
> http-equiv="Content-Type" content="text/html;charset=utf-8"/>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "<title>Error 404 Not Found</title>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "</head>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "<body><h2>HTTP ERROR 404</h2>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "<p>Problem 
> accessing /solr/select. Reason:[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << "<pre> Not 
> Found</pre></p>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "</body>[\n]"
>  DEBUG [aliveCheckExecutor-1-thread-1] [wire] http-outgoing-83 << 
> "</html>[\n]"{code}
> *Analysis*
>  when using only *base root urls* in a LBHttpSolrClient we need to pass a 
> "*collection*" paramter when sending a request. It works fine except that in 
> a method 
> {code:java}
> private void checkAZombieServer(ServerWrapper zombieServer){code}
> it tries to query a solr without the collection parameter, to check if the 
> server is alive. This causes a html content (apparently dashboard) to be 
> returned, and as a result it will move to the exception clause in the method 
> therefore even if the server is back it will never be marked as alive again.
>  I debugged this and if we pass a collection name there as a second param it 
> will respond in a right manner.
> Suggestion is either to somehow pass the collection name or to change the way 
> zombie servers are pinged.
> *Steps to reproduce*
> Run 2 servers - master and slave. Create client using base urls. Index, test 
> search etc.
> Turn off slave server and after couple of seconds turn it on again.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to