Sorry, my fault. I try to rewrite my email without images:
I’m experiencing a strange behaviour with a SolrCloud cluster.
Cluster description
I have a cluster with a total of 38 nodes. All nodes are installed with the
following features:
- OS: Debian GNU/Linux 9.13 (stretch)
- JRE: openjdk version "11.0.6" 2020-01-14
- Apache Solr: Apache Solr 8.11.2
The cluster nodes are divided as follows:
Nodes used for indexing
solrindex-01
solrindex-02
Nodes used for queries
solrquery-01
solrquery-02
Cluster nodes with collections
solrnode-01
…
solrnode-34
Configuration of the collection
In the cluster I have a collection (i.e testcollection) divided on the various
nodes through different shards (one shard for each month, i.e. shard_202201,
shard_202202, ...)
Problem
From time to time the solrquery-01 node is no longer able to query the entire
collection and in particular it is unable to contact some replicas of the
collection present on the other nodes of the cluster. The problem does not
resolve itself but it is necessary to restart the Apache Solr service on the
solrquery-01 node.
In particular:
If I try to query a specific replica from the solrquery-01 node, the request
remains pending until it times out
Query
http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=track&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/
Response
{
"response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]},
"debug":{
"track":{
"rid":"solrquery-01.volo.local-232528",
"EXECUTE_QUERY":{
"http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/":{
"Exception":"Timeout occured while waiting response from server at:
http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/select"}}}}
}
By executing the same query from another node (eg: solrnode-01) the query is
successful.
Query
http://solrnode-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=track&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/
Response:
{
"response":{"numFound":0,"start":0,"maxScore":0.0,"numFoundExact":true,"docs":[]},
"debug":{
"track":{
"rid":"solrnode-01.volo.local-1849853",
"EXECUTE_QUERY":{
"http://solrnode-24.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n575/":{
"QTime":"0",
"ElapsedTime":"28",
"RequestPurpose":"GET_TOP_IDS,SET_TERM_STATS",
"NumFound":"0",
"Response":"{responseHeader={zkConnected=true,status=0,QTime=0},response={numFound=0,numFoundExact=true,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}"}}}}
}
The same happens if I try to run the query from solrquery-01 node to a
different replica
Query
http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReference:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED&debug=track&shards=http://solrnode-23.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n573/
Response
{
"response":{"numFound":0,"start":0,"maxScore":0.0,"numFoundExact":true,"docs":[]},
"debug":{
"track":{
"rid":"solrquery-01.volo.local-232531",
"EXECUTE_QUERY":{
"http://solrnode-23.volo.local:8080/solr/volocomapi_search_shard_201501_replica_n573/":{
"QTime":"0",
"ElapsedTime":"88",
"RequestPurpose":"GET_TOP_IDS,SET_TERM_STATS",
"NumFound":"0",
"Response":"{responseHeader={zkConnected=true,status=0,QTime=0},response={numFound=0,numFoundExact=true,start=0,maxScore=0.0,docs=[]},sort_values={},debug={}}"}}}}
}
Checking the network traffic with tcpdump on the solrquery-01 machine does not
show any connection as it does on the solrnode-01 machine
tcpdump from the solrquery-01 machine
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
tcpdump on the solrnode-01 machine
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
10:57:10.979736 IP solrnode-01.volo.local.39888 >
solrnode-24.volo.local.http-alt: Flags [P.], seq 881884455:881885148, ack
1974049136, win 364, options [nop,nop,TS val 561210041 ecr 561833498], length
693: HTTP
10:57:11.008007 IP solrnode-01.volo.local.39888 >
solrnode-24.volo.local.http-alt: Flags [.], ack 132, win 364, options
[nop,nop,TS val 561210048 ecr 561835614], length 0
Question
Do you have any suggestions on how to investigate this issue further?
Suggestions on possible solutions?
Thank you in advance,
Matteo
Matteo Diarena
Direttore Innovazione
Volocom s.r.l. (www.volocom.it - [email protected])
Via Antonio Cechov, 50 - 20151 MILANO
Via Leone XIII, 95 - 00165 ROMA
Tel +39 02 89453024 / +39 02 89453023
Mobile +39 345 2129244
[email protected]
-----Messaggio originale-----
Da: Vincenzo D'Amore <[email protected]>
Inviato: 05 September 2022 00:34
A: [email protected]
Oggetto: Re: SolrCloud node fail to connect to another node in the cluster
Hi Matteo, FYI, images has been removed from your email.
The mailing list ate it. You'll need to give us text, not an image.
On Thu, 1 Sep 2022 at 16:35, Matteo Diarena <[email protected]> wrote:
> Dear all,
>
> I’m experiencing a strange behaviour with a SolrCloud cluster.
>
>
>
> *Cluster description *
>
> I have a cluster with a total of 38 nodes. All nodes are installed
> with the following features:
>
> - *OS*: Debian GNU/Linux 9.13 (stretch)
> - JRE: openjdk version "11.0.6" 2020-01-14
> - Apache Solr: Apache Solr 8.11.2
>
>
>
> The cluster nodes are divided as follows:
>
>
>
> *Nodes used for indexing*
>
> solrindex-01
>
> solrindex-02
>
>
>
> *Nodes used for queries*
>
> solrquery-01
>
> solrquery-02
>
>
>
> *Cluster nodes with collections*
>
> solrnode-01
>
> …
>
> solrnode-34
>
>
>
> *Configuration of the collection*
>
> In the cluster I have a collection (i.e testcollection) divided on the
> various nodes through different shards (one shard for each month, i.e.
> shard_202201, shard_202202, ...)
>
>
>
> *Problem*
>
> From time to time the solrquery-01 node is no longer able to query the
> entire collection and in particular it is unable to contact some
> replicas of the collection present on the other nodes of the cluster.
> The problem does not resolve itself but it is necessary to restart the
> Apache Solr service on the solrquery-01 node.
>
>
>
> In particular:
>
> If I try to query a specific replica from the solrquery-01 node, the
> request remains pending until it times out
>
>
>
> Query
>
>
> http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReferen
> ce:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFE
> D&debug=true&shards=http://solrnode-24.volo.local:8080/solr/volocomapi
> _search_shard_201501_replica_n575/
>
>
>
> Response
>
>
>
> By executing the same query from another node (eg: solrnode-01) the
> query is successful.
>
>
>
> Query
>
>
> http://solrnode-01:8080/solr/volocomapi_search/select?q=UniqueReferenc
> e:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFED
> &debug=true&shards=http://solrnode-24.volo.local:8080/solr/volocomapi_
> search_shard_201501_replica_n575/
>
>
>
>
>
> Response:
>
>
>
> The same happens if I try to run the query to a different replica
>
>
>
> Query
>
>
> http://solrquery-01:8080/solr/volocomapi_search/select?q=UniqueReferen
> ce:DOC_EBF3D4C11F1239852490280F583D052FC214A10D6E716BD98C19CBC599E5EFE
> D&debug=true&shards=http://solrnode-23.volo.local:8080/solr/volocomapi
> _search_shard_201501_replica_n573/
>
>
>
> Response
>
>
>
>
>
> Checking the network traffic with tcpdump on the solrquery-01 machine
> does not show any connection as it does on the solrnode-01 machine
>
>
>
> *tcpdump from the solrquery-01 machine*
>
>
>
> *tcpdump on the solrnode-01 machine*
>
>
>
> *Question*
>
> Do you have any suggestions on how to investigate this issue further?
> Suggestions on possible solutions?
>
>
>
>
>
> Thank you in advance,
>
> Matteo
>
--
Vincenzo D'Amore