[ 
https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583657#comment-14583657
 ] 

Erick Erickson commented on SOLR-6875:
--------------------------------------

Do any of the logs on the leaders mention "leader initiated recovery"? And how 
fast are you sending documents at Solr? I've seen situations where flooding 
"too many" updates at Solr can cause some wonky behavior, there are some 
inefficiencies in how leaders talk to replicas, see Tim Potter's blog here: 
http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/

The symptom I saw was two-fold:
1> the leader forced the follower into recovery. No errors reported on the 
follower, just a timeout on the leader
2> There were a bazillion updates coming in as fast as possible, there were a 
lot of threads outstanding on the leader from ConcurrentUpdateSolrServer.

Not saying this is your problem, but if you see something like this it'd be 
good to know when tracking this down. If you don't have followers going down 
then this isn't the issue.

> No data integrity between replicas
> ----------------------------------
>
>                 Key: SOLR-6875
>                 URL: https://issues.apache.org/jira/browse/SOLR-6875
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10.2
>         Environment: One replica is @ Linux solr1.devops.wegohealth.com 
> 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 
> x86_64 x86_64 GNU/Linux
> Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic 
> #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> Solr is running with the next options:
> * -Xms12G
> * -Xmx16G
> * -XX:+UseConcMarkSweepGC
> * -XX:+UseLargePages
> * -XX:+CMSParallelRemarkEnabled
> * -XX:+ParallelRefProcEnabled
> * -XX:+UseLargePages
> * -XX:+AggressiveOpts
> * -XX:CMSInitiatingOccupancyFraction=75
>            Reporter: Alexander S.
>         Attachments: replica1.png, replica2.png
>
>
> Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
> Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, 
> and another (Solr1.1) 45 574 038 docs.
> Solr1 is the leader, these errors appeared in the logs:
> {code}
> ERROR - 2014-12-20 09:54:38.783; 
> org.apache.solr.update.StreamingSolrServers$1; error
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.787; 
> org.apache.solr.update.processor.DistributedUpdateProcessor; Error sending 
> update
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.813; org.apache.solr.cloud.ZkController; Leader 
> is publishing core=crm-prod coreNodeName =10.128.209.232:8081_solr_crm-prod 
> state=down on behalf of un-reachable replica 
> http://10.128.209.232:8081/solr/crm-prod/; forcePublishState? false
> ERROR - 2014-12-20 09:54:38.818; 
> org.apache.solr.update.processor.DistributedUpdateProcessor; Setting up to 
> try to start recovery on replica http://10.128.209.232:8081/solr/crm-prod/ 
> after: java.net.SocketException: Connection reset
> {code}
> On Solr1.1:
> {code}
> WARN  - 2014-12-20 09:54:38.854; org.apache.solr.cloud.RecoveryStrategy; 
> Stopping recovery for core=crm-prod 
> coreNodeName=10.128.209.232:8081_solr_crm-prod
> {code}
> Index optimization was running at that time.
> It was not a system crash, the server is up and was running smoothly with a 
> lot of available resources on board, lots of CPU, available RAM and a very 
> fast SSD RAID. So whatever happened Solr should get recovered properly, e.g. 
> as mysql does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to