[ 
https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582960#comment-14582960
 ] 

Alexander S. edited comment on SOLR-6875 at 6/12/15 5:24 AM:
-------------------------------------------------------------

Got another error today on 4 shards set up, each has 2 replicas (8 nodes in 
total).

On the shard 4/replica 1 I see the next error: [^replica1.png]
On the shard 4/replica 2 the next: [^replica2.png]

Here's the backtrace for the error on the first screenshot:
{code}
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
        at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
        at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
        at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
{code}

After all this replica 1 shows:
{quote}
numDocs: 28 215 608
{quote}

And the replica 2 shows:
{quote}
numDocs: 28 215 609
{quote}

Everything worked well for a few months until yesterday, when we started to 
reindex some data (like 1.7m records).

Our Solr set up is using large pages and there's enough resources. Here's how 
we run the instances:
{code}
exec chpst -u solr java -Xms6G -Xmx8G -XX:+UseConcMarkSweepGC 
-XX:+UseLargePages -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled 
-XX:+UseLargePages -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=75 
-DzkHost=zoo5.devops:2181,zoo4.devops:2181,zoo1.devops:2181,zoo2.devops:2181,zoo3.devops:2181
 -Dcollection.configName=Carmen -Dbootstrap_confdir=./solr/conf 
-Dbootstrap_conf=true -DnumShards=4 -jar start.jar etc/jetty.xml
{code}

The server has 16 CPU cores and SSD RAID 10, the load average is between 2 and 
3 usually. The charts also don't show anything suspicious in server load, it is 
very stable.

So seems like something went wrong during recovery after the network error. Not 
sure how to debug that deeper and what those warnings in the log mean, for 
example the last 2 messages on the first screenshot, from 
DistributedUpdateProcessor and CoreAdminHandler.


was (Author: aheaven):
Get another error today on 4 shards set up, each has 2 replicas (8 nodes in 
total).

On the shard 4/replica 1 I see the next error: [^replica1.png]
On the shard 4/replica 2 the next: [^replica2.png]

Here's the backtrace for the error on the first screenshot:
{code}
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
        at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
        at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
        at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
{code}

After all this replica 1 shows:
{quote}
numDocs: 28 215 608
{quote}

And the replica 2 shows:
{quote}
numDocs: 28 215 609
{quote}

Everything worked well for a few months until yesterday, when we started to 
reindex some data (like 1.7m records).

Our Solr set up is using large pages and there's enough resources. Here's how 
we run the instances:
{code}
exec chpst -u solr java -Xms6G -Xmx8G -XX:+UseConcMarkSweepGC 
-XX:+UseLargePages -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled 
-XX:+UseLargePages -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=75 
-DzkHost=zoo5.devops:2181,zoo4.devops:2181,zoo1.devops:2181,zoo2.devops:2181,zoo3.devops:2181
 -Dcollection.configName=Carmen -Dbootstrap_confdir=./solr/conf 
-Dbootstrap_conf=true -DnumShards=4 -jar start.jar etc/jetty.xml
{code}

The server has 16 CPU cores and SSD RAID 10, the load average is between 2 and 
3 usually. The charts also don't show anything suspicious in server load, it is 
very stable.

So seems like something went wrong during recovery after the network error. Not 
sure how to debug that deeper and what those warnings in the log mean, for 
example the last 2 messages on the first screenshot, from 
DistributedUpdateProcessor and CoreAdminHandler.

> No data integrity between replicas
> ----------------------------------
>
>                 Key: SOLR-6875
>                 URL: https://issues.apache.org/jira/browse/SOLR-6875
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10.2
>         Environment: One replica is @ Linux solr1.devops.wegohealth.com 
> 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 
> x86_64 x86_64 GNU/Linux
> Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic 
> #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> Solr is running with the next options:
> * -Xms12G
> * -Xmx16G
> * -XX:+UseConcMarkSweepGC
> * -XX:+UseLargePages
> * -XX:+CMSParallelRemarkEnabled
> * -XX:+ParallelRefProcEnabled
> * -XX:+UseLargePages
> * -XX:+AggressiveOpts
> * -XX:CMSInitiatingOccupancyFraction=75
>            Reporter: Alexander S.
>         Attachments: replica1.png, replica2.png
>
>
> Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
> Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, 
> and another (Solr1.1) 45 574 038 docs.
> Solr1 is the leader, these errors appeared in the logs:
> {code}
> ERROR - 2014-12-20 09:54:38.783; 
> org.apache.solr.update.StreamingSolrServers$1; error
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.787; 
> org.apache.solr.update.processor.DistributedUpdateProcessor; Error sending 
> update
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:196)
>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
>         at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
>         at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
> WARN  - 2014-12-20 09:54:38.813; org.apache.solr.cloud.ZkController; Leader 
> is publishing core=crm-prod coreNodeName =10.128.209.232:8081_solr_crm-prod 
> state=down on behalf of un-reachable replica 
> http://10.128.209.232:8081/solr/crm-prod/; forcePublishState? false
> ERROR - 2014-12-20 09:54:38.818; 
> org.apache.solr.update.processor.DistributedUpdateProcessor; Setting up to 
> try to start recovery on replica http://10.128.209.232:8081/solr/crm-prod/ 
> after: java.net.SocketException: Connection reset
> {code}
> On Solr1.1:
> {code}
> WARN  - 2014-12-20 09:54:38.854; org.apache.solr.cloud.RecoveryStrategy; 
> Stopping recovery for core=crm-prod 
> coreNodeName=10.128.209.232:8081_solr_crm-prod
> {code}
> Index optimization was running at that time.
> It was not a system crash, the server is up and was running smoothly with a 
> lot of available resources on board, lots of CPU, available RAM and a very 
> fast SSD RAID. So whatever happened Solr should get recovered properly, e.g. 
> as mysql does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to