[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas

2015-06-11 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582960#comment-14582960
 ] 

Alexander S. edited comment on SOLR-6875 at 6/12/15 5:24 AM:
-

Got another error today on 4 shards set up, each has 2 replicas (8 nodes in 
total).

On the shard 4/replica 1 I see the next error: [^replica1.png]
On the shard 4/replica 2 the next: [^replica2.png]

Here's the backtrace for the error on the first screenshot:
{code}
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

After all this replica 1 shows:
{quote}
numDocs: 28 215 608
{quote}

And the replica 2 shows:
{quote}
numDocs: 28 215 609
{quote}

Everything worked well for a few months until yesterday, when we started to 
reindex some data (like 1.7m records).

Our Solr set up is using large pages and there's enough resources. Here's how 
we run the instances:
{code}
exec chpst -u solr java -Xms6G -Xmx8G -XX:+UseConcMarkSweepGC 
-XX:+UseLargePages -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled 
-XX:+UseLargePages -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=75 
-DzkHost=zoo5.devops:2181,zoo4.devops:2181,zoo1.devops:2181,zoo2.devops:2181,zoo3.devops:2181
 -Dcollection.configName=Carmen -Dbootstrap_confdir=./solr/conf 
-Dbootstrap_conf=true -DnumShards=4 -jar start.jar etc/jetty.xml
{code}

The server has 16 CPU cores and SSD RAID 10, the load average is between 2 and 
3 usually. The charts also don't show anything suspicious in server load, it is 
very stable.

So seems like something went wrong during recovery after the network error. Not 
sure how to debug that deeper and what those warnings in the log mean, for 
example the last 2 messages on the first screenshot, from 
DistributedUpdateProcessor and CoreAdminHandler.


was (Author: aheaven):
Get another error today on 4 shards set up, each has 2 replicas (8 nodes in 
total).

On the shard 4/replica 1 I see the next error: [^replica1.png]
On the shard 4/replica 2 the next: [^replica2.png]

Here's the backtrace for the error on the first screenshot:
{code}
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
at 

[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas

2015-01-11 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272877#comment-14272877
 ] 

Alexander S. edited comment on SOLR-6875 at 1/11/15 11:33 AM:
--

Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture:
{noformat}
Shard 1:
  Replica 1: *14 486 089*
  Replica 2: *14 496 445*

Shard 2
  Replica 1: 14 496 609
  Replica 2: 14 496 609

Shard 3
  Replica 1: 14 492 812
  Replica 2: 14 492 812

Shard 4
  Replica 1: 14 488 755
  Replica 2: 14 488 755
{noformat}

How could it be? We didn't see anything like that before upgrade from 4.8.1 to 
4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason?


was (Author: aheaven):
Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture:
{noformat}
Shard 1:
  Replica 1: 14 486 089
  Replica 2: 14 496 445

Shard 2
  Replica 1: 14 496 609
  Replica 2: 14 496 609

Shard 3
  Replica 1: 14 492 812
  Replica 2: 14 492 812

Shard 4
  Replica 1: 14 488 755
  Replica 2: 14 488 755
{noformat}

How could it be? We didn't see anything like that before upgrade from 4.8.1 to 
4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason?

 No data integrity between replicas
 --

 Key: SOLR-6875
 URL: https://issues.apache.org/jira/browse/SOLR-6875
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10.2
 Environment: One replica is @ Linux solr1.devops.wegohealth.com 
 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 
 x86_64 x86_64 GNU/Linux
 Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic 
 #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 Solr is running with the next options:
 * -Xms12G
 * -Xmx16G
 * -XX:+UseConcMarkSweepGC
 * -XX:+UseLargePages
 * -XX:+CMSParallelRemarkEnabled
 * -XX:+ParallelRefProcEnabled
 * -XX:+UseLargePages
 * -XX:+AggressiveOpts
 * -XX:CMSInitiatingOccupancyFraction=75
Reporter: Alexander S.

 Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
 Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, 
 and another (Solr1.1) 45 574 038 docs.
 Solr1 is the leader, these errors appeared in the logs:
 {code}
 ERROR - 2014-12-20 09:54:38.783; 
 org.apache.solr.update.StreamingSolrServers$1; error
 java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:196)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
 at 
 org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
 at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
 at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
 at 
 org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
 at 
 org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
 at 
 org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
 at 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
 at 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
 at 
 org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
 at 
 org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 WARN  - 2014-12-20 09:54:38.787; 
 org.apache.solr.update.processor.DistributedUpdateProcessor; 

[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas

2015-01-11 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272877#comment-14272877
 ] 

Alexander S. edited comment on SOLR-6875 at 1/11/15 11:33 AM:
--

Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture:
{noformat}
Shard 1:
  Replica 1: 14 486 089
  Replica 2: 14 496 445

Shard 2
  Replica 1: 14 496 609
  Replica 2: 14 496 609

Shard 3
  Replica 1: 14 492 812
  Replica 2: 14 492 812

Shard 4
  Replica 1: 14 488 755
  Replica 2: 14 488 755
{noformat}

How could it be? We didn't see anything like that before upgrade from 4.8.1 to 
4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason?


was (Author: aheaven):
Now we have 4 shards, each with 2 replics (8 total nodes) and the next picture:
{noformat}
Shard 1:
  Replica 1: 14 486 089
  Replica 2: 14 496 445

Shard 2
  Replica 1: 14 496 609
  Replica 2: 14 496 609

Shard 3
  Replica 1: 14 492 812
  Replica 2: 14 492 812

Shard 4
  Replica 1: 14 488 755
  Replica 2: 14 488 755
{noformat}

How could it be? We didn't see anything like that before upgrade from 4.8.1 to 
4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason?

 No data integrity between replicas
 --

 Key: SOLR-6875
 URL: https://issues.apache.org/jira/browse/SOLR-6875
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10.2
 Environment: One replica is @ Linux solr1.devops.wegohealth.com 
 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 
 x86_64 x86_64 GNU/Linux
 Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic 
 #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 Solr is running with the next options:
 * -Xms12G
 * -Xmx16G
 * -XX:+UseConcMarkSweepGC
 * -XX:+UseLargePages
 * -XX:+CMSParallelRemarkEnabled
 * -XX:+ParallelRefProcEnabled
 * -XX:+UseLargePages
 * -XX:+AggressiveOpts
 * -XX:CMSInitiatingOccupancyFraction=75
Reporter: Alexander S.

 Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
 Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, 
 and another (Solr1.1) 45 574 038 docs.
 Solr1 is the leader, these errors appeared in the logs:
 {code}
 ERROR - 2014-12-20 09:54:38.783; 
 org.apache.solr.update.StreamingSolrServers$1; error
 java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:196)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
 at 
 org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
 at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
 at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
 at 
 org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
 at 
 org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
 at 
 org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
 at 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
 at 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
 at 
 org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
 at 
 org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 WARN  - 2014-12-20 09:54:38.787; 
 org.apache.solr.update.processor.DistributedUpdateProcessor; Error 

[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas

2015-01-11 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272877#comment-14272877
 ] 

Alexander S. edited comment on SOLR-6875 at 1/11/15 11:33 AM:
--

Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture:
{noformat}
Shard 1:
  Replica 1: 14 486 089
  Replica 2: 14 496 445

Shard 2
  Replica 1: 14 496 609
  Replica 2: 14 496 609

Shard 3
  Replica 1: 14 492 812
  Replica 2: 14 492 812

Shard 4
  Replica 1: 14 488 755
  Replica 2: 14 488 755
{noformat}

How could it be? We didn't see anything like that before upgrade from 4.8.1 to 
4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason?


was (Author: aheaven):
Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture:
{noformat}
Shard 1:
  Replica 1: *14 486 089*
  Replica 2: *14 496 445*

Shard 2
  Replica 1: 14 496 609
  Replica 2: 14 496 609

Shard 3
  Replica 1: 14 492 812
  Replica 2: 14 492 812

Shard 4
  Replica 1: 14 488 755
  Replica 2: 14 488 755
{noformat}

How could it be? We didn't see anything like that before upgrade from 4.8.1 to 
4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason?

 No data integrity between replicas
 --

 Key: SOLR-6875
 URL: https://issues.apache.org/jira/browse/SOLR-6875
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.10.2
 Environment: One replica is @ Linux solr1.devops.wegohealth.com 
 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 
 x86_64 x86_64 GNU/Linux
 Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic 
 #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 Solr is running with the next options:
 * -Xms12G
 * -Xmx16G
 * -XX:+UseConcMarkSweepGC
 * -XX:+UseLargePages
 * -XX:+CMSParallelRemarkEnabled
 * -XX:+ParallelRefProcEnabled
 * -XX:+UseLargePages
 * -XX:+AggressiveOpts
 * -XX:CMSInitiatingOccupancyFraction=75
Reporter: Alexander S.

 Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total.
 Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, 
 and another (Solr1.1) 45 574 038 docs.
 Solr1 is the leader, these errors appeared in the logs:
 {code}
 ERROR - 2014-12-20 09:54:38.783; 
 org.apache.solr.update.StreamingSolrServers$1; error
 java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:196)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
 at 
 org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
 at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
 at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
 at 
 org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
 at 
 org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
 at 
 org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
 at 
 org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
 at 
 org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
 at 
 org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
 at 
 org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
 at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
 at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 WARN  - 2014-12-20 09:54:38.787; 
 org.apache.solr.update.processor.DistributedUpdateProcessor;