[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas
[ https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582960#comment-14582960 ] Alexander S. edited comment on SOLR-6875 at 6/12/15 5:24 AM: - Got another error today on 4 shards set up, each has 2 replicas (8 nodes in total). On the shard 4/replica 1 I see the next error: [^replica1.png] On the shard 4/replica 2 the next: [^replica2.png] Here's the backtrace for the error on the first screenshot: {code} java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} After all this replica 1 shows: {quote} numDocs: 28 215 608 {quote} And the replica 2 shows: {quote} numDocs: 28 215 609 {quote} Everything worked well for a few months until yesterday, when we started to reindex some data (like 1.7m records). Our Solr set up is using large pages and there's enough resources. Here's how we run the instances: {code} exec chpst -u solr java -Xms6G -Xmx8G -XX:+UseConcMarkSweepGC -XX:+UseLargePages -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -XX:CMSInitiatingOccupancyFraction=75 -DzkHost=zoo5.devops:2181,zoo4.devops:2181,zoo1.devops:2181,zoo2.devops:2181,zoo3.devops:2181 -Dcollection.configName=Carmen -Dbootstrap_confdir=./solr/conf -Dbootstrap_conf=true -DnumShards=4 -jar start.jar etc/jetty.xml {code} The server has 16 CPU cores and SSD RAID 10, the load average is between 2 and 3 usually. The charts also don't show anything suspicious in server load, it is very stable. So seems like something went wrong during recovery after the network error. Not sure how to debug that deeper and what those warnings in the log mean, for example the last 2 messages on the first screenshot, from DistributedUpdateProcessor and CoreAdminHandler. was (Author: aheaven): Get another error today on 4 shards set up, each has 2 replicas (8 nodes in total). On the shard 4/replica 1 I see the next error: [^replica1.png] On the shard 4/replica 2 the next: [^replica2.png] Here's the backtrace for the error on the first screenshot: {code} java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at
[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas
[ https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272877#comment-14272877 ] Alexander S. edited comment on SOLR-6875 at 1/11/15 11:33 AM: -- Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture: {noformat} Shard 1: Replica 1: *14 486 089* Replica 2: *14 496 445* Shard 2 Replica 1: 14 496 609 Replica 2: 14 496 609 Shard 3 Replica 1: 14 492 812 Replica 2: 14 492 812 Shard 4 Replica 1: 14 488 755 Replica 2: 14 488 755 {noformat} How could it be? We didn't see anything like that before upgrade from 4.8.1 to 4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason? was (Author: aheaven): Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture: {noformat} Shard 1: Replica 1: 14 486 089 Replica 2: 14 496 445 Shard 2 Replica 1: 14 496 609 Replica 2: 14 496 609 Shard 3 Replica 1: 14 492 812 Replica 2: 14 492 812 Shard 4 Replica 1: 14 488 755 Replica 2: 14 488 755 {noformat} How could it be? We didn't see anything like that before upgrade from 4.8.1 to 4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason? No data integrity between replicas -- Key: SOLR-6875 URL: https://issues.apache.org/jira/browse/SOLR-6875 Project: Solr Issue Type: Bug Affects Versions: 4.10.2 Environment: One replica is @ Linux solr1.devops.wegohealth.com 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Solr is running with the next options: * -Xms12G * -Xmx16G * -XX:+UseConcMarkSweepGC * -XX:+UseLargePages * -XX:+CMSParallelRemarkEnabled * -XX:+ParallelRefProcEnabled * -XX:+UseLargePages * -XX:+AggressiveOpts * -XX:CMSInitiatingOccupancyFraction=75 Reporter: Alexander S. Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total. Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, and another (Solr1.1) 45 574 038 docs. Solr1 is the leader, these errors appeared in the logs: {code} ERROR - 2014-12-20 09:54:38.783; org.apache.solr.update.StreamingSolrServers$1; error java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) WARN - 2014-12-20 09:54:38.787; org.apache.solr.update.processor.DistributedUpdateProcessor;
[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas
[ https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272877#comment-14272877 ] Alexander S. edited comment on SOLR-6875 at 1/11/15 11:33 AM: -- Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture: {noformat} Shard 1: Replica 1: 14 486 089 Replica 2: 14 496 445 Shard 2 Replica 1: 14 496 609 Replica 2: 14 496 609 Shard 3 Replica 1: 14 492 812 Replica 2: 14 492 812 Shard 4 Replica 1: 14 488 755 Replica 2: 14 488 755 {noformat} How could it be? We didn't see anything like that before upgrade from 4.8.1 to 4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason? was (Author: aheaven): Now we have 4 shards, each with 2 replics (8 total nodes) and the next picture: {noformat} Shard 1: Replica 1: 14 486 089 Replica 2: 14 496 445 Shard 2 Replica 1: 14 496 609 Replica 2: 14 496 609 Shard 3 Replica 1: 14 492 812 Replica 2: 14 492 812 Shard 4 Replica 1: 14 488 755 Replica 2: 14 488 755 {noformat} How could it be? We didn't see anything like that before upgrade from 4.8.1 to 4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason? No data integrity between replicas -- Key: SOLR-6875 URL: https://issues.apache.org/jira/browse/SOLR-6875 Project: Solr Issue Type: Bug Affects Versions: 4.10.2 Environment: One replica is @ Linux solr1.devops.wegohealth.com 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Solr is running with the next options: * -Xms12G * -Xmx16G * -XX:+UseConcMarkSweepGC * -XX:+UseLargePages * -XX:+CMSParallelRemarkEnabled * -XX:+ParallelRefProcEnabled * -XX:+UseLargePages * -XX:+AggressiveOpts * -XX:CMSInitiatingOccupancyFraction=75 Reporter: Alexander S. Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total. Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, and another (Solr1.1) 45 574 038 docs. Solr1 is the leader, these errors appeared in the logs: {code} ERROR - 2014-12-20 09:54:38.783; org.apache.solr.update.StreamingSolrServers$1; error java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) WARN - 2014-12-20 09:54:38.787; org.apache.solr.update.processor.DistributedUpdateProcessor; Error
[jira] [Comment Edited] (SOLR-6875) No data integrity between replicas
[ https://issues.apache.org/jira/browse/SOLR-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272877#comment-14272877 ] Alexander S. edited comment on SOLR-6875 at 1/11/15 11:33 AM: -- Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture: {noformat} Shard 1: Replica 1: 14 486 089 Replica 2: 14 496 445 Shard 2 Replica 1: 14 496 609 Replica 2: 14 496 609 Shard 3 Replica 1: 14 492 812 Replica 2: 14 492 812 Shard 4 Replica 1: 14 488 755 Replica 2: 14 488 755 {noformat} How could it be? We didn't see anything like that before upgrade from 4.8.1 to 4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason? was (Author: aheaven): Now we have 4 shards, each with 2 replicas (8 total nodes) and the next picture: {noformat} Shard 1: Replica 1: *14 486 089* Replica 2: *14 496 445* Shard 2 Replica 1: 14 496 609 Replica 2: 14 496 609 Shard 3 Replica 1: 14 492 812 Replica 2: 14 492 812 Shard 4 Replica 1: 14 488 755 Replica 2: 14 488 755 {noformat} How could it be? We didn't see anything like that before upgrade from 4.8.1 to 4.10.2. Also we enabled checkIntegrityAtMerge, could it be the reason? No data integrity between replicas -- Key: SOLR-6875 URL: https://issues.apache.org/jira/browse/SOLR-6875 Project: Solr Issue Type: Bug Affects Versions: 4.10.2 Environment: One replica is @ Linux solr1.devops.wegohealth.com 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Another replica is @ Linux solr2.devops.wegohealth.com 3.16.0-23-generic #30-Ubuntu SMP Thu Oct 16 13:17:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Solr is running with the next options: * -Xms12G * -Xmx16G * -XX:+UseConcMarkSweepGC * -XX:+UseLargePages * -XX:+CMSParallelRemarkEnabled * -XX:+ParallelRefProcEnabled * -XX:+UseLargePages * -XX:+AggressiveOpts * -XX:CMSInitiatingOccupancyFraction=75 Reporter: Alexander S. Setup: SolrCloud with 2 shards, each with 2 replicas, 4 nodes in total. Indexing is stopped, one replica of a shard (Solr1) shows 45 574 039 docs, and another (Solr1.1) 45 574 038 docs. Solr1 is the leader, these errors appeared in the logs: {code} ERROR - 2014-12-20 09:54:38.783; org.apache.solr.update.StreamingSolrServers$1; error java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) WARN - 2014-12-20 09:54:38.787; org.apache.solr.update.processor.DistributedUpdateProcessor;