[ https://issues.apache.org/jira/browse/SOLR-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962052#comment-16962052 ]
Erick Erickson commented on SOLR-7483: -------------------------------------- I think this can be closed? Sounds like some other JIRAs we've had where TLOG replay was inefficient.... > Investigate ways to deal with the tlog growing indefinitely while it's being > replayed > ------------------------------------------------------------------------------------- > > Key: SOLR-7483 > URL: https://issues.apache.org/jira/browse/SOLR-7483 > Project: Solr > Issue Type: Task > Components: SolrCloud > Reporter: Timothy Potter > Priority: Major > > While trying to track down the data-loss issue I found while testing > SOLR-7332, one of my replicas was forced into recovery by the leader due to a > network error (I'm over-stressing Solr as part of this test) ... > In the leader log: > {code} > INFO - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 > perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; > I/O exception (java.net.SocketException) caught when processing request to > {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985: Broken pipe > INFO - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 > perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; > Retrying request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985 > ERROR - 2015-04-28 21:36:55.091; [perf10x2 shard2 core_node7 > perf10x2_shard2_replica2] org.apache.solr.update.StreamingSolrClients$1; error > org.apache.http.NoHttpResponseException: > ec2-54-242-70-241.compute-1.amazonaws.com:8985 failed to respond > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261) > at > org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) > at > org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272) > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124) > at > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487) > at > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) > at > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243) > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > In the logs on the replica, I see a bunch of failed checksums messages, like: > {code} > WARN - 2015-04-28 21:38:43.345; [ ] org.apache.solr.handler.IndexFetcher; > File _xv.si did not match. expected checksum is 617655777 and actual is > checksum 1090588695. expected length is 419 and actual length is 419 > WARN - 2015-04-28 21:38:43.349; [ ] org.apache.solr.handler.IndexFetcher; > File _xv.fnm did not match. expected checksum is 1992662616 and actual is > checksum 1632122630. expected length is 1756 and actual length is 1756 > WARN - 2015-04-28 21:38:43.353; [ ] org.apache.solr.handler.IndexFetcher; > File _xv.nvm did not match. expected checksum is 384078655 and actual is > checksum 3108095639. expected length is 92 and actual length is 92 > {code} > This tells me it tried a snapshot pull of the index from the leader ... > Also, I see the replica started to replay the tlog (presumably the snapshot > pull succeeded - of course my logging is set to WARN so I'm not seeing a full > story in the logs): > {code} > WARN - 2015-04-28 21:38:45.656; [ ] > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay > tlog{file=/vol0/cloud85/solr/perf10x2_shard2_replica1/data/tlog/tlog.0000000000000000046 > refcount=2} active=true starting pos=56770101 > {code} > The problem is the tlog continues to grow and grow while this "replay" is > happening ... when I first looked at the tlog, it was 769m a few minutes > later, it's at 2.2g and still growing, i.e the leader is still pounding it > with updates it can't keep up with. > The good thing of course is that the updates are being persisted to durable > storage on the replica, so it's better than if the replica was just marked > down. So maybe there isn't much we can do about this, but I wanted to capture > the description of this event in a JIRA so we can investigate it further. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org