[ 
https://issues.apache.org/jira/browse/SOLR-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962052#comment-16962052
 ] 

Erick Erickson commented on SOLR-7483:
--------------------------------------

I think this can be closed? Sounds like some other JIRAs we've had where TLOG 
replay was inefficient....

> Investigate ways to deal with the tlog growing indefinitely while it's being 
> replayed
> -------------------------------------------------------------------------------------
>
>                 Key: SOLR-7483
>                 URL: https://issues.apache.org/jira/browse/SOLR-7483
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Timothy Potter
>            Priority: Major
>
> While trying to track down the data-loss issue I found while testing 
> SOLR-7332, one of my replicas was forced into recovery by the leader due to a 
> network error (I'm over-stressing Solr as part of this test) ... 
> In the leader log:
> {code}
> INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 
> perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; 
> I/O exception (java.net.SocketException) caught when processing request to 
> {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985: Broken pipe
> INFO  - 2015-04-28 21:36:55.096; [perf10x2 shard2 core_node7 
> perf10x2_shard2_replica2] org.apache.http.impl.client.DefaultRequestDirector; 
> Retrying request to {}->http://ec2-54-242-70-241.compute-1.amazonaws.com:8985
> ERROR - 2015-04-28 21:36:55.091; [perf10x2 shard2 core_node7 
> perf10x2_shard2_replica2] org.apache.solr.update.StreamingSolrClients$1; error
> org.apache.http.NoHttpResponseException: 
> ec2-54-242-70-241.compute-1.amazonaws.com:8985 failed to respond
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
>         at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
>         at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
>         at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
>         at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
>         at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
>         at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
>         at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
>         at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
>         at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
>         at 
> org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:243)
>         at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> In the logs on the replica, I see a bunch of failed checksums messages, like:
> {code}
> WARN  - 2015-04-28 21:38:43.345; [   ] org.apache.solr.handler.IndexFetcher; 
> File _xv.si did not match. expected checksum is 617655777 and actual is 
> checksum 1090588695. expected length is 419 and actual length is 419
> WARN  - 2015-04-28 21:38:43.349; [   ] org.apache.solr.handler.IndexFetcher; 
> File _xv.fnm did not match. expected checksum is 1992662616 and actual is 
> checksum 1632122630. expected length is 1756 and actual length is 1756
> WARN  - 2015-04-28 21:38:43.353; [   ] org.apache.solr.handler.IndexFetcher; 
> File _xv.nvm did not match. expected checksum is 384078655 and actual is 
> checksum 3108095639. expected length is 92 and actual length is 92
> {code}
> This tells me it tried a snapshot pull of the index from the leader ...
> Also, I see the replica started to replay the tlog (presumably the snapshot 
> pull succeeded - of course my logging is set to WARN so I'm not seeing a full 
> story in the logs):
> {code}
> WARN  - 2015-04-28 21:38:45.656; [   ] 
> org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay 
> tlog{file=/vol0/cloud85/solr/perf10x2_shard2_replica1/data/tlog/tlog.0000000000000000046
>  refcount=2} active=true starting pos=56770101
> {code}
> The problem is the tlog continues to grow and grow while this "replay" is 
> happening ... when I first looked at the tlog, it was 769m a few minutes 
> later, it's at 2.2g and still growing, i.e the leader is still pounding it 
> with updates it can't keep up with.
> The good thing of course is that the updates are being persisted to durable 
> storage on the replica, so it's better than if the replica was just marked 
> down. So maybe there isn't much we can do about this, but I wanted to capture 
> the description of this event in a JIRA so we can investigate it further.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to