[ 
https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264572#comment-15264572
 ] 

Timothy Potter commented on SOLR-9050:
--------------------------------------

hmmm ... I reproduced the STE locally, but the request gets retried multiple 
times (as expected) locally, but I didn't see that in my prod env? Or maybe I 
just got incomplete logs from my ops team :P

> IndexFetcher not retrying after SocketTimeoutException correctly, which leads 
> to trying a full download again
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9050
>                 URL: https://issues.apache.org/jira/browse/SOLR-9050
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 5.3.1
>            Reporter: Timothy Potter
>            Assignee: Timothy Potter
>         Attachments: SOLR-9050.patch
>
>
> I'm seeing a problem where reading a large file from the leader (in SolrCloud 
> mode) during index replication leads to a SocketTimeoutException:
> {code}
> 2016-04-28 16:22:23.568 WARN  (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error 
> in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes)
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:150)
>         at java.net.SocketInputStream.read(SocketInputStream.java:121)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
>         at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
>         at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
>         at 
> org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253)
>         at 
> org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227)
>         at 
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186)
>         at 
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
>         at 
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
>         at 
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
>         at 
> org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140)
>         at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167)
>         at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161)
>         at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312)
>         at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275)
>         at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
> {code}
> and this leads to the following error in cleanup:
> {code}
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler 
> Index fetch failed :org.apache.solr.common.SolrException: Unable to download 
> _405k.cfs completely. Downloaded 7314866176!=9990844536
>         at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406)
>         at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286)
>         at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800)
>         at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423)
>         at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254)
>         at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo 
> s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy 
> Error while trying to recover:org.apache.solr.common.SolrException: 
> Replication for recovery failed.
>         at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
>         at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
> {code}
> So a simple read timeout exception leads to re-downloading the whole index 
> again, and again, and again ...
> It also looks like any exception raised in fetchPackets would be squelched if 
> an exception is raised in cleanup (called in the finally block)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to