[ https://issues.apache.org/jira/browse/SOLR-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264572#comment-15264572 ]
Timothy Potter commented on SOLR-9050: -------------------------------------- hmmm ... I reproduced the STE locally, but the request gets retried multiple times (as expected) locally, but I didn't see that in my prod env? Or maybe I just got incomplete logs from my ops team :P > IndexFetcher not retrying after SocketTimeoutException correctly, which leads > to trying a full download again > ------------------------------------------------------------------------------------------------------------- > > Key: SOLR-9050 > URL: https://issues.apache.org/jira/browse/SOLR-9050 > Project: Solr > Issue Type: Bug > Components: replication (java) > Affects Versions: 5.3.1 > Reporter: Timothy Potter > Assignee: Timothy Potter > Attachments: SOLR-9050.patch > > > I'm seeing a problem where reading a large file from the leader (in SolrCloud > mode) during index replication leads to a SocketTimeoutException: > {code} > 2016-04-28 16:22:23.568 WARN (RecoveryThread-foo_shard11_replica2) [c:foo > s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.IndexFetcher Error > in fetching file: _405k.cfs (downloaded 7314866176 of 9990844536 bytes) > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:150) > at java.net.SocketInputStream.read(SocketInputStream.java:121) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) > at > org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) > at > org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:253) > at > org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227) > at > org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186) > at > org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) > at > org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80) > at > org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) > at > org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:140) > at > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:167) > at > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:161) > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1312) > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1275) > at > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800) > {code} > and this leads to the following error in cleanup: > {code} > 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo > s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.h.ReplicationHandler > Index fetch failed :org.apache.solr.common.SolrException: Unable to download > _405k.cfs completely. Downloaded 7314866176!=9990844536 > at > org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1406) > at > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1286) > at > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:800) > at > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:423) > at > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:254) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:380) > at > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227) > 2016-04-28 16:26:04.332 ERROR (RecoveryThread-foo_shard11_replica2) [c:foo > s:shard11 r:core_node139 x:foo_shard11_replica2] o.a.s.c.RecoveryStrategy > Error while trying to recover:org.apache.solr.common.SolrException: > Replication for recovery failed. > at > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:165) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227) > {code} > So a simple read timeout exception leads to re-downloading the whole index > again, and again, and again ... > It also looks like any exception raised in fetchPackets would be squelched if > an exception is raised in cleanup (called in the finally block) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org