[ 
https://issues.apache.org/jira/browse/SOLR-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15864195#comment-15864195
 ] 

Erick Erickson edited comment on SOLR-10006 at 2/13/17 7:13 PM:
----------------------------------------------------------------

Still fails, see the attached log for everything after I restarted the solr 
node that I had removed some index files from one of the cores on. This is on a 
fresh 6x pull in the last hour.

The take-away here is that the solr core must be restarted so there is never an 
open searcher on that core, perhaps your stress test isn't doing that? In this 
state commands appear to succeed.

So I poked a little more, here are a couple of observations:

> for this scenario to fail you must restart Solr. I suspect the pre-condition 
> here is that the searcher has never been successfully opened.

> reloading the core from the admin UI silently fails with a .doc file removed. 
> By that I mean the UI doesn't show any problems even though the log file has 
> exceptions.

> The core admin API correctly reports an error  for action=RELOAD though (curl 
> or the like)

> the admin UI still thinks the replica is active.

> a search on the replica with distrib=false also succeeds, even when I set a 
> very large start parameter, but I suspect this is a function there still 
> being an open file handle on the file I deleted so it's "kinda there" until 
> restart.

> At this point (the searcher is working even thought the doc file is missing), 
> a fetchindex doesn't think there's any work to do so "succeeds", i.e. it 
> doesn't fetch from the masterUrl, here's the log messages:

INFO  - 2017-02-13 18:50:57.434; [c:eoe s:shard1 r:core_node2 
x:eoe_shard1_replica2] org.apache.solr.core.SolrCore; [eoe_shard1_replica2]  
webapp=/solr path=/replication 
params={masterUrl=http://localhost:8982/solr/eoe_shard1_replica1&command=fetchindex}
 status=0 QTime=0
INFO  - 2017-02-13 18:50:57.439; [c:eoe s:shard1 r:core_node2 
x:eoe_shard1_replica2] org.apache.solr.handler.IndexFetcher; Master's 
generation: 4
INFO  - 2017-02-13 18:50:57.439; [c:eoe s:shard1 r:core_node2 
x:eoe_shard1_replica2] org.apache.solr.handler.IndexFetcher; Master's version: 
1487010762766
INFO  - 2017-02-13 18:50:57.439; [c:eoe s:shard1 r:core_node2 
x:eoe_shard1_replica2] org.apache.solr.handler.IndexFetcher; Slave's 
generation: 4
INFO  - 2017-02-13 18:50:57.439; [c:eoe s:shard1 r:core_node2 
x:eoe_shard1_replica2] org.apache.solr.handler.IndexFetcher; Slave's version: 
1487010762766
INFO  - 2017-02-13 18:50:57.439; [c:eoe s:shard1 r:core_node2 
x:eoe_shard1_replica2] org.apache.solr.handler.IndexFetcher; Slave in sync with 
master.



was (Author: erickerickson):
Still fails, see the attached log for everything after I restarted the solr 
node that I had removed some index files from one of the cores on. This is on a 
fresh 6x pull in the last hour.

> Cannot do a full sync (fetchindex) if the replica can't open a searcher
> -----------------------------------------------------------------------
>
>                 Key: SOLR-10006
>                 URL: https://issues.apache.org/jira/browse/SOLR-10006
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 5.3.1, 6.4
>            Reporter: Erick Erickson
>         Attachments: SOLR-10006.patch, SOLR-10006.patch, solr.log, solr.log
>
>
> Doing a full sync or fetchindex requires an open searcher and if you can't 
> open the searcher those operations fail.
> For discussion. I've seen a situation in the field where a replica's index 
> became corrupt. When the node was restarted, the replica tried to do a full 
> sync but fails because the core can't open a searcher. The replica went into 
> an endless sync/fail/sync cycle.
> I couldn't reproduce that exact scenario, but it's easy enough to get into a 
> similar situation. Create a 2x2 collection and index some docs. Then stop one 
> of the instances and go in and remove a couple of segments files and restart.
> The replica stays in the "down" state, fine so far.
> Manually issue a fetchindex. That fails because the replica can't open a 
> searcher. Sure, issuing a fetchindex is abusive.... but I think it's the same 
> underlying issue: why should we care about the state of a replica's current 
> index when we're going to completely replace it anyway?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to