[jira] [Updated] (SOLR-9836) Add more graceful recovery steps when failing to create SolrCore

Mike Drob (JIRA) Thu, 15 Dec 2016 14:18:19 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mike Drob updated SOLR-9836:
----------------------------
    Attachment: SOLR-9836.patch

Current WIP patch.

* Moved {{modifyIndexProps}} to {{SolrCore}}
* Added system property toggle for controlling desired behaviour here.
** Property name and values are shots in the dark and by no means final
** Used an enum because it made sense logically at the time, not sure if this 
actually matters.
* Switched to looking for CorruptIndexException

* Fall back to earlier segments file implementation is missing, pending some 
questions below. (there's a unit test though)
** It's very hard to tell if it was actually the segments file that is corrupt, 
or if it was something else.
** Is it sufficient to delete {{segments_n}} and let lucene try to read from 
the new "latest" commit? Will this screw up replication? Do we need to update 
the generation anywhere else? And I'm still nervous about indiscriminately 
deleting files where recovery might be possible. I guess that's the point of 
the config options.
** Another option is to hack a FilterDirectory on the index that would hide the 
latest segments_n file instead of deleting it. That might work to open it, but 
we will likely end up with write conflicts next time we commit.

The more I toss this idea around, the more it feels like something that would 
be more cleanly handled at the Lucene level. Possibly best to have two options 
(recover from leader, do nothing) instead of the initial three proposed by 
[~markrmil...@gmail.com] and expand on them later.

> Add more graceful recovery steps when failing to create SolrCore
> ----------------------------------------------------------------
>
>                 Key: SOLR-9836
>                 URL: https://issues.apache.org/jira/browse/SOLR-9836
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Mike Drob
>         Attachments: SOLR-9836.patch, SOLR-9836.patch
>
>
> I have seen several cases where there is a zero-length segments_n file. We 
> haven't identified the root cause of these issues (possibly a poorly timed 
> crash during replication?) but if there is another node available then Solr 
> should be able to recover from this situation. Currently, we log and give up 
> on loading that core, leaving the user to manually intervene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9836) Add more graceful recovery steps when failing to create SolrCore

Reply via email to