Re: segment gets corrupted (after background merge ?)

Lance Norskog Thu, 13 Jan 2011 19:56:57 -0800

1) CheckIndex is not supposed to change a corrupt segment, only remove it.
2) Are you using local hard disks, or do run on a common SAN or remote
file server? I have seen corruption errors on SANs, where existing
files have random changes.


On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> Generally it's not safe to run CheckIndex if a writer is also open on the 
> index.
>
> It's not safe because CheckIndex could hit FNFE's on opening files,
> or, if you use -fix, CheckIndex will change the index out from under
> your other IndexWriter (which will then cause other kinds of
> corruption).
>
> That said, I don't think the corruption that CheckIndex is detecting
> in your index would be caused by having a writer open on the index.
> Your first CheckIndex has a different deletes file (_phe_p3.del, with
> 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with
> 44828 deleted docs), so it must somehow have to do with that change.
>
> One question: if you have a corrupt index, and run CheckIndex on it
> several times in a row, does it always fail in the same way?  (Ie the
> same term hits the below exception).
>
> Is there any way I could get a copy of one of your corrupt cases?  I
> can then dig...
>
> Mike
>
> On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat
> <stephane.delp...@blogspirit.com> wrote:
>> I understand less and less what is happening to my solr.
>>
>> I did a checkIndex (without -fix) and there was an error...
>>
>> So a did another checkIndex with -fix and then the error was gone. The
>> segment was alright
>>
>>
>> During checkIndex I do not shut down the solr server, I just make sure no
>> client connect to the server.
>>
>> Should I shut down the solr server during checkIndex ?
>>
>>
>>
>> first checkIndex :
>>
>>  4 of 17: name=_phe docCount=264148
>>    compound=false
>>    hasProx=true
>>    numFiles=9
>>    size (MB)=928.977
>>    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
>> java.vendor=Sun Microsystems Inc.}
>>    has deletions [delFileName=_phe_p3.del]
>>    test: open reader.........OK [44824 deleted docs]
>>    test: fields..............OK [51 fields]
>>    test: field norms.........OK [51 fields]
>>    test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs
>> seen 0 + num docs deleted 0]
>> java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 +
>> num docs deleted 0
>>        at
>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>    test: stored fields.......OK [7206878 total field count; avg 32.86 fields
>> per doc]
>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>> vector fields per doc]
>> FAILED
>>    WARNING: fixIndex() would remove reference to this segment; full
>> exception:
>> java.lang.RuntimeException: Term Index test failed
>>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>
>>
>> a few minutes latter :
>>
>>  4 of 18: name=_phe docCount=264148
>>    compound=false
>>    hasProx=true
>>    numFiles=9
>>    size (MB)=928.977
>>    diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64,
>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>> _20, java.vendor=Sun Microsystems Inc.}
>>    has deletions [delFileName=_phe_p4.del]
>>    test: open reader.........OK [44828 deleted docs]
>>    test: fields..............OK [51 fields]
>>    test: field norms.........OK [51 fields]
>>    test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs;
>> 28919124 tokens]
>>    test: stored fields.......OK [7206764 total field count; avg 32.86 fields
>> per doc]
>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>> vector fields per doc]
>>
>>
>> Le 12/01/2011 16:50, Michael McCandless a écrit :
>>>
>>> Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted
>>> 0?
>>>
>>> It looks like new deletions were flushed against the segment (del file
>>> changed from _ncc_22s.del to _ncc_24f.del).
>>>
>>> Are you hitting any exceptions during indexing?
>>>
>>> Mike
>>>
>>> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat
>>> <stephane.delp...@blogspirit.com>  wrote:
>>>>
>>>> I got another corruption.
>>>>
>>>> It sure looks like it's the same type of error. (on a different field)
>>>>
>>>> It's also not linked to a merge, since the segment size did not change.
>>>>
>>>>
>>>> *** good segment :
>>>>
>>>>  1 of 9: name=_ncc docCount=1841685
>>>>    compound=false
>>>>    hasProx=true
>>>>    numFiles=9
>>>>    size (MB)=6,683.447
>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>> os.version=2.6.26-2-amd64,
>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>    has deletions [delFileName=_ncc_22s.del]
>>>>    test: open reader.........OK [275881 deleted docs]
>>>>    test: fields..............OK [51 fields]
>>>>    test: field norms.........OK [51 fields]
>>>>    test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs
>>>> pairs;
>>>> 204561440 tokens]
>>>>    test: stored fields.......OK [45511958 total field count; avg 29.066
>>>> fields per doc]
>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>>
>>>>
>>>> a few hours latter :
>>>>
>>>> *** broken segment :
>>>>
>>>>  1 of 17: name=_ncc docCount=1841685
>>>>    compound=false
>>>>    hasProx=true
>>>>    numFiles=9
>>>>    size (MB)=6,683.447
>>>>    diagnostics = {optimize=false, mergeFactor=10,
>>>> os.version=2.6.26-2-amd64,
>>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
>>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0
>>>> _20, java.vendor=Sun Microsystems Inc.}
>>>>    has deletions [delFileName=_ncc_24f.del]
>>>>    test: open reader.........OK [278167 deleted docs]
>>>>    test: fields..............OK [51 fields]
>>>>    test: field norms.........OK [51 fields]
>>>>    test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num
>>>> docs seen 0 + num docs deleted 0]
>>>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs
>>>> seen
>>>> 0 + num docs deleted 0
>>>>        at
>>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>>>>        at
>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>    test: stored fields.......OK [45429565 total field count; avg 29.056
>>>> fields per doc]
>>>>    test: term vectors........OK [0 total vector count; avg 0 term/freq
>>>> vector fields per doc]
>>>> FAILED
>>>>    WARNING: fixIndex() would remove reference to this segment; full
>>>> exception:
>>>> java.lang.RuntimeException: Term Index test failed
>>>>        at
>>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>>>>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>>>>
>>>>
>>>> I'll activate infoStream for next time.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Le 12/01/2011 00:49, Michael McCandless a écrit :
>>>>>
>>>>> When you hit corruption is it always this same problem?:
>>>>>
>>>>>   java.lang.RuntimeException: term source:margolisphil docFreq=1 !=
>>>>> num docs seen 0 + num docs deleted 0
>>>>>
>>>>> Can you run with Lucene's IndexWriter infoStream turned on, and catch
>>>>> the output leading to the corruption?  If something is somehow messing
>>>>> up the bits in the deletes file that could cause this.
>>>>>
>>>>> Mike
>>>>>
>>>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat
>>>>> <stephane.delp...@blogspirit.com>    wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We are using :
>>>>>> Solr Specification Version: 1.4.1
>>>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
>>>>>> Lucene Specification Version: 2.9.3
>>>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
>>>>>>
>>>>>> # java -version
>>>>>> java version "1.6.0_20"
>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>>>>
>>>>>> We want to index 4M docs in one core (and when it works fine we will
>>>>>> add
>>>>>> other cores with 2M on the same server) (1 doc ~= 1kB)
>>>>>>
>>>>>> We use SOLR replication every 5 minutes to update the slave server
>>>>>> (queries
>>>>>> are executed on the slave only)
>>>>>>
>>>>>> Documents are changing very quickly, during a normal day we will have
>>>>>> approx
>>>>>> :
>>>>>> * 200 000 updated docs
>>>>>> * 1000 new docs
>>>>>> * 200 deleted docs
>>>>>>
>>>>>>
>>>>>> I attached the last good checkIndex : solr20110107.txt
>>>>>> And the corrupted one : solr20110110.txt
>>>>>>
>>>>>>
>>>>>> This is not the first time a segment gets corrupted on this server,
>>>>>> that's
>>>>>> why I ran frequent "checkIndex". (but as you can see the first segment
>>>>>> is
>>>>>> 1.800.000 docs and it works fine!)
>>>>>>
>>>>>>
>>>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr logs.
>>>>>>
>>>>>>
>>>>>> I also attached my schema.xml and solrconfig.xml
>>>>>>
>>>>>>
>>>>>> Is there something wrong with what we are doing ? Do you need other
>>>>>> info
>>>>>> ?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: segment gets corrupted (after background merge ?)

Reply via email to