1) CheckIndex is not supposed to change a corrupt segment, only remove it. 2) Are you using local hard disks, or do run on a common SAN or remote file server? I have seen corruption errors on SANs, where existing files have random changes.
On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > Generally it's not safe to run CheckIndex if a writer is also open on the > index. > > It's not safe because CheckIndex could hit FNFE's on opening files, > or, if you use -fix, CheckIndex will change the index out from under > your other IndexWriter (which will then cause other kinds of > corruption). > > That said, I don't think the corruption that CheckIndex is detecting > in your index would be caused by having a writer open on the index. > Your first CheckIndex has a different deletes file (_phe_p3.del, with > 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with > 44828 deleted docs), so it must somehow have to do with that change. > > One question: if you have a corrupt index, and run CheckIndex on it > several times in a row, does it always fail in the same way? (Ie the > same term hits the below exception). > > Is there any way I could get a copy of one of your corrupt cases? I > can then dig... > > Mike > > On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat > <stephane.delp...@blogspirit.com> wrote: >> I understand less and less what is happening to my solr. >> >> I did a checkIndex (without -fix) and there was an error... >> >> So a did another checkIndex with -fix and then the error was gone. The >> segment was alright >> >> >> During checkIndex I do not shut down the solr server, I just make sure no >> client connect to the server. >> >> Should I shut down the solr server during checkIndex ? >> >> >> >> first checkIndex : >> >> 4 of 17: name=_phe docCount=264148 >> compound=false >> hasProx=true >> numFiles=9 >> size (MB)=928.977 >> diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, >> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 >> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, >> java.vendor=Sun Microsystems Inc.} >> has deletions [delFileName=_phe_p3.del] >> test: open reader.........OK [44824 deleted docs] >> test: fields..............OK [51 fields] >> test: field norms.........OK [51 fields] >> test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs >> seen 0 + num docs deleted 0] >> java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 + >> num docs deleted 0 >> at >> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) >> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) >> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) >> test: stored fields.......OK [7206878 total field count; avg 32.86 fields >> per doc] >> test: term vectors........OK [0 total vector count; avg 0 term/freq >> vector fields per doc] >> FAILED >> WARNING: fixIndex() would remove reference to this segment; full >> exception: >> java.lang.RuntimeException: Term Index test failed >> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) >> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) >> >> >> a few minutes latter : >> >> 4 of 18: name=_phe docCount=264148 >> compound=false >> hasProx=true >> numFiles=9 >> size (MB)=928.977 >> diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, >> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 >> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 >> _20, java.vendor=Sun Microsystems Inc.} >> has deletions [delFileName=_phe_p4.del] >> test: open reader.........OK [44828 deleted docs] >> test: fields..............OK [51 fields] >> test: field norms.........OK [51 fields] >> test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs; >> 28919124 tokens] >> test: stored fields.......OK [7206764 total field count; avg 32.86 fields >> per doc] >> test: term vectors........OK [0 total vector count; avg 0 term/freq >> vector fields per doc] >> >> >> Le 12/01/2011 16:50, Michael McCandless a écrit : >>> >>> Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted >>> 0? >>> >>> It looks like new deletions were flushed against the segment (del file >>> changed from _ncc_22s.del to _ncc_24f.del). >>> >>> Are you hitting any exceptions during indexing? >>> >>> Mike >>> >>> On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat >>> <stephane.delp...@blogspirit.com> wrote: >>>> >>>> I got another corruption. >>>> >>>> It sure looks like it's the same type of error. (on a different field) >>>> >>>> It's also not linked to a merge, since the segment size did not change. >>>> >>>> >>>> *** good segment : >>>> >>>> 1 of 9: name=_ncc docCount=1841685 >>>> compound=false >>>> hasProx=true >>>> numFiles=9 >>>> size (MB)=6,683.447 >>>> diagnostics = {optimize=false, mergeFactor=10, >>>> os.version=2.6.26-2-amd64, >>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 >>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 >>>> _20, java.vendor=Sun Microsystems Inc.} >>>> has deletions [delFileName=_ncc_22s.del] >>>> test: open reader.........OK [275881 deleted docs] >>>> test: fields..............OK [51 fields] >>>> test: field norms.........OK [51 fields] >>>> test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs >>>> pairs; >>>> 204561440 tokens] >>>> test: stored fields.......OK [45511958 total field count; avg 29.066 >>>> fields per doc] >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>> vector fields per doc] >>>> >>>> >>>> a few hours latter : >>>> >>>> *** broken segment : >>>> >>>> 1 of 17: name=_ncc docCount=1841685 >>>> compound=false >>>> hasProx=true >>>> numFiles=9 >>>> size (MB)=6,683.447 >>>> diagnostics = {optimize=false, mergeFactor=10, >>>> os.version=2.6.26-2-amd64, >>>> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 >>>> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 >>>> _20, java.vendor=Sun Microsystems Inc.} >>>> has deletions [delFileName=_ncc_24f.del] >>>> test: open reader.........OK [278167 deleted docs] >>>> test: fields..............OK [51 fields] >>>> test: field norms.........OK [51 fields] >>>> test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num >>>> docs seen 0 + num docs deleted 0] >>>> java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs >>>> seen >>>> 0 + num docs deleted 0 >>>> at >>>> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) >>>> at >>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) >>>> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) >>>> test: stored fields.......OK [45429565 total field count; avg 29.056 >>>> fields per doc] >>>> test: term vectors........OK [0 total vector count; avg 0 term/freq >>>> vector fields per doc] >>>> FAILED >>>> WARNING: fixIndex() would remove reference to this segment; full >>>> exception: >>>> java.lang.RuntimeException: Term Index test failed >>>> at >>>> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) >>>> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) >>>> >>>> >>>> I'll activate infoStream for next time. >>>> >>>> >>>> Thanks, >>>> >>>> >>>> Le 12/01/2011 00:49, Michael McCandless a écrit : >>>>> >>>>> When you hit corruption is it always this same problem?: >>>>> >>>>> java.lang.RuntimeException: term source:margolisphil docFreq=1 != >>>>> num docs seen 0 + num docs deleted 0 >>>>> >>>>> Can you run with Lucene's IndexWriter infoStream turned on, and catch >>>>> the output leading to the corruption? If something is somehow messing >>>>> up the bits in the deletes file that could cause this. >>>>> >>>>> Mike >>>>> >>>>> On Mon, Jan 10, 2011 at 5:52 AM, Stéphane Delprat >>>>> <stephane.delp...@blogspirit.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> We are using : >>>>>> Solr Specification Version: 1.4.1 >>>>>> Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42 >>>>>> Lucene Specification Version: 2.9.3 >>>>>> Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 >>>>>> >>>>>> # java -version >>>>>> java version "1.6.0_20" >>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02) >>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) >>>>>> >>>>>> We want to index 4M docs in one core (and when it works fine we will >>>>>> add >>>>>> other cores with 2M on the same server) (1 doc ~= 1kB) >>>>>> >>>>>> We use SOLR replication every 5 minutes to update the slave server >>>>>> (queries >>>>>> are executed on the slave only) >>>>>> >>>>>> Documents are changing very quickly, during a normal day we will have >>>>>> approx >>>>>> : >>>>>> * 200 000 updated docs >>>>>> * 1000 new docs >>>>>> * 200 deleted docs >>>>>> >>>>>> >>>>>> I attached the last good checkIndex : solr20110107.txt >>>>>> And the corrupted one : solr20110110.txt >>>>>> >>>>>> >>>>>> This is not the first time a segment gets corrupted on this server, >>>>>> that's >>>>>> why I ran frequent "checkIndex". (but as you can see the first segment >>>>>> is >>>>>> 1.800.000 docs and it works fine!) >>>>>> >>>>>> >>>>>> I can't find any "SEVER" "FATAL" or "exception" in the Solr logs. >>>>>> >>>>>> >>>>>> I also attached my schema.xml and solrconfig.xml >>>>>> >>>>>> >>>>>> Is there something wrong with what we are doing ? Do you need other >>>>>> info >>>>>> ? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>> >>>> >>> >> > -- Lance Norskog goks...@gmail.com