Btw, this 2.9 indexer is fast! I indexed 4Gb (1.07 million docs) with
optimization in just under 30 min.
I used setRAMBufferSizeMB=1.9G

Peter

On Thu, Oct 29, 2009 at 3:46 PM, Peter Keegan <peterlkee...@gmail.com>wrote:

> A handful of the source documents did contain the U+FFFF character. The
> patch from *LUCENE-2016<https://issues.apache.org/jira/browse/LUCENE-2016>
> *fixed the problem.
> Thanks Mike!
>
> Peter
>
>
> On Wed, Oct 28, 2009 at 1:29 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm, only a few affected terms, and all this particular
>> "literals:cfid196$" term, with optional suffixes.  Really strange.
>>
>> One things that's odd is the exact term "literals:cfid196$" is printed
>> twice, which should never happen (every unique term should be stored
>> only once, in the terms dict).
>>
>> And, otherwise, CheckIndex got through the index just fine.
>>
>> Try searching a TermQuery with these affected terms and see if it
>> succeeds?  If so, maybe trying making an index with one or two of
>> them, alone, and see if that index shows the problem?
>>
>> OK I'm attaching more mods.  Can you re-run your CheckIndex?  It will
>> produce an enormous amount of output, but if you can excise the few
>> lines around when that warning comes out & post back that'd be great.
>>
>> Mike
>>
>> On Wed, Oct 28, 2009 at 12:23 PM, Peter Keegan <peterlkee...@gmail.com>
>> wrote:
>> > Just to be safe, I ran with the official jar file from one of the
>> mirrors
>> > and reproduced the problem.
>> > The debug session is not showing any characters = '\uffff' (checking
>> this in
>> > Tokenizer).
>> > The output from the modified CheckIndex follows. There are only a few
>> terms
>> > with the inconsistency. They are all legitimate terms from the app's
>> > context. With this info, I might be able to isolate the source
>> documents.
>> > What should I be looking for when they are indexed?
>> >
>> > CheckInput output:
>> >
>> > Opening index @
>> D:\mnsavs\lresumes1\lresumes1.luc\lresumes1.search.main.4
>> >
>> > Segments file=segments_2 numSegments=3 version=FORMAT_DIAGNOSTICS
>> [Lucene
>> > 2.9]
>> >  1 of 3: name=_0 docCount=413585
>> >    compound=false
>> >    hasProx=true
>> >    numFiles=8
>> >    size (MB)=1,148.817
>> >    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>> >    docStoreOffset=0
>> >    docStoreSegment=_0
>> >    docStoreIsCompoundFile=false
>> >    no deletions
>> >    test: open reader.........OK
>> >    test: fields..............OK [33 fields]
>> >    test: field norms.........OK [33 fields]
>> >    test: terms, freq, prox...OK [7704753 terms; 180326717 terms/docs
>> pairs;
>> > 340244234 tokens]
>> >    test: stored fields.......OK [1240755 total field count; avg 3 fields
>> > per doc]
>> >    test: term vectors........OK [0 total vector count; avg 0 term/freq
>> > vector fields per doc]
>> >
>> >  2 of 3: name=_1 docCount=359068
>> >    compound=false
>> >    hasProx=true
>> >    numFiles=8
>> >    size (MB)=1,125.161
>> >    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>> >    docStoreOffset=413585
>> >    docStoreSegment=_0
>> >    docStoreIsCompoundFile=false
>> >    no deletions
>> >    test: open reader.........OK
>> >    test: fields..............OK [33 fields]
>> >    test: field norms.........OK [33 fields]
>> >    test: terms, freq, prox...WARNING: term  literals:cfid196$ docFreq=43
>> !=
>> > num docs seen 4 + num docs deleted 0
>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>> > deleted 0
>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>> > deleted 0
>> > WARNING: term  literals:cfid196$commandant docFreq=1 != num docs seen 9
>> +
>> > num docs deleted 0
>> > WARNING: term  literals:cfid196$on docFreq=3178 != num docs seen 1 + num
>> > docs deleted 0
>> > OK [7137621 terms; 179101847 terms/docs pairs; 346076058 tokens]
>> >    test: stored fields.......OK [1077204 total field count; avg 3 fields
>> > per doc]
>> >    test: term vectors........OK [0 total vector count; avg 0 term/freq
>> > vector fields per doc]
>> >
>> >  3 of 3: name=_2 docCount=304849
>> >    compound=false
>> >    hasProx=true
>> >    numFiles=8
>> >    size (MB)=962.004
>> >    diagnostics = {os.version=5.2, os=Windows 2003, lucene.version=2.9.0
>> > 817268P - 2009-09-21 10:25:09, source=flush, os.arch=amd64,
>> > java.version=1.6.0_16, java.vendor=Sun Microsystems Inc.}
>> >    docStoreOffset=772653
>> >    docStoreSegment=_0
>> >    docStoreIsCompoundFile=false
>> >    no deletions
>> >    test: open reader.........OK
>> >    test: fields..............OK [33 fields]
>> >    test: field norms.........OK [33 fields]
>> >    test: terms, freq, prox...WARNING: term  contents:? docFreq=1 != num
>> > docs seen 246 + num docs deleted 0
>> > WARNING: term  literals:cfid196$ docFreq=45 != num docs seen 4 + num
>> docs
>> > deleted 0
>> > WARNING: term  literals:cfid196$ docFreq=1 != num docs seen 4 + num docs
>> > deleted 0
>> > WARNING: term  literals:cfid196$cashier docFreq=1 != num docs seen 37 +
>> num
>> > docs deleted 0
>> > WARNING: term  literals:cfid196$interrogation docFreq=181 != num docs
>> seen 1
>> > + num docs deleted 0
>> > WARNING: term  literals:cfid196$leader docFreq=1 != num docs seen 353 +
>> num
>> > docs deleted 0
>> > WARNING: term  literals:cfid196$microsoft docFreq=3114 != num docs seen
>> 1 +
>> > num docs deleted 0
>> > WARNING: term  literals:cfid196$nt docFreq=200 != num docs seen 1 + num
>> docs
>> > deleted 0
>> > OK [6497769 terms; 145296880 terms/docs pairs; 293458734 tokens]
>> >    test: stored fields.......OK [914547 total field count; avg 3 fields
>> per
>> > doc]
>> >    test: term vectors........OK [0 total vector count; avg 0 term/freq
>> > vector fields per doc]
>> >
>> > No problems were detected with this index.
>> >
>> > Peter
>> >
>> >
>> > On Wed, Oct 28, 2009 at 11:29 AM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> On Wed, Oct 28, 2009 at 10:58 AM, Peter Keegan <peterlkee...@gmail.com
>> >
>> >> wrote:
>> >> > The only change I made to the source code was the patch for
>> >> PayloadNearQuery
>> >> > (LUCENE-1986).
>> >>
>> >> That patch certainly shouldn't lead to this.
>> >>
>> >> > It's possible that our content contains U+FFFF. I will run in
>> debugger
>> >> and
>> >> > see.
>> >>
>> >> OK may as well check just so we cover all possibilities.
>> >>
>> >> > The data is 'sensitive', so I may not be able to provide a bad
>> segment,
>> >> > unfortunately.
>> >>
>> >> OK, maybe we can modify your CheckIndex instead.  Let's start with
>> >> this, which prints a warning whenever the docFreq differs but
>> >> otherwise continues (vs throwing RuntimeException).  I'm curious how
>> >> many terms show this, and whether the TermEnum keeps working after
>> >> this term that has different docFreq:
>> >>
>> >> Index: src/java/org/apache/lucene/index/CheckIndex.java
>> >> ===================================================================
>> >> --- src/java/org/apache/lucene/index/CheckIndex.java    (revision
>> 829889)
>> >> +++ src/java/org/apache/lucene/index/CheckIndex.java    (working copy)
>> >> @@ -672,8 +672,8 @@
>> >>         }
>> >>
>> >>         if (freq0 + delCount != docFreq) {
>> >> -          throw new RuntimeException("term " + term + " docFreq=" +
>> >> -                                     docFreq + " != num docs seen " +
>> >> freq0 + " + num docs deleted " + delCount);
>> >> +          System.out.println("WARNING: term  " + term + " docFreq=" +
>> >> +                             docFreq + " != num docs seen " + freq0 +
>> >> " + num docs deleted " + delCount);
>> >>         }
>> >>       }
>> >>
>> >> Mike
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>>
>
>

Reply via email to