Re: Weird discrepancy with term counts vs. terms (off by 1)

ohaya Sun, 02 Aug 2009 12:12:44 -0700

Hi Phil,

For problem with my app, it wasn't what you suggested (about the tokens, etc.).


For some later things, my indexer creates both a "path" field that is analyzed 
(and thus tokenized, etc.) and another field, "fullpath", which is not analyzed 
(and thus, not tokenized).

The problem with my app was that I was create a TermEnum:

Term term = new Term("fullpath", "");
termsEnumerator = reader.terms(term);

and then going immediately into a while loop:

while (termsEnumerator.next()) {
.
.
}

i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the 
TermEnum to the 2nd term, initially).

Anyway, so the code that I ended up with is:

try {
System.out.println("Outside while: About to get 1st termsEnumerator.term()...");
currentTerm = termsEnumerator.term();
currentField = currentTerm.field();
termpathcount++;
System.out.println("Outside while: 1st Field = [" + currentField + "] Term = [" 
+ currentTerm.text() + "]");
System.out.println("Outside while: About to drop into while()...");
while (termsEnumerator.next()) {
        currentTerm = termsEnumerator.term();
        currentField = currentTerm.field();
        if (currentField.equalsIgnoreCase("fullpath")) {
                termpathcount++;
                System.out.println("Count=" + termpathcount + " Field = [" + 
currentField + "] Term = [" + currentTerm.text() + "]");
        }
} // end while()

termsEnumerator.close();
System.out.println("Matching terms count = " + termpathcount);
} catch (Exception e) {
        System.out.println("** ERROR **: Exception while stepping through 
index: [" + e + "]");
        e.printStackTrace();
        }

and, that seems to be working perfectly.

Also, thanks for following up re. that Luke problem.  That was one piece of 
this "puzzle" that was kind of driving me batty :)!!

Jim



---- Phil Whelan <phil...@gmail.com> wrote: 
> Hi Jim,
> 
> On Sun, Aug 2, 2009 at 1:32 AM, <oh...@cox.net> wrote:
> > I first noticed the problem that I'm seeing while working on this latter 
> > app.  Basically, what I noticed was that while I was adding 13 documents to 
> > the index, when I listed the "path" terms, there were only 12 of them.
> 
> Field text (the whole "path" in your case) and terms (the tokens of
> the field text) are different.
> 
> The StandardAnalyzer breaks up words like this...
> Field text = "/a/b/c.txt"
> Tokens = {"a","b","c","txt"}
> 
> So this 1 field of 1 document become 4 terms / tokens (not sure if
> there is a difference in this terminology between "terms" and "tokens"
> sorry).
> Therefore, you're going to have more terms than documents initially,
> but as the overlap in term usage increases this changes.
> 
> For instance, these 3 paths
> "/a/b/c/d.txt","/b/c/d/a.txt","/c/d/a/b.txt" are still only a total of
> 4 terms, since they share the same terms.
> 
> In fact, StandardAnalyzer goes a bit further than that and removes
> "stop-words", such as "a" (or "an", "the") as it's designed for
> general text searching.
> 
> That said, I think you have a point with the next part of your question...
> 
> > So then, I reviewed the index using Luke, and what I saw with that was that 
> > there were indeed only 12 "path" terms (under "Term Count" on the left), 
> > but, when I clicked the "Show Top Terms" in Luke, there were 13 terms 
> > listed by Luke.
> 
> Yes, I just checked this and this seems to be a bug with Luke. It
> always shows 1 less than in "Term Count" than it should. Well spotted.
> 
> Cheers,
> Phil
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Weird discrepancy with term counts vs. terms (off by 1)

Reply via email to