Hi Phil,
For problem with my app, it wasn't what you suggested (about the tokens, etc.).
For some later things, my indexer creates both a "path" field that is analyzed
(and thus tokenized, etc.) and another field, "fullpath", which is not analyzed
(and thus, not tokenized).
The problem with my app was that I was create a TermEnum:
Term term = new Term("fullpath", "");
termsEnumerator = reader.terms(term);
and then going immediately into a while loop:
while (termsEnumerator.next()) {
.
.
}
i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the
TermEnum to the 2nd term, initially).
Anyway, so the code that I ended up with is:
try {
System.out.println("Outside while: About to get 1st termsEnumerator.term()...");
currentTerm = termsEnumerator.term();
currentField = currentTerm.field();
termpathcount++;
System.out.println("Outside while: 1st Field = [" + currentField + "] Term = ["
+ currentTerm.text() + "]");
System.out.println("Outside while: About to drop into while()...");
while (termsEnumerator.next()) {
currentTerm = termsEnumerator.term();
currentField = currentTerm.field();
if (currentField.equalsIgnoreCase("fullpath")) {
termpathcount++;
System.out.println("Count=" + termpathcount + " Field = [" +
currentField + "] Term = [" + currentTerm.text() + "]");
}
} // end while()
termsEnumerator.close();
System.out.println("Matching terms count = " + termpathcount);
} catch (Exception e) {
System.out.println("** ERROR **: Exception while stepping through
index: [" + e + "]");
e.printStackTrace();
}
and, that seems to be working perfectly.
Also, thanks for following up re. that Luke problem. That was one piece of
this "puzzle" that was kind of driving me batty :)!!
Jim
---- Phil Whelan <[email protected]> wrote:
> Hi Jim,
>
> On Sun, Aug 2, 2009 at 1:32 AM, <[email protected]> wrote:
> > I first noticed the problem that I'm seeing while working on this latter
> > app. Basically, what I noticed was that while I was adding 13 documents to
> > the index, when I listed the "path" terms, there were only 12 of them.
>
> Field text (the whole "path" in your case) and terms (the tokens of
> the field text) are different.
>
> The StandardAnalyzer breaks up words like this...
> Field text = "/a/b/c.txt"
> Tokens = {"a","b","c","txt"}
>
> So this 1 field of 1 document become 4 terms / tokens (not sure if
> there is a difference in this terminology between "terms" and "tokens"
> sorry).
> Therefore, you're going to have more terms than documents initially,
> but as the overlap in term usage increases this changes.
>
> For instance, these 3 paths
> "/a/b/c/d.txt","/b/c/d/a.txt","/c/d/a/b.txt" are still only a total of
> 4 terms, since they share the same terms.
>
> In fact, StandardAnalyzer goes a bit further than that and removes
> "stop-words", such as "a" (or "an", "the") as it's designed for
> general text searching.
>
> That said, I think you have a point with the next part of your question...
>
> > So then, I reviewed the index using Luke, and what I saw with that was that
> > there were indeed only 12 "path" terms (under "Term Count" on the left),
> > but, when I clicked the "Show Top Terms" in Luke, there were 13 terms
> > listed by Luke.
>
> Yes, I just checked this and this seems to be a bug with Luke. It
> always shows 1 less than in "Term Count" than it should. Well spotted.
>
> Cheers,
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]