Hi Phil, For problem with my app, it wasn't what you suggested (about the tokens, etc.).
For some later things, my indexer creates both a "path" field that is analyzed (and thus tokenized, etc.) and another field, "fullpath", which is not analyzed (and thus, not tokenized). The problem with my app was that I was create a TermEnum: Term term = new Term("fullpath", ""); termsEnumerator = reader.terms(term); and then going immediately into a while loop: while (termsEnumerator.next()) { . . } i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the TermEnum to the 2nd term, initially). Anyway, so the code that I ended up with is: try { System.out.println("Outside while: About to get 1st termsEnumerator.term()..."); currentTerm = termsEnumerator.term(); currentField = currentTerm.field(); termpathcount++; System.out.println("Outside while: 1st Field = [" + currentField + "] Term = [" + currentTerm.text() + "]"); System.out.println("Outside while: About to drop into while()..."); while (termsEnumerator.next()) { currentTerm = termsEnumerator.term(); currentField = currentTerm.field(); if (currentField.equalsIgnoreCase("fullpath")) { termpathcount++; System.out.println("Count=" + termpathcount + " Field = [" + currentField + "] Term = [" + currentTerm.text() + "]"); } } // end while() termsEnumerator.close(); System.out.println("Matching terms count = " + termpathcount); } catch (Exception e) { System.out.println("** ERROR **: Exception while stepping through index: [" + e + "]"); e.printStackTrace(); } and, that seems to be working perfectly. Also, thanks for following up re. that Luke problem. That was one piece of this "puzzle" that was kind of driving me batty :)!! Jim ---- Phil Whelan <phil...@gmail.com> wrote: > Hi Jim, > > On Sun, Aug 2, 2009 at 1:32 AM, <oh...@cox.net> wrote: > > I first noticed the problem that I'm seeing while working on this latter > > app. Basically, what I noticed was that while I was adding 13 documents to > > the index, when I listed the "path" terms, there were only 12 of them. > > Field text (the whole "path" in your case) and terms (the tokens of > the field text) are different. > > The StandardAnalyzer breaks up words like this... > Field text = "/a/b/c.txt" > Tokens = {"a","b","c","txt"} > > So this 1 field of 1 document become 4 terms / tokens (not sure if > there is a difference in this terminology between "terms" and "tokens" > sorry). > Therefore, you're going to have more terms than documents initially, > but as the overlap in term usage increases this changes. > > For instance, these 3 paths > "/a/b/c/d.txt","/b/c/d/a.txt","/c/d/a/b.txt" are still only a total of > 4 terms, since they share the same terms. > > In fact, StandardAnalyzer goes a bit further than that and removes > "stop-words", such as "a" (or "an", "the") as it's designed for > general text searching. > > That said, I think you have a point with the next part of your question... > > > So then, I reviewed the index using Luke, and what I saw with that was that > > there were indeed only 12 "path" terms (under "Term Count" on the left), > > but, when I clicked the "Show Top Terms" in Luke, there were 13 terms > > listed by Luke. > > Yes, I just checked this and this seems to be a bug with Luke. It > always shows 1 less than in "Term Count" than it should. Well spotted. > > Cheers, > Phil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org