Bok Tomi,

What do you mean by "terms are misrepresented"?  What should they be, and what 
are you seeing?

> What I'm not clear on is how can I see the problematic *terms* in the list of 
> terms, but not the documents they're stored in?

Are you saying that the content got indexed, but the file names did not?

Out of curiosity (note my last name), I'm curious about what analyzer/tokenizer 
you're using.  Is there an equivallent of Porter stemmer for Croatian?  I could 
use that. :)

Otis

----- Original Message ----
From: Tomi NA <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, July 13, 2006 8:19:31 AM
Subject: accented characters, wildcards and other problems

I've done a bit of testing with accented characters (Croatian, to be
specific) and can't really explain what I see when I explore the index
with luke.
I've used accented characters in directory names, file names and file contents.
Now, in the list of terms (in "Top ranking terms", "Overview" tab) I
see that 2 out of 5 terms are misrepresented, but are indexed,
nonetheless.
The file names containing the problematic characters contain these
characters themselves, i.e. if the file name is "file[x].txt", the
file contents are "test[x]", where [x] represents the accented
character. What I'm not clear on is how can I see the problematic
*terms* in the list of terms, but not the documents they're stored in?

That's one issue. The other is somewhat simpler, I expect.
A search for "test*" returns no results. Acording to the FAQ, it
should, so what am I missing?

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to