arnaudbuffet wrote:

For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?
I've been looking at a similar problem recently. There's org.apache.lucene.analysis.ISOLatin1AccentFilter on the svn trunk which may be quite close to what you want. I have a perl script here that I used to generate downgrading table for a C program. I can let you have the perl script as is, but if there's enough interest(*) I'll use it to generate, say, CompoundAsciiFilter since it converts compound characters like á, æ, ffi (ffi-ligature, in case it doesn't display) to a, ae and ffi. It's actually built from http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds up having nearly 1200 entries. An earlier version converted all compound characters to their constient parts, but this version just converts characters that are made up entirely of ASCII and modifiers.

jch

(*) Any interest, actually. Might be enough for me to be interested.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to