Re: encoding

John Haxby Thu, 26 Jan 2006 03:01:58 -0800

arnaudbuffet wrote:

For text files, data could be in different languages so different
encoding. If data are in Turkish for exemple, all special characters and
accents are not recognized in my lucene index. Is there a way to resolve
problem? How do I work with the encoding ?

I've been looking at a similar problem recently. There'sorg.apache.lucene.analysis.ISOLatin1AccentFilter on the svn trunk whichmay be quite close to what you want. I have a perl script here that Iused to generate downgrading table for a C program. I can let you havethe perl script as is, but if there's enough interest(*) I'll use it togenerate, say, CompoundAsciiFilter since it converts compound characterslike á, æ, ﬃ (ffi-ligature, in case it doesn't display) to a, ae andffi. It's actually built fromhttp://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt so it winds uphaving nearly 1200 entries. An earlier version converted all compoundcharacters to their constient parts, but this version just convertscharacters that are made up entirely of ASCII and modifiers.


jch

(*) Any interest, actually. Might be enough for me to be interested.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: encoding

Reply via email to