John Cherouvim wrote:

I'm having some problems indexing my UTF-8 html pages. I am running lucene on Linux and I cannot understand why does the index generated depends on the locale of my operating system. If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this to en_US the index generated will be different. Why is this the case? My HTMLs are all UTF-8.

What verison of Linux are you using?

On Fedora Core 4 (and probably other Fedora's and RHEL) LANG=el_GR sets the character set to ISO 8859-7, eg (on my various machines):

   $ LANG=en_GR date | iconv -f iso88597
   Πεμ Σεπ 29 11:59:19 BST 2005
   $ LANG=el_GR.utf8 date
   Πεμ Σεπ 29 12:01:40 BST 2005

(Everything in FC4 is UTF-8 so it displays right and it seems that the Greek for "Sep" is "Sep" -- no surprises there I guess.)

In your case, replacing "date" with whatever the command is that you use to generate the indexes should do the right thing.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to