John Cherouvim wrote:
I'm having some problems indexing my UTF-8 html pages. I am running
lucene on Linux and I cannot understand why does the index generated
depends on the locale of my operating system.
If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set
this to en_US the index generated will be different. Why is this the
case? My HTMLs are all UTF-8.
What verison of Linux are you using?
On Fedora Core 4 (and probably other Fedora's and RHEL) LANG=el_GR sets
the character set to ISO 8859-7, eg (on my various machines):
$ LANG=en_GR date | iconv -f iso88597
Πεμ Σεπ 29 11:59:19 BST 2005
$ LANG=el_GR.utf8 date
Πεμ Σεπ 29 12:01:40 BST 2005
(Everything in FC4 is UTF-8 so it displays right and it seems that the
Greek for "Sep" is "Sep" -- no surprises there I guess.)
In your case, replacing "date" with whatever the command is that you use
to generate the indexes should do the right thing.
jch
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]