Andrzej, Cheers! Good to know. Thanks! r/d
-----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 5:01 PM To: nutch-user@lucene.apache.org Subject: Re: hi all Dan Morrill wrote: > Since you are using Luke to see the index, luke may not have the character > support built in for non utf-8 character sets (meaning gork when you look at > it). I went to the luke site http://www.getopt.org/luke/ to see if they make > mention of the character sets they support, but there is nothing that states > they support any character set. > > When you run your search, do you see good characters, or do you see gork? > Luke may not be able to understand the ISO character sets. (Hypothesis). > Hi, (I'm the guy behind Luke) Luke uses UTF-8, because that's what Lucene stores in the index. You may experience problems with the default font that it uses, i.e. that it doesn't support all Unicode characters. Please try to change the font (in Settings) and see if it helps. Another frequent source of garbled characters is when you read the original content using wrong encoding, e.g. if you read a UTF-8 file using your native platform encoding like Latin1 or Big5, or the other way around. Then you get broken characters being encoded to UTF-8, when Lucene writes out the index, and restored from UTF-8 to their broken form when Luke reads the index.... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com