Andrzej,

Cheers! Good to know. Thanks!
r/d

-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Sunday, April 02, 2006 5:01 PM
To: nutch-user@lucene.apache.org
Subject: Re: hi all

Dan Morrill wrote:
> Since you are using Luke to see the index, luke may not have the character
> support built in for non utf-8 character sets (meaning gork when you look
at
> it). I went to the luke site http://www.getopt.org/luke/ to see if they
make
> mention of the character sets they support, but there is nothing that
states
> they support any character set. 
>
> When you run your search, do you see good characters, or do you see gork?
> Luke may not be able to understand the ISO character sets. (Hypothesis). 
>   

Hi,

(I'm the guy behind Luke)

Luke uses UTF-8, because that's what Lucene stores in the index. You may 
experience problems with the default font that it uses, i.e. that it 
doesn't support all Unicode characters. Please try to change the font 
(in Settings) and see if it helps.

Another frequent source of garbled characters is when you read the 
original content using wrong encoding, e.g. if you read a UTF-8 file 
using your native platform encoding like Latin1 or Big5, or the other 
way around. Then you get broken characters being encoded to UTF-8, when 
Lucene writes out the index, and restored from UTF-8 to their broken 
form when Luke reads the index....

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to