Also note that both apache and tomcat has a default setting that force 
re-encodes all pages. in tomcat it is the <DecodeInterceptor /> in 
server.xml, in apache it is a line that says "AddDefaultCharset on" in 
httpd.conf. These are applied _after_ any servlet output so it might 
lead to strange result, be sure to turn off both directives when you 
test different encoding problems.

Last but not least is the encoding the SQL database was created in. On 
DB2 i have to use the right database constructor to get norwegian 
character support (db2 CREATE DATABASE mydb USING CODESET ISO-8859-1 
TERRITORY NO COLLATE USING SYSTEM;). Without the correct encoding on 
the database constructor the database behave strange in sorting and 
insert/update scenarios.

To be sure to get everything make sure that all steps are using the 
same encoding, just like you use the same analyzer (perhaps encoding 
should be a part of a analyzer?!?)

1: create the database with ISO-8859-1 encoding (my favorite)...

        CREATE DATABASE mydb USING CODESET ISO-8859-1 TERRITORY NO COLLATE 
USING SYSTEM;

2: in the indexer force feed lucene with ISO-8859-1 strings:

        String value = resultset.getString("fieldname");
        document.add(Field.UnStored("fieldname", new 
String(value.getBytes("ISO-8859-1"))));
        ...

3: force encode all queries to lucene in the same manner
        
        String querystring = httprequest.getParameter("query");
        querystring = new String(querystring.getBytes("ISO-8859-1"));
        ...


mvh karl řie


On sřndag, okt 13, 2002, at 14:15 Europe/Oslo, Chris Davis wrote:

> To Dominator,
>
> Where you able to solve the display problem as well?  I am having a 
> similiar problem with documents that contain the " (open double quote 
> &#8220).  I am not concerned with searching on the character, but when 
> I attempt to dsiplay a stored field with this character, it does not 
> display correctly.  Even stranger, the closing quote &#8221 does 
> display.
>
> To All,
>
> I have browsed through the majority of messages related to Unicode in 
> the archive, and my reading tells me that Lucene does not normally 
> change the data that is "stored" for a field.  Can someone give me 
> some pointers on how to troubleshoot this problem.
>
> Note:  I am indexing data that is being pulled from a SQL Server 2000 
> DB on Windows 2000.
>
> -------------------
>
>
> In an earlier message Dominator wrote:
>
>> I print out a result string it shows a very strange result, for 
>> example
>> search for: "civilingeni&rcaron;r" string: "civilingeni&Abreve;¸r".. 
>> I'm sure it's an
>> unicode problem, but where can I change it??
>
>
>
> Dominator wrote:
>
> thx, with your help I could solve the problem
>
> "karl ie" <[EMAIL PROTECTED]> wrote in message
> [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
> i had such problems with norwegian characters and it resolved into
> making sure the querystring has the same encoding as the index has.
>
> since this is again a java.lang.String encoding question i had these
> problems with querystrings coming from java Servlets and CLI. For both
> the quickfix was to re-encode the query in UTF-8/16:
>
> String querystring = argv[0]; ' String querystring =
> httprequest.getParameter("query");
> querystring = new String(querystring.getBytes("UTF-8"));
> ...
>
> this fixed my norwegian/samii problems...
>
>
> mvh karl ie
>
> On mandag, okt 7, 2002, at 13:04 Europe/Oslo, Dominator wrote:
>
>>> I use czech language with more bizzare characters and there is no
>>> problem at all. Are you sure, that your XML contains character set
>>> information?
>>
>> yes, I tried <?xml version="1.0" encoding="ISO-8859-2"?> and <?xml
>> version="1.0" encoding="UTF-8"?> but I get the same strange 
>> characters.
>>
>>
>>
>>
>>
>>
>> --
>> To unsubscribe, e-mail:
>> <mailto:[EMAIL PROTECTED]>
>> For additional commands, e-mail:
>> <mailto:[EMAIL PROTECTED]>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to