Thanks a lot for the time you spent understanding my problem and checking for a solution in Neko! It helps a lot.
-----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Friday, April 27, 2007 4:02 PM To: solr-user@lucene.apache.org Subject: Re: Unicode characters : -fetch a web page : -decode entities and unicode characters(such as $#149; ) using Neko : library : -get a unicode String in Java : -Sent it to SOLR through XML created by SAX, with the right encoding : (UTF-8) specified everywhere( writer, header etc...) : -it apparently arrives clean on the SOLR side (verified in our logs). : -In the query output from SOLR (XML message), the character is not : encoded as an entity (not •) but the character itself is used : (character 149=95 hexadecimal). Just because someone uses an html entity to display a character in a web page doesn't mean it needs to be "escaped" in XML ... i think that in theory we could use numeric entities to escape *every* character but that would make the XML responses a lot bigger ... so in general Solr only escapes the characters that need to be escaped to have a valid UTF-8 XML response. Your may also be having some additional problems since 149 (hex 95) is not a printable UTF-8 character, it's a control character (MESSAGE_WAITING) ... it sounds like you're dealing with HTML where people were using the numeric value from the "Windows-1252" charset. you may want to modify your parsing code to do some mappings between "control" characters that you know aren't ment to be control characters before you ever send them to solr. a quick search for "Neko windows-1525" indicates that enough people have had problems with this that it is a built in feature... http://people.apache.org/~andyc/neko/doc/html/settings.html "http://cyberneko.org/html/features/scanner/fix-mswindows-refs Specifies whether to fix character entity references for Microsoft Windows characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html." (I've run into this a number of times over the years when dealing with content created by windows users, as you can see from my one and only thread on "JavaJunkies" ... http://www.javajunkies.org/index.pl?node_id=3436 ) -Hoss