Re: Unicode characters that are not legal XML characters;
I have wirte a class to deal with this problem. public class XmlCharFilter { public static String doFilter(String in) { StringBuffer out = new StringBuffer(); // Used to hold the output. char current; // Used to reference the current character. if (in == null || ("".equals(in))) return ""; // vacancy test. for (int i = 0; i < in.length(); i++) { current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught // here; it should not happen. if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x1) && (current <= 0x10))) out.append(current); } return out.toString(); } } 2008/12/23 Jarek Zgoda > Wiadomość napisana w dniu 2008-12-23, o godz. 14:46, przez rohit arora: > > > When i give post command to build my Index on my (databases / XML) file it >> gives me >> an error which is like . >> >> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character >> ((CTRL-CHAR, code 22)) >> at [row,col {unknown-source}]: [1676,86] >> >> I find a inbuild function in perl to convert all my character date in >> "UTF-8" format >> I find that there are many Unicode Character that are not legal XML >> Character. >> >> Can any one help me to find the list of all the legal XML Character so >> that >> I can strip all character except those characters. >> > > > http://en.wikipedia.org/wiki/Unicode_control_characters > > Basically, anything from 0 to 31 + DEL character (127). > > -- > We read Knuth so you don't have to. - Tim Peters > > Jarek Zgoda, R&D, Redefine > jarek.zg...@redefine.pl > >
Re: Unicode characters that are not legal XML characters
I believe you can use the following unicode characters in XML documents: U+0009, U+000A, U+000D, [U+0020-U+D7FF], [U+E000-U+FFFD], and [U+1-U+10] One of your documents contains a U0022 character which is an invalid space character for XML. http://www.unicode.org/unicode/reports/tr20/#White If your data is all text, you can probably safely remove the disallowed whitespace characters. -Bryan On Dec 23, 2008, at Dec 23, 5:50 AM, rohit arora wrote: Hi, When i give post command to build my Index on my (databases / XML) file it gives me an error which is like . com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 22)) at [row,col {unknown-source}]: [1676,86] I find a inbuild function in perl to convert all my character data in "UTF-8" format I find that there are many Unicode Character that are not legal XML Character. Can any one help me to find the list of all the legal XML Character so that I can strip all character except those characters. with regards Rohit Arora
Re: Unicode characters that are not legal XML characters;
Wiadomość napisana w dniu 2008-12-23, o godz. 14:46, przez rohit arora: When i give post command to build my Index on my (databases / XML) file it gives me an error which is like . com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 22)) at [row,col {unknown-source}]: [1676,86] I find a inbuild function in perl to convert all my character date in "UTF-8" format I find that there are many Unicode Character that are not legal XML Character. Can any one help me to find the list of all the legal XML Character so that I can strip all character except those characters. http://en.wikipedia.org/wiki/Unicode_control_characters Basically, anything from 0 to 31 + DEL character (127). -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, R&D, Redefine jarek.zg...@redefine.pl
RE: Unicode characters
Thanks a lot for the time you spent understanding my problem and checking for a solution in Neko! It helps a lot. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Friday, April 27, 2007 4:02 PM To: solr-user@lucene.apache.org Subject: Re: Unicode characters : -fetch a web page : -decode entities and unicode characters(such as $#149; ) using Neko : library : -get a unicode String in Java : -Sent it to SOLR through XML created by SAX, with the right encoding : (UTF-8) specified everywhere( writer, header etc...) : -it apparently arrives clean on the SOLR side (verified in our logs). : -In the query output from SOLR (XML message), the character is not : encoded as an entity (not •) but the character itself is used : (character 149=95 hexadecimal). Just because someone uses an html entity to display a character in a web page doesn't mean it needs to be "escaped" in XML ... i think that in theory we could use numeric entities to escape *every* character but that would make the XML responses a lot bigger ... so in general Solr only escapes the characters that need to be escaped to have a valid UTF-8 XML response. Your may also be having some additional problems since 149 (hex 95) is not a printable UTF-8 character, it's a control character (MESSAGE_WAITING) ... it sounds like you're dealing with HTML where people were using the numeric value from the "Windows-1252" charset. you may want to modify your parsing code to do some mappings between "control" characters that you know aren't ment to be control characters before you ever send them to solr. a quick search for "Neko windows-1525" indicates that enough people have had problems with this that it is a built in feature... http://people.apache.org/~andyc/neko/doc/html/settings.html "http://cyberneko.org/html/features/scanner/fix-mswindows-refs Specifies whether to fix character entity references for Microsoft Windows characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html."; (I've run into this a number of times over the years when dealing with content created by windows users, as you can see from my one and only thread on "JavaJunkies" ... http://www.javajunkies.org/index.pl?node_id=3436 ) -Hoss
Re: Unicode characters
: -fetch a web page : -decode entities and unicode characters(such as $#149; ) using Neko : library : -get a unicode String in Java : -Sent it to SOLR through XML created by SAX, with the right encoding : (UTF-8) specified everywhere( writer, header etc...) : -it apparently arrives clean on the SOLR side (verified in our logs). : -In the query output from SOLR (XML message), the character is not : encoded as an entity (not •) but the character itself is used : (character 149=95 hexadecimal). Just because someone uses an html entity to display a character in a web page doesn't mean it needs to be "escaped" in XML ... i think that in theory we could use numeric entities to escape *every* character but that would make the XML responses a lot bigger ... so in general Solr only escapes the characters that need to be escaped to have a valid UTF-8 XML response. Your may also be having some additional problems since 149 (hex 95) is not a printable UTF-8 character, it's a control character (MESSAGE_WAITING) ... it sounds like you're dealing with HTML where people were using the numeric value from the "Windows-1252" charset. you may want to modify your parsing code to do some mappings between "control" characters that you know aren't ment to be control characters before you ever send them to solr. a quick search for "Neko windows-1525" indicates that enough people have had problems with this that it is a built in feature... http://people.apache.org/~andyc/neko/doc/html/settings.html "http://cyberneko.org/html/features/scanner/fix-mswindows-refs Specifies whether to fix character entity references for Microsoft Windows characters as described at http://www.cs.tut.fi/~jkorpela/www/windows-chars.html."; (I've run into this a number of times over the years when dealing with content created by windows users, as you can see from my one and only thread on "JavaJunkies" ... http://www.javajunkies.org/index.pl?node_id=3436 ) -Hoss
Re: Unicode characters
On 4/27/07, HUYLEBROECK Jeremy RD-ILAB-SSF -In the query output from SOLR (XML message), the character is not encoded as an entity (not •) but the character itself is used (character 149=95 hexadecimal). That's fine, as they are equivalent representations, and that character is directly representable in UTF-8 (which Solr uses for it's output). Is this causing a problem for you somehow? -Yonik