Re: Unicode characters that are not legal XML characters;

2008-12-23 Thread lucas song
I have wirte a class to deal with this problem.
public class XmlCharFilter {
public static String doFilter(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in)))
return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught
// here; it should not happen.
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x1) && (current <= 0x10)))
out.append(current);
}
return out.toString();
}

}



2008/12/23 Jarek Zgoda 

> Wiadomość napisana w dniu 2008-12-23, o godz. 14:46, przez rohit arora:
>
>
>  When i give post command to build my Index on my (databases / XML) file it
>> gives me
>> an error which is like .
>>
>> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
>> ((CTRL-CHAR, code 22))
>>  at [row,col {unknown-source}]: [1676,86]
>>
>> I find a inbuild function in perl to convert all my character date in
>> "UTF-8" format
>> I find that there are many Unicode Character that are not legal XML
>> Character.
>>
>> Can any one help me to find the list of all the legal XML Character so
>> that
>> I can strip all character except those characters.
>>
>
>
> http://en.wikipedia.org/wiki/Unicode_control_characters
>
> Basically, anything from 0 to 31 + DEL character (127).
>
> --
> We read Knuth so you don't have to. - Tim Peters
>
> Jarek Zgoda, R&D, Redefine
> jarek.zg...@redefine.pl
>
>


Re: Unicode characters that are not legal XML characters

2008-12-23 Thread Bryan Talbot
I believe you can use the following unicode characters in XML  
documents: U+0009, U+000A, U+000D, [U+0020-U+D7FF], [U+E000-U+FFFD],  
and [U+1-U+10]


One of your documents contains a U0022 character which is an invalid  
space character for XML.


http://www.unicode.org/unicode/reports/tr20/#White

If your data is all text, you can probably safely remove the  
disallowed whitespace characters.



-Bryan




On Dec 23, 2008, at Dec 23, 5:50 AM, rohit arora wrote:




Hi,

When i give post command to build my Index on my (databases / XML)  
file it gives me

an error which is like .

com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character  
((CTRL-CHAR, code 22))

 at [row,col {unknown-source}]: [1676,86]

I find a inbuild function in perl to convert all my character data  
in "UTF-8" format
I find that there are many Unicode Character that are not legal XML  
Character.


Can any one help me to find the list of all the legal XML Character  
so that

I can strip all character except those characters.


with regards
 Rohit Arora







Re: Unicode characters that are not legal XML characters;

2008-12-23 Thread Jarek Zgoda

Wiadomość napisana w dniu 2008-12-23, o godz. 14:46, przez rohit arora:

When i give post command to build my Index on my (databases / XML)  
file it gives me

an error which is like .

com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character  
((CTRL-CHAR, code 22))

 at [row,col {unknown-source}]: [1676,86]

I find a inbuild function in perl to convert all my character date  
in "UTF-8" format
I find that there are many Unicode Character that are not legal XML  
Character.


Can any one help me to find the list of all the legal XML Character  
so that

I can strip all character except those characters.



http://en.wikipedia.org/wiki/Unicode_control_characters

Basically, anything from 0 to 31 + DEL character (127).

--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, R&D, Redefine
jarek.zg...@redefine.pl



RE: Unicode characters

2007-05-01 Thread HUYLEBROECK Jeremy RD-ILAB-SSF

Thanks a lot for the time you spent understanding my problem and
checking for a solution in Neko!
It helps a lot.


-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 27, 2007 4:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Unicode characters 


: -fetch a web page
: -decode entities and unicode characters(such as $#149; ) using Neko
: library
: -get a unicode String in Java
: -Sent it to SOLR through XML created by SAX, with the right encoding
: (UTF-8) specified everywhere( writer, header etc...)
: -it apparently arrives clean on the SOLR side (verified in our logs).
: -In the query output from SOLR (XML message), the character is not
: encoded as an entity (not •) but the character itself is used
: (character 149=95 hexadecimal).

Just because someone uses an html entity to display a character in a web
page doesn't mean it needs to be "escaped" in XML ... i think that in
theory we could use numeric entities to escape *every* character but
that would make the XML responses a lot bigger ... so in general Solr
only escapes the characters that need to be escaped to have a valid
UTF-8 XML response.

Your may also be having some additional problems since 149 (hex 95) is
not a printable UTF-8 character, it's a control character
(MESSAGE_WAITING) ... it sounds like you're dealing with HTML where
people were using the numeric value from the "Windows-1252" charset.

you may want to modify your parsing code to do some mappings between
"control" characters that you know aren't ment to be control characters
before you ever send them to solr.  a quick search for "Neko
windows-1525" indicates that enough people have had problems with this
that it is a built in feature...
http://people.apache.org/~andyc/neko/doc/html/settings.html
"http://cyberneko.org/html/features/scanner/fix-mswindows-refs
 Specifies whether to fix character entity references for Microsoft
 Windows characters as described at
 http://www.cs.tut.fi/~jkorpela/www/windows-chars.html.";

(I've run into this a number of times over the years when dealing with
content created by windows users, as you can see from my one and only
thread on "JavaJunkies" ...
  http://www.javajunkies.org/index.pl?node_id=3436
)


-Hoss



Re: Unicode characters

2007-04-27 Thread Chris Hostetter

: -fetch a web page
: -decode entities and unicode characters(such as $#149; ) using Neko
: library
: -get a unicode String in Java
: -Sent it to SOLR through XML created by SAX, with the right encoding
: (UTF-8) specified everywhere( writer, header etc...)
: -it apparently arrives clean on the SOLR side (verified in our logs).
: -In the query output from SOLR (XML message), the character is not
: encoded as an entity (not •) but the character itself is used
: (character 149=95 hexadecimal).

Just because someone uses an html entity to display a character in a web
page doesn't mean it needs to be "escaped" in XML ... i think that in
theory we could use numeric entities to escape *every* character but that
would make the XML responses a lot bigger ... so in general Solr only
escapes the characters that need to be escaped to have a valid UTF-8 XML
response.

Your may also be having some additional problems since 149 (hex 95) is not
a printable UTF-8 character, it's a control character (MESSAGE_WAITING)
... it sounds like you're dealing with HTML where people were using the
numeric value from the "Windows-1252" charset.

you may want to modify your parsing code to do some mappings between
"control" characters that you know aren't ment to be control characters
before you ever send them to solr.  a quick search for "Neko
windows-1525" indicates that enough people have had problems with this
that it is a built in feature...
http://people.apache.org/~andyc/neko/doc/html/settings.html
"http://cyberneko.org/html/features/scanner/fix-mswindows-refs
 Specifies whether to fix character entity references for Microsoft
 Windows characters as described at
 http://www.cs.tut.fi/~jkorpela/www/windows-chars.html.";

(I've run into this a number of times over the years when dealing with
content created by windows users, as you can see from my one and only
thread on "JavaJunkies" ...
  http://www.javajunkies.org/index.pl?node_id=3436
)


-Hoss



Re: Unicode characters

2007-04-27 Thread Yonik Seeley

On 4/27/07, HUYLEBROECK Jeremy RD-ILAB-SSF

-In the query output from SOLR (XML message), the character is not
encoded as an entity (not •) but the character itself is used
(character 149=95 hexadecimal).


That's fine, as they are equivalent representations, and that
character is directly representable in UTF-8 (which Solr uses for it's
output).
Is this causing a problem for you somehow?

-Yonik