RE: Solr exception when parsing XML

2013-01-16 Thread Zhang, Lisheng
slightly to call URLDecoder on text. Thanks and best regards, Lisheng -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, January 16, 2013 2:41 PM To: solr-user@lucene.apache.org Subject: RE: Solr exception when parsing XML In Apache Nutch we strip

RE: Solr exception when parsing XML

2013-01-16 Thread Markus Jelsma
solr-user@lucene.apache.org > Subject: RE: Solr exception when parsing XML > > Hi Alex, > > Thanks very much for helps! I switched to (I am using PHP in client side) > > createTextNode(urlencode($value)) > > so CTRL character problem is avoided, but I noticed that some

RE: Solr exception when parsing XML

2013-01-16 Thread Zhang, Lisheng
@lucene.apache.org Subject: Re: Solr exception when parsing XML Interesting point. Looks like CDATA is more limiting than I thought: http://en.wikipedia.org/wiki/CDATA#Issues_with_encoding . Basically, the recommendation is to avoid CDATA and automatically encode characters such as yours, as well as less

Re: Solr exception when parsing XML

2013-01-16 Thread Alexandre Rafalovitch
Looking at this second time, maybe we have an X/Y problem (sp?). Why was that symbol in there in the first place? Was it a field separator instead of using multiple fields? Was it a character in an encoding other than UTF-8? My guess is that the character will not make sense to Solr during either

Re: Solr exception when parsing XML

2013-01-16 Thread Yonik Seeley
On Tue, Jan 15, 2013 at 3:55 PM, Alexandre Rafalovitch wrote: > Basically, the > recommendation is to avoid CDATA and automatically encode characters such > as yours, as well as less/more and ampersand. Unfortunately that doesn't even work. Just as a raw control character like a 0 byte is invali

Re: Solr exception when parsing XML

2013-01-16 Thread Andre Bois-Crettez
Forgot the link : http://en.wikipedia.org/wiki/Valid_characters_in_XML André On 01/16/2013 02:24 PM, Andre Bois-Crettez wrote: Worth to note that some characters are completely forbidden in XML, such as "chr(0)". When dealing with external text input, some cleanup might be necessary to avoid br

Re: Solr exception when parsing XML

2013-01-16 Thread Andre Bois-Crettez
Worth to note that some characters are completely forbidden in XML, such as "chr(0)". When dealing with external text input, some cleanup might be necessary to avoid breaking indexation. For example you could replace each forbidden XML character with " ". André On 01/15/2013 09:55 PM, Alexandre

Re: Solr exception when parsing XML

2013-01-15 Thread Alexandre Rafalovitch
Interesting point. Looks like CDATA is more limiting than I thought: http://en.wikipedia.org/wiki/CDATA#Issues_with_encoding . Basically, the recommendation is to avoid CDATA and automatically encode characters such as yours, as well as less/more and ampersand. Regards, Alex.

Solr exception when parsing XML

2013-01-15 Thread Zhang, Lisheng
Hi, I got SolrException when submitting XML for indexing (using solr 3.6.1) Jan 15, 2013 10:22:42 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, cod e 31)) at [row,col {unknown-source}]: [2,1169] at org.apac