Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Dawid Weiss wrote: ... So, will I amend the patch in NUTCH-110 so it uses XMLSerializerHelper#toValidXmlText in place of #getLegalXml method? Copy the method's contents. It doesn't really make sense to copy the entire class just for this method. Good luck. Thanks Dawid. I've just uploaded a new patch that puts toValidXmlText into StringUtil, adds a few basic unit tests for the just-added method, and has OpenSearchServlet call StringUtil#toValidXmlText on all text added to DOM nodes. St.Ack
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Chris Mattmann wrote: Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work, and there's no other way to solve the problem without losing text, then your patch has my +1. We should not drop the offending characters, but escape them. Either the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler). So, this is -1 for the patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Dawid Weiss wrote: We should not drop the offending characters, but escape them. Either the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler). This isn't entirely true, Andrzej -- escaping a character, or putting it in a CDATA section is just about different ways of expressing the same character code in an XML structure. The same and ILLEGAL character code in terms of XML spec (there is a fragment specifying legal character ranges there), so a conforming XML parser should throw an exception if it encounters anything outside of the legal range. The only way of transferring a full binary is to encode it to legal unicode characters (using uuencode or such). I agree with the person who submitted this patch that it is a potential issue and should be addressed somehow. Right, I didn't think about this... somehow I thought this was all about special characters like ' . Then we should take the best of both worlds - escape valid characters, and replace invalid ones with '?' or space, or nothing. I know a place where we could find some inspiration (Carrot2 XMLSerializerHelper.java ... ;-) ) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Right, I didn't think about this... somehow I thought this was all about special characters like ' . Oh, believe me: this knowledge came from sour experience not from book wisdom... I know for sure some XML parsers complain about invalid characters, while others don't. Then we should take the best of both worlds - escape valid characters, and replace invalid ones with '?' or space, or nothing. I know a place where we could find some inspiration (Carrot2 XMLSerializerHelper.java ... ;-) ) Feel free to take anything you need; I don't claim it's the best way to implement it, but it is certainly better then passing through incorrect character codes. Alternatively you could correct everything that is indexed not to contain invalid characters (via a token filter?). Dawid
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Andrzej Bialecki wrote: Then we should take the best of both worlds - escape valid characters, and replace invalid ones with '?' or space, or nothing. I know a place where we could find some inspiration (Carrot2 XMLSerializerHelper.java ... ;-) ) Thanks for the pointer. See starting at line 92, XMLSerializerHelper#toValidXmlText: http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html The differences between this method and the patch supplied in NUTCH-110 are: 1. XMLSerializerHelper#toValidXmlText throws an exception when an invalid character whereas NUTCH-110 just drops it. 2. XMLSerializerHelper#toValidXmlText escapes all characters including the 5 xml 'special characters' whereas the NUTCH-110 patch only looks for the characters outside of the allowed XML character range. 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes about creating new 'safe' string instance. I think throwing an exception is inappropriate at search-results-drawing time. Dropping the character or replacing it with '?' or some such seems better way to go. Should I change the NUTCH-110 patch to do entity escaping too as XMLSerializerHelper#toValidXmlText does because we can't depend on the underlying jdk parser instance doing the right thing? Yours, St.Ack
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
The differences between this method and the patch supplied in NUTCH-110 are: Take a closer look at the source code -- 1. XMLSerializerHelper#toValidXmlText throws an exception when an invalid character whereas NUTCH-110 just drops it. Not really, it is governed by a boolean flag. If the flag is set to true, it'll throw an exception, otherwise it silently ignores bad characters. 2. XMLSerializerHelper#toValidXmlText escapes all characters including the 5 xml 'special characters' whereas the NUTCH-110 patch only looks for the characters outside of the allowed XML character range. If you intend to put this string in any XML text block (such as the content of an attribute, or between tags and not enclosed in a CDATA block), you'll have to deal with special characters such as lt; and gt;. If you don't, your XML will be simply incorrect. 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes about creating new 'safe' string instance. So does this routine, actually. The string buffer is only created if there are changes to be made to the string. XMLSerializerHelper#toValidXmlText does because we can't depend on the underlying jdk parser instance doing the right thing? It's not really about the parser, it's about the XML. If you emit blocks like searchresultTEXT/searchresult then if TEXT happens to be 2 + 2 3 then it has to be either escaped or put in a CDATA section. Otherwise any parser will complain (because it should). Also keep in mind that the URL at searchmorph.com -- http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html shows incorrect source code (entities are replaced to their characters which might be confusing), so: 108case '': // '' 109entity = ; 110 111break; Should actually read 108case '': // '' 109entity = lt;; 110 111break; and similar with other entities. D.
RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work, and there's no other way to solve the problem without losing text, then your patch has my +1. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 12, 2005 5:19 PM To: nutch-dev@incubator.apache.org Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part? OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira