Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-14 Thread stack

Dawid Weiss wrote:


...

So, will I amend the patch in NUTCH-110 so it uses 
XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?



Copy the method's contents. It doesn't really make sense to copy the 
entire class just for this method. Good luck.  


Thanks Dawid.

I've just uploaded a new patch that puts toValidXmlText into StringUtil, 
adds a few basic unit tests for the just-added method, and has 
OpenSearchServlet call StringUtil#toValidXmlText on all text added to 
DOM nodes.


St.Ack



Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Andrzej Bialecki

Chris Mattmann wrote:

Hi,

 I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a !CDATA[ ]! tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?

  If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.


We should not drop the offending characters, but escape them. Either the 
Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler).


So, this is -1 for the patch.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Andrzej Bialecki

Dawid Weiss wrote:


We should not drop the offending characters, but escape them. Either 
the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler).



This isn't entirely true, Andrzej -- escaping a character, or putting it 
in a CDATA section is just about different ways of expressing the same 
character code in an XML structure. The same and ILLEGAL character code 
in terms of XML spec (there is a fragment specifying legal character 
ranges there), so a conforming XML parser should throw an exception if 
it encounters anything outside of the legal range. The only way of 
transferring a full binary is to encode it to legal unicode characters 
(using uuencode or such).


I agree with the person who submitted this patch that it is a potential 
issue and should be addressed somehow.


Right, I didn't think about this... somehow I thought this was all about 
special characters like '   .


Then we should take the best of both worlds - escape valid characters, 
and replace invalid ones with '?' or space, or nothing. I know a place 
where we could find some inspiration (Carrot2 XMLSerializerHelper.java 
... ;-) )


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss


Right, I didn't think about this... somehow I thought this was all about 
special characters like '   .


Oh, believe me: this knowledge came from sour experience not from book 
wisdom... I know for sure some XML parsers complain about invalid 
characters, while others don't.


Then we should take the best of both worlds - escape valid characters, 
and replace invalid ones with '?' or space, or nothing. I know a place 
where we could find some inspiration (Carrot2 XMLSerializerHelper.java 
... ;-) )


Feel free to take anything you need; I don't claim it's the best way to 
implement it, but it is certainly better then passing through incorrect 
character codes. Alternatively you could correct everything that is 
indexed not to contain invalid characters (via a token filter?).


Dawid


Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread stack

Andrzej Bialecki wrote:




Then we should take the best of both worlds - escape valid characters, 
and replace invalid ones with '?' or space, or nothing. I know a place 
where we could find some inspiration (Carrot2 XMLSerializerHelper.java 
... ;-) )


Thanks for the pointer. See starting at line 92, 
XMLSerializerHelper#toValidXmlText: 
http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html


The differences between this method and the patch supplied in NUTCH-110 are:

1. XMLSerializerHelper#toValidXmlText throws an exception when an 
invalid character whereas NUTCH-110 just drops it.
2. XMLSerializerHelper#toValidXmlText escapes all characters including 
the 5 xml 'special characters' whereas the NUTCH-110 patch only looks 
for the characters outside of the allowed XML character range.
3. NUTCH-110 first scans to see if text has 'bad xml' before it goes 
about creating new 'safe' string instance.


I think throwing an exception is inappropriate at search-results-drawing 
time. Dropping the character or replacing it with '?' or some such seems 
better way to go.


Should I change the NUTCH-110 patch to do entity escaping too as 
XMLSerializerHelper#toValidXmlText does because we can't depend on the 
underlying jdk parser instance doing the right thing?


Yours,
St.Ack


Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss



 The differences between this method and the patch supplied in NUTCH-110
 are:

Take a closer look at the source code --



1. XMLSerializerHelper#toValidXmlText throws an exception when an 
invalid character whereas NUTCH-110 just drops it.


Not really, it is governed by a boolean flag. If the flag is set to 
true, it'll throw an exception, otherwise it silently ignores bad 
characters.


2. XMLSerializerHelper#toValidXmlText escapes all characters including 
the 5 xml 'special characters' whereas the NUTCH-110 patch only looks 
for the characters outside of the allowed XML character range.


If you intend to put this string in any XML text block (such as the 
content of an attribute, or between tags and not enclosed in a CDATA 
block), you'll have to deal with special characters such as lt; and 
gt;. If you don't, your XML will be simply incorrect.


3. NUTCH-110 first scans to see if text has 'bad xml' before it goes 
about creating new 'safe' string instance.


So does this routine, actually. The string buffer is only created if 
there are changes to be made to the string.


XMLSerializerHelper#toValidXmlText does because we can't depend on the 
underlying jdk parser instance doing the right thing?


It's not really about the parser, it's about the XML. If you emit blocks 
like


searchresultTEXT/searchresult

then if TEXT happens to be 2 + 2  3 then it has to be either escaped 
or put in a CDATA section. Otherwise any parser will complain (because 
it should).


Also keep in mind that the URL at searchmorph.com --

http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html 



shows incorrect source code (entities are replaced to their characters 
which might be confusing), so:


108case '': // ''
109entity = ;
110
111break;

Should actually read

108case '': // ''
109entity = lt;;
110
111break;

and similar with other entities.

D.


RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread Chris Mattmann
Hi,

 I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a !CDATA[ ]! tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?

  If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.

Cheers,
 Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 12, 2005 5:19 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
 characters
 
  [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
 
 [EMAIL PROTECTED] updated NUTCH-110:
 
 
 Attachment: fixIllegalXmlChars.patch
 
 Attached patch runs all xml text through a check for bad xml characters.
 This patch is brutal dropping silently illegal characters.  Patch was made
 after hunting xalan, jdk, and nutch itself for a method that would do the
 above filtering but was unable to find any such method -- perhaps an
 oversight on my part?
 
  OpenSearchServlet outputs illegal xml characters
  
 
   Key: NUTCH-110
   URL: http://issues.apache.org/jira/browse/NUTCH-110
   Project: Nutch
  Type: Bug
Components: searcher
  Versions: 0.7
   Environment: linux, jdk 1.5
  Reporter: [EMAIL PROTECTED]
   Attachments: fixIllegalXmlChars.patch
 
  OpenSearchServlet does not check text-to-output for illegal xml
 characters; dependent on  search result, its possible for OSS to output
 xml that is not well-formed.  For example, if text has the character FF
 character in it -- -- i.e. the ascii character at position (decimal) 12 --
 the produced XML will show the FF character as '#12;' The
 character/entity '#12;' is not legal in XML according to
 http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira