Andrzej Bialecki wrote:

....

Then we should take the best of both worlds - escape valid characters, and replace invalid ones with '?' or space, or nothing. I know a place where we could find some inspiration (Carrot2 XMLSerializerHelper.java ... ;-) )

Thanks for the pointer. See starting at line 92, XMLSerializerHelper#toValidXmlText: http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html

The differences between this method and the patch supplied in NUTCH-110 are:

1. XMLSerializerHelper#toValidXmlText throws an exception when an invalid character whereas NUTCH-110 just drops it. 2. XMLSerializerHelper#toValidXmlText escapes all characters including the 5 xml 'special characters' whereas the NUTCH-110 patch only looks for the characters outside of the allowed XML character range. 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes about creating new 'safe' string instance.

I think throwing an exception is inappropriate at search-results-drawing time. Dropping the character or replacing it with '?' or some such seems better way to go.

Should I change the NUTCH-110 patch to do entity escaping too as XMLSerializerHelper#toValidXmlText does because we can't depend on the underlying jdk parser instance doing the right thing?

Yours,
St.Ack

Reply via email to