Andrzej Bialecki wrote:
....
Then we should take the best of both worlds - escape valid characters,
and replace invalid ones with '?' or space, or nothing. I know a place
where we could find some inspiration (Carrot2 XMLSerializerHelper.java
... ;-) )
Thanks for the pointer. See starting at line 92,
XMLSerializerHelper#toValidXmlText:
http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html
The differences between this method and the patch supplied in NUTCH-110 are:
1. XMLSerializerHelper#toValidXmlText throws an exception when an
invalid character whereas NUTCH-110 just drops it.
2. XMLSerializerHelper#toValidXmlText escapes all characters including
the 5 xml 'special characters' whereas the NUTCH-110 patch only looks
for the characters outside of the allowed XML character range.
3. NUTCH-110 first scans to see if text has 'bad xml' before it goes
about creating new 'safe' string instance.
I think throwing an exception is inappropriate at search-results-drawing
time. Dropping the character or replacing it with '?' or some such seems
better way to go.
Should I change the NUTCH-110 patch to do entity escaping too as
XMLSerializerHelper#toValidXmlText does because we can't depend on the
underlying jdk parser instance doing the right thing?
Yours,
St.Ack