Hi, 2015-06-16 9:28 GMT-04:00 Allison, Timothy B. <talli...@mitre.org>: > So, is there a way to make the XMLParser more lenient?
I don't think so. XML is draconian by design. > Or is there a way to configure the HtmlParser to add spaces for > non-html tags? One option that wouldn't require changes in Tika code could be to use HtmlParser with the IdentityHtmlMapper and process the output using TextContentHandler with the addSpaceBetweenElements option enabled. > Or, is there a better solution? The cleanest alternative would be to come up with a more accurate detection heuristics to detect SGML. Are there some common file name patterns, DOCTYPEs or other easily identifiable bits that could be used to improve the accuracy of type detection? Things like the <?xml ...?> header, presence of xmlns attributes, the .xml file extension, etc. can be used as highly reliable signals for XML content, so the lack of them coupled with even some fairly weak SGML detection signals (stuff like upper case element names?) might be enough to get significant improvements in this area. BR, Jukka Zitting