Jukka, Sorry for my delay. addSpaceBetweenElements ...exactly what I was looking for. Thank you.
I'll send an update after further analysis of the incorrectly identified files to see if we can tweak our mimes. Cheers, Tim -----Original Message----- From: Jukka Zitting [mailto:jukka.zitt...@gmail.com] Sent: Tuesday, June 16, 2015 10:26 AM To: Tika Users Subject: Re: xml vs html parser Hi, 2015-06-16 9:28 GMT-04:00 Allison, Timothy B. <talli...@mitre.org>: > So, is there a way to make the XMLParser more lenient? I don't think so. XML is draconian by design. > Or is there a way to configure the HtmlParser to add spaces for > non-html tags? One option that wouldn't require changes in Tika code could be to use HtmlParser with the IdentityHtmlMapper and process the output using TextContentHandler with the addSpaceBetweenElements option enabled. > Or, is there a better solution? The cleanest alternative would be to come up with a more accurate detection heuristics to detect SGML. Are there some common file name patterns, DOCTYPEs or other easily identifiable bits that could be used to improve the accuracy of type detection? Things like the <?xml ...?> header, presence of xmlns attributes, the .xml file extension, etc. can be used as highly reliable signals for XML content, so the lack of them coupled with even some fairly weak SGML detection signals (stuff like upper case element names?) might be enough to get significant improvements in this area. BR, Jukka Zitting