All, On govdocs1, the xml parser's exceptions accounted for nearly a quarter of all thrown exceptions at one point (Tika 1.7ish). Typically, a file was mis-identified as xml when in fact it was sgml or some other text based file with some markup that wasn't meant to be xml.
For kicks, I switched the config to use the HtmlParser for files identified as xml. This got rid of the exceptions, but the content was quite different (ballpark 6k files out of 35k files had similarity < 0.95) mostly because of elisions "the quick" -> "thequick", and I assume this is across tags... So, is there a way to make the XMLParser more lenient? Or is there a way to configure the HtmlParser to add spaces for non-html tags? Or, is there a better solution? Thank you! Best, Tim