[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217609#comment-17217609 ]
Karl Wright commented on CONNECTORS-1656: ----------------------------------------- The issue, in my opinion, is that the document produced identifies itself as XML when it is not. The first line therefore may be all you need to change to get Tika to not blow up on badly formed XML that comes from HTML. If you want to research this, you might be able to find out what Tika accepts and what it does not pretty readily with some offline experimentation. > HTML extractor produces invalid XML > ----------------------------------- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor > Affects Versions: ManifoldCF 2.17 > Reporter: Julien Massiera > Assignee: Karl Wright > Priority: Major > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)