[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

Karl Wright (Jira) Tue, 20 Oct 2020 06:31:13 -0700


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217609#comment-17217609
 ]


Karl Wright commented on CONNECTORS-1656:
-----------------------------------------

The issue, in my opinion, is that the document produced identifies itself as 
XML when it is not.  The first line therefore may be all you need to change to 
get Tika to not blow up on badly formed XML that comes from HTML.

If you want to research this, you might be able to find out what Tika accepts 
and what it does not pretty readily with some offline experimentation.



> HTML extractor produces invalid XML
> -----------------------------------
>
>                 Key: CONNECTORS-1656
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1656
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: HTML extractor
>    Affects Versions: ManifoldCF 2.17
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Major
>
> The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' 
> option is disabled) but invalid XML (some tags like img do not have closing 
> tag), and in some cases it is problematic. For example, when Tika is used 
> behind, it processes the document as an XML document and most of the time a 
> parse exception is raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

Reply via email to