[
https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217609#comment-17217609
]
Karl Wright commented on CONNECTORS-1656:
-
The issue, in my opinion, is that the document
[
https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wright reassigned CONNECTORS-1656:
---
Assignee: Karl Wright
> HTML extractor produces invalid XML
>
Julien Massiera created CONNECTORS-1656:
---
Summary: HTML extractor produces invalid XML
Key: CONNECTORS-1656
URL: https://issues.apache.org/jira/browse/CONNECTORS-1656
Project: ManifoldCF
I've added the component as requested. As for the advice, I suggest you
create a ticket and we can discuss there.
Karl
On Tue, Oct 20, 2020 at 6:24 AM wrote:
> Hi,
>
>
>
> I noticed a problem with the HTML extractor connector. It produces valid
> HTML doc (when the 'Strip HTML' option is
Hi,
I noticed a problem with the HTML extractor connector. It produces valid
HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some
tags like img do not have closing tag), and in some cases it is problematic.
For example, when Tika is used behind, it processes the document