[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

2020-10-20 Thread Karl Wright (Jira)
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217609#comment-17217609 ] Karl Wright commented on CONNECTORS-1656: - The issue, in my opinion, is that the document

[jira] [Assigned] (CONNECTORS-1656) HTML extractor produces invalid XML

2020-10-20 Thread Karl Wright (Jira)
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1656: --- Assignee: Karl Wright > HTML extractor produces invalid XML >

[jira] [Created] (CONNECTORS-1656) HTML extractor produces invalid XML

2020-10-20 Thread Julien Massiera (Jira)
Julien Massiera created CONNECTORS-1656: --- Summary: HTML extractor produces invalid XML Key: CONNECTORS-1656 URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 Project: ManifoldCF

Re: HTML extractor produces invalid XML

2020-10-20 Thread Karl Wright
I've added the component as requested. As for the advice, I suggest you create a ticket and we can discuss there. Karl On Tue, Oct 20, 2020 at 6:24 AM wrote: > Hi, > > > > I noticed a problem with the HTML extractor connector. It produces valid > HTML doc (when the 'Strip HTML' option is

HTML extractor produces invalid XML

2020-10-20 Thread julien.massiera
Hi, I noticed a problem with the HTML extractor connector. It produces valid HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some tags like img do not have closing tag), and in some cases it is problematic. For example, when Tika is used behind, it processes the document