[jira] [Issue Comment Deleted] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1656: Comment: was deleted (was: Bonjour, Je suis actuellement absent. Je serai de retour le lundi 22 février 2021. Pour toute question, merci d'écrire à l'email suivant: cedric [point] Ulmer [att] francelabs [point] com Cordialement, Julien Massiera + Hi, I will be out of office until Sunday Feb 21st included. For any question, please contact cedric [point] Ulmer [att] francelabs [dot] com ) > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289516#comment-17289516 ] Julien Massiera commented on CONNECTORS-1656: - Bonjour, Je suis actuellement absent. Je serai de retour le lundi 22 février 2021. Pour toute question, merci d'écrire à l'email suivant: cedric [point] Ulmer [att] francelabs [point] com Cordialement, Julien Massiera + Hi, I will be out of office until Sunday Feb 21st included. For any question, please contact cedric [point] Ulmer [att] francelabs [dot] com > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283929#comment-17283929 ] Karl Wright commented on CONNECTORS-1656: - The patch is fine. I was not notified it was attached, for some reason. > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF next > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283832#comment-17283832 ] Julien Massiera commented on CONNECTORS-1656: - [~kwri...@metacarta.com], is the patch ok ? > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF next > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218358#comment-17218358 ] Julien Massiera commented on CONNECTORS-1656: - Hi [~kwri...@metacarta.com], The document produced identifies itself as XHTML. But even if it was HTML, the default HTML parser of Tika uses SAX to parse documents. Here is the configuration of the Tika HTML parser (default configuration): HtmlParser Class: org.apache.tika.parser.html.HtmlParser Mime Types: text/html application/vnd.wap.xhtml+xm application/x-asp application/xhtml+xml So as it handles html and xhtml, the processed files have to be XML valid anyway > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217609#comment-17217609 ] Karl Wright commented on CONNECTORS-1656: - The issue, in my opinion, is that the document produced identifies itself as XML when it is not. The first line therefore may be all you need to change to get Tika to not blow up on badly formed XML that comes from HTML. If you want to research this, you might be able to find out what Tika accepts and what it does not pretty readily with some offline experimentation. > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1656: --- Assignee: Karl Wright > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1656) HTML extractor produces invalid XML
Julien Massiera created CONNECTORS-1656: --- Summary: HTML extractor produces invalid XML Key: CONNECTORS-1656 URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 Project: ManifoldCF Issue Type: Bug Components: HTML extractor Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some tags like img do not have closing tag), and in some cases it is problematic. For example, when Tika is used behind, it processes the document as an XML document and most of the time a parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: HTML extractor produces invalid XML
I've added the component as requested. As for the advice, I suggest you create a ticket and we can discuss there. Karl On Tue, Oct 20, 2020 at 6:24 AM wrote: > Hi, > > > > I noticed a problem with the HTML extractor connector. It produces valid > HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some > tags like img do not have closing tag), and in some cases it is > problematic. > For example, when Tika is used behind, it processes the document as an XML > document and most of the time a parse exception is raised and the document > content is lost. > > > > I would like to create a ticket for this issue and I would be glad to > propose a patch and do the commit myself but I need two things: > > 1/ Create The "HTML extractor" component in Jira > > > > 2/ Your advise concerning the way to resolve the issue: Either we configure > this connector to always output XML valid document (when the "Strip HTML" > option is disabled), or we add a new option in the configuration to enforce > XML output when enabled ? > > > > Regards, > Julien > >
HTML extractor produces invalid XML
Hi, I noticed a problem with the HTML extractor connector. It produces valid HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some tags like img do not have closing tag), and in some cases it is problematic. For example, when Tika is used behind, it processes the document as an XML document and most of the time a parse exception is raised and the document content is lost. I would like to create a ticket for this issue and I would be glad to propose a patch and do the commit myself but I need two things: 1/ Create The "HTML extractor" component in Jira 2/ Your advise concerning the way to resolve the issue: Either we configure this connector to always output XML valid document (when the "Strip HTML" option is disabled), or we add a new option in the configuration to enforce XML output when enabled ? Regards, Julien