HTML extractor produces invalid XML

julien.massiera Tue, 20 Oct 2020 03:25:00 -0700

Hi,


I noticed a problem with the HTML extractor connector. It produces valid
HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some
tags like img do not have closing tag), and in some cases it is problematic.
For example, when Tika is used behind, it processes the document as an XML
document and most of the time a parse exception is raised and the document
content is lost.

 

I would like to create a ticket for this issue and I would be glad to
propose a patch and do the commit myself but I need two things: 

1/ Create The "HTML extractor" component in Jira

 

2/ Your advise concerning the way to resolve the issue: Either we configure
this connector to always output XML valid document (when the "Strip HTML"
option is disabled), or we add a new option in the configuration to enforce
XML output when enabled ? 

 

Regards,
Julien

HTML extractor produces invalid XML

Reply via email to