Hi,
I noticed a problem with the HTML extractor connector. It produces valid HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some tags like img do not have closing tag), and in some cases it is problematic. For example, when Tika is used behind, it processes the document as an XML document and most of the time a parse exception is raised and the document content is lost. I would like to create a ticket for this issue and I would be glad to propose a patch and do the commit myself but I need two things: 1/ Create The "HTML extractor" component in Jira 2/ Your advise concerning the way to resolve the issue: Either we configure this connector to always output XML valid document (when the "Strip HTML" option is disabled), or we add a new option in the configuration to enforce XML output when enabled ? Regards, Julien
