I've added the component as requested. As for the advice, I suggest you create a ticket and we can discuss there. Karl
On Tue, Oct 20, 2020 at 6:24 AM <julien.massi...@francelabs.com> wrote: > Hi, > > > > I noticed a problem with the HTML extractor connector. It produces valid > HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some > tags like img do not have closing tag), and in some cases it is > problematic. > For example, when Tika is used behind, it processes the document as an XML > document and most of the time a parse exception is raised and the document > content is lost. > > > > I would like to create a ticket for this issue and I would be glad to > propose a patch and do the commit myself but I need two things: > > 1/ Create The "HTML extractor" component in Jira > > > > 2/ Your advise concerning the way to resolve the issue: Either we configure > this connector to always output XML valid document (when the "Strip HTML" > option is disabled), or we add a new option in the configuration to enforce > XML output when enabled ? > > > > Regards, > Julien > >