Re: HTML extractor produces invalid XML

Karl Wright Tue, 20 Oct 2020 04:35:00 -0700

I've added the component as requested.  As for the advice, I suggest you
create a ticket and we can discuss there.
Karl



On Tue, Oct 20, 2020 at 6:24 AM <[email protected]> wrote:

> Hi,
>
>
>
> I noticed a problem with the HTML extractor connector. It produces valid
> HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some
> tags like img do not have closing tag), and in some cases it is
> problematic.
> For example, when Tika is used behind, it processes the document as an XML
> document and most of the time a parse exception is raised and the document
> content is lost.
>
>
>
> I would like to create a ticket for this issue and I would be glad to
> propose a patch and do the commit myself but I need two things:
>
> 1/ Create The "HTML extractor" component in Jira
>
>
>
> 2/ Your advise concerning the way to resolve the issue: Either we configure
> this connector to always output XML valid document (when the "Strip HTML"
> option is disabled), or we add a new option in the configuration to enforce
> XML output when enabled ?
>
>
>
> Regards,
> Julien
>
>

Re: HTML extractor produces invalid XML

Reply via email to