[
https://issues.apache.org/jira/browse/TIKA-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952838#comment-17952838
]
Tim Allison commented on TIKA-4419:
-----------------------------------
I grepped through our cc-html corpus for self-closing tags. There's plenty of
noise. Regexes aren't perfect, nor are the files. I've attached the top 500
self-closing tags.
I did a crosswalk with [https://www.w3schools.com/tags,] and came up with a
list of about 30 tags. I'll run with that on branch TIKA-4419 and see what we
find.
> Deal with self-closeable tags handling change in jsoup 1.20.1
> -------------------------------------------------------------
>
> Key: TIKA-4419
> URL: https://issues.apache.org/jira/browse/TIKA-4419
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: tags-top500.txt
>
>
> On TIKA-4411, [~tilman] found a significant change in behavior for how jsoup
> 1.21.1 is handling self-closing tags. We need to figure out how to deal with
> this in a reasonable way.
>
> Ref:
> https://issues.apache.org/jira/browse/TIKA-4411?focusedCommentId=17952615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17952615
--
This message was sent by Atlassian Jira
(v8.20.10#820010)