[ https://issues.apache.org/jira/browse/TIKA-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17768076#comment-17768076 ]
ASF GitHub Bot commented on TIKA-1599: -------------------------------------- tballison commented on PR #1356: URL: https://github.com/apache/tika/pull/1356#issuecomment-1731703997 This leaves the tagsoup html parser where it was for now. We need to figure out if we want to delete it, keep it where it is or move it to a new module. > Switch from TagSoup to JSoup > ---------------------------- > > Key: TIKA-1599 > URL: https://issues.apache.org/jira/browse/TIKA-1599 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.7, 1.8 > Reporter: Kenneth William Krugler > Assignee: Kenneth William Krugler > Priority: Major > Attachments: TIKA-1599-crazy-files.tar.gz, consumentenbond.html, > tagsoup_vs_jsoup_reports.zip > > > There are several Tika issues related to how TagSoup cleans up HTML > ([TIKA-381], [TIKA-985], maybe [TIKA-715]), but TagSoup doesn't seem to be > under active development. > On the other hand I know of several projects that are now using > [JSoup|https://github.com/jhy/jsoup], which is an active project (albeit only > one main contributor) under the MIT license. > I haven't looked into how hard it would be to switch this dependency. -- This message was sent by Atlassian Jira (v8.20.10#820010)