[ https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler resolved TIKA-2539. ------------------------------- Resolution: Duplicate > TagSoup HTML parser is project EOL > ---------------------------------- > > Key: TIKA-2539 > URL: https://issues.apache.org/jira/browse/TIKA-2539 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.16, 1.17 > Environment: All > Reporter: Richard Jones > > The TagSoup HTML parser is project EOL, and the last update was to create the > 1.2.1 version (that Tika references) back in Aug 2011. > I cannot find any TagSoup forks that are still active but there are many > alternative (and perhaps better if you believe the reviews and wikipedia > comparisons) html parsers out there. > Perhaps the most active is already pulled in by Tika as a transitive > dependency of edu.ucar:grib, and that is jsoup with over 1,000 usages and > updates as recent as a few months ago: > https://mvnrepository.com/artifact/org.jsoup/jsoup > https://jsoup.org/ > Requesting consideration of moving away from the long EOL'd TagSoup to an > active and modern HTML parser like jsoup that is already a transitive Tika > dependency. -- This message was sent by Atlassian JIRA (v6.4.14#64029)