[ https://issues.apache.org/jira/browse/ANY23-324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney resolved ANY23-324. ---------------------------------------- Resolution: Fixed > Replace net.sourceforge.nekohtml with jsoup > -------------------------------------------- > > Key: ANY23-324 > URL: https://issues.apache.org/jira/browse/ANY23-324 > Project: Apache Any23 > Issue Type: Improvement > Components: core > Reporter: Lewis John McGibbney > Priority: Major > Fix For: 2.2 > > > A long standing issue relates to the performance of the existing default > [TagSoupParser.java|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java]. > There are a number of issues which now relate to limitations in the way > nekohtml parses HTML5 for example > [ANY23-317|https://issues.apache.org/jira/browse/ANY23-317], > [ANY23-273|https://issues.apache.org/jira/browse/ANY23-273], > [ANY23-267|https://issues.apache.org/jira/browse/ANY23-267]... there are > several others. > I propose to @Deprecate the TagSoupParser.java implementation for the next > release (possibly making it configurable via > default-configuration.properties). I also propose to replace it with > https://jsoup.org/. AFAIK, Apache Tika also did this several years ago. -- This message was sent by Atlassian JIRA (v7.6.3#76005)