[ 
https://issues.apache.org/jira/browse/ANY23-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337404#comment-16337404
 ] 

ASF GitHub Bot commented on ANY23-324:
--------------------------------------

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/58
  
    @lewismc Yeah, I just realized that this PR fixes none of the issues we 
thought it would... because the TagSoupParser is not what was causing the 
problem... the semargl parser is causing the problem. Don't worry, I've got 
another PR coming shortly!


> Replace net.sourceforge.nekohtml with jsoup 
> --------------------------------------------
>
>                 Key: ANY23-324
>                 URL: https://issues.apache.org/jira/browse/ANY23-324
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>            Reporter: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.2
>
>
> A long standing issue relates to the performance of the existing default 
> [TagSoupParser.java|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java].
>  There are a number of issues which now relate to limitations in the way 
> nekohtml parses HTML5 for example 
> [ANY23-317|https://issues.apache.org/jira/browse/ANY23-317], 
> [ANY23-273|https://issues.apache.org/jira/browse/ANY23-273], 
> [ANY23-267|https://issues.apache.org/jira/browse/ANY23-267]... there are 
> several others.
> I propose to @Deprecate the TagSoupParser.java implementation for the next 
> release (possibly making it configurable via 
> default-configuration.properties). I also propose to replace it with 
> https://jsoup.org/. AFAIK, Apache Tika also did this several years ago.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to