Richard Jones created TIKA-2539:
-----------------------------------

             Summary: TagSoup HTML parser is project EOL
                 Key: TIKA-2539
                 URL: https://issues.apache.org/jira/browse/TIKA-2539
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.17, 1.16
         Environment: All
            Reporter: Richard Jones


The TagSoup HTML parser is project EOL, and the last update was to create the 
1.2.1 version (that Tika references) back in Aug 2011.
I cannot find any TagSoup forks that are still active but there are many 
alternative (and perhaps better if you believe the reviews and wikipedia 
comparisons) html parsers out there.
Perhaps the most active is already pulled in by Tika as a transitive dependency 
of edu.ucar:grib, and that is jsoup with over 1,000 usages and updates as 
recent as a few months ago:
https://mvnrepository.com/artifact/org.jsoup/jsoup
https://jsoup.org/
Requesting consideration of moving away from the long EOL'd TagSoup to an 
active and modern HTML parser like jsoup that is already a transitive Tika 
dependency.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to