[ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392248#comment-14392248 ]
Rupert Westenthaler commented on TIKA-676: ------------------------------------------ FYI: This issue is still present. I just recently got this with http://www.mico-project.eu/mico_team/adam-dahlgren-lindstrom/ I implemented a simple workaround by wrapping boilerpipe with {code} public class TIKA676WorkaroundHandler extends ContentHandlerDecorator { private final Logger log = LoggerFactory.getLogger(getClass()); public static final String A = "a"; private boolean inLink = false; public TIKA676WorkaroundHandler(ContentHandler handler) { super(handler == null ? new DefaultHandler() : handler); } @Override public void startElement(String elemUri, String localName, String name, Attributes atts) throws SAXException { if(A.equalsIgnoreCase(localName)){ if(inLink){ log.warn(" - closing open link before next one is starting!"); endElement(elemUri, localName, name); } inLink = true; } super.startElement(elemUri, localName, name, atts); } @Override public void endElement(String uri, String localName, String name) throws SAXException { if(A.equalsIgnoreCase(localName)){ if(inLink){ super.endElement(uri, localName, name); inLink = false; } else { log.warn(" - ignoring closing link that was missing before"); } } else { super.endElement(uri, localName, name); } } } {code} > Boilerpipe fails > ---------------- > > Key: TIKA-676 > URL: https://issues.apache.org/jira/browse/TIKA-676 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Gabriele Kahlout > Priority: Minor > > This is apparently a [boilerpipe issue > |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the > [Web API edition | http://boilerpipe-web.appspot.com/]. > {code} > $ curl --fail -L http://thisrecording.com/the-past | java -jar > tika-app-0.9.jar -T > % Total % Received % Xferd Average Speed Time Time Time > Current > Dload Upload Total Spent Left Speed > 100 65688 0 65688 0 0 17650 0 --:--:-- 0:00:03 --:--:-- > 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains > nested A elements -- You have probably hit a bug in your HTML parser (e.g., > NekoHTML bug #2909310). Please clean the HTML externally and feed it to > boilerpipe again > 100 128k 0 128k 0 0 32019 0 --:--:-- 0:00:04 --:--:-- 33735 > at > de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108) > at > de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169) > at > org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237) > at > org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279) > at > org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197) > at > org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135) > at > org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126) > at > org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61) > at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794) > at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061) > at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016) > at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565) > at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)