[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13720694#comment-13720694 ]
Ray Gauss II commented on TIKA-1154: ------------------------------------ I've been pushing the metadata-extractor Maven release through Sonatype thus far, but Mr. Noakes has been granted access there [1]. If there's no response to your Google code issue I can push a 2.6.2.1 release that upgrades xercesImpl to 2.11.0 which, on first look, compiles and has no test failures. [1] https://issues.sonatype.org/browse/OSSRH-3948 > Tika hangs on format detection of malformed HTML file. > ------------------------------------------------------ > > Key: TIKA-1154 > URL: https://issues.apache.org/jira/browse/TIKA-1154 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.4 > Reporter: Andrew Jackson > Priority: Minor > Attachments: tika-breaker.html > > > We are using Tika on large web archives, which also happen to contain some > malformed files. In particular, we found a HTML file with binary characters > in the DOCTYPE declaration. This hangs Tika, either embedded or from the > command line, during format detection. > An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira