[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13719594#comment-13719594 ]
Andrew Jackson commented on TIKA-1154: -------------------------------------- We could exclude the package from coming in via the metadata-extractor dependency and include the later version as a top-level dependency, but if there have been significant API changes between 2.8.1 and 2.10.0 then this could cause problems. I can submit an issue at https://code.google.com/p/metadata-extractor/issues/list and see if they're willing to upgrade? > Tika hangs on format detection of malformed HTML file. > ------------------------------------------------------ > > Key: TIKA-1154 > URL: https://issues.apache.org/jira/browse/TIKA-1154 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.4 > Reporter: Andrew Jackson > Priority: Minor > Attachments: tika-breaker.html > > > We are using Tika on large web archives, which also happen to contain some > malformed files. In particular, we found a HTML file with binary characters > in the DOCTYPE declaration. This hangs Tika, either embedded or from the > command line, during format detection. > An example file is attached. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira