[ https://issues.apache.org/jira/browse/TIKA-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371900#comment-14371900 ]
Tyler Palsulich edited comment on TIKA-1154 at 3/20/15 7:19 PM: ---------------------------------------------------------------- Marking as Fixed, since the file is detected and parsed without issue. Not sure what was happening before! Thank you, [~anjackson]! was (Author: tpalsulich): Marking as Fixed, since the file is detected and parsed without issue. Not sure what was happening before! Thanks! > Tika hangs on format detection of malformed HTML file. > ------------------------------------------------------ > > Key: TIKA-1154 > URL: https://issues.apache.org/jira/browse/TIKA-1154 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.4 > Reporter: Andrew Jackson > Priority: Minor > Attachments: tika-breaker.html > > > We are using Tika on large web archives, which also happen to contain some > malformed files. In particular, we found a HTML file with binary characters > in the DOCTYPE declaration. This hangs Tika, either embedded or from the > command line, during format detection. > An example file is attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)