[ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432218#comment-13432218 ]
Ken Krugler commented on TIKA-881: ---------------------------------- I've asked Jukka to look into this. From my email to tika-dev: {quote} The fix that Klaus provided avoids using reset() on the input stream. But I thought that Tika tries to wrap streams such that a reset() will work properly, as otherwise auto detection of content can fail. I haven't had to dig into all of the tricky issues around stream management, so I'm hoping you can take a look at Klaus's report and provide commentary. {quote} > HtmlParser sometimes(!) throws IOException while determining Html-Encoding > -------------------------------------------------------------------------- > > Key: TIKA-881 > URL: https://issues.apache.org/jira/browse/TIKA-881 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.0 > Environment: Windows7, JDK1.5, JDK1.6 > Reporter: Klaus v. Einem > Assignee: Ken Krugler > Labels: stability > Attachments: BugfixHtmlParser.java, HtmlParser.java > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out > of 10 (approximately). > java.io.IOException: Resetting to invalid mark > at java.io.BufferedInputStream.reset(Unknown Source) > at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168) > at > org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92) > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188) > In the getEncoding()-Method: To re-read() the input stream, the current read > position is marked and the readlimit (maximum number of bytes to be read > before the mark position gets invalidated) is given. > So far so good, but then an InputStreamReader comes into play. When you check > the API-Doc you see this: > * ... > * To enable the efficient conversion of bytes to characters, more bytes may > * be read ahead from the underlying stream than are necessary to satisfy the > * current read operation. > * ... > Please notice the term "may"... So, when this happens the following reset() > on the stream will throw the Exception because the mark position gets > invalidated (the number of read bytes exceeds the readlimit). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira