[ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235527#comment-13235527 ]
Klaus v. Einem edited comment on TIKA-881 at 3/22/12 1:17 PM: -------------------------------------------------------------- BugfixHtmlParser.java: This is my Workaround... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor. was (Author: v.einem): This is my Solution... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor. > HtmlParser sometimes(!) throws IOException while determining Html-Encoding > -------------------------------------------------------------------------- > > Key: TIKA-881 > URL: https://issues.apache.org/jira/browse/TIKA-881 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.0 > Environment: Windows7, JDK1.5, JDK1.6 > Reporter: Klaus v. Einem > Labels: stability > Attachments: BugfixHtmlParser.java, HtmlParser.java > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out > of 10 (approximately). > java.io.IOException: Resetting to invalid mark > at java.io.BufferedInputStream.reset(Unknown Source) > at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168) > at > org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92) > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188) > In the getEncoding()-Method: To re-read() the input stream, the current read > position is marked and the readlimit (maximum number of bytes to be read > before the mark position gets invalidated) is given. > So far so good, but then an InputStreamReader comes into play. When you check > the API-Doc you see this: > * ... > * To enable the efficient conversion of bytes to characters, more bytes may > * be read ahead from the underlying stream than are necessary to satisfy the > * current read operation. > * ... > Please notice the term "may"... So, when this happens the following reset() > on the stream will throw the Exception because the mark position gets > invalidated (the number of read bytes exceeds the readlimit). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira