HtmlParser sometimes(!) throws IOException while determining Html-Encoding
--------------------------------------------------------------------------

                 Key: TIKA-881
                 URL: https://issues.apache.org/jira/browse/TIKA-881
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
         Environment: Windows7, JDK1.5, JDK1.6
            Reporter: Klaus v. Einem


Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 
10 (approximately). 

java.io.IOException: Resetting to invalid mark
        at java.io.BufferedInputStream.reset(Unknown Source)
        at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
        at 
org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)

In the getEncoding()-Method: To re-read() the input stream, the current read 
position is marked and the readlimit (maximum number of bytes to be read before 
the mark position gets invalidated) is given. 

So far so good, but then an InputStreamReader comes into play. When you check 
the API-Doc you see this: 
 * ...
 * To enable the efficient conversion of bytes to characters, more bytes may
 * be read ahead from the underlying stream than are necessary to satisfy the
 * current read operation.
 * ...

Please notice the term "may"... So, when this happens the following reset() on 
the stream will throw the Exception because the mark position gets invalidated 
(the number of read bytes exceeds the readlimit).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to