[ 
https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus v. Einem updated TIKA-881:
--------------------------------

    Attachment: BugfixHtmlParser.java

This is my Solution... Sorry, Comments are in German. The Key is: No 
InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) 
with the String constructor.
                
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out 
> of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
>       at java.io.BufferedInputStream.reset(Unknown Source)
>       at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
>       at 
> org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
>       at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read 
> position is marked and the readlimit (maximum number of bytes to be read 
> before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check 
> the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() 
> on the stream will throw the Exception because the mark position gets 
> invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to