[
https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916526#comment-16916526
]
Anthony Pessy commented on ANY23-441:
-------------------------------------
Updating jsoup seems to resolve the issue too.
> TikaEncodingDetector: guessEncoding may throws an
> ArrayIndexOutOfBoundsException
> --------------------------------------------------------------------------------
>
> Key: ANY23-441
> URL: https://issues.apache.org/jira/browse/ANY23-441
> Project: Apache Any23
> Issue Type: Bug
> Components: encoding
> Affects Versions: 2.3
> Reporter: Anthony Pessy
> Priority: Major
>
> Using `TikaEncodingDetector.guessEncoding` may result in an
> `ArrayIndexOutOfBoundsException`.
>
> The following snippet:
> {noformat}
> String encoding = new TikaEncodingDetector().guessEncoding(new
> URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream());
> System.out.println(encoding);{noformat}
> Will result in the following exception:
> {noformat}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1
> out of bounds for length 32768Exception in thread "main"
> java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length
> 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100)
> at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at
> org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at
> org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at
> org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at
> org.jsoup.parser.Parser.parseFragment(Parser.java:140) at
> org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184)
> at
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95)
> at
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159)
> at
> org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat}
> Whereas the expected result is `ISO-8859-15`
> Note the bunch of HTML at the bottom of the page after the `</html>` tag.
>
> Replacing:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE);
> {code}
> By:
> {code:java}
> ParseErrorList htmlErrors = ParseErrorList.tracking(100);
> {code}
>
> Will fix the issue. Not quite sure why, maybe at one point the errors are too
> far and the reader cannot reset far enough...
>
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)