[ https://issues.apache.org/jira/browse/ANY23-441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938195#comment-16938195 ]
Hans Brende commented on ANY23-441: ----------------------------------- FYI to anyone who's paying attention, this is the result of a jsoup bug which can be viewed here: [https://github.com/jhy/jsoup/issues/1251] > TikaEncodingDetector: guessEncoding may throws an > ArrayIndexOutOfBoundsException > -------------------------------------------------------------------------------- > > Key: ANY23-441 > URL: https://issues.apache.org/jira/browse/ANY23-441 > Project: Apache Any23 > Issue Type: Bug > Components: encoding > Affects Versions: 2.3 > Reporter: Anthony Pessy > Priority: Major > Fix For: 2.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Using `TikaEncodingDetector.guessEncoding` may result in an > `ArrayIndexOutOfBoundsException`. > > The following snippet: > {noformat} > String encoding = new TikaEncodingDetector().guessEncoding(new > URL("https://www.streetpadel.com/overgrip-head-pro-grip-dz-negro-p-17233.html").openStream()); > System.out.println(encoding);{noformat} > Will result in the following exception: > {noformat} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index -1 > out of bounds for length 32768Exception in thread "main" > java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length > 32768 at org.jsoup.parser.CharacterReader.consume(CharacterReader.java:100) > at org.jsoup.parser.TokeniserState$34.read(TokeniserState.java:556) at > org.jsoup.parser.Tokeniser.read(Tokeniser.java:57) at > org.jsoup.parser.TreeBuilder.runParser(TreeBuilder.java:64) at > org.jsoup.parser.HtmlTreeBuilder.parseFragment(HtmlTreeBuilder.java:126) at > org.jsoup.parser.Parser.parseFragment(Parser.java:140) at > org.apache.any23.encoding.TikaEncodingDetector.parseFragment(TikaEncodingDetector.java:184) > at > org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:95) > at > org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:159) > at > org.apache.any23.encoding.TikaEncodingDetector.guessEncoding(TikaEncodingDetector.java:58){noformat} > Whereas the expected result is `ISO-8859-15` > Note the bunch of HTML at the bottom of the page after the `</html>` tag. > > Replacing: > {code:java} > ParseErrorList htmlErrors = ParseErrorList.tracking(Integer.MAX_VALUE); > {code} > By: > {code:java} > ParseErrorList htmlErrors = ParseErrorList.tracking(100); > {code} > > Will fix the issue. Not quite sure why, maybe at one point the errors are too > far and the reader cannot reset far enough... > > -- This message was sent by Atlassian Jira (v8.3.4#803005)