[ https://issues.apache.org/jira/browse/TIKA-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975402#comment-15975402 ]
Shai Erera commented on TIKA-2328: ---------------------------------- Thanks [~talli...@apache.org]. I did look at that issue before, so just an FYI, I tried to parse my HTML documents using Jsoup and it fails when the document contains weird characters, where TagSoup is able to parse them just fine. Therefore I think that if Tika moves to Jsoup it may solve this issue, but raise other ones. This is just an FYI of course. > HtmlParser fails when DOCTYPE has unbalanced quotes > --------------------------------------------------- > > Key: TIKA-2328 > URL: https://issues.apache.org/jira/browse/TIKA-2328 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Shai Erera > > When attempting to parse HTML documents that start like this: > {noformat} > <!DOCTYPE HTML PUBLIC "> > <head> > <HEAD> > <title>PolClub - Polish Page on VicNet - Australia</title> > {noformat} > I receive the following exception: > {noformat} > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String > index out of range: -1 > at java.lang.String.substring(String.java:1967) > at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881) > at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856) > at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557) > at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449) > at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122) > {noformat} > The problem seems to be in Tagsoup's {{Parser.trimquotes}}: > {code} > private static String trimquotes(String in) { > if (in == null) return in; > int length = in.length(); > if (length == 0) return in; > char s = in.charAt(0); > char e = in.charAt(length - 1); > if (s == e && (s == '\'' || s == '"')) { > in = in.substring(1, in.length() - 1); > } > return in; > } > {code} > Instead of checking for string of length 0, it should check {{if length <= 1) > return in;}}, as even if the string is of length 1, there's no point trimming > the quotes. Or, if the desired behavior is to remove the leading quotes only, > better protect against this case. > I know the bug is in tagsoup, but it looks like the code hasn't been touched > in 6 years. I hope it's OK to report the bug here. -- This message was sent by Atlassian JIRA (v6.3.15#6346)