[jira] [Commented] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes

Shai Erera (JIRA) Wed, 19 Apr 2017 12:57:57 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975402#comment-15975402
 ]


Shai Erera commented on TIKA-2328:
----------------------------------

Thanks [~talli...@apache.org]. I did look at that issue before, so just an FYI, 
I tried to parse my HTML documents using Jsoup and it fails when the document 
contains weird characters, where TagSoup is able to parse them just fine. 
Therefore I think that if Tika moves to Jsoup it may solve this issue, but 
raise other ones. This is just an FYI of course.

> HtmlParser fails when DOCTYPE has unbalanced quotes
> ---------------------------------------------------
>
>                 Key: TIKA-2328
>                 URL: https://issues.apache.org/jira/browse/TIKA-2328
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shai Erera
>
> When attempting to parse HTML documents that start like this:
> {noformat}
> <!DOCTYPE HTML PUBLIC ">
> <head>
>               <HEAD>
>         <title>PolClub - Polish Page on VicNet - Australia</title>
> {noformat}
> I receive the following exception:
> {noformat}
> Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String 
> index out of range: -1
>       at java.lang.String.substring(String.java:1967)
>       at org.ccil.cowan.tagsoup.Parser.trimquotes(Parser.java:881)
>       at org.ccil.cowan.tagsoup.Parser.decl(Parser.java:856)
>       at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:557)
>       at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>       at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:122)
> {noformat}
> The problem seems to be in Tagsoup's {{Parser.trimquotes}}:
> {code}
>       private static String trimquotes(String in) {
>               if (in == null) return in;
>               int length = in.length();
>               if (length == 0) return in;
>               char s = in.charAt(0);
>               char e = in.charAt(length - 1);
>               if (s == e && (s == '\'' || s == '"')) {
>                       in = in.substring(1, in.length() - 1);
>                       }
>               return in;
>               }
> {code}
> Instead of checking for string of length 0, it should check {{if length <= 1) 
> return in;}}, as even if the string is of length 1, there's no point trimming 
> the quotes. Or, if the desired behavior is to remove the leading quotes only, 
> better protect against this case.
> I know the bug is in tagsoup, but it looks like the code hasn't been touched 
> in 6 years. I hope it's OK to report the bug here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2328) HtmlParser fails when DOCTYPE has unbalanced quotes

Reply via email to