[ 
https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218412#comment-15218412
 ] 

Ken Krugler commented on TIKA-1896:
-----------------------------------

Hi Tim - hmm, changing the type of the script tag from cdata to element 
seems...exciting. If we had that hefty corpus of gnarly HTML docs to compare 
before/after results, then I'd be less concerned about this "fix". 

In any case I wouldn't call this a major bug, more like a minor improvement to 
how Tika (via TagSoup) handles invalid HTML.

Personally I'd suggest better documentation (with this as an example) of how to 
customize the HTMLSchema via adding it to the context, and then add this 
particular case to the list of things we'd want to try when/if we switch to 
JSoup (see [TIKA-1599]). But we'd only want to make that switch when we have 
the corpus (via TIKA-1302 and TIKA-1331) to validate against.

> Invalid closing script tag not handled gracefully by HtmlParser
> ---------------------------------------------------------------
>
>                 Key: TIKA-1896
>                 URL: https://issues.apache.org/jira/browse/TIKA-1896
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.12
>            Reporter: Matthew Caruana Galizia
>         Attachments: test.html
>
>
> When an HTML file contains an invalid closing script tag, all content after 
> that tag is interpreted as script data and therefore ignored.
> Reduced test case file attached.
> To reproduce:
> 1) create a file with the following HTML
> {code:html}
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
> "http://www.w3.org/TR/html4/loose.dtd";>
> <html>
>       <head>
>               <script lang="javascript"></script language>
>       </head>
>       <body>
>               <p>This is a test.</p>
>       </body>
> </html>
> {code}
> 2) {{java -jar tika-app-1.12.jar -t test.html}}
> Expected result:
> {{This is a test.}}
> What is actually returned:
> Nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to