[ https://issues.apache.org/jira/browse/TIKA-1896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218412#comment-15218412 ]
Ken Krugler commented on TIKA-1896: ----------------------------------- Hi Tim - hmm, changing the type of the script tag from cdata to element seems...exciting. If we had that hefty corpus of gnarly HTML docs to compare before/after results, then I'd be less concerned about this "fix". In any case I wouldn't call this a major bug, more like a minor improvement to how Tika (via TagSoup) handles invalid HTML. Personally I'd suggest better documentation (with this as an example) of how to customize the HTMLSchema via adding it to the context, and then add this particular case to the list of things we'd want to try when/if we switch to JSoup (see [TIKA-1599]). But we'd only want to make that switch when we have the corpus (via TIKA-1302 and TIKA-1331) to validate against. > Invalid closing script tag not handled gracefully by HtmlParser > --------------------------------------------------------------- > > Key: TIKA-1896 > URL: https://issues.apache.org/jira/browse/TIKA-1896 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.12 > Reporter: Matthew Caruana Galizia > Attachments: test.html > > > When an HTML file contains an invalid closing script tag, all content after > that tag is interpreted as script data and therefore ignored. > Reduced test case file attached. > To reproduce: > 1) create a file with the following HTML > {code:html} > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" > "http://www.w3.org/TR/html4/loose.dtd"> > <html> > <head> > <script lang="javascript"></script language> > </head> > <body> > <p>This is a test.</p> > </body> > </html> > {code} > 2) {{java -jar tika-app-1.12.jar -t test.html}} > Expected result: > {{This is a test.}} > What is actually returned: > Nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)