Hi folks, I introduced a regression in the HtmlParser in TIKA-1938, which added the ability to emit parsed <script src="..."> tags found in the HTML <head>. <script> is not currently included in the list of valid <head> child elements in XHTMLContentHandler.java, so when the first <script> tag is parsed the <head> is immediately closed. After correcting this, because my patch treats <script> in the same manner as <base> and <link>, empty <script> tags are emitted as <script src="..." />, which is invalid (empty <script> elements must have both opening and closing tags, e.g. <script src="..."></script>). Unfortunately I haven't yet found an easy fix, so:
1. Would it be be best to open two tickets: one for reverting TIKA-1938, and another for correcting the issue? 2. Is anyone particularly familiar with the HtmlParser and able to take a look? I'm finding it very difficult to add support for these <script> tags due to the way the HtmlHandler, and XHTMLContentHandler lazily handle the <head> element, but I believe it's a necessary feature.
