Hi folks,

I introduced a regression in the HtmlParser in TIKA-1938, which added the
ability to emit parsed <script src="..."> tags found in the HTML <head>.
<script> is not currently included in the list of valid <head> child
elements in XHTMLContentHandler.java, so when the first <script> tag is
parsed the <head> is immediately closed. After correcting this, because my
patch treats <script> in the same manner as <base> and <link>, empty
<script> tags are emitted as <script src="..." />, which is invalid (empty
<script> elements must have both opening and closing tags, e.g. <script
src="..."></script>). Unfortunately I haven't yet found an easy fix, so:

1. Would it be be best to open two tickets: one for reverting TIKA-1938, and
another for correcting the issue?

2. Is anyone particularly familiar with the HtmlParser and able to take a
look?

I'm finding it very difficult to add support for these <script> tags due to
the way the HtmlHandler, and XHTMLContentHandler lazily handle the <head>
element, but I believe it's a necessary feature.

Reply via email to