Gerard Bouchar created TIKA-2709:
------------------------------------

             Summary: Invalid handling of <base> tags
                 Key: TIKA-2709
                 URL: https://issues.apache.org/jira/browse/TIKA-2709
             Project: Tika
          Issue Type: Bug
            Reporter: Gerard Bouchar


Currently, when the HTML parser encounters the following:

{code:html}
<base href='http://example.com/'>
{code}

it emits SAX events corresponding to the following:

{code:html}
<base />
<meta name='Content-Location' value='http://example.com/' />
{code}

Remark that the "base" tag has no attribute, which is [not valid in 
HTML|https://html.spec.whatwg.org/multipage/semantics.html#the-base-element].

Moreover the [Content-Location HTTP 
header|https://tools.ietf.org/html/rfc7231#section-3.1.4.2] has a different 
meaning, and the behavior of tika doesn't allow application code to distinguish 
between a base tag and an http-equiv meta-tag setting the Content-Location.
 
See: 
https://github.com/apache/tika/blob/18f4e24451b1d835ab1897f49389788f78063a52/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java#L158-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to