Gerard Bouchar created TIKA-2709: ------------------------------------ Summary: Invalid handling of <base> tags Key: TIKA-2709 URL: https://issues.apache.org/jira/browse/TIKA-2709 Project: Tika Issue Type: Bug Reporter: Gerard Bouchar
Currently, when the HTML parser encounters the following: {code:html} <base href='http://example.com/'> {code} it emits SAX events corresponding to the following: {code:html} <base /> <meta name='Content-Location' value='http://example.com/' /> {code} Remark that the "base" tag has no attribute, which is [not valid in HTML|https://html.spec.whatwg.org/multipage/semantics.html#the-base-element]. Moreover the [Content-Location HTTP header|https://tools.ietf.org/html/rfc7231#section-3.1.4.2] has a different meaning, and the behavior of tika doesn't allow application code to distinguish between a base tag and an http-equiv meta-tag setting the Content-Location. See: https://github.com/apache/tika/blob/18f4e24451b1d835ab1897f49389788f78063a52/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java#L158-L164 -- This message was sent by Atlassian JIRA (v7.6.3#76005)