[
https://issues.apache.org/jira/browse/TIKA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760038#action_12760038
]
Ken Krugler commented on TIKA-287:
----------------------------------
Hi Uwe,
One comment about the Pangaea code - from the HTML spec, the <base> element
href has to be absolute, not relative. That's how I was handling it, but I see
you're using URL(baseUrl, hrefUrl) to construct the new baseUrl.
Have you run into cases where the <base> element href attribute was actually a
relative URL?
Thanks,
-- Ken
> HtmlParser should resolve relative paths in <a href="xxx"> elements
> -------------------------------------------------------------------
>
> Key: TIKA-287
> URL: https://issues.apache.org/jira/browse/TIKA-287
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 0.4
> Reporter: Ken Krugler
>
> Currently clients of the HtmlParser need to manually keep track of the
> appropriate base URL to use when resolving relative URLs in href="xxx"
> attributes.
> The parser should use the metadata RESOURCE_NAME_KEY value as the base.
> The parser should also watch for a <base> element in the <head> section, and
> use that to update the base URL.
> Note that special care must be taken to work around a known bug in the Java
> URL() class, when the relative URL is a query string and the base URL doesn't
> end with a '/'.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.