[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887523#action_12887523
]
Ken Krugler commented on TIKA-463:
----------------------------------
The other issue I've run into with HtmlMapper is that it seems impossible
currently to have it do the right thing for remapping URLs, even if I create my
own custom implementation of that interface.
The problem is that you specify the mapper via ParseContext(HtmlMapper.class,
my-custom-code.class). So this means my-custom-code gets instantiated via a
no-args constructor, and it doesn't have access to the metadata, so it doesn't
know the base URL to use for normalizing URLs.
If I could, I'd change HtmlParser to be an abstract class, and have a
constructor that takes Metadata and ParseContext arguments. And give it a
"resolveUrl()" method that the mapSafeAttribute() method could use, versus
baking that into HtmlHandler.
> HtmlParser doesn't extract links from img, map, object, frame, iframe, area,
> link
> ---------------------------------------------------------------------------------
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
> Issue Type: Bug
> Reporter: Ken Krugler
> Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants,
> then all of the above are valid, and thus should be emitted by the parser,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.