[ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887523#action_12887523
 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

The other issue I've run into with HtmlMapper is that it seems impossible 
currently to have it do the right thing for remapping URLs, even if I create my 
own custom implementation of that interface.

The problem is that you specify the mapper via ParseContext(HtmlMapper.class, 
my-custom-code.class). So this means my-custom-code gets instantiated via a 
no-args constructor, and it doesn't have access to the metadata, so it doesn't 
know the base URL to use for normalizing URLs.

If I could, I'd change HtmlParser to be an abstract class, and have a 
constructor that takes Metadata and ParseContext arguments. And give it a 
"resolveUrl()" method that the mapSafeAttribute() method could use, versus 
baking that into HtmlHandler.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to