[ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716
 ] 

Julien Nioche commented on TIKA-463:
------------------------------------

creating a LinksHtmlMapper : +1, that would be a nice intermediate between the 
default mapper and the identity mapper 

handling of links in mapper : mapSafeAttribute() returns a normalised 
representation of the attribute names that are allowed but does not affect the 
value of the attributes. Maybe we should change the method so that it returns 
BOTH the normalised name (or null of the attribute must be skipped) and the 
corresponding normalised value (e.g. the resolved URL) given a name/value 
couple. The mapper implementation could then manage the resolution of the URLs 
internally. This would also be useful for normalising the names and values of 
elements in the header such as http-equiv.

HtmlParser as an abstract class : what about following Jukka's suggestion for 
Handlers in https://issues.apache.org/jira/browse/TIKA-458 and have a Factory?

As for frames, it raises another issue (see 
https://issues.apache.org/jira/browse/TIKA-457) which is that anything outside 
<body> and <head> is currently discarded by the HTMLMapper. This is why I 
considered doing TIKA-458 but maybe we could make the HTMLHandler more generic 
and delegate the decisions to the Mappers e.g. by adding a method isBody(). 

The body level is currently used to : 
a) distinguish the elements in the header
b) determine where characters should be added to the text of the document

Do we really need (a)? Are elements such as LINK, BASE or META found anywhere 
outside the HEAD? Should mapSafeElement() take into account the path of an 
element as well e.g. to allow a <link> only if it has <head> for parent?




> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to