[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958
]
Julien Nioche commented on TIKA-463:
------------------------------------
Am very tempted to push things one step further and delegate the startElement()
and endElement() to the mappers so that users can do whatever they fancy in
their custom mapper implementations. In that case we'd probably not need
mapSafeElement and mapSafeAttribute any longer. The patch above gives the
mappers access to the metadata.
For example, <a> have a special treatment in the HTMLHandler and we currently
can't get the rel attribute in from <a href="http://www.nutch.org"
rel="nofollow">, which for a crawler is quite an embarrassment. Instead, by
delegating the logic to the mappers we get total control on what can be done
while at the same time remain able to keep the existing behaviour by default.
Any reason not to delegate start/endElement to the mappers? It would be good to
get some feedback on this, as I really need to improve the handling of HTML
for Nutch :-)
> HtmlParser doesn't extract links from img, map, object, frame, iframe, area,
> link
> ---------------------------------------------------------------------------------
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants,
> then all of the above are valid, and thus should be emitted by the parser,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.