[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Julien Nioche (JIRA) Tue, 27 Jul 2010 15:13:41 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958
 ]


Julien Nioche commented on TIKA-463:
------------------------------------

Am very tempted to push things one step further and delegate the startElement() 
and endElement() to the mappers so that users can do whatever they fancy in 
their custom mapper implementations. In that case we'd probably not need 
mapSafeElement and mapSafeAttribute any longer. The patch above gives the 
mappers access to the metadata.

For example, <a> have a special treatment in the HTMLHandler and we currently 
can't get the rel attribute in from <a href="http://www.nutch.org"; 
rel="nofollow">, which for a crawler is quite an embarrassment. Instead, by 
delegating the logic to the mappers we get total control on what can be done 
while at the same time remain able to keep the existing behaviour by default. 

Any reason not to delegate start/endElement to the mappers? It would be good to 
get some feedback on this, as I really need to improve the  handling of HTML 
for Nutch :-)

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, 
> link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd 
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges 
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, 
> then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Reply via email to