[
https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892996#action_12892996
]
Ken Krugler commented on TIKA-463:
----------------------------------
I like the idea of being able to encapsulate all special processing into one
easily extensible class.
I'm trying to come to grips with what things should be done in HtmlParser vs.
HtmlHandler vs. HtmlMapper.
Since most of what we're talking about is moving code from HtmlHandler to
HtmlMapper, I agree that trying to provide as much control as possible to
HtmlMapper (which can be overridden) makes sense. But when I look at what would
be left in HtmlHandler, it's not clear to me that we'd even need that class
anymore. But I'd need to spend more time thinking about things like why
HtmlHandler is subclassing TextContentHandler vs. DefaultHandler.
In summary, it feels like we're heading down a path where HtmlHandler is the
extension point (there is no HtmlMapper), and it should have some methods
(beyond the std ContentHandler methods) that can be overridden to adjust
behavior. Otherwise it would be this very thin shim, without much value, that
just adds complexity to the calling chain.
> HtmlParser doesn't extract links from img, map, object, frame, iframe, area,
> link
> ---------------------------------------------------------------------------------
>
> Key: TIKA-463
> URL: https://issues.apache.org/jira/browse/TIKA-463
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Ken Krugler
> Assignee: Ken Krugler
> Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd
> want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges
> in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants,
> then all of the above are valid, and thus should be emitted by the parser,
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.