[
https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764087#action_12764087
]
Ken Krugler commented on TIKA-304:
----------------------------------
A few comments on this:
1. I think it's an improvement, not a bug :)
2. I agree that it would be great to be able to alter the behavior of
HtmlParser. Making subclassing easier is one approach, another might be the
ability (IoC model) of specifying a different content handler.
3. Preserving attributes is very important - I had a todo on my list to file an
issue about this. E.g. with links, there can be attributes like the target
content language that you want to preserve.
4. I have some mods for HtmlParser that I need to turn into issues/patches,
e.g. link extraction from <img>, <link>, etc tags. But I'd hate to put Jukka
into n-way merge hell. So I might wait for this patch to get rolled in first.
> HtmlParser could be easier to subclass
> --------------------------------------
>
> Key: TIKA-304
> URL: https://issues.apache.org/jira/browse/TIKA-304
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.4, 0.5
> Reporter: Benson Margulies
> Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes
> along, instead of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id
> attributes) but let's see how you like my first patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.