Extension point with filters for Neko HTML parser (with patch)
--------------------------------------------------------------
Key: NUTCH-490
URL: https://issues.apache.org/jira/browse/NUTCH-490
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.9.0
Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
Attachments: HtmlParser.java.diff
In my project I need to set filters for Neko HTML parser. So instead of adding
it hard coded, I made an extension point to define filters for Neko. I was
fallowing the code for HtmlParser filters. In fact the method to get filters I
think could be generalized to handle both cases. But I didn't want to make too
big mess.
The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk,
so should be applicable easily.
BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by
extension point itself. Now there are options for Neko and TagSoap. But if
someone would like to use something else or set give different settings for the
parser, he would need to modify HtmlParser class, instead of replacing a plugin.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers