Avoid parsing uneccessary links and get a more relevant outlink list
--------------------------------------------------------------------
Key: NUTCH-488
URL: https://issues.apache.org/jira/browse/NUTCH-488
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Environment: Windows, Java 1.5
Reporter: Emmanuel Joke
NekoHTML parser use a method to extract all outlinks from the HTML page. It
will extracts them from the HTML content based on the list of param defined in
the method setConf(). Then this list of links will be truncated to be limit to
the the maximum number of outlinks that we'll process for a page defined in
nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and finally it
will be go through all urlfilter defined.
Unfortunetly it can happen that the list of outlinks is more than 100, so it
will truncated the list and could remove some relevant links.
So I've added few options in the nutch-default.xml in order to enable/disable
the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM,
LINK).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers