Doug Cutting wrote:

Andrzej Bialecki wrote:

Another question is: what are the consequences of NOT running the "analyze" step? How does this affect the fetchlist generation, and the search scoring?


It works fine.  This is what the "crawl" command does.

The indexer can optionally use log(number of incoming links) as a simplified link analysis score that does not require running "analyze".

This is very good news, indeed - after watching the "nutch analyze" process filling up my 0.5TB disk it's good to know that it can be avoided.



The "crawl" tool specifies this option (indexer.boost.by.link.count), and it works pretty well.

Actually, I think it doesn't specify this at the moment. However, IndexSegment supports it, just not as a command-line option; and perhaps it shouldn't support a command-line option, because you want to be consistent with this setting across many runs.



It would also be useful if fetchlist generation could similarly prioritize by number of incoming links. This would be easy to add. Simply change line 501 of FetchListTool.java to something like:


   curScore.set(scoreByLinkCount
                ? (float)Math.log(results.length)
                : page.getScore());

where scoreByLinkCount is determined by a config file property.

If I get a chance I'll try to add this, or maybe you can first.

I can do that.

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)



-------------------------------------------------------
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more
http://productguide.itmanagersjournal.com/guidepromo.tmpl
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to