Hello,
I'm experiencing performance problems with link analysis, and I would like to look into the DistributedLinkAnalysis tool to see if something can be improved.
The symptoms are that for a 10mln pages/15mln links webDB (roughly 6GB in size) running the "nutch analyze" produces temporary files roughly 600GB in size, which is not only difficult in terms of storage space, but also because of the write throughput of the disks! The process is taking several hours to complete on a fast machine.
I noticed that the scoreEdits.* files contain the full URLs of pages from the webDB. I was wondering what is the reason for that - because we could use the MD5Hash to as a unique identifier, and of 16 bytes length at, instead of using the full URL.
Since I don't quite understand the algorithm used in the tool, I cannot say if that would be the right way to fix it, but as it is now I expect it to be completely unusable for a DB with 50mln pages.
Another question is: what are the consequences of NOT running the "analyze" step? How does this affect the fetchlist generation, and the search scoring?
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
------------------------------------------------------- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your Opinions, Get Free ThinkGeek Gift Certificates! Click to find out more http://productguide.itmanagersjournal.com/guidepromo.tmpl _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
