Re: [Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool takes up all the disk space when there are circular links

Massimo Miccoli Wed, 20 Apr 2005 09:04:41 -0700

Hi,

By removing urls loop by insert regex in regex-ulrfilter.txt my crawler increase the speed of crawling by 50%. From 40 pages/seconds with 120 threads to 74 pages/seconds. The spider traps is realy a big problem for whole web crawlers. For now the only solution is by observation of the urls inserted in the db and create the appropriate regex.

Massimo


[EMAIL PROTECTED] wrote:

Dear Doug,
I try your suggestion, and it works fine, but: How to eliminate the following pages from fetch (every content is the same after first '2002/kepaloldal/m')?: http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/1121kisputekep.htm
Thanks for your help:
   Ferenc
------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] [jira] Commented: (NUTCH-7) analyze tool takes up all the disk space when there are circular links

Reply via email to