Hi,
By removing urls loop by insert regex in regex-ulrfilter.txt my crawler increase the speed of crawling by 50%.
From 40 pages/seconds with 120 threads to 74 pages/seconds.
The spider traps is realy a big problem for whole web crawlers. For now the only solution is by observation of the urls inserted in the db and create the appropriate regex.
Massimo
[EMAIL PROTECTED] wrote:
Dear Doug,
I try your suggestion, and it works fine, but:
How to eliminate the following pages from fetch (every content is the same after first '2002/kepaloldal/m')?:
http://www.nb1.hu/galeria/Hun_Ita/reti/m/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/m/2002/kepaloldal/m/2001/kepaloldal/1121kisputekep.htm
Thanks for your help: Ferenc
-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
