----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]> Sent: Sunday, June 10, 2007 5:48 PM
> Enzo Michelangeli wrote: >> ----- Original Message ----- From: "Berlin Brown" >> <[EMAIL PROTECTED]> >> Sent: Sunday, June 10, 2007 11:24 AM >> >>> Yea, but how do crawl the actual pages like you would a intranet >>> crawl. For example, lets say that I have 20 urls in my set from the >>> DmozParser. Lets also say that I want to go into the depth 3 levels >>> deep into the 20 urls. Is that possible. >>> >>> For example with the intranet crawl I would start with some seed URL >>> and then go into some depth. How would I do that URLs fetched from >>> for example dmoz. >> >> The only way I can imagine is doing it on a host-by-host basis, >> restricting the host you crawl at various stages with an URLFilter, e.g. >> by changing the content of regex-urlfilter.txt . > > One simple and efficient way to limit the maximum depth (i.e. the number > of path elements) for any given site is to ... count the slashes ;) You > can do it in a regex, or you can implement your own URLFilter plugin that > does exactly this. Well, it depends on what you mean by "depth": maybe Berlin wants to limit the length of the chain of recursion (page1.html links to page2.html, which links to page3.html - and we stop there). Also, in these days many sites like blogs or CMS-based ones have dynamically-generated content, with no relationship between '/' and tree structure in the server's filesystem. Enzo ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
