Re: [Nutch-general] Crawling the web and going into depth

Enzo Michelangeli Tue, 12 Jun 2007 15:10:34 -0700

----- Original Message ----- 
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
Sent: Sunday, June 10, 2007 5:48 PM


> Enzo Michelangeli wrote:
>> ----- Original Message ----- From: "Berlin Brown" 
>> <[EMAIL PROTECTED]>
>> Sent: Sunday, June 10, 2007 11:24 AM
>>
>>> Yea, but how do crawl the actual pages like you would a intranet
>>> crawl. For example, lets say that I have 20 urls in my set from the
>>> DmozParser.  Lets also say that I want to go into the depth 3 levels
>>> deep into the 20 urls.  Is that possible.
>>>
>>> For example with the intranet crawl I would start with some seed URL
>>> and then go into some depth.  How would I do that URLs fetched from
>>> for example dmoz.
>>
>> The only way I can imagine is doing it on a host-by-host basis, 
>> restricting the host you crawl at various stages with an URLFilter, e.g. 
>> by changing the content of regex-urlfilter.txt .
>
> One simple and efficient way to limit the maximum depth (i.e. the number 
> of path elements) for any given site is to ... count the slashes ;) You 
> can do it in a regex, or you can implement your own URLFilter plugin that 
> does exactly this.

Well, it depends on what you mean by "depth": maybe Berlin wants to limit 
the length of the chain of recursion (page1.html links to page2.html, which 
links to page3.html - and we stop there). Also, in these days many sites 
like blogs or CMS-based ones have dynamically-generated content, with no 
relationship between '/' and tree structure in the server's filesystem.

Enzo


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawling the web and going into depth

Reply via email to