Re: [Nutch-general] Crawling the web and going into depth

Andrzej Bialecki Tue, 12 Jun 2007 14:48:16 -0700

Enzo Michelangeli wrote:
> ----- Original Message ----- From: "Berlin Brown" <[EMAIL PROTECTED]>
> Sent: Sunday, June 10, 2007 11:24 AM
> 
>> Yea, but how do crawl the actual pages like you would a intranet
>> crawl. For example, lets say that I have 20 urls in my set from the
>> DmozParser.  Lets also say that I want to go into the depth 3 levels
>> deep into the 20 urls.  Is that possible.
>>
>> For example with the intranet crawl I would start with some seed URL
>> and then go into some depth.  How would I do that URLs fetched from
>> for example dmoz.
> 
> The only way I can imagine is doing it on a host-by-host basis, 
> restricting the host you crawl at various stages with an URLFilter, e.g. 
> by changing the content of regex-urlfilter.txt .


One simple and efficient way to limit the maximum depth (i.e. the number 
of path elements) for any given site is to ... count the slashes ;) You 
can do it in a regex, or you can implement your own URLFilter plugin 
that does exactly this.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawling the web and going into depth

Reply via email to