Enzo Michelangeli wrote:
> ----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
> Sent: Sunday, June 10, 2007 5:48 PM
> 
>> Enzo Michelangeli wrote:
>>> ----- Original Message ----- From: "Berlin Brown" 
>>> <[EMAIL PROTECTED]>
>>> Sent: Sunday, June 10, 2007 11:24 AM
>>>
>>>> Yea, but how do crawl the actual pages like you would a intranet
>>>> crawl. For example, lets say that I have 20 urls in my set from the
>>>> DmozParser.  Lets also say that I want to go into the depth 3 levels
>>>> deep into the 20 urls.  Is that possible.
>>>>
>>>> For example with the intranet crawl I would start with some seed URL
>>>> and then go into some depth.  How would I do that URLs fetched from
>>>> for example dmoz.
>>>
>>> The only way I can imagine is doing it on a host-by-host basis, 
>>> restricting the host you crawl at various stages with an URLFilter, 
>>> e.g. by changing the content of regex-urlfilter.txt .
>>
>> One simple and efficient way to limit the maximum depth (i.e. the 
>> number of path elements) for any given site is to ... count the 
>> slashes ;) You can do it in a regex, or you can implement your own 
>> URLFilter plugin that does exactly this.
> 
> Well, it depends on what you mean by "depth": maybe Berlin wants to 
> limit the length of the chain of recursion (page1.html links to 
> page2.html, which links to page3.html - and we stop there). Also, in 
> these days many sites like blogs or CMS-based ones have 
> dynamically-generated content, with no relationship between '/' and tree 
> structure in the server's filesystem.

Yes, there could be different definitions of depth.

When it comes to depth as in the sense of proximity, i.e. how many 
levels removed the page is from the starting point - no problem with 
that either ;) Here's how you can do it: put a counter in 
CrawlDatum.metadata, and pass it around to newly discovered pages, 
increasing it by one. When you reach a limit, you stop adding outlinks 
from such pages.

If I'm not mistaken it could be handled throughout the whole cycle if 
you use a ScoringPlugin.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to