Re: [Nutch-general] Re-crawl

chad savage Tue, 05 Dec 2006 07:31:33 -0800

Hello All,

With ftp and file crawls you can check the date of the file and match 
the date against yours. Http does not have that luxury.
If this is on an internal site of yours being generated by a cms or even 
by hand, I'm sure you can create a list of pages that have been updated 
since last crawl.
As for generic web page in the wild,  No software (that I am aware of) 
can determine if a page has been updated without actually downloading it 
and matching it against its history.


Chad


Yoni Amir wrote:
> Hey Gal,
>
> Thanks for the excellent explanation. I am surprised that nutch will
> re-fetch a page (assuming 30 days have passed), even if the page hasn't
> been updated on the server.
>
> Yoni
>
> On Tue, 2006-12-05 at 15:41 +0200, Gal Nitzan wrote:
>   
>> The concept of keeping track of the crawl db is as follows.
>>
>> every url that is found during crawl (parse) is updated into crawldb with the
>> updatedb. ofcourse this url should pass all filters and normalizers prior to
>> this.
>>
>> when entered into the crawl db an object (crawldatum) is created with
>> inormation about this link. one of the parameters of crawldatum is a status.
>> this status indicates the status of the url and initially it is unfetched.
>> when generate is called, the generator will add links which their status is
>> unfetched. further in the crawldatum object there is an information on when
>> the url was fetched. if you didn't change settings than it should be
>> re-fetched in 30 days.
>>
>> HTH
>>
>> Gal Nitzan
>>
>> ------ Original Message ------
>> Received: Mon, 04 Dec 2006 01:26:54 PM IST
>> From: Yoni Amir <[EMAIL PROTECTED]>
>> To: [email protected]
>> Subject: Re: Re-crawl
>>
>>     
>>> I am struggling with the same questions. I don't understand how nutch
>>> decides whether to re-fetch content that was not updated, and how/where
>>> to configure it?
>>>
>>> Any help will be greatly appreciated :)
>>>
>>> Yoni
>>>
>>> On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote:
>>>       
>>>> First time I let nutch crawl and if some urls are not feteched, nutch
>>>>         
>> reports
>>     
>>>> an error in the log file. Is there a way, Nutch can re-crawl and update
>>>>         
>> the
>>     
>>>> affected/non-fetched ones and do not do any operations on the valid ones?
>>>>
>>>> Also, If I wanted to recrawl again, say after few days/months on the same
>>>> website and some content of the website was updated and some not. What
>>>>         
>> does
>>     
>>>> nutch do in this case? What operations does it do for the 
>>>> 1. updated content
>>>> 2. not-updated content
>>>> in the current database (local database from the previous crawl)?
>>>>
>>>> Does it just get the updated contents? Does it get all?
>>>>
>>>> If nutch gets everything(updated and non-updated), is there a way, we can
>>>> ask nutch to get only the updated content?
>>>>
>>>>         
>>
>>     
>
>   

-- 

Chad Savage
Active Athlete®
[EMAIL PROTECTED]

Active Athlete® Advertising Network
www.activeathletemedia.com

Skype me My status <skype:chad.savage?call>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Re-crawl

Reply via email to