Re: [Nutch-general] Recrawling and Merging

John Reidy Sat, 14 Jul 2007 03:50:40 -0700

Kai_testing Middleton wrote:

>Anuradha brought this up on nutch-dev and I also have a lot of questions 
>regarding recrawling and merging.  Unfortunately, many of these questions are 
>not even clearly formulated yet.  
>
>I have been working on a new blog. I only have two posts on there so far but 
>this one:
>http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
>is about recrawling and merging.
>
>There are a couple things I'm trying to accomplish:
>
>How do I control the crawling--how can I set up crawl jobs so that I know how 
>long they will take, so that I can throttle or stop them if necessary (not 
>supported I think).  Do I want to have lots of disparate crawls and merge them?
>
>One clear question to ask is: how do I manage web crawls--are the few examples 
>in the FAQ all there is or do we have more fine grained control?
>
>Another thing that I'm having significant cognitive dissonance on is 
>NUTCH-230.  Do I understand that recrawls have some kind of penalty in terms 
>of scoring--older pages get a higher score?  There is a whole conversation 
>about "cash value" and "inflation" here: 
>http://www.mail-archive.com/[EMAIL PROTECTED]/msg12695.html
>
>Please advise.
>
>
>
>----- Forwarded Message ----
>From: anuradha (JIRA) <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Thursday, July 12, 2007 4:40:04 AM
>Subject: [jira] Created: (NUTCH-511) Recrawling
>
>Recrawling 
>-----------
>
>                 Key: NUTCH-511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-511
>             Project: Nutch
>          Issue Type: Wish
>    Affects Versions: 0.9.0
>            Reporter: anuradha
>
>
>Hi,
>
>First I have crawled one website.
>I added one page to crawled site. After that I have recrawled the same website.
>I have copied the recrawling the code  from 
>"http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";;
>
>But I didn't get the results from the newly added page.
>
>I am using nutch 0.9.0 and jvm/java-1.5.0-sun
>
>Please guide me how to recrawl the site.
>Thanks in advance,
>
>Regards,
>Anuradha
>
>  
>
Thanks this looks like a great resource, I have posted on the blog and 
will add to it when I have information.
As I have posted to this list earlier, I am interested in the 
incremental merge scenario, where I add new urls and want to merge them 
with the main index on a hourly/daily basis. I believe that v0.71 did 
what I wanted and will look to see how v0.9 can do the same.
Regards
John Reidy.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Recrawling and Merging

Reply via email to