Anuradha brought this up on nutch-dev and I also have a lot of questions 
regarding recrawling and merging.  Unfortunately, many of these questions are 
not even clearly formulated yet.  

I have been working on a new blog. I only have two posts on there so far but 
this one:
http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
is about recrawling and merging.

There are a couple things I'm trying to accomplish:

How do I control the crawling--how can I set up crawl jobs so that I know how 
long they will take, so that I can throttle or stop them if necessary (not 
supported I think).  Do I want to have lots of disparate crawls and merge them?

One clear question to ask is: how do I manage web crawls--are the few examples 
in the FAQ all there is or do we have more fine grained control?

Another thing that I'm having significant cognitive dissonance on is NUTCH-230. 
 Do I understand that recrawls have some kind of penalty in terms of 
scoring--older pages get a higher score?  There is a whole conversation about 
"cash value" and "inflation" here: 
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12695.html

Please advise.



----- Forwarded Message ----
From: anuradha (JIRA) <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, July 12, 2007 4:40:04 AM
Subject: [jira] Created: (NUTCH-511) Recrawling

Recrawling 
-----------

                 Key: NUTCH-511
                 URL: https://issues.apache.org/jira/browse/NUTCH-511
             Project: Nutch
          Issue Type: Wish
    Affects Versions: 0.9.0
            Reporter: anuradha


Hi,

First I have crawled one website.
I added one page to crawled site. After that I have recrawled the same website.
I have copied the recrawling the code  from 
"http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";;

But I didn't get the results from the newly added page.

I am using nutch 0.9.0 and jvm/java-1.5.0-sun

Please guide me how to recrawl the site.
Thanks in advance,

Regards,
Anuradha

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.








       
____________________________________________________________________________________Ready
 for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV. 
http://tv.yahoo.com/
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to