Anuradha brought this up on nutch-dev and I also have a lot of questions
regarding recrawling and merging. Unfortunately, many of these questions are
not even clearly formulated yet.
I have been working on a new blog. I only have two posts on there so far but
this one:
http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/
is about recrawling and merging.
There are a couple things I'm trying to accomplish:
How do I control the crawling--how can I set up crawl jobs so that I know how
long they will take, so that I can throttle or stop them if necessary (not
supported I think). Do I want to have lots of disparate crawls and merge them?
One clear question to ask is: how do I manage web crawls--are the few examples
in the FAQ all there is or do we have more fine grained control?
Another thing that I'm having significant cognitive dissonance on is NUTCH-230.
Do I understand that recrawls have some kind of penalty in terms of
scoring--older pages get a higher score? There is a whole conversation about
"cash value" and "inflation" here:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg12695.html
Please advise.
----- Forwarded Message ----
From: anuradha (JIRA) <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Thursday, July 12, 2007 4:40:04 AM
Subject: [jira] Created: (NUTCH-511) Recrawling
Recrawling
-----------
Key: NUTCH-511
URL: https://issues.apache.org/jira/browse/NUTCH-511
Project: Nutch
Issue Type: Wish
Affects Versions: 0.9.0
Reporter: anuradha
Hi,
First I have crawled one website.
I added one page to crawled site. After that I have recrawled the same website.
I have copied the recrawling the code from
"http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03";
But I didn't get the results from the newly added page.
I am using nutch 0.9.0 and jvm/java-1.5.0-sun
Please guide me how to recrawl the site.
Thanks in advance,
Regards,
Anuradha
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
____________________________________________________________________________________Ready
for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general