Kai_testing Middleton wrote: >Anuradha brought this up on nutch-dev and I also have a lot of questions >regarding recrawling and merging. Unfortunately, many of these questions are >not even clearly formulated yet. > >I have been working on a new blog. I only have two posts on there so far but >this one: >http://nutch.wordpress.com/2007/07/13/recrawling-and-merging/ >is about recrawling and merging. > >There are a couple things I'm trying to accomplish: > >How do I control the crawling--how can I set up crawl jobs so that I know how >long they will take, so that I can throttle or stop them if necessary (not >supported I think). Do I want to have lots of disparate crawls and merge them? > >One clear question to ask is: how do I manage web crawls--are the few examples >in the FAQ all there is or do we have more fine grained control? > >Another thing that I'm having significant cognitive dissonance on is >NUTCH-230. Do I understand that recrawls have some kind of penalty in terms >of scoring--older pages get a higher score? There is a whole conversation >about "cash value" and "inflation" here: >http://www.mail-archive.com/[EMAIL PROTECTED]/msg12695.html > >Please advise. > > > >----- Forwarded Message ---- >From: anuradha (JIRA) <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Thursday, July 12, 2007 4:40:04 AM >Subject: [jira] Created: (NUTCH-511) Recrawling > >Recrawling >----------- > > Key: NUTCH-511 > URL: https://issues.apache.org/jira/browse/NUTCH-511 > Project: Nutch > Issue Type: Wish > Affects Versions: 0.9.0 > Reporter: anuradha > > >Hi, > >First I have crawled one website. >I added one page to crawled site. After that I have recrawled the same website. >I have copied the recrawling the code from >"http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03"; > >But I didn't get the results from the newly added page. > >I am using nutch 0.9.0 and jvm/java-1.5.0-sun > >Please guide me how to recrawl the site. >Thanks in advance, > >Regards, >Anuradha > > > Thanks this looks like a great resource, I have posted on the blog and will add to it when I have information. As I have posted to this list earlier, I am interested in the incremental merge scenario, where I add new urls and want to merge them with the main index on a hourly/daily basis. I believe that v0.71 did what I wanted and will look to see how v0.9 can do the same. Regards John Reidy.
------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
