Re: Incremental crawl again ... (Please explain)

zzcgiacomini Fri, 26 May 2006 01:40:48 -0700

I am not at all a Nutch expert, I am just experimenting a little bit,but as far as I understood it

you can remove the indexes directory and re-index again the segments:
In may case ofter step 8 of the (see below) I have only one segment :
test/segments/20060522144050
after step 9 I will have a second segment
test/segments/20060522144050
Now what we can do is to remove the test/indexes directory and
re-index the two segments:
this what I did :


hadoop dfs -rm test/indexes

nutch index test/indexes test/crawldb linkdbtest/segments/20060522144050 test/segments/20060522144050


Hope it helps

-Corrqdo



Jacob Brunson wrote:

I looked at the referenced messaged at
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html
but I am still having problems.

I am running the latest checkout from subversion.

These are the commands which I've run:
bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN 10000
bin/nutch generate crawl/crawldb crawl/segments -topN 500
lastsegment=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $lastsegment
bin/nutch updatedb crawl/crawldb $lastsegment
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $lastsegment

This last command fails with a java.io.IOException saying: "Output
directory /home/nutch/nutch/crawl/indexes already exists"

So I'm confused because it seems like I did exactly what was described
in the referenced email, but it didn't work for me.  Can someone help
me figure out what I'm doing wrong or what I need to do instead?
Thanks,
Jacob


On 5/22/06, sudhendra seshachala <[EMAIL PROTECTED]> wrote:

Please do follow the link below..
  http://www.mail-archive.com/nutch-user@lucene.apache.org/msg03990.html

I have been able to follow the threads as explained and mergemultiple crawl.. It works like a champ.


  Thanks
  Sudhi

zzcgiacomini <[EMAIL PROTECTED]> wrote:
  I am currently using the last nightly nutch-0.8-dev build and
I am really confused about how to proceed after I have done two
different "whole web" incremental crawl

The tutorial to me is not clear on how to merge the results after the
two crawls in order to be able to
make a search operation.

Could some one please give me an Hints on what is the right procedure ?!
here is what I am doing:

1. create an initial urls file /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050
8. nutch index test/indexes test/crawldb linkdb
test/segments/20060522144050

..and now I am able to search...

Now I run

9. nutch generate test/crawldb test/segments -topN 1000

and I will end up to have a new segment : test/segments/20060522151957

10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957


From this point on I cannot make any progresses much

A) I have tried to merge the two segments into a new one with theidea to rerun an invertlinks and index on it but:


nutch mergesegs test/segments -dir test/segments

whatever I specify as outputdir or outputsegment I get errors

B) I have also tried to make invertlinks on all test/segments withthe goal to run nutch index command to produce a secondindexes directory, let say test/indexes1, an finally run the mergeindex on index2


nutch invertlinks test/linkdb -dir test/segments

This as created a new linkdb directory *NOT* under test as specifiedbut as /linkdb-1108390519

nutch index test/indexes1 test/crawldb linkdbtest/segments/20060522144050

nutch merge index2 test/indexes test/indexes1

now I am not sure what to do; If I rename test/index2 to betest/indexes after having removed test/indexes

I will not able to search anymore.


-Corrado






  Sudhi Seshachala
  http://sudhilogs.blogspot.com/



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: Incremental crawl again ... (Please explain)

Reply via email to