Re: Indexing and Re-crawling site

Lukas Vlcek Mon, 04 Dec 2006 14:12:09 -0800

Hi,

I will try to use my out-dated knowledge to answer (& confuse you on) your
items:


1.      Why does nutch has to create a new index every time when indexing,

while it can just merge it with the old existing index? I try to change
the
value in the IndexMerger class to 'false' while creating an index
therefore
Lucene doesn't recreate a new index each time it is indexing. The problem
with this is, I keep on having some exception when it tries to merge the
indexes. There is a lock time out exception that is thrown by the
IndexMerger. And consequently the index that get created. Is it possible
to
let nutch index by merging it with an existing index? I have to crawl
about
100Gb of data and if there are only a few documents that have been
changed,
I don't nutch to recreate a new index because of that but update the
existing index by merging it with the new one. I need some light on this.



This is more for Nutch experts but to me it seems that new index is
reasonable. Besides others it means that original index is still searchable
while the new index is being created (creating a new index can take long
time based on your settings). Updating one document at a time in large index
is not very optimal approach I think.

2.      What is the best way to make nutch re-crawl? I have implemented a

class that loops the crawl process; it has a crawl interval which is set
in
a property file and a running status. The running status is a Boolean
variable which is set to true if the re-crawl process is ongoing or false
if
it should stop. But with this approach, it seems that the index is not
being
fully generated. The values in the index cannot be queried. The re-crawl
is
in java which calls an underlying ant script to run nutch. I know most
re-crawl are written as batch script but can you tell me which one do you
recommended? A batch script or a loop-based java program?



I used to use batch and was happy with it.

3.      What is the best way of implementing nutch has a window service or

unix daemon?


Sorry - what do you mean byt this?

Regards,
Lukas

Re: Indexing and Re-crawling site

Reply via email to