Hi guys,

 

I want to extend Nutch to use real-time indexing on local file system. I
have been through the source code to find out ways to modify values stored
in CrawlDB. The idea is simple:

 

I have an external program (or a script) which checks for changes in a
directory (url injected in the crawldb). When there are new changes
recorded, the program will update the status in the crawldb and generate a
new fetch list for the fetcher to fetch. I do not want to make great changes
to the nutch source code as I want the program to be compatible with future
releases. Now, I know the crawldatum is saved in the crawldb with the url. I
am not too sure but I think the url is the key to retrieve the crawldatum.
For my program to work successfully, I need to know the following:

 

*         How to read data from the crawldb; what data structure does it use
and how to referenced to it?

*         How to write back to the crawldb; updating information back to the
crawldb or probably creating a new with changed and unchanged values.

 

This is an extract from the crawldb:

 

http://some-url.com/    Version: 4

Status: 2 (DB_fetched)

Fetch time: Thu Feb 22 12:44:05 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 1.0323955

Signature: f4c14c46074b66aad8829b8aa84cd636

Metadata: null

 

How can get this information with an external program and modify/ update it.
Once I know how to implement that part, I can call nutch in the usual way of
generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
have the new value that I want re-indexed. This will stop the fetcher from
fetching a long list of urls (changed or unchanged but need fetching because
of their next_fetch_time is due). The program gets its update from the
underlying OS to know notify about any changes to files and folders being
monitored. Once the program is working with sufficient tests, I will be
willing to share the source code; it's written in java and doesn't need any
script to launch nutch.

 

I will be looking forward to your kind support.

 

Armel

 

-------------------------------------------------

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 <http://blog.idna-solutions.com/> http://blog.idna-solutions.com

 

Reply via email to