Hi guys,
I want to extend Nutch to use real-time indexing on local file system. I have been through the source code to find out ways to modify values stored in CrawlDB. The idea is simple: I have an external program (or a script) which checks for changes in a directory (url injected in the crawldb). When there are new changes recorded, the program will update the status in the crawldb and generate a new fetch list for the fetcher to fetch. I do not want to make great changes to the nutch source code as I want the program to be compatible with future releases. Now, I know the crawldatum is saved in the crawldb with the url. I am not too sure but I think the url is the key to retrieve the crawldatum. For my program to work successfully, I need to know the following: * How to read data from the crawldb; what data structure does it use and how to referenced to it? * How to write back to the crawldb; updating information back to the crawldb or probably creating a new with changed and unchanged values. This is an extract from the crawldb: http://some-url.com/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:44:05 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0323955 Signature: f4c14c46074b66aad8829b8aa84cd636 Metadata: null How can get this information with an external program and modify/ update it. Once I know how to implement that part, I can call nutch in the usual way of generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will have the new value that I want re-indexed. This will stop the fetcher from fetching a long list of urls (changed or unchanged but need fetching because of their next_fetch_time is due). The program gets its update from the underlying OS to know notify about any changes to files and folders being monitored. Once the program is working with sufficient tests, I will be willing to share the source code; it's written in java and doesn't need any script to launch nutch. I will be looking forward to your kind support. Armel ------------------------------------------------- Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 <http://blog.idna-solutions.com/> http://blog.idna-solutions.com