Piotr Kosiorowski wrote:


For keeping index up to date you can simply start crawl from scratch or if it takes too much time you can use 'Whole web crawling' method described in tutorial to do it incrementally.


If I crawl from scratch with 'nutch crawl' I can't crawl to the same directory - it fails telling me that the directory already exists, which means I have to restart tomcat on the new directory after crawling. Is it possible with the "whole web crawling" method to crawl to an existing directory while the web-interface is live on the very same directory.

I haven't found any reference to how people actually setup af search-engine using Nutch and keep it running and fresh. There is only descriptions like this: 1) crawl with 'nutch crawl' or use the whole web crawl 2) copy nutch-x.y.x.war to webapps/ROOT.war and go. Where is step 3) - how to keep the index up to date. I mean, do people run the crawler in a cron job or keep it running slowly all the time, or what. I'm specifically crawling an intranet.


Things work well with html, msword and pdf, but I'd like to index zip-archives, tar.gz archives, rpm-files and openoffice documents as well. Are plugins for these file types available?

I think zip plugin is already committed in trunk but I am not sure if it is a part of nutch 0.7.1 distribution. There is a JIRA issue about openoffice docs - it is not committed yet but there are chances it will be in some time. For tar.gz and rpm plugins - I never heard of such attempts - you can try to write your own.


Okay - I see there is a parse-ext plugin, which I might use to index the what-provides, changelog and info stuff from and rpm - the simple easily accessible stuff without having to extract stuff from the rpm.

Thanks for your help so far!

Thomas Sondergaard


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to