[Nutch-general] Re: is the nutch shell script only used for initial crawling

Thomas Sondergaard Tue, 03 Jan 2006 00:35:04 -0800

Piotr Kosiorowski wrote:

For keeping index up to date you can simply start crawl from scratchor if it takes too much time you can use 'Whole web crawling' methoddescribed in tutorial to do it incrementally.

If I crawl from scratch with 'nutch crawl' I can't crawl to the samedirectory - it fails telling me that the directory already exists, whichmeans I have to restart tomcat on the new directory after crawling. Isit possible with the "whole web crawling" method to crawl to an existingdirectory while the web-interface is live on the very same directory.

I haven't found any reference to how people actually setup afsearch-engine using Nutch and keep it running and fresh. There is onlydescriptions like this: 1) crawl with 'nutch crawl' or use the whole webcrawl 2) copy nutch-x.y.x.war to webapps/ROOT.war and go. Where is step3) - how to keep the index up to date. I mean, do people run the crawlerin a cron job or keep it running slowly all the time, or what. I'mspecifically crawling an intranet.

Things work well with html, msword and pdf, but I'd like to indexzip-archives, tar.gz archives, rpm-files and openoffice documents aswell. Are plugins for these file types available?
I think zip plugin is already committed in trunk but I am not sure ifit is a part of nutch 0.7.1 distribution. There is a JIRA issue aboutopenoffice docs - it is not committed yet but there are chances itwill be in some time. For tar.gz and rpm plugins - I never heard ofsuch attempts - you can try to write your own.

Okay - I see there is a parse-ext plugin, which I might use to index thewhat-provides, changelog and info stuff from and rpm - the simple easilyaccessible stuff without having to extract stuff from the rpm.


Thanks for your help so far!

Thomas Sondergaard


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: is the nutch shell script only used for initial crawling

Reply via email to