Piotr Kosiorowski wrote:
For keeping index up to date you can simply start crawl from scratch
or if it takes too much time you can use 'Whole web crawling' method
described in tutorial to do it incrementally.
If I crawl from scratch with 'nutch crawl' I can't crawl to the same
directory - it fails telling me that the directory already exists, which
means I have to restart tomcat on the new directory after crawling. Is
it possible with the "whole web crawling" method to crawl to an existing
directory while the web-interface is live on the very same directory.
I haven't found any reference to how people actually setup af
search-engine using Nutch and keep it running and fresh. There is only
descriptions like this: 1) crawl with 'nutch crawl' or use the whole web
crawl 2) copy nutch-x.y.x.war to webapps/ROOT.war and go. Where is step
3) - how to keep the index up to date. I mean, do people run the crawler
in a cron job or keep it running slowly all the time, or what. I'm
specifically crawling an intranet.
Things work well with html, msword and pdf, but I'd like to index
zip-archives, tar.gz archives, rpm-files and openoffice documents as
well. Are plugins for these file types available?
I think zip plugin is already committed in trunk but I am not sure if
it is a part of nutch 0.7.1 distribution. There is a JIRA issue about
openoffice docs - it is not committed yet but there are chances it
will be in some time. For tar.gz and rpm plugins - I never heard of
such attempts - you can try to write your own.
Okay - I see there is a parse-ext plugin, which I might use to index the
what-provides, changelog and info stuff from and rpm - the simple easily
accessible stuff without having to extract stuff from the rpm.
Thanks for your help so far!
Thomas Sondergaard
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general