[Nutch-general] Re: is the nutch shell script only used for initial crawling

Piotr Kosiorowski Mon, 02 Jan 2006 13:04:12 -0800

Thomas Sondergaard wrote:

Hi,
I've installed Nutch on my machine and convinced it to crawl ourintranet, ie the local NFS and samba shares via the local filesystem andour local intranet web servers and I'm quite impressed with how well itworks. One thing I'm not sure about though, is how the index is kept upto date. Is the "nutch crawl" command only used for creating the initialindex/db? What do I need to do to keep the index/db up to date?

For keeping index up to date you can simply start crawl from scratch orif it takes too much time you can use 'Whole web crawling' methoddescribed in tutorial to do it incrementally.

Things work well with html, msword and pdf, but I'd like to indexzip-archives, tar.gz archives, rpm-files and openoffice documents aswell. Are plugins for these file types available?

I think zip plugin is already committed in trunk but I am not sure if itis a part of nutch 0.7.1 distribution. There is a JIRA issue aboutopenoffice docs - it is not committed yet but there are chances it willbe in some time. For tar.gz and rpm plugins - I never heard of suchattempts - you can try to write your own.

Regards
Piotr

Regards,

Thomas Sondergaard




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: is the nutch shell script only used for initial crawling

Reply via email to