[Nutch-general] Indexing the file system / best approach

Bruno Thiel Tue, 17 Oct 2006 17:22:18 -0700

All,

I want to get nutch to index the file system. My first approach was to
nfs-mount the file system and et nutch crawl through the hierachary over
http/Apache. This turned out to be fairly slow  ~3,000 fetches per hour. 
Next approach was to go via file:/// <file:///>  and to generate a file list
to be crawled. This file list is fairly big ~200,000 entries, and with the
current 0.8.1 release of nutch the fetcher just freezes right at the end of
a crawl. Other strategies to split up the filelist into smaller parts
~20,000 and subsequently merging the indexes still fail for the same reason.


Anybody doing an extensive crawl with nutch through the file system in the
community - what's your setup?

Cheers, Bruno

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Indexing the file system / best approach

Reply via email to