Hi I am trying to find a combination of the best settings for topN and depth > for running the crawl script on a very large internal filesystem. > > I have tried setting the depth to a very high number (1000), but I fail to > complete the crawl. The main reason for this is the number of "bad" > powerpoint and pdf files that we have. Some of the pdf files are causing > the > script to hang and consume all the memory on the machine. >
try the patch in https://issues.apache.org/jira/browse/NUTCH-696 against the SVN trunk. this should definitely help > > Once I resolve where those bad files are (wish the script would exit from > that condition cleaner though) and remove them, then I am wondering what > those parameters really should be? > > I also plan to add other sources to my crawl once I have completed my > filesystem crawling. Most of them consist of oracle databases, websites, > etc, that I need to crawl. But, I can't until I get this large shared > filesystem completely crawled.. > > HTH Julien -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com

