Hi

I am trying to find a combination of the best settings for topN and depth
> for running the crawl script on a very large internal filesystem.
>
> I have tried setting the depth to a very high number (1000), but I fail to
> complete the crawl.  The main reason for this is the number of "bad"
> powerpoint and pdf files that we have. Some of the pdf files are causing
> the
> script to hang and consume all the memory on the machine.
>

try the patch in https://issues.apache.org/jira/browse/NUTCH-696 against the
SVN trunk. this should definitely help


>
> Once I resolve where those bad files are (wish the script would exit from
> that condition cleaner though) and remove them, then I am wondering what
> those parameters really should be?
>
> I also plan to add other sources to my crawl once I have completed my
> filesystem crawling.  Most of them consist of oracle databases, websites,
> etc, that I need to crawl.  But, I can't until I get this large shared
> filesystem completely crawled..
>
>
HTH

Julien

-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to