Is truncating content not a possibility? By default, parsing is skipped for truncated docs IIRC.
On Fri, Feb 8, 2013 at 4:18 PM, Eyeris Rodriguez Rueda <eru...@uci.cu>wrote: > I have an idea of what was the problem, there is a url that contain a > repository of pdf documents and nutch delay and delay in this domain, Im > doing a crawl process without topN parameter and for that reason nutch was > trying to fetch all those pdf in that site. > Is posible configure nutch to make a crawl without topN and restrict the > number of url fetched ?, im thinking to make block for each cicle to avoid > the amount of space used in /tmp . > It will be great because if nutch find a collection of pdf bigger than our > hard disk, it will fail > >