Is truncating content not a possibility? By default, parsing is skipped for
truncated docs IIRC.



On Fri, Feb 8, 2013 at 4:18 PM, Eyeris Rodriguez Rueda <eru...@uci.cu>wrote:

> I have an idea of what was the problem, there is a url that contain a
> repository of pdf documents and nutch delay and delay in this domain, Im
> doing a crawl process without topN parameter and for that reason nutch was
> trying to fetch all those pdf in that site.
> Is posible configure nutch to make a crawl without topN and restrict the
> number of url fetched ?, im thinking to make block for each cicle to avoid
> the amount of space used in /tmp .
> It will be great because if nutch find a collection of pdf bigger than our
> hard disk, it will fail
>
>

Reply via email to