Hi Lewis,
Can you try the patch attached over here:
https://issues.apache.org/jira/browse/NUTCH-1483

Thanks,
Tejas


On Tue, Aug 6, 2013 at 7:24 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi,
> Now using Nutch trunk 1.8-SNAPSHOT HEAD
> Back at this tonight. When attempting to fetch
>
> file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes)
>
> which contains loads of HTML files, I get the error as below.
>
>
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=1
> org.apache.nutch.protocol.file.FileError: File Error: 404
>     at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with:
> org.apache.nutch.protocol.file.FileError: File Error: 404
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02
>
> I then deleted the crawldb changed the seed URL to
>
> file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)
>
> But when I eventually get fetching after a few rounds of generate, fetch,
> parse, updatedb, I am landed with
>
> fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> (queue crawl delay=500ms)
> org.apache.nutch.protocol.file.FileError: File Error: 404
>     at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
> fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> (queue crawl delay=500ms)
> org.apache.nutch.protocol.file.FileError: File Error: 404
>     at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
>     at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
> fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
> failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
>
> Same as before... this happens with every single URL in the directory I am
> trying to crawl.
>
> Any advice here please?
> Thanks
> Lewis
>
> --
> *Lewis*
>

Reply via email to