Hi,
Now using Nutch trunk 1.8-SNAPSHOT HEAD
Back at this tonight. When attempting to fetch

file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes)

which contains loads of HTML files, I get the error as below.


Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with:
org.apache.nutch.protocol.file.FileError: File Error: 404
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02

I then deleted the crawldb changed the seed URL to

file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)

But when I eventually get fetching after a few rounds of generate, fetch,
parse, updatedb, I am landed with

fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
(queue crawl delay=500ms)
org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
(queue crawl delay=500ms)
org.apache.nutch.protocol.file.FileError: File Error: 404
    at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
    at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404

Same as before... this happens with every single URL in the directory I am
trying to crawl.

Any advice here please?
Thanks
Lewis

-- 
*Lewis*

Reply via email to