Hi,
Now using Nutch trunk 1.8-SNAPSHOT HEAD
Back at this tonight. When attempting to fetch
file://home/law/Downloads/asf/solr-4.3.1/example/e001 (notice two slashes)
which contains loads of HTML files, I get the error as below.
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=1
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file://home/law/Downloads/asf/solr-4.3.1/example/e001 failed with:
org.apache.nutch.protocol.file.FileError: File Error: 404
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2013-08-06 18:59:00, elapsed: 00:00:02
I then deleted the crawldb changed the seed URL to
file:/home/law/Downloads/asf/solr-4.3.1/example/e001 (notice one slash)
But when I eventually get fetching after a few rounds of generate, fetch,
parse, updatedb, I am landed with
fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
(queue crawl delay=500ms)
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5428_03.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
fetching file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
(queue crawl delay=500ms)
org.apache.nutch.protocol.file.FileError: File Error: 404
at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:118)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:703)
fetch of file:/home/law/Downloads/asf/solr-4.3.1/example/5094_08.html
failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
Same as before... this happens with every single URL in the directory I am
trying to crawl.
Any advice here please?
Thanks
Lewis
--
*Lewis*