Hi Bob, > I am not seeing the http status codes though??
Sorry, yes you're right. The headers are recorded but parsechecker does not print them if fetching fails. The server responds with a "400 Bad request" if the user-agent string contains "nutch", reproducible by: wget --header 'User-Agent: nutch' -d https://www.avalonpontoons.com/ ... ---response begin--- HTTP/1.1 400 Bad Request ... You could set the user-agent string: bin/nutch parsechecker \ -Dhttp.agent.name=somethingelse \ -Dhttp.agent.version='' ... and this site should work. Recommended, to send a meaningful user-agent string. Best, Sebastian On 12/17/19 9:43 PM, Robert Scavilla wrote: > Thank you Sebastian. I added the run-time parameters and the output is > identical. I am not seeing the http status codes though?? > > The log file shows: > > 2019-12-17 15:37:36,602 INFO parse.ParserChecker - fetching: > https://www.avalonpontoons.com/ > 2019-12-17 15:37:36,872 INFO protocol.RobotRulesParser - robots.txt > whitelist not configured. > 2019-12-17 15:37:36,872 INFO http.Http - http.proxy.host = null > 2019-12-17 15:37:36,872 INFO http.Http - http.proxy.port = 8080 > 2019-12-17 15:37:36,873 INFO http.Http - http.proxy.exception.list = false > 2019-12-17 15:37:36,873 INFO http.Http - http.timeout = 10000 > 2019-12-17 15:37:36,873 INFO http.Http - http.content.limit = -1 > 2019-12-17 15:37:36,873 INFO http.Http - http.agent = FFDevBot/Nutch-1.14 ( > fourfront.us) > 2019-12-17 15:37:36,873 INFO http.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2019-12-17 15:37:36,873 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2019-12-17 15:37:36,873 INFO http.Http - http.enable.cookie.header = true > > the command line shows: >> $NUTCHl/bin/nutch parsechecker -Dstore.http.headers=true > -Dstore.http.request=true https://www.avalonpontoons.com/ > fetching: https://www.avalonpontoons.com/ > robots.txt whitelist not configured. > Fetch failed with protocol status: gone(11), lastModified=0: > https://www.avalonpontoons.com/ > > > On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel > <[email protected]> wrote: > >> Hi Bob, >> >> the relevant Javadoc comment stands before the declaration of a variable >> (here a constant): >> /** Resource is gone. */ >> public static final int GONE = 11; >> >> More detailed, GONE results from one of the following HTTP status codes: >> 400 Bad request >> 401 Unauthorized >> 410 Gone (*forever* gone, opposed to 404 Not Found) >> See >> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java >> >> My guess would be that "www.sitename.com" requires authentication. >> >> Just repeat the request as >> bin/nutch parsechecker \ >> -Dstore.http.headers=true \ >> -Dstore.http.request=true \ >> ... <url> >> >> (I guess you're already using parsechecker or indexchecker) >> This will show the HTTP headers where you'll find the exact HTTP status >> code. >> >> Best, >> Sebastian >> >> >> >> On 12/17/19 4:36 PM, Robert Scavilla wrote: >>> Hi again, and thank in advance for your kind help. >>> >>> Nutch 1.14 >>> >>> I am getting the following error message when crawling a site: >>> *Fetch failed with protocol status: gone(11), lastModified=0: >>> https://www.sitename.com <https://www.sitename.com>* >>> >>> The only documentation I can find says: >>> >>>> public static final int GONE = 11; >>>> /** Resource has moved permanently. New url should be found in args. */ >>>> >>> I'm not sure what this means. When I load the page in my browser it shows >>> status codes 200 or 304 for all resources. >>> >>> The problem only exists on a single site - other sites crawl fine. >>> >>> I saved a page from the site locally and that page fetches successfully. >>> >>> Can you please steer my in the right direction. Many Thanks, >>> ...bob >>> >> >> >

