Hi Bob,

> I am not seeing the http status codes though??

Sorry, yes you're right. The headers are recorded but parsechecker
does not print them if fetching fails.

The server responds with a "400 Bad request" if the user-agent string
contains "nutch", reproducible by:
  wget --header 'User-Agent: nutch' -d https://www.avalonpontoons.com/
  ...
  ---response begin---
  HTTP/1.1 400 Bad Request
  ...

You could set the user-agent string:

 bin/nutch parsechecker \
   -Dhttp.agent.name=somethingelse \
   -Dhttp.agent.version='' ...

and this site should work. Recommended, to send a meaningful user-agent string.

Best,
Sebastian


On 12/17/19 9:43 PM, Robert Scavilla wrote:
> Thank you Sebastian. I added the run-time parameters and the output is
> identical. I am not seeing the http status codes though??
> 
> The log file shows:
> 
> 2019-12-17 15:37:36,602 INFO  parse.ParserChecker - fetching:
> https://www.avalonpontoons.com/
> 2019-12-17 15:37:36,872 INFO  protocol.RobotRulesParser - robots.txt
> whitelist not configured.
> 2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.host = null
> 2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.port = 8080
> 2019-12-17 15:37:36,873 INFO  http.Http - http.proxy.exception.list = false
> 2019-12-17 15:37:36,873 INFO  http.Http - http.timeout = 10000
> 2019-12-17 15:37:36,873 INFO  http.Http - http.content.limit = -1
> 2019-12-17 15:37:36,873 INFO  http.Http - http.agent = FFDevBot/Nutch-1.14 (
> fourfront.us)
> 2019-12-17 15:37:36,873 INFO  http.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2019-12-17 15:37:36,873 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2019-12-17 15:37:36,873 INFO  http.Http - http.enable.cookie.header = true
> 
> the command line shows:
>> $NUTCHl/bin/nutch parsechecker     -Dstore.http.headers=true
> -Dstore.http.request=true https://www.avalonpontoons.com/
> fetching: https://www.avalonpontoons.com/
> robots.txt whitelist not configured.
> Fetch failed with protocol status: gone(11), lastModified=0:
> https://www.avalonpontoons.com/
> 
> 
> On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
> <[email protected]> wrote:
> 
>> Hi Bob,
>>
>> the relevant Javadoc comment stands before the declaration of a variable
>> (here a constant):
>>   /** Resource is gone. */
>>   public static final int GONE = 11;
>>
>> More detailed, GONE results from one of the following HTTP status codes:
>>  400 Bad request
>>  401 Unauthorized
>>  410 Gone   (*forever* gone, opposed to 404 Not Found)
>> See
>> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>>
>> My guess would be that "www.sitename.com" requires authentication.
>>
>> Just repeat the request as
>>  bin/nutch parsechecker \
>>     -Dstore.http.headers=true \
>>     -Dstore.http.request=true \
>>     ... <url>
>>
>> (I guess you're already using parsechecker or indexchecker)
>> This will show the HTTP headers where you'll find the exact HTTP status
>> code.
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 12/17/19 4:36 PM, Robert Scavilla wrote:
>>> Hi again, and thank in advance for your kind help.
>>>
>>> Nutch 1.14
>>>
>>> I am getting the following error message when crawling a site:
>>> *Fetch failed with protocol status: gone(11), lastModified=0:
>>> https://www.sitename.com <https://www.sitename.com>*
>>>
>>> The only documentation I can find says:
>>>
>>>> public static final int GONE = 11;
>>>> /** Resource has moved permanently. New url should be found in args. */
>>>>
>>> I'm not sure what this means. When I load the page in my browser it shows
>>> status codes 200 or 304 for all resources.
>>>
>>> The problem only exists on a single site - other sites crawl fine.
>>>
>>> I saved a page from the site locally and that page fetches successfully.
>>>
>>> Can you please steer my in the right direction. Many Thanks,
>>> ...bob
>>>
>>
>>
> 

Reply via email to