Thank you Sebastian. I added the run-time parameters and the output is
identical. I am not seeing the http status codes though??

The log file shows:

2019-12-17 15:37:36,602 INFO  parse.ParserChecker - fetching:
https://www.avalonpontoons.com/
2019-12-17 15:37:36,872 INFO  protocol.RobotRulesParser - robots.txt
whitelist not configured.
2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.host = null
2019-12-17 15:37:36,872 INFO  http.Http - http.proxy.port = 8080
2019-12-17 15:37:36,873 INFO  http.Http - http.proxy.exception.list = false
2019-12-17 15:37:36,873 INFO  http.Http - http.timeout = 10000
2019-12-17 15:37:36,873 INFO  http.Http - http.content.limit = -1
2019-12-17 15:37:36,873 INFO  http.Http - http.agent = FFDevBot/Nutch-1.14 (
fourfront.us)
2019-12-17 15:37:36,873 INFO  http.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-12-17 15:37:36,873 INFO  http.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-12-17 15:37:36,873 INFO  http.Http - http.enable.cookie.header = true

the command line shows:
>$NUTCHl/bin/nutch parsechecker     -Dstore.http.headers=true
-Dstore.http.request=true https://www.avalonpontoons.com/
fetching: https://www.avalonpontoons.com/
robots.txt whitelist not configured.
Fetch failed with protocol status: gone(11), lastModified=0:
https://www.avalonpontoons.com/


On Tue, Dec 17, 2019 at 11:53 AM Sebastian Nagel
<[email protected]> wrote:

> Hi Bob,
>
> the relevant Javadoc comment stands before the declaration of a variable
> (here a constant):
>   /** Resource is gone. */
>   public static final int GONE = 11;
>
> More detailed, GONE results from one of the following HTTP status codes:
>  400 Bad request
>  401 Unauthorized
>  410 Gone   (*forever* gone, opposed to 404 Not Found)
> See
> src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>
> My guess would be that "www.sitename.com" requires authentication.
>
> Just repeat the request as
>  bin/nutch parsechecker \
>     -Dstore.http.headers=true \
>     -Dstore.http.request=true \
>     ... <url>
>
> (I guess you're already using parsechecker or indexchecker)
> This will show the HTTP headers where you'll find the exact HTTP status
> code.
>
> Best,
> Sebastian
>
>
>
> On 12/17/19 4:36 PM, Robert Scavilla wrote:
> > Hi again, and thank in advance for your kind help.
> >
> > Nutch 1.14
> >
> > I am getting the following error message when crawling a site:
> > *Fetch failed with protocol status: gone(11), lastModified=0:
> > https://www.sitename.com <https://www.sitename.com>*
> >
> > The only documentation I can find says:
> >
> >> public static final int GONE = 11;
> >> /** Resource has moved permanently. New url should be found in args. */
> >>
> > I'm not sure what this means. When I load the page in my browser it shows
> > status codes 200 or 304 for all resources.
> >
> > The problem only exists on a single site - other sites crawl fine.
> >
> > I saved a page from the site locally and that page fetches successfully.
> >
> > Can you please steer my in the right direction. Many Thanks,
> > ...bob
> >
>
>

Reply via email to