I was going to inject a single simple page that redirects to another
page with zero links on the pages themselves. And fetch/update this
against a clean crawldb. Then see what the dump output of the crawldb
is after being updated.
If it works the way I think it does then there should be mult
Dennis,
> I will need to look deeper but I think there is a subtle logic bug in
> Fetcher.
>
> Redirect statuses, both temp and perm get output by the fetchers, even
> when redirecting immediately, so if you have multiple redirects you
> would have multiple outputs in the crawl_fetch output fr
I will need to look deeper but I think there is a subtle logic bug in
Fetcher.
Redirect statuses, both temp and perm get output by the fetchers, even
when redirecting immediately, so if you have multiple redirects you
would have multiple outputs in the crawl_fetch output from segments.
The ou
On Thu, 04 Dec 2008, Dennis Kubes wrote:
> Forget my last email. I went back and read your original email. What
> type of webpages are you trying to fetch? This doesn't seem like a
> configuration issue to me.
Most of this particular url set are redirects.
The pages are dynamic pages, all fr
Forget my last email. I went back and read your original email. What
type of webpages are you trying to fetch? This doesn't seem like a
configuration issue to me.
Dennis
Dennis Kubes wrote:
Hi John,
If the http.redirect.max config variable in nutch-*.xml is set to 0 then
any redirect i
Hi John,
If the http.redirect.max config variable in nutch-*.xml is set to 0 then
any redirect is queued to be fetched during the next fetching round
similar to new urls we parse off of a webpage. Try setting it to 3 and
your redirects should go down.
Dennis
John Mendenhall wrote:
We are
> We are using nutch version nutch-2008-07-22_04-01-29.
> We have a crawldb with over 500k urls.
>
> The status breakdown was as follows:
>
> status 1 (db_unfetched):19261
> status 2 (db_fetched): 71628
> status 4 (db_redir_temp): 274899
> status 5 (db_redir_perm): 148220
> s
We are using nutch version nutch-2008-07-22_04-01-29.
We have a crawldb with over 500k urls.
The status breakdown was as follows:
status 1 (db_unfetched):19261
status 2 (db_fetched): 71628
status 4 (db_redir_temp): 274899
status 5 (db_redir_perm): 148220
status 6 (db_notmodif