Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread Dennis Kubes
I was going to inject a single simple page that redirects to another page with zero links on the pages themselves. And fetch/update this against a clean crawldb. Then see what the dump output of the crawldb is after being updated. If it works the way I think it does then there should be mult

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread John Mendenhall
Dennis, > I will need to look deeper but I think there is a subtle logic bug in > Fetcher. > > Redirect statuses, both temp and perm get output by the fetchers, even > when redirecting immediately, so if you have multiple redirects you > would have multiple outputs in the crawl_fetch output fr

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread Dennis Kubes
I will need to look deeper but I think there is a subtle logic bug in Fetcher. Redirect statuses, both temp and perm get output by the fetchers, even when redirecting immediately, so if you have multiple redirects you would have multiple outputs in the crawl_fetch output from segments. The ou

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread John Mendenhall
On Thu, 04 Dec 2008, Dennis Kubes wrote: > Forget my last email. I went back and read your original email. What > type of webpages are you trying to fetch? This doesn't seem like a > configuration issue to me. Most of this particular url set are redirects. The pages are dynamic pages, all fr

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread Dennis Kubes
Forget my last email. I went back and read your original email. What type of webpages are you trying to fetch? This doesn't seem like a configuration issue to me. Dennis Dennis Kubes wrote: Hi John, If the http.redirect.max config variable in nutch-*.xml is set to 0 then any redirect i

Re: nutch fetch of redirects not ending up in index

2008-12-04 Thread Dennis Kubes
Hi John, If the http.redirect.max config variable in nutch-*.xml is set to 0 then any redirect is queued to be fetched during the next fetching round similar to new urls we parse off of a webpage. Try setting it to 3 and your redirects should go down. Dennis John Mendenhall wrote: We are

Re: nutch fetch of redirects not ending up in index

2008-12-03 Thread John Mendenhall
> We are using nutch version nutch-2008-07-22_04-01-29. > We have a crawldb with over 500k urls. > > The status breakdown was as follows: > > status 1 (db_unfetched):19261 > status 2 (db_fetched): 71628 > status 4 (db_redir_temp): 274899 > status 5 (db_redir_perm): 148220 > s

nutch fetch of redirects not ending up in index

2008-12-01 Thread John Mendenhall
We are using nutch version nutch-2008-07-22_04-01-29. We have a crawldb with over 500k urls. The status breakdown was as follows: status 1 (db_unfetched):19261 status 2 (db_fetched): 71628 status 4 (db_redir_temp): 274899 status 5 (db_redir_perm): 148220 status 6 (db_notmodif