Hi there, I am crawling a site using nutch1.8 manually right now following the Nutch tutorial for about 6 rounds right now. However, When I run the readdb stats command, I can see the number of URLs are not increasing dramatically but there are a whole lot (50%?) URLs that are in the redirect temp folder.
I noticed that there is an attribute that I left as default, which might be the reason. <property> <name>http.redirect.max</name> <value>0</value> <description>The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. </description> </property> The interesting part "it will record them for later fetching" and I think it is actually never fetching those redirections while I leave the http.redirect.max as default 0. I am wondering is there a way to force Nutch go and fetch those redirections? How could I do it? Change the configuration from 0 to some number, say 3, and then generate -> fetch -> parse -> updatedb ...? Best regards, Bin

