Force to fetch the redirected URLs that in db_redir_temp

Bin Wang Sat, 12 Jul 2014 21:06:25 -0700

Hi there,

I am crawling a site using nutch1.8 manually right now following the Nutch
tutorial for about 6 rounds right now. However, When I run the readdb stats
command, I can see the number of URLs are not increasing dramatically but
there are a whole lot (50%?) URLs that are in the redirect temp folder.


I noticed that there is an attribute that I left as default, which might be
the reason.

<property>
  <name>http.redirect.max</name>
  <value>0</value>
  <description>The maximum number of redirects the fetcher will follow
when trying
to fetch a page. If set to negative or 0, fetcher won't immediately follow
redirected URLs, instead it will record them for later fetching.
  </description>
</property>

The interesting part "it will record them for later fetching" and I think
it is actually never fetching those redirections while I leave the
http.redirect.max as default 0.

I am wondering is there a way to force Nutch go and fetch those
redirections? How could I do it?
Change the configuration from 0 to some number, say 3, and then generate ->
fetch -> parse -> updatedb ...?

Best regards,

Bin

Force to fetch the redirected URLs that in db_redir_temp

Reply via email to