How about introducing these changes in an effort to force the nutch
admins
to properly edit the bot identity strings?
1. Add the http.agent.* entries to nutch-site.xml with the value being
"EDITME".
The description should clearly state that these values *must* be
edited
to reflect the true identity of the site.
2. Add a piece of code to the HTTP crawler that checks the
configuration.
If any of the http.agent.* entries are EDITME, the code would log
the error and exit.
-kuro
p.s. I'm subscribing to the digest version of the ML. If the same or
better idea
has been raised already, please ignore this.
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers