Ken Schweigert wrote:

On Fri, Feb 11, 2005 at 07:21:00PM -0000, Aengus wrote:


On Friday, February 11, 2005 7:14 PM [GMT],
Ken Schweigert <[EMAIL PROTECTED]> wrote:



Or ... to regenerate it at your convenience:


[EMAIL PROTECTED] tmp]$ wget http://www.robotstxt.org/wc/active/all.txt
[EMAIL PROTECTED] tmp]$ grep "robot-name:" all.txt | awk -F: '{print $2}' |
sed 's/^ *//g' | sort | awk '{print "ROBOTINCLUDE \"" $1 "*\""}'


grep "robot-name:" or grep "robot-useragent:"?




I used robot-name because there were entries for robot-useragent that had stuff like:

robot-useragent:                Due to a deficiency in Java it's not currently 
possible to set the User-Agent.
robot-useragent:None
robot-useragent: no
robot-useragent:

This kind of messed up the list and using robot-name produces a list
more like Jeremy's.  Maybe he can chime in and let us know the correct
way.




My script uses perl because it's a little more complicated (but not much). I used the user-agent string because it's correct, but you have to clean it. First, there are special cases like the java message that have to be removed. Then there are blank lines and things that look like wildcards that might accidentally match all user agents. Then you want to remove version numbers, and finally you don't want duplicates in your list.


I have moved the script to my own site. http://www.wadsack.com/robot-list.html. As before this should update weekly.


-- Jeremy Wadsack Seven Simple Machines

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
+------------------------------------------------------------------------

Reply via email to