On 7/4/2008 12:30 AM, Michael Crawford wrote:
On Thu, Jul 3, 2008 at 8:37 AM, Jeremy Wadsack
<[EMAIL PROTECTED]> wrote:
The robots list from which that page was built no longer exists. The group that was 
maintaining it decided that it didn't make sense to maintain a database of "known 
robots" any more as anyone can make a robot.

In my personal case, it's not so much that I want to watch all the
bots, as to monitor my progress at getting a new site indexed by the
search engines.

While Google, Yahoo and MSN together provide the vast majority of
search engine referrals, there are still a few small, independent
players such as JGDO.

There are lots of reasons for running a bot, some good, some bad.  I'd
be happy if I could get a report of visits by the bots belong to, say,
the top half-dozen search engines.

Note that it often happens, with new sites, that a search engine
spider may not visit at all for months, and even then will only fetch
the home page.  By creating config files for each of my pages, I hope
to monitor spider visits throughout my site.

If this isn't yet possible with analog, I don't think it would be hard
to implement, and would be very popular, and so would get Analog a lot
more users, and maybe some consulting fees for Analog experts.

It all comes down to the same simple question - how do you decide that any given request is from a spider/bot rather than a real person? If you rely on the User-Agent string you then have to decide how to identify the relevant strings - assume that everything that isn't a "well known browser" is a spider, or assume that everything that asks for asks for /robots.txt is a spider.

Unfortunately, there's nothing to stop a bot using a "well known browser" User-Agent (see recent controversy about the AVG LinkScanner, for example), and there's nothing to stop an ordinary user from requesting /robots.txt. That means that there's no simple way to automate the identification of spiders - it requires some judgement, and Analog doesn't do judgement :-).

Once you come up with a set of rules that work for you (or for the set of log files that you're working with at the moment), then it's not difficult to use Analog to delve deeper into the robot traffic. You can use FILEINCLUDE /robots.txt to get a list of IP addresses or Browser strings that have requested /robots.txt. You can then use this information with HOSTINCLUDE or with BROWINCLUDE to get a view of the rest of the traffic from either one specific spider, or all of the spiders as a whole, bearing in mind that the job of spidering your site might be spread between a number of different machines, so you might need to HOSTINCLUDE a range of machines if you use that technique.

So you can certainly use Analog to watch this type of traffic - indeed Analog's configurability makes it an ideal tool for the job. But because there are no black and white rules for deciding what is or is not a robot/spider, this functionality can't be built-in to Analog. The decisions that you might make today to do this analysis on your site might be different for someone else, and might be different in a few months time, as the list of search engines change.

Aengus

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Analog Documentation: http://analog.cx/docs/Readme.html
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Reply via email to