Re: [analog-help] Identifying Known Spiders?

Aengus Fri, 04 Jul 2008 06:08:42 -0700

On 7/4/2008 12:30 AM, Michael Crawford wrote:

On Thu, Jul 3, 2008 at 8:37 AM, Jeremy Wadsack
<[EMAIL PROTECTED]> wrote:

The robots list from which that page was built no longer exists. The group that was 
maintaining it decided that it didn't make sense to maintain a database of "known 
robots" any more as anyone can make a robot.


In my personal case, it's not so much that I want to watch all the
bots, as to monitor my progress at getting a new site indexed by the
search engines.

While Google, Yahoo and MSN together provide the vast majority of
search engine referrals, there are still a few small, independent
players such as JGDO.

There are lots of reasons for running a bot, some good, some bad.  I'd
be happy if I could get a report of visits by the bots belong to, say,
the top half-dozen search engines.

Note that it often happens, with new sites, that a search engine
spider may not visit at all for months, and even then will only fetch
the home page.  By creating config files for each of my pages, I hope
to monitor spider visits throughout my site.

If this isn't yet possible with analog, I don't think it would be hard
to implement, and would be very popular, and so would get Analog a lot
more users, and maybe some consulting fees for Analog experts.

It all comes down to the same simple question - how do you decide thatany given request is from a spider/bot rather than a real person? If yourely on the User-Agent string you then have to decide how to identifythe relevant strings - assume that everything that isn't a "well knownbrowser" is a spider, or assume that everything that asks for asks for/robots.txt is a spider.

Unfortunately, there's nothing to stop a bot using a "well knownbrowser" User-Agent (see recent controversy about the AVG LinkScanner,for example), and there's nothing to stop an ordinary user fromrequesting /robots.txt. That means that there's no simple way toautomate the identification of spiders - it requires some judgement, andAnalog doesn't do judgement :-).

Once you come up with a set of rules that work for you (or for the setof log files that you're working with at the moment), then it's notdifficult to use Analog to delve deeper into the robot traffic. You canuse FILEINCLUDE /robots.txt to get a list of IP addresses or Browserstrings that have requested /robots.txt. You can then use thisinformation with HOSTINCLUDE or with BROWINCLUDE to get a view of therest of the traffic from either one specific spider, or all of thespiders as a whole, bearing in mind that the job of spidering your sitemight be spread between a number of different machines, so you mightneed to HOSTINCLUDE a range of machines if you use that technique.

So you can certainly use Analog to watch this type of traffic - indeedAnalog's configurability makes it an ideal tool for the job. But becausethere are no black and white rules for deciding what is or is not arobot/spider, this functionality can't be built-in to Analog. Thedecisions that you might make today to do this analysis on your sitemight be different for someone else, and might be different in a fewmonths time, as the list of search engines change.


Aengus

+------------------------------------------------------------------------
|  TO UNSUBSCRIBE from this list:
|    http://lists.meer.net/mailman/listinfo/analog-help
|
|  Analog Documentation: http://analog.cx/docs/Readme.html
|  List archives:  http://www.analog.cx/docs/mailing.html#listarchives
|  Usenet version: news://news.gmane.org/gmane.comp.web.analog.general
+------------------------------------------------------------------------

Re: [analog-help] Identifying Known Spiders?

Reply via email to