I'm not sure if it's the best way to do things but I may be able to help
with the user agents.  Basically what I've done is capture all the user
agents to hit my sites over the past few years.  I go through periodically
and (using a bit column in the table) mark whether the agents are bots or
not.

I'm not saying its 100% accurate (or complete) but whatever is?  I can send
you the data if you like (let me know how you'd like it).  It's sizable.
there are many thousands of rows.

I use the table to determine which session on the application are generated
by bots and a prevent those sessions from being stored in my metrics
application (reduces clutter significantly).

If you want more accuracy/completeness you may also consider checking out
browserhawk (forgot the company name) - it's a user-agent parsing component
that works from a regularly updated database of agent information.  It'll
give you much more than just "isBot" but will also cost you.

Let me know if you want my database.

Jim Davis

  _____  

From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 2:06 PM
To: CF-Talk
Subject: user agent checking and spidering...

Cf talkers,

I have a client with many many similar sites on a single server using CFMX.
Each of the sites is part of a "network" of
sites that all link together - about 150 to 200 sites in all.  Each home
page has links to other sites in the network.
Periodically, it appears that google or a similar search engine  hits a home
page and spiders the links - which of
course leads it to other sites on the server and other links. This generates
(again - this is my hypothesus from
examining the logs and behaviour) concurrent requests for similar pages that
all hit the same "news" database (in
access). Sequelink (the access service for Jrun I think) locks up quickly
trying to service hundreds of requests at once
to the same access file. This results in a climbing request queue that
climbs into the thousands and requires a restart
of the CFMX services.

To fix this issue I am migrating the databases over to SQL server which will
help greatly with stability, but this will
take a little time and there is still the problem of trying to avoid letting
a spider hit this single server with so
many requests at once.  Each site has a pretty well thought out robots.txt
file, but it doesn't help because the links
in question are to external sites - not pages on THIS site (even though
these external sites are virtuals on the same
server).

I'm considering suggesting a "mask" be installed for spider agents that
eliminates the absolute links and only exposes
the "internal" links - which are controlled by the robots.txt.  I'd like to
know if:

A) in anyone's experience my hypothesis may be correct and

B) Is there anything I should watch out for in masking these links

C) Does anyone know of a link that gives me the string values of the various
user-agents I'm trying to look for.

Any help will be appreciated - thanks!

-Mark

Mark A. Kruger, MCSE, CFG
www.cfwebtools.com
www.necfug.com
http://blog.mxconsulting.com
...what the web can be!

  _____
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to