I'm not sure if it's the best way to do things but I may be able to help
with the user agents. Basically what I've done is capture all the user
agents to hit my sites over the past few years. I go through periodically
and (using a bit column in the table) mark whether the agents are bots or
not.
I'm not saying its 100% accurate (or complete) but whatever is? I can send
you the data if you like (let me know how you'd like it). It's sizable.
there are many thousands of rows.
I use the table to determine which session on the application are generated
by bots and a prevent those sessions from being stored in my metrics
application (reduces clutter significantly).
If you want more accuracy/completeness you may also consider checking out
browserhawk (forgot the company name) - it's a user-agent parsing component
that works from a regularly updated database of agent information. It'll
give you much more than just "isBot" but will also cost you.
Let me know if you want my database.
Jim Davis
_____
From: Mark A. Kruger - CFG [mailto:[EMAIL PROTECTED]
Sent: Sunday, April 04, 2004 2:06 PM
To: CF-Talk
Subject: user agent checking and spidering...
Cf talkers,
I have a client with many many similar sites on a single server using CFMX.
Each of the sites is part of a "network" of
sites that all link together - about 150 to 200 sites in all. Each home
page has links to other sites in the network.
Periodically, it appears that google or a similar search engine hits a home
page and spiders the links - which of
course leads it to other sites on the server and other links. This generates
(again - this is my hypothesus from
examining the logs and behaviour) concurrent requests for similar pages that
all hit the same "news" database (in
access). Sequelink (the access service for Jrun I think) locks up quickly
trying to service hundreds of requests at once
to the same access file. This results in a climbing request queue that
climbs into the thousands and requires a restart
of the CFMX services.
To fix this issue I am migrating the databases over to SQL server which will
help greatly with stability, but this will
take a little time and there is still the problem of trying to avoid letting
a spider hit this single server with so
many requests at once. Each site has a pretty well thought out robots.txt
file, but it doesn't help because the links
in question are to external sites - not pages on THIS site (even though
these external sites are virtuals on the same
server).
I'm considering suggesting a "mask" be installed for spider agents that
eliminates the absolute links and only exposes
the "internal" links - which are controlled by the robots.txt. I'd like to
know if:
A) in anyone's experience my hypothesis may be correct and
B) Is there anything I should watch out for in masking these links
C) Does anyone know of a link that gives me the string values of the various
user-agents I'm trying to look for.
Any help will be appreciated - thanks!
-Mark
Mark A. Kruger, MCSE, CFG
www.cfwebtools.com
www.necfug.com
http://blog.mxconsulting.com
...what the web can be!
_____
[Todays Threads]
[This Message]
[Subscription]
[Fast Unsubscribe]
[User Settings]
- user agent checking and spidering... Mark A. Kruger - CFG
- RE: user agent checking and spidering... Jim Davis
- RE: user agent checking and spidering... Mark A. Kruger - CFG
- RE: user agent checking and spidering... Jim Davis
- Re: user agent checking and spidering... Jochem van Dieten
- RE: user agent checking and spidering... Dave Watts
- RE: user agent checking and spidering... Mark A. Kruger - CFG
- Re: user agent checking and spidering... Stephen Moretti
- RE: user agent checking and spiderin... Mark A. Kruger - CFG
- RE: user agent checking and spidering... Mark A. Kruger - CFG
- RE: user agent checking and spidering... Dave Watts