stopping robots

2006-07-25 Thread prad
what is the best way to stop those robots and spiders from getting in?

.htaccess?
robot.txt and apache directives?
find them on the access_log and block with pf?

i should also ask whether it is a good idea to block robots in the first place 
since some do help to increase presence on the web.
which are good robots and which are bad?


-- 
In friendship,
prad

  ... with you on your journey
Towards Freedom
http://www.towardsfreedom.com (website)
Information, Inspiration, Imagination - truly a site for soaring I's



Re: stopping robots

2006-07-25 Thread Jack J. Woehr
On Jul 25, 2006, at 3:45 PM, prad wrote:

> which are good robots and which are bad?

The good ones are Asenion robots, the bad ones are non-Asenion  
robots. But
that's not a hard-and-fast rule; remember the Nestor series.

---
Jack J. Woehr
Director of Development
Absolute Performance, Inc.
[EMAIL PROTECTED]
303-443-7000 ext. 527



Re: stopping robots

2006-07-25 Thread Darrin Chandler
On Tue, Jul 25, 2006 at 02:45:28PM -0700, prad wrote:
> what is the best way to stop those robots and spiders from getting in?
> 
> .htaccess?
> robot.txt and apache directives?
> find them on the access_log and block with pf?
> 
> i should also ask whether it is a good idea to block robots in the first 
> place 
> since some do help to increase presence on the web.
> which are good robots and which are bad?

Almost all "real" robots will obey robots.txt, and that should be your
first attempt.

The ones that do not obey robots.txt will probably not obey anything
else, either. If you block them with pf, try "block return" instead of
"block drop" and maybe they'll give up quicker.

As for whether you should block them or not, that's up to you. I am
currently blocking Yahoo Slurp in robots.txt (yes, it works) because
Yahoo has always been irrelevant to my traffic, their robot is
incredibly obnoxious, and every one of their referrals leaves off the
trailing slash for directories. Other than that I let them all come and
for the most part they behave well.

And... this is *REALLY* not an OpenBSD topic, and there's a *LOT* that's
been written about this topic in other places.

-- 
Darrin Chandler|  Phoenix BSD Users Group
[EMAIL PROTECTED]   |  http://bsd.phoenix.az.us/
http://www.stilyagin.com/  |



Re: stopping robots

2006-07-25 Thread Rogier Krieger

On 7/25/06, prad <[EMAIL PROTECTED]> wrote:

what is the best way to stop those robots and spiders from getting in?


The sure way to stop robots and spiders is to shut down your web
server. I don't suppose that's the answer you're looking for.

Treat malicious robots as malicious/unwelcome users. For whatever your
definition of malicious, do not expect to be able to easily discern
between regular human users and robots. It's too easy to alter
user-agent strings, etc to rely on those without precautions (as with
all client-generated input).



.htaccess?


That might help, but not solve your problem discerning between human
and automated clients. Also, the usual problems/threats regarding
credentials will of course apply. Mind you, automated processes
(robots) can also use credentials.

Possibly you can also use CAPTCHA. Various modules (PHP, Perl) exist
that allow to integrate these easily. Whether (or when) robots will be
able to fool these tests is another matter.



robot.txt and apache directives?


Well-behaved robots will adhere to measures such as (x)html meta tags,
robots.txt files, etc. Other robots may not.



find them on the access_log and block with pf?


Using access_log means you're using information gathered from after the fact.



which are good robots and which are bad?


Apart from robots/spiders potentially being an excellent friend,
allowing robots (e.g. Google) may also have undesirable side effects.
Such effects range from out-dated information being displayed to
search engine users to sensitive data being stored on servers outside
your influence. I'm sure there are many more.

I'd recommend you think about your threat model first and use that to
determine which information you deem sensitive and to what lengths you
will go to secure that information.

Cheers,

Rogier

--
If you don't know where you're going, any road will get you there.



Re: stopping robots

2006-07-25 Thread Han Boetes
I got these tips from an old message on this list, I hope this
helps you as well.

# rule-based rewriting engine to rewrite requested URLs on the fly
LoadModule rewrite_module /usr/lib/apache/modules/mod_rewrite.so

#
# Redirect allows you to tell clients about documents which used to exist in
# your server's namespace, but do not anymore. This allows you to tell the
# clients where to look for the relocated document.
# Format: Redirect old-URI new-URL
#

# This is a tip from jont #openbsd.org
RedirectMatch ^.*\.(ida|exe|dll).* http://support.microsoft.com/

### Special section for stopping bad traffic and DDoS attack type, etc
# This list are all global rewrite rules that can be turn on per virtual
# server.

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP_USER_AGENT}  ^$
RewriteCond %{REQUEST_URI}  ^/$
 
#

RewriteRule ^/.*http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)cmd.exe(.*)$http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)root.exe(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)shell.exe(.*)$  http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/_vti_bin\/(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/_vti_cnf\/(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/_vti_inf\/(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/scripts\/\.\.(.*)$http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/_mem_bin\/(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/msadc\/(.*)$  http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/MSADC\/(.*)$  http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/c\/winnt\/(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/d\/winnt\/(.*)$   http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/x80\/(.*)$http://support.microsoft.com/$1 
[L,E=nolog:1]
RewriteRule (.*)\/x90\/(.*)$http://support.microsoft.com/$1 
[L,E=nolog:1]


# Various attempts found in my logfiles, always eager to please:
RewriteRule (.*)php(.*)$http://www.php.net/$1 
[L,E=nolog:1]
RewriteRule (.*)awstats(.*)$
http://awstats.sourceforge.net/$1 [L,E=nolog:1]

# Various attempts found in my logfiles, always eager to please:
RewriteRule (.*)php(.*)$http://www.php.net/$1 
[L,E=nolog:1]
RewriteRule (.*)awstats(.*)$
http://awstats.sourceforge.net/$1 [L,E=nolog:1]

# Against phpbb exploit searchers
RewriteRule (.*)forum(.*)$  http://www.phpbb.com/$1 
[L,E=nolog:1]
RewriteRule (.*)discussion(.*)$ http://www.phpbb.com/$1 
[L,E=nolog:1]

RewriteRule (.*)foros(.*)$  http://www.phpbb.com/$1 
[L,E=nolog:1]  
# RewriteRule (.*)nar(.*)$http://www.phpbb.com/$1 
[L,E=nolog:1]
RewriteRule (.*)portal(.*)$ http://www.phpbb.com/$1 
[L,E=nolog:1]
RewriteRule (.*)discussion(.*)$ http://www.phpbb.com/$1 
[L,E=nolog:1]

# Complex rules bb.gif is ok, anything else with bb is sent to phpbb
RewriteRule (.*)bb(.*)$ - [C]
RewriteRule !(.*)bb(.*)\.gif$   http://www.phpbb.com/$1 
[L,E=nolog:1]

RewriteRule (.*)board(.*)$ - [C]
RewriteRule !(.*)board(.*).pl(.*)$  http://www.phpbb.com/$1 
[L,E=nolog:1]

SetEnvIf Referer 0 nolog=1
SetEnvIf Request_URI 0 nolog=1

RewriteEngine  on



# Han



Re: stopping robots

2006-07-25 Thread Mike Erdely

prad wrote:

what is the best way to stop those robots and spiders from getting in?


Someone on this list (who can reveal themselves if they want) has a 
pretty good setup to block "disrespectful" robots.


They have a robots.txt file that specifies a "Disallow: /somedir/". 
Anyone that actually GOES into that directory gets blocked by PF.


It'd be pretty easy to parse your /var/www/logs/access_log for accesses 
of "/somedir/" and have them added to a table.


-ME



Re: stopping robots

2006-07-25 Thread Spruell, Darren-Perot
From: [EMAIL PROTECTED] 
> what is the best way to stop those robots and spiders from getting in?
> 
> .htaccess?
> robot.txt and apache directives?
> find them on the access_log and block with pf?
> 
> i should also ask whether it is a good idea to block robots 
> in the first place 
> since some do help to increase presence on the web.
> which are good robots and which are bad?

And here I've never considered them a threat. Do you have information in a
robots.txt that you shouldn't? What's your concern with them?

DS



Re: stopping robots

2006-07-26 Thread Nick Guenther

On 7/25/06, Mike Erdely <[EMAIL PROTECTED]> wrote:

prad wrote:
> what is the best way to stop those robots and spiders from getting in?

Someone on this list (who can reveal themselves if they want) has a
pretty good setup to block "disrespectful" robots.

They have a robots.txt file that specifies a "Disallow: /somedir/".
Anyone that actually GOES into that directory gets blocked by PF.

It'd be pretty easy to parse your /var/www/logs/access_log for accesses
of "/somedir/" and have them added to a table.

-ME



Arxiv dumps massive amounts of data at you and then blocks you if you
access a special robot-trap page. See
http://arxiv.org/RobotsBeware.html.

If using a CGI/template-based or frame-based site It would not be
difficult to generate a new trap page every day and link it on all
pages for robots to fall in to. You could even make the url sound
reasonable by using a wordbank so that statistical analysis of the
characters can't pick out the real links from the trap ones. You'd
also have to make sure to move the link around (i.e. just having the
trap as the last link on every page is obvious).

However the above is probably excessive; robot authors really aren't
that unlazy (that's the whole reason they are running a robot in the
first place).

-Nick



Re: stopping robots

2006-07-31 Thread Marc Espie
I've got a robots.txt, and a script that loops to infinity.
Actually, it's a useful page on the server, there's a list that can be
ordered two ways, and switching from one to the other increments a parameter
at the end of the invocation.

A robot has no business reading that specific page in the first place (in
fact, they're disallowed to), and after a small number of loops (10 or 15),
the webserver becomes very unresponsive, thus ensuring the robot writer will
lose a lot of time on that page.

Assuming reasonable technologies (e.g., mason), the url does not even have
to look like a script...