Dear Apache developers,
This is a suggestion relative to the code of the Apache httpd webserver, and a
possible
default new default option in the standard distribution of Apache httpd.
It also touches on WWW security, which is why I felt that it belongs on this
list, rather
than on the general user's list. Please correct me if I am mistaken.
According to Netcraft, there are currently some 600 Million webservers on the
WWW, with
more than 60% of those identified as "Apache".
I currently administer about 25 Apache httpd/Tomcat of these webservers, not
remarkable in
any way (business applications for medium-sized companies).
In the logs of these servers, every day, there are episodes like the following :
209.212.145.91 - - [03/Apr/2013:00:52:32 +0200] "GET /muieblackcat HTTP/1.1" 404 362 "-"
"-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/index.php
HTTP/1.1" 404 365
"-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET //admin/pma/index.php
HTTP/1.1" 404
369 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:36 +0200] "GET
//admin/phpmyadmin/index.php
HTTP/1.1" 404 376 "-" "-"
209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //db/index.php HTTP/1.1" 404 362 "-"
"-"
209.212.145.91 - - [03/Apr/2013:00:52:37 +0200] "GET //dbadmin/index.php
HTTP/1.1" 404 367
"-" "-"
... etc..
Such lines are the telltale trace of a "URL-scanning bot" or of the
"URL-scanning" part of
a bot, and I am sure that you are all familiar with them. Obviously, these
bots are
trying to find webservers which exhibit poorly-designed or poorly-configured
applications,
with the aim of identifying hosts which can be submitted to various kinds of
attacks, for
various purposes. As far as I can tell from my own unremarkable servers, I
would surmise
that many or most webservers facing the Internet are submitted to this type of
scan every
day.
Hopefully, most webservers are not really vulnerable to this type of scan.
But the fact is that *these scans are happening, every day, on millions of
webservers*.
And they are at least a nuisance, and at worst a serious security problem
when, as a
result of poorly configured webservers or applications, they lead to break-ins
and
compromised systems.
It is basically a numbers game, like malicious emails : it costs very little to
do this,
and if even a tiny proportion of webservers exhibit one of these
vulnerabilities, because
of the numbers involved, it is worth doing it.
If there are 600 Million webservers, and 50% of them are scanned every day, and
0.01% of
these webservers are vulnerable because of one of these URLs, then it means
that every
day, 30,000 (600,000,000 x 0.5 x 0.0001) vulnerable servers will be identified.
About the "cost" aspect : from the data in my own logs, such bots seem to be
scanning
about 20-30 URLs per pass, at a rate of about 3-4 URLs per second.
Since it is taking my Apache httpd servers approximately 10 ms on average to
respond (by a
404 Not Found) to one of these requests, and they only request 1 URL per 250
ms, I would
imagine that these bots have some built-in rate-limiting mechanism, to avoid
being
"caught" by various webserver-protection tools. Maybe also they are smart, and
scan
several servers in parallel, so as to limit the rate at which they "burden" any
server in
particular. (In this rough calculation, I am ignoring network latency for now).
So if we imagine a smart bot which is scanning 10 servers in parallel, issuing
4 requests
per second to each of them, for a total of 20 URLs per server, and we assume
that all
these requests result in 404 responses with an average response time of 10 ms,
then it
"costs" this bot only about 2 seconds to complete the scan of 10 servers.
If there are 300 Million servers to scan, then the total cost for scanning all
the
servers, by any number of such bots working cooperatively, is an aggregated 60
Million
seconds. And if one of such "botnets" has 10,000 bots, that boils down to only
6,000
seconds per bot.
Scary, that 50% of all Internet webservers can be scanned for vulnerabilities
in less than
2 hours, and that such a scan may result in "harvesting" several thousand hosts,
candidates for takeover.
Now, how about making it so that without any special configuration or add-on
software or
skills on the part of webserver administrators, it would cost these same bots
*about 100
times as long (several days)* to do their scan ?
The only cost would a relatively small change to the Apache webservers, which
is what my
suggestion consists of : adding a variable delay (say between 100 ms and 2000
ms) to any
404 response.
The suggestion is based on the observation that there is a dichotomy between
this kind of
access by bots, and the kind of access made by legitimate HTTP users/clients :
legitimate
users/clients (including the "good bots") are accessing mostly links "which
work", so they
rarely get "404 Not Found" responses. Malicious URL-scanning bots on the other
hand, by
the very nature of what they are scanning for, are getting many "404 Not Found"
responses.
As a general idea thus, anything which impacts the delay to obtain a 404
response, should
impact these bots much more than it impacts legitimate users/clients.
How much ?
Let us imagine for a moment that this suggestion is implemented in the Apache
webservers,
and is enabled in the default configuration. And let's imagine that after a
while, 20% of
the Apache webservers deployed on the Internet have this feature enabled, and
are now
delaying any 404 response by an average of 1000 ms.
And let's re-use the numbers above, and redo the calculation.
The same "botnet" of 10,000 bots is thus still scanning 300 Million webservers,
each bot
scanning 10 servers at a time for 20 URLs per server. Previously, this took
about 6000
seconds.
However now, instead of an average delay of 10 ms to obtain a 404 response, in
20% of the
cases (60 Million webservers) they will experience an average 1000 ms
additional delay per
URL scanned.
This adds (60,000,000 / 10 * 20 URLs * 1000 ms) 120,000,000 seconds to the scan.
Divided by 10,000 bots, this is 12,000 additional seconds per bot (roughly 3
1/2 hours).
So with a small change to the code, no add-ons, no special configuration skills
on the
part of the webserver administrator, no firewalls, no filtering, no need for
updates to
any list of URLs or bot characteristics, little inconvenience to legitimate
users/clients,
and a very partial adoption over time, it seems that this scheme could more
than double
the cost for bots to acquire the same number of targets. Or, seen another way,
it could
more than halve the number of webservers being scanned every day.
I know that this is a hard sell. The basic idea sounds a bit too simple to be
effective.
It will not kill the bots, and it will not stop the bots from scanning Internet
servers in
other ways that they use. It does not miraculously protect any single server
against such
scans, and the benefit of any one server implementing this is diluted over all
webservers
on the Internet.
But it is also not meant as an absolute weapon. It is targeted specifically at
a
particular type of scan done by a particular type of bot for a particular
purpose, and is
is just a scheme to make this more expensive for them. It may or may not
discourage these
bots from continuing with this type of scan (if it does, that would be a very
big result).
But at the same time, compared to any other kind of tool that can be used
against these
scans, this one seems really cheap to implement, it does not seem to be easy to
circumvent, and it seems to have at least a potential of bringing big benefits
to the WWW
at large.
If there are reasonable objections to it, I am quite prepared to accept that,
and drop it.
I have already floated the idea in a couple of other places, and gotten what
could be
described as "tepid" responses. But it seems to me that most of the
negative-leaning
responses which I received so far, were more of the a-priori "it will never
work" kind,
rather than real objections based on real facts.
So my hope here is that someone has the patience to read through this, and
would have the
additional patience to examine the idea "professionally".