Re: using proxy/cache for apache mirrors

2005-12-07 Thread Colm MacCarthaigh
On Wed, Dec 07, 2005 at 01:18:32AM -0600, William A. Rowe, Jr. wrote:
 Do mirrors even validate any server signature for rsync?  If not this
 argument is blowing smoke.  For that matter, we could even endorse the
 use of ssl privately to our mirrors on the backend, with server cert
 validation to avoid exactly what you describe above, as well as any
 number of man in the middle attacks.  In fact, it seems this would be
 much more robust than today's rsync, in terms of security.

Yep, if we could do the pull over https, that would solve this. 

 I generally discourage ftp mirrors.  But yes, they would continue to 
 need to do rsync.
 
 Why?  I'm not certain, but expect there are ways to play with wget to
 fetch only new/changed files.  If not, perhaps it's time to teach wget
 some new tricks :)

If you dropped rsync, we'd lose most of the mirrors. They absolutely
won't be interested in that kind of poking. 

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


using proxy/cache for apache mirrors

2005-12-06 Thread Joshua Slive
This is really an infrastructure topic, not an httpd-dev one, but I'd 
like the caching experts to look over this to make sure this simple 
configuration looks reasonable.  (The main issue being, it is almost 
impossible to get a mirror to change its configuration after they've 
been accepted into our list; so this needs to be future-proof.)


http://www.apache.org/info/mirror-proxy.html

Joshua.


Re: using proxy/cache for apache mirrors

2005-12-06 Thread Graham Leggett

Joshua Slive wrote:

This is really an infrastructure topic, not an httpd-dev one, but I'd 
like the caching experts to look over this to make sure this simple 
configuration looks reasonable.  (The main issue being, it is almost 
impossible to get a mirror to change its configuration after they've 
been accepted into our list; so this needs to be future-proof.)


In that case, get just one or two sites to implement it for a while to 
check whether the config is correct. Once it's been battle tested, we 
can give it a wider audience.


Regards,
Graham
--


smime.p7s
Description: S/MIME Cryptographic Signature


Re: using proxy/cache for apache mirrors

2005-12-06 Thread Colm MacCarthaigh
On Tue, Dec 06, 2005 at 04:16:17PM -0500, Joshua Slive wrote:
 This is really an infrastructure topic, not an httpd-dev one, but I'd 
 like the caching experts to look over this to make sure this simple 
 configuration looks reasonable.  

I think this is a terrible, terrible, terrible idea in general. I've
always hated this recommendation. I've kept quite, but no longer ! ;-)

we highly recommend that you mirror using a cached reverse proxy in
place of rsync makes me wince with pain.

Problems:

* It's vastly more complicated than neccessary and adds a burden
  to what admins have to manage. Why should they have to worry
  about managing a cache? They're busy enough trying to give us
  free resources in the first place.

* It adds massive dependencies to what a mirror server needs to run.
  Adding modules, especially proxy, is not resource-free. These
  things eat memory, research time and security work.

* It defeats a huge part of the point of having a mirroring
  system in the first place. Mirroring isn't just a way of
  decreasing bandwidth usage on the primary, it's also a means
  of building content resilience. When www.apache.org goes down, users 
  want their mirror to work. And worst of all, in the case of
  infrequently used mirrors, this is exactly when they'll
  suddenly get a lot of queries - all of which will end up in IOWAIT
  land, with a boat-load of back-end TCP connections, and no 
  content served. That really sucks, for both them and their
  users.

* mod_cache + mod_proxy is trivially vulnerable to all of the latest
  DNS cache-poisoning trickery, with no easy fix. At the very
  least we should recommend that admins hard-code www.apache.org
  in their /etc/hosts file, and that INFRA get some PI-space and
  guarantee availability at a particular IP address for
  eternity. Or deploy DNSSEC, and insist that mirrors verify the
  records.

* We havn't fixed all of the thundering herd problems :/

* It's HTTP only. A lot of users use rsync and FTP to fetch
  content from a local mirror.

* Next time www.apache.org gets compromised, the exposure
  will be two to four times as great compared to the rsync
  mirrors. CacheMaxExpire can fix this problem though.

Personally, specially for the reasons of potential cache poisoning, I'd
strong advise against using this kind of technique and stay with KISS.

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: using proxy/cache for apache mirrors

2005-12-06 Thread Pau Garcia i Quiles

Quoting Colm MacCarthaigh [EMAIL PROTECTED]:


* It defeats a huge part of the point of having a mirroring
 system in the first place. Mirroring isn't just a way of
  decreasing bandwidth usage on the primary, it's also a means
  of building content resilience. When www.apache.org goes down, users
 want their mirror to work. And worst of all, in the case of
  infrequently used mirrors, this is exactly when they'll
  suddenly get a lot of queries - all of which will end up in IOWAIT
  land, with a boat-load of back-end TCP connections, and no
  content served. That really sucks, for both them and their
  users.


A few months ago I developed a patch for 2.0.54 that added a new directive
(CacheGatewayUnreachable) for this not to happen.

When CacheGatewayUnreachable was set to true, Apache would show the latest
cached version of the resource when the gateway is unreachable. When
CacheGatewayUnreachable was set to false, Apache would show the same error it
currently shows.

I have not tested/ported it to 2.2.0, though. If I find some time, I will.




Re: using proxy/cache for apache mirrors

2005-12-06 Thread Joshua Slive

[ This really should be on infrastructure; oh well.]

Perhaps I should have mentioned off the top that I envision setting 30+ 
day expiry times on all .gz/.zip/.msi/.jar/etc files under dist/.  These 
files should never change without being renamed.


Colm MacCarthaigh wrote:

* It's vastly more complicated than neccessary and adds a burden
  to what admins have to manage. Why should they have to worry
  about managing a cache? They're busy enough trying to give us
  free resources in the first place.


Either you manage the cache or you manage rsync.  I don't really see why 
one is easier than the other.



* It adds massive dependencies to what a mirror server needs to run.
  Adding modules, especially proxy, is not resource-free. These
  things eat memory, research time and security work.


We're not going to get rid of rsync mirroring, so mirrors that don't 
want to use this don't need to.




* It defeats a huge part of the point of having a mirroring
  system in the first place. Mirroring isn't just a way of
  decreasing bandwidth usage on the primary, it's also a means
	  of building content resilience. When www.apache.org goes down, users 
  want their mirror to work. And worst of all, in the case of

  infrequently used mirrors, this is exactly when they'll
  suddenly get a lot of queries - all of which will end up in IOWAIT
	  land, with a boat-load of back-end TCP connections, and no 
	  content served. That really sucks, for both them and their

  users.


Certainly for mirrors that see themselves as providing large-scale 
backups, this would not be a good technique.  From the apache.org point 
of view, people have no way of even finding our recommended mirrors if 
we are down, so it doesn't really help.  And for frequently requested 
files, the long expiry time will allow the mirrors to continue to serve 
them.



* mod_cache + mod_proxy is trivially vulnerable to all of the latest
  DNS cache-poisoning trickery, with no easy fix. At the very
  least we should recommend that admins hard-code www.apache.org
  in their /etc/hosts file, and that INFRA get some PI-space and
  guarantee availability at a particular IP address for
  eternity. Or deploy DNSSEC, and insist that mirrors verify the
  records.


I don't really see how the situation with mod_proxy is any worse than 
the existing situation in that regard.  It could even be better given 
that cache expiry times will far exceed rsync frequencies.



* We havn't fixed all of the thundering herd problems :/


Again, with long expiry times, I don't see this as a problem.



* It's HTTP only. A lot of users use rsync and FTP to fetch
  content from a local mirror.


I generally discourage ftp mirrors.  But yes, they would continue to 
need to do rsync.



* Next time www.apache.org gets compromised, the exposure
  will be two to four times as great compared to the rsync
  mirrors. CacheMaxExpire can fix this problem though.


Again, long expiry times seem to make this problem less severe than with 
rsync.


Just to explain the reasoning behind this a little: our dist/ directory 
is rapidly approaching 10GB.  Although I don't have any statistics to 
back this up, I strongly suspect that a very small portion of that 
accounts for a very large portion of the downloads.  The rest gets 
rsynced to our hundreds of mirrors for no good reason (other than 
backups; but we don't need hundreds of backups).  In addition, our 
projects are always clammering for faster releases -- they don't want to 
delay their announcements to wait for mirrors to sync.  I know you have 
push ideas for how to solve that, but the proxy technique works as well.


(There are other ways to address these issues, of course.  We could stop 
recruiting mirrors and limit ourselves to a dozen or so more reliable 
mirrors.  But that would be a major change in thinking.)


Joshua.


Re: using proxy/cache for apache mirrors

2005-12-06 Thread Colm MacCarthaigh
On Tue, Dec 06, 2005 at 08:16:07PM -0500, Joshua Slive wrote:
 Perhaps I should have mentioned off the top that I envision setting 30+ 
 day expiry times on all .gz/.zip/.msi/.jar/etc files under dist/.  These 
 files should never change without being renamed.

This is a double-edged sword, see below ...

 Colm MacCarthaigh wrote:
  * It's vastly more complicated than neccessary and adds a burden
to what admins have to manage. Why should they have to worry
about managing a cache? They're busy enough trying to give us
free resources in the first place.
 
 Either you manage the cache or you manage rsync.  I don't really see why 
 one is easier than the other.

100s of projects use rsync, there's 1 that recommends a cache. For the
most part  the either-or is a fallacy, for most mirror operators you add
to their workload :/ Doesn't affect Apache-only mirrors though, but not
sure how many of those there are.

 want to use this don't need to.

Sure, but why recommend one method over another so strongly?

 Certainly for mirrors that see themselves as providing large-scale 
 backups, this would not be a good technique.  From the apache.org point 
 of view, people have no way of even finding our recommended mirrors if 
 we are down, so it doesn't really help.  And for frequently requested 
 files, the long expiry time will allow the mirrors to continue to serve 
 them.

The default max expiry respected by mod_cache is 1 day, and this
really needs to be lowered in this case :)

  * mod_cache + mod_proxy is trivially vulnerable to all of the latest
DNS cache-poisoning trickery, with no easy fix. At the very
least we should recommend that admins hard-code www.apache.org
in their /etc/hosts file, and that INFRA get some PI-space and
guarantee availability at a particular IP address for
eternity. Or deploy DNSSEC, and insist that mirrors verify the
records.
 
 I don't really see how the situation with mod_proxy is any worse than 
 the existing situation in that regard.  It could even be better given 
 that cache expiry times will far exceed rsync frequencies.

That's the problem!  ...

  * Next time www.apache.org gets compromised, the exposure
will be two to four times as great compared to the rsync
mirrors. CacheMaxExpire can fix this problem though.
 
 Again, long expiry times seem to make this problem less severe than with 
 rsync.

They make the problem worse. Compromised binaries hang around for 30
days if you do that. And we'd have to track them all down. And we don't
have any useful logs of many of the mirrors, they look just like regular
HTTP requests. This is why it's much more dangerous compared to rsync.

If we ask operators to increase the CacheMaxExpire to 30 days, that
means that my one-time cache-poisoning now gets me a dodgy binary on
their mirror for a full month. With rsync, only a few hours - if at all,
as it's *considerably* harder to set up a full rsync repository that
will get through most mirroring systems. 

We need to make the mirrors much more aware of the risk involved.

 Just to explain the reasoning behind this a little: our dist/ directory 
 is rapidly approaching 10GB.  Although I don't have any statistics to 
 back this up, I strongly suspect that a very small portion of that 
 accounts for a very large portion of the downloads.  

A tiny proportion :) Out of that 10GB, we see about 150MB being pulled
daily. 10GB is fairly small, in the context of project archives.  You'd
be surprised just how many projects are at least 10 times that number!

 The rest gets rsynced to our hundreds of mirrors for no good reason
 (other than backups; but we don't need hundreds of backups). In
 addition, our projects are always clammering for faster releases --
 they don't want to delay their announcements to wait for mirrors to
 sync.  I know you have push ideas for how to solve that, but the
 proxy technique works as well.

We'd have to go all-proxy for that, don't really have an answer
though.

 (There are other ways to address these issues, of course.  We could
 stop recruiting mirrors and limit ourselves to a dozen or so more
 reliable mirrors.  But that would be a major change in thinking.)

It would, and less community involvement too :/ 

-- 
Colm MacCárthaighPublic Key: [EMAIL PROTECTED]


Re: using proxy/cache for apache mirrors

2005-12-06 Thread William A. Rowe, Jr.

Joshua Slive wrote:

[ This really should be on infrastructure; oh well.]

Perhaps I should have mentioned off the top that I envision setting 30+ 
day expiry times on all .gz/.zip/.msi/.jar/etc files under dist/.  These 
files should never change without being renamed.


Ok, it must be 24 hours.  Although they should not change, we should see
a HEAD/ifmodifiedsince query at least once per day per file requested to
**ENSURE** that if we strike a file that we find to be corrupt/viral/invalid,
it in fact goes poof from the mirrors in some reasonable amount of time.

In fact I'd set it initially to 1 hr, measure, back it down to 24 hrs and
see if we save any bandwidth/load from not testing it frequently.  Any
correctly configured machine should not burden us with anything more than
a still valid? ping.

Remember that the mod_autoindex results, themselves, can be invalidated
more frequently, which will let mirrors 'catch up' quicker than they do
today.


Colm MacCarthaigh wrote:


* mod_cache + mod_proxy is trivially vulnerable to all of the latest
  DNS cache-poisoning trickery, with no easy fix. At the very
  least we should recommend that admins hard-code www.apache.org
  in their /etc/hosts file, and that INFRA get some PI-space and
  guarantee availability at a particular IP address for
  eternity. Or deploy DNSSEC, and insist that mirrors verify the
  records.



I don't really see how the situation with mod_proxy is any worse than 
the existing situation in that regard.  It could even be better given 
that cache expiry times will far exceed rsync frequencies.


Do mirrors even validate any server signature for rsync?  If not this
argument is blowing smoke.  For that matter, we could even endorse the
use of ssl privately to our mirrors on the backend, with server cert
validation to avoid exactly what you describe above, as well as any
number of man in the middle attacks.  In fact, it seems this would be
much more robust than today's rsync, in terms of security.


* We havn't fixed all of the thundering herd problems :/


Again, with long expiry times, I don't see this as a problem.


Or fix it?


* It's HTTP only. A lot of users use rsync and FTP to fetch
  content from a local mirror.


I generally discourage ftp mirrors.  But yes, they would continue to 
need to do rsync.


Why?  I'm not certain, but expect there are ways to play with wget to
fetch only new/changed files.  If not, perhaps it's time to teach wget
some new tricks :)


* Next time www.apache.org gets compromised, the exposure
  will be two to four times as great compared to the rsync
  mirrors. CacheMaxExpire can fix this problem though.


Again, long expiry times seem to make this problem less severe than with 
rsync.


But the converse is an issue, see my first point.

Just to explain the reasoning behind this a little: our dist/ directory 
is rapidly approaching 10GB.  Although I don't have any statistics to 
back this up, I strongly suspect that a very small portion of that 
accounts for a very large portion of the downloads.  The rest gets 
rsynced to our hundreds of mirrors for no good reason (other than 
backups; but we don't need hundreds of backups).  In addition, our 
projects are always clammering for faster releases -- they don't want to 
delay their announcements to wait for mirrors to sync.  I know you have 
push ideas for how to solve that, but the proxy technique works as well.


(There are other ways to address these issues, of course.  We could stop 
recruiting mirrors and limit ourselves to a dozen or so more reliable 
mirrors.  But that would be a major change in thinking.)


Or ship more things to archive.apache.org more quickly.