Re: mod_mbox helper scripts and programs

2006-01-10 Thread Paul A Houle

Justin Erenkrantz wrote:

On Tue, Jan 10, 2006 at 06:55:37PM +0100, Mads Toftum wrote:
  

On Tue, Jan 10, 2006 at 09:51:36AM -0800, Paul Querna wrote:


Python!
  

Excellent choice - at least that way I won't have to even consider
trying ;)



Even Perl would be an improvement over zsh.  =)  -- justin

  
   I'd learn towards Perl.  You're more likely to find Perl installed 
on a system than Python.




Re: [Fwd: [EMAIL PROTECTED] [announce] Apache HTTP Server 2.2.0 Released]

2005-12-02 Thread Paul A Houle

William A. Rowe, Jr. wrote:


If the user must rewrite their content handlers to pay attention to 
thread
jumps, and must use some sort of yeild beyond 
apr_socket_poll/select(), then

I'd agree it becomes 3.0.

I wouldn't worry about protocol/mpm authors, who are a breed to ourselves
and should be able to handle any bump.  It's the joe who put together 
a nifty
auth or content module who has to jump through hoops to learn the new 
API who

will suffer if we don't yell out **3.0**!!!

   Well,  I get back to thinking about the cultural difference between 
open source and commercial software here.  That is,  the difference 
between a conceptually simple system (Unix) and a system with 
optimization for every special case,  codesign,  et. al (OS/360)


   When Microsoft decided that threading would be ~the way~ in 
Windows,  they put a lot of effort into letting people run old 
components by tagging old components as threadsafe or not,  and 
automatically serializing components that aren't threadsafe.


   The prefork model has contributed to the reality and perception of 
Apache as a reliable product.  Threading is great for a system that's 
built like a swiss watch,  but a disaster for the tangle of scripts (or 
tangle of objects) that most web sites run.


   In the days of MS-DOS,  I did a lot of downloading over a 2400 baud 
modem,  so I wrote an xmodem/ymodem client that ran in the background as 
you did other things -- this was your classic interrupt-driven state 
machine.  I never got it to work 100% right.  It's not easy to do this 
kind of programming AND when you push the hardware and OS in this 
direction you find out things you wish you hadn't.


   I've had the same experience with single-threaded web servers such 
as thttpd and boa.  I'd do a quick evaluation and I'd be like "damn 
that's fast" and then I'd put it into production and find that they 
didn't work 100% right.  I'd go back to Apache because,  even if ate 
more RAM than I liked,  Apache worked correctly.


   As Apache has been moving in the direction of more efficient 
concurrency models,  IIS has been moving in the direction of more 
isolation.  I'd love more efficiency,  but I don't want to give up 
reliability -- if the world's web apps are going to need to be designed 
like swiss watches in the future,  the world is going to pass Apache 3 by.




  


Re: OT: performance FUD

2005-11-30 Thread Paul A Houle

Nick Kew wrote:


That looks a lot like Windows' market position.  And I suspect it's no 
accident: both products have heaped on new 'goodies', all too often

at the expense of other considerations.  It's IMO also no accident
that PHP is moving towards a Windows-like security track record.
  

   You'll find skeletons if you go looking in CPAN.

   Market share is a lot of the reason why people target malware at 
Windows.  If you wrote an email virus for the mac,  one mac would infect 
the other mac and that would be the end of your fun.


   The real trouble with PHP is that it's sparked a revolution in web 
server software:  code reuse.  Before PHP,  you couldn't find affordable 
web hosting for dynamic sites:  cgi-bin was so expensive and problematic 
that mass hosting facilities couldn't afford to host it.  Mod_perl would 
be out of the question.


   If you wanted to start a weblog or a wiki four years ago,  you 
couldn't find reliable software that would hold up in the real world 
unless you were willing to put a lot of work in it.  Today you can 
download Drupal,  Wordpress or any of a large number of packages.  So 
now there are tens of thousands of site running the same software with 
predictable URLs that people can mess around with and find bugs in the 
underlying software.  If there were any Perl or Java apps of the same 
popularity,  we'd be seeing the same thing.


   The difference is you can get a shared web hosting account for $10 / 
month if you want to run a Wordpress site on PHP,  but you really want a 
dedicated server,  more like $200 /month if you want to run mod_perl or 
Java.


   If you wanted to match the functionality of PHP,  in mod_perl or 
Java,  you'd have to install twenty or so framework modules -- everybody 
is going to pick a different set of modules,  so attackers aren't going 
to have a consistent profile to hit,  but on the other hand,  this 
inconsistency makes it harder to incorporate other people's code into 
your site.





Re: OT: performance FUD

2005-11-30 Thread Paul A Houle

Colm MacCarthaigh wrote:



Is there anything we can do in 2.4/3.0 that will help gain that trust?
 

   It's not Apache's fault.  It's not even PHP's fault.  It's a much 
bigger problem with the open source libraries that people link into 
PHP,  Perl,  Python and the like.


   The problem is particularly perceived as a PHP problem because (1) 
PHP is the market leader,  and (2) the PHP developers are a lot more 
responsible than,  say,  the Python developers,  who tell you to go 
ahead and write threaded apps in Python anyway.


   I suppose that the PHP developers could set up some system where 
extensions are marked as being threadsafe or not,  and there's a lock on 
every untrusted module,  then do a program of certifying modules as 
safe,  but that's a ~big~ project:  race conditions and deadlocks are a 
bitch to debug,  particularly when the problems are in somebody else's code.


   PHP's market position is as a product that any idiot can download 
and install,  just following the instructions,  and get a system with 
good reliability and performance -- a painful phase of shaking out 
threading bugs would endanger that perception.


   The best thing I can see Apache doing is some kind of hybrid model 
which works like mod_event for static pages,  and passes off (some or 
all) dynamic requests to something like prefork.  Dynamic requests would 
eat more memory than worker,  but you don't have the  problem of using a 
heavyweight mod_perl or mod_php process spending two hours blasting bits 
out of a file to somebody on dialup.


   A process-based model is always going to be more reliable than a 
thread-based model.  A hand grenade can go off in an server process,  a 
server process can hemorage memory terribly,  and nobody gets hurt.  The 
user on the other end just hits 'reload' and goes on hs way.


Re: OT: performance FUD

2005-11-30 Thread Paul A Houle

Jess Holle wrote:



So if one uses worker and few processes (i.e. lots of threads per), 
then Solaris should be fine?



   That's what people think,  but I'd like to see some numbers.

   I've never put a worker Apache into production because most of our 
systems depend on PHP or something else which I wouldn't trust 100% in a 
threaded configuration.


   Now that I think about it,  there is a common situation where people 
with modest web sites (at the 50,000 ranking in Alexa) have performance 
problems with Apache...  That's the case of people doing downloads of 
big (>>1 M files.)  Conventional benchmarking,  which fetishizes a large 
and constant number of connections on a LAN doesn't model the situation 
well (it doesn't model any real-world situation well.)


   The trouble you have a population of people with really bad 
connections that take forever to download things...  Back when I had 
dialup,  I used to download ISO images,  I'd just use a download manager 
and have my computer running overnight to do it.  For one project I work 
on,  we have people uploading files that sometimes are in the ~1 M 
range,  then we do processing on the files that is sometimees 
extensive.  We were worried that some processes were running for 20, 30, 
40 minutes,  but we discovered that many of our users have horrible 
connections.


   The result is that a site with a modest number of "hits" per day can 
have > 1000 simultaneous connections.  With prefork you end up burning a 
lot more RAM than really seems fair -- although it's not so bad if you 
can afford to load your machine with 8G.





  


Re: OT: performance FUD

2005-11-29 Thread Paul A Houle

Justin Erenkrantz wrote:



If it's on equivalent hardware (i.e. Linux/Intel vs. Solaris/Intel on 
the same box), I doubt there will be an extreme performance gap.  In 
fact, I've often seen Solaris outperform Linux on certain types of 
loads.  In my experience, a lot of Linux network card drivers are 
sub-standard; if it's supported by Solaris, there's a fair chance the 
driver takes full advantage of the hardware.  (Netgear GigE drivers on 
Linux are abysmal.)  -- justin


   I think the issue with Apache/Solaris is that process switches take 
a long time on Solaris.


   I've got a computer that I need to rehabilitate for the family this 
Christmas.  I think I'll put Solaris 10 on it and do some web server 
benching before I put Ubuntu on it.


  


Re: OT: performance FUD

2005-11-29 Thread Paul A Houle

Joshua Slive wrote:



Suggestions to improve
http://httpd.apache.org/docs/2.1/misc/perf-tuning.html
are very welcome.  Suggestions backed by data are even better.

   Basically there's nothing quantitative there.  There's a lot of talk 
about "some operating systems" and not a lot of talk about specifics.


One issue is that this page was written for (and, in fact, by) the 
Dean Gaudet-type performance freak who was looking to squeeze every 
last ounce of performance when serving static pages.  All you need to 
do is add one CGI script or php app to your site and everything on 
that page after the hardware section gets lost in the noise.  So when 
people mail [EMAIL PROTECTED] asking how to fix performance problems, the 
answer is almost always "fix your database" or "rewrite your web app" 
and not "change your apache configuration" or "get a faster web server".


   For me,  that's the reason why quantitative information is so important.

   I did extensive performance testing on the new server we 
commissioned precisely because of the situation you describe:  we had 
people saying "rewriting is slow",  "extendedstatus on is slow" -- 
people were making decisions based on qualitative statements about 
performance,  not qualitative performance.


   After doing those tests,  I learned that I had nothing to fear if I 
wanted to put in 500 rewriting rules,  but that 50,000 is too much.


Re: OT: performance FUD

2005-11-29 Thread Paul A Houle

Brian Akins wrote:

This is probably way off topic for this list.  I was searching for 
something related to php this morning (I know, I know... But some 
people here need php) and the majority of the google hits where FUD 
sites. Most of them generally say "Apache is bloated and slow, you 
should use X."  I know we have several people on this list who run 
Apache on very high traffic sites.  While we cannot answer every 
single piece of FUD out there, do we need a general page to answer 
some of them.  Maybe "testimonials" or something.  I know, with my 
config, I can easily saturate multiple gig interfaces and have a 
rather full feature installation.


   Apache isn't the fastest web server -- at least without mod_event.  
I've seen data corruption with all of the free single-process web 
servers,  although I'd assume that products like Zeus do better.


   Looking at Alexa,  the logs from a few sites I run,  and 
benchmarking I've done,  there are probably only a few thousand web 
sites in the world that push the limits of a single Apache web server.  
Perhaps 100x as many PHB's ~might~ pick a web server because of numbers 
in a glossy ad.  The real competition is with IIS,  and people don't 
choose Apache or IIS based on performance numbers -- they choose it 
because they are familiar with Unix or familiar with Windows.  Other web 
servers are at the 1% market share level:


http://news.netcraft.com/archives/2005/11/07/november_2005_web_server_survey.html



   Don't make it a "fudbusting" site,  make it a "apache performance 
tuning" site.


   There are all of these statements in the apache docs that

* .htaccess is slow
* ExtendedStatus on reduces performance

   We did a round of performance testing on a server that we 
commissioned last year and took measurements of these things,  and found 
that we'd need to put >1000 rewriting rules to harm performance 
noticably,  that the overhead of ExtendedStatus On is negligible for a 
site that gets 500 hits/sec,  etc.


   I might see if I can find my report about this on this and put it 
online -- there some things that I know,  and even more that I don't...


* prefork and worker seem to be about equally fast on linux?
* is the case on Solaris?
* MacOS X?
* Solaris 9 is embarassingly slow running Apache compared to Linux -- is 
the same the case with Solaris 10?


Re: pgp trust for https?

2005-11-09 Thread Paul A Houle

Peter J. Cranstone wrote:


Currently Windows, Linux and Unix only use two levels of privilege - Ring 3
and Ring 0. Everybody and there uncle's code want to run at Ring 0. Another
really bad idea, as once I introduce a network/video/keyboard/whatever
driver at that level I can execute malicious code. From there I can control
the machine.

 

   You'd need a new hardware architecture for ring 1 drivers to be 
worth it.  The trouble is that drivers can initiate DMA operations 
against physical memory.  Unless you devise some system where the OS can 
veto DMA operations,  protection in the CPU is worthless.


Re: httpd has no reason to improve...

2005-08-22 Thread Paul A Houle

Joe Schaefer wrote:


Paul A Houle <[EMAIL PROTECTED]> writes:

IMO market-share doesn't relate to project activity.  The
word I most associate with apache development is empowerment;
cheifly to empower users to build better web stuff.  Users that
need to tweak the software in order to make that happen become
committers, who are further empowered to make commits, and eventually
make decisions about the project as a whole when they join the pmc.

 


   Market share does matter.  It's a function of a product having a niche.

   A project that's thinking about market share is thinking about end 
users.  A project that isn't thinking about end users is a project of 
people working on something for their own amusement.



What I personally enjoy seeing on [EMAIL PROTECTED] is "piling on", where
one person implements something, then two or three other people
(not always committers) jump on the bandwagon and take it in a better
direction.  It's good to see more of that happening lately.

 

   Yeah,  soon httpd will have a second- or third-rate implementation 
of every protocol under the sun.


   mod_ftp will underperform mainstream ftp servers so long as its 
running under prefork.  Similarly,  cacheing proxy servers will 
outperform squid.


   I don't see end users clamoring for mod_ftp,  or mod_snmpd.  What's 
the point of writing a squid replacement unless you can actually make 
something better?



According to the statistics, most users are content with 1.3. But
that is a very short-sighted way to measure progress, because the
situation will change as the *nix distros move away from 1.3.  

   And it's got nothing to do with how content users are with 1.3,  but 
with how,  after 54 revisions,  Apache has finally produced a server 
that doesn't have serious problems running in production environments.


Exploring 
new protocols within httpd will almost certainly will make the internal 
architecture better, even if none of those modules you mention are 
distributed with httpd. IMO mod_smtpd will be a fine example of that, 
because unlike httpd, almost all of the real action will be on the 
input_filter side (which IMO hasn't received the same amount of polish 
that the output_filter side has).


 

   A server framework that can implement any protocol under the sun 
might be a nice PhD thesis,  but what does it do for end users?


   I don't see excellence coming from "swiss army knife" frameworks 
that do everything,  but from systems that are developed from a whole 
system viewpoint,  that have a good amount of codesign between layers of 
the system -- if you build a system that lets you plug in arbitrary 
junk,  you're going to get arbitrary performance and reliability.


   (I think of the old days of Java Applet programming,  where people 
who were just learning OO techniques were writing great complicated 
classes that were trying to be infinitely reusable,  and it took them a 
long time to realize that people were going to have download every byte 
of the class files they were generating...  But it's not just the old 
"bloat" argument,  but the burden of maintaining superfluous code.)



To sum it up:

better server architecture => better modules => more toys for users 
=> more interest in the 2.x internals => more patches => more activity

=> better server architecture ...
 

   Sure,  I appreciate mod_ssl and mod_dav in Apache 2 -- but the 
reason why so many people have stuck with 1.3 for so long as that 2.0 
has had very little to offer end users.  A few people thought it would 
be great to have pluggable MPM's,  and a few other people introduced 
half-baked systems such as mod_cache and filters.  You know a tree by 
its fruit,  and the fruit I care about is performance and reliability.  
Apache 2.0.54 is a great server,  but the fact that it took 54 revisions 
to get there isn't a good indication about the design and development 
process.


   Were Apache development targeting real problems that real users 
have,  I think things would be quite different. 

   A big part of the problem is that the Apache project has settled 
into a local equilibrium -- this explains the paradox of a product that 
obviously satisfies end user's needs well (no competition has emerged) 
but has a moribund development process.  Any real innovation in the web 
server space will need to be disruptive,  to break things.  Apache,  as 
we know it,  just can't do that.


httpd has no reason to improve...

2005-08-22 Thread Paul A Houle

Ben Collins-Sussman wrote:

I see a lot of frustration going on.  The thing is, httpd's  
development process is nearly identical to Subversion's process... we  
stole most of it from you folks!  So why all the angst in httpd-land,  
but not in Subversion-land?


   It's really a lack of direction.

   Apache already is the market leader in the web server space and has 
no potential for further growth -- Windows users aren't going to switch 
to Apache in significant numbers.  (They never were,  so developing a 
good NT port was a waste of resources that has helped paralyze the 
development of Apache for it's core users.)  This in contrast to svn,  
which has a long way to go in growth.


   There is very little going on in Apache to make it be a better web 
server -- the frontiers of development are trying to make it an ftp 
server,  an smtp server,  a ntp server,  a snpd server,  a bgp 
server...  Efforts that 99.44% of httpd users are completely indifferent 
to (Ok,  maybe a few of them will want to turn on an ftp server because 
they're not getting enough support calls from people who are behind NAT 
or firewalls.)


   The module mechanism means that significant improvements in Apache 
can be made without making changes to the Apache distribution -- people 
are who are serious about improving Apache as a web server are taking 
this route and not adding things to the Apache distribution.  (Any 
change which affects the core use of Apache is politically controversial 
and gets an automatic -1.)


   Pluggable MPMs sound like a good idea in theory,  but in practice 
they're a failure.  It's hard enough writing code that runs in a single 
concurrent environment,  but impossible if the concurrency environment 
is unknown and unknowable.  It's hard enough keeping the modules that 
are part of the distribution working under all MPMs,  even harder to 
support third-party modules (mod_php) and even harder to support all the 
libraries that link against third-party modules (kiss of death for 
mod_php and mod_perl in worker.)


---

   People who want to wake Apache up would be best to try a fork.  
Throw things that are irrelevant to running httpd on Posix overboard.  
Pick ONE mpm that offers a significant advance in functionality 
(mod_event or mod_perchild that actually works) and qualify everything 
to run against it.  Add mod_macro.


   But why?  Apache is good enough for most people,  and you can fix 
any real problems for a particular project by adding modules or making 
little source code patches.  There are 1,000 or so sites in the world 
that need an MPM faster than prefork,  and they'd be better off going to 
a cluster strategy for availability anyway.


   Apache 2 was a rough ride,  but Apache 2.0.54 runs smooth on every 
installation I have...  Why change anything?







Re: httpd-2.x servertype inetd

2005-08-05 Thread Paul A Houle

Nick Kew wrote:


Jan Kratochvil wrote:
 



In what sense low-cost?

Apache's high startup cost is self-reinforcing.  We know it's a
once-only thing, so we have every module do expensive things at
startup rather than per-request.  I don't see how inetd would
affect that.  The only thing you'd save is the spawning of
children, which you already noted ...

 

   Funny enough,  the case where I'd think that partial restarts would 
come in handy would be to be able to change configuration on one virtual 
host on a production system without bringing them all down.


   Practically,  it isn't a big issue:  I've run systems in the 10-50 
virtual host range,  and startup time seems to be a few seconds.  Our 
apache is basically stateless,  so probably one person has to reload and 
a few people see a transient broken image:  no apps screw up.  We add a 
new virtual host or make a config change maybe once or twice a week,  so 
it's not terrible.


   I wonder what mass hosting places with 1000's of hosts do,  but I 
guess they use dynamic methods of implementing vhosts,  batch config 
file changes,  and are a little more tolerant of downtime.


   Contrast this the ColdFusion MX server,  which back ends one Apache 
system I look after.  It runs very well once it's up,  but startup takes 
about 30 seconds.  Internal state kept on the server has two 
consequences:  we have to restart it often when we make configuration 
changes in an app or after an app transitions to a bad state,  also a 
restart breaks sessions,  seriously inconveniencing users.


   I always laugh when people tell me that web apps need to keep state 
in RAM in order to be scalable...




macro facility/default conf file/auth groups

2005-08-04 Thread Paul A Houle
I'm in the middle of setting up (selective) WebDAV access for 
particular directories on a set of (almost) identical vhosts with 
complicated requirements for access control.  I thought of a bunch of 
Apache gripes that I have often:


(1) I've evolved a nice system for implementing virtual hosts (directory 
layouts,  using Include directives to segment configuration) files that 
benefits from years of mistakes that I and others have made.  I'm 
wondering if there's any interest in refactoring the apache conf file 
into a number of separate files to reflect "better practices" than the 
current default conf file.  The idea isn't to force people to change,  
but to encourage future installations to be more managable.  I'll be 
happy to give more specifics if people are interested.


(2) This particular system has a production and a test instance,  so I'd 
love to have a way to set variables that I can interpolate into 
arbitrary strings.  For instance,  everything connected with the 
production system may be under


/production/

in the filesystem

and the test system stuff is under

/test/

so I want ServerRoot to be

ServerRoot {mybasedir}/apache2

I have repetitive stuff all over my conf files,  for instance,  a 
virtual  hostsis stored at


/{mybasedir}/sites/{vhost}

and the DocumentRoot is at

/{mybasedir}/sites/{vhost}/htdocs

so it would be nice to write something like

set vhost myvhost.com

  DocumentRoot /{mybasedir}/sites/{vhost}/htdocs


I'd have to think about how scoping works,  because I'd obviously like 
to reuse the same variable name for different vhosts.  
VirtualDocumentRoot doesn't really cut it,  because there are other 
things,  such as log files,   containers,  rewrite rules 
(already using SetEnvIf cheats) and such that need to be configured as well.


(3) Along these lines I can imagine that a macro facility could be 
useful (one answer to the scoping problem) but if I did my  
containers as macros,  it would be a little ugly.


   Another case where I might like macros (but settled for using an 
Include) is filling out all the auth configuration for  
files.  A whole bunch of vhosts will share a users and group file across 
a number of directories.  It's really a drag to specify four 
configuration parameters per- that are the same everywhere,  
and some way to repeat this would be welcome.


(4) Once again I'm stung by the inability to make AND criteria stick.  
For instance,  we have 2 kinds of people working on N projects:  
developers and designers.  All of these projects have certain 
directories that developers are interested in and others that designers 
are interested on.  I'd like to say something like


require group projectA AND group developer

for the developer-only directories.  I've similarly had cases where I've 
wanted to have AND criteria involving IP address AND user agent,  IP 
address and authentication,  etc.  I've heard talk about overhauling 
this area,  but I don't know what the status is.


-

   If were to make a more concrete proposal for (2) is there any chance 
of it getting in Apache 2.1?


Re: Initial mod_smtpd code.

2005-07-20 Thread Paul A Houle

Jem Berkes wrote:



I could also start work on a mod_smtpd_dnsbl if the mentors feel that is 
worthwhile? This would look up a connecting IP address against a blacklist 
and return a descriptive string to mod_smtpd if the client should be 
rejected with an error: "550 5.7.1 Email rejected because 127.0.0.2 is 
listed by sbl-xbl.spamhaus.org"


I'd also like to include support for RHSBL, a newer type of listing by 
domain names from the envelope sender address. That's used by a growing 
number of projects.
 

   Overall blacklists aren't that effective and cause a lot of false 
positives.  They may make sense in the case of something like 
SpamAssassin which uses a blacklist in conjunction with other false 
positives,  but by themselves they really aren't a responsible way of 
dealing with the spam problem.  I think it's better to discourage "worst 
practices" than to sucumb to plugin mania.


Re: mod_mbox user interface

2005-07-12 Thread Paul A Houle

Maxime Petazzoni wrote:.



As promised : a first try at XHTML mockup. Comments about DOM
structure and XHTML semantics are welcome !

http://skikda.bulix.org/~sam/ajax/jsless/
 

   On firefox/win32,  the "box list" and "message lists" aren't 
vertically aligned correctly (at least for me.)  Other than that,  it's 
quite pretty.  You could reduce the whitespace margins considerably,  
maybe even lose the box around the year/month listings on the side.


   From a usability perspective,  I like 'three-pane' sorts of 
interfaces,  but I'm often infuriated by things that waste a lot of 
space so I can't read the actual message.  The huge space allocated to 
the logo at the top could be reduced (that will probably be easy to 
template away,  but if you set a bad example for people,  they'll follow 
it -- also I get the feeling that the message list takes up space that I 
could use to read the message...


Re: [VOTE] mod_ftp for HTTP Server Project

2005-07-07 Thread Paul A Houle

Jim Jagielski wrote:



I therefore Call A Vote on whether we should support mod_ftp for
inclusion into the Incubator and if we should accept mod_ftp upon
graduation from the Incubator.


   I don't know if I get a vote,  but it's

-1

   This would have been an exciting project in 1989,  but ftp doesn't 
work well with today's internet.:  today it's just a way to make systems 
that "just don't work" for people.
  



Re: mod_smtpd project planning

2005-06-30 Thread Paul A Houle

Jem Berkes wrote:

This is the problem encountered by many spam filters, as to be most 
effective they really need to be _involved_ in the SMTP transaction and not 
just stage 2, after receipt happens. Think greylisting as an example.


 


   You read this?

http://www.acme.com/mail_filtering/

   One thing that's critical isn't just having access to information 
from early stages of mail processing,  but being able to intervene at 
early stages in the processing so to avoid the CPU and bandwidth waste 
at advanced stages.  This particularly matters during a computer virus 
outbreak:  I remember hitting on many of Jeff's solutions when a mail 
server I managed was getting hammered by an incredible volume of 
viruses,  and I wrote scripts that picked up bad addresses from the 
virus filter output and put them into the software firewall.






Re: mod_smtpd project planning

2005-06-30 Thread Paul A Houle
Luo Gang wrote:

>
>   mod_smtpd is a SMTP protocol handler, used to receive mails by SMTP, 
> maybe it will use sendmail as its MTA(not sure). Somebody hope it could also 
> include a spam filter.
>
>  
>
Hooks for a spam/virus filter aren't optional if it's an autoresponder:
running an autoresponder that doesn't filter is about the same as
sending spam in the first place.


Re: mod_smtpd project planning

2005-06-30 Thread Paul A Houle

Jem Berkes wrote:


Hi all, I'm another student working on mod_smtpd

Been running httpd 2.x since it appeared, but am new to development.
 

   What does mod_smtpd do?  Is it a sendmail replacer or does it let 
people request content via smtp or what?


Re: how do i debug a segfault?

2005-06-29 Thread Paul A Houle

Akins, Brian wrote:


Sorry if I missed it, which mpm are you using?

 


   prefork



how do i debug a segfault?

2005-06-29 Thread Paul A Houle
   This weekend we had the kind of experience with Apache httpd which 
we expect from Microsoft IIS or Tomcat.


   We're running a self-compiled 2.0.54 on RHEL 4 on x86_64 on a 4-way 
machine.


   Our server got kicked around midnight to rotate logs,  but around 
5AM we started getting a large volume (> 1 /sec) of messages like


[Tue Jun 28 14:45:53 2005] [notice] child pid 28182 exit signal 
Segmentation fault (11)
[Tue Jun 28 14:45:53 2005] [notice] child pid 28183 exit signal 
Segmentation fault (11)
[Tue Jun 28 14:45:53 2005] [notice] child pid 28184 exit signal 
Segmentation fault (11)


   This server isn't very heavily loaded,  it's lucky if it's getting 
1 hits/day at this point.  The site still uses CGI extensively:  
some CGIs worked just fine,  but other CGIs failed with a 0 length 
document,  I think nothing in the log.


   Kicking the server resolved the problem,  at least for now.

   It has ExtendedStatus on and my hunches are:  (i) the problem is 
x86_64 specific (haven't seen this on a heavily loaded x86 machine) and 
(ii) the underlying problem is in server global state.


   One obvious step is to set up monitoring of stderr (as has been 
discussed) to page me and maybe auto-kick the server if this happens 
again -- but I'd like to see a real fix.


Re: mod_smtpd project planning

2005-06-29 Thread Paul A Houle

Paul Querna wrote:

As some of you might be aware, one of the Summer of Code Projects is 
an SMTP protocol module for httpd 2.x.



   Huh?


Re: Monitoring HTTP error logs

2005-06-28 Thread Paul A Houle

William A. Rowe, Jr. wrote:



Offhand, no, but I'd suggest looking at Piped Log scripts.
This would be pretty trivial to do (even looking for very
specific messages or masking out other common occurances.)
The messages can then be written to one or more log file, 
as well.


See the ErrorLog documentation for pipe syntax, and rotatelogs
or logresolve for additional examples.
 

   Another possibility is to,  more or less,  write a script that does 
the same thing as 'tail -f',  or alternately a script that runs 
periodically and keeps track of the position it left off at in the log.




Re: stress testing of Apache server

2005-05-04 Thread Paul A. Houle
On Tue, 03 May 2005 13:51:55 -0700, Paul Querna <[EMAIL PROTECTED]>  
wrote:

Sergey Ten wrote:
Hello all,
SourceLabs is developing a set of tests (and appropriate workload data)  
to
perform stress testing of an Apache server using requests for static  
HTML
pages only. We are interested in getting feedback on our plans from the
Apache server community, which has a lot of experience in developing,
testing and using the Apache server.

	Although Apache is hardly the fastest web server,  it's fast enough at  
serving static pages that there are only about 1000 sites in the world  
that would be concerned with it's performance in that area...

	Ok,  there's one area where I've had trouble with Apache performance,   
and that's in serving very big files.  If you've got a lot of people  
downloading 100 MB files via dialup connections,  the process count can  
get uncomfortably high.  I've tried a number of the 'single process' web  
servers like thttpd and boa,  and generally found they've been too glitchy  
for production work -- a lot of that may involve spooky problems like  
sendfile() misbehavior on Linux.


Information available on the Internet, as well as our own experiments,  
make
it clear that stressing a web server with requests for static HTML pages
requires special care to avoid situations when either network bandwidth  
or
disk IO become a limiting factor. Thus simply increasing the number of
clients (http requests sent) alone is not the appropriate way to stress  
the
server. We think that use of a special workload data (including  
httpd.conf
and .htaccess files) will help to execute more code, and as a result,  
better
stress the server.
  If you've got a big working set,  you're in trouble -- you might be  
able to get a factor of two by software tweaking,  but the answers are:

(i) 64-bit (or PAE) system w/ lots of RAM.
(ii) good storage system:  Ultra320 or Fibre Channel.  Think seriously  
about your RAID configuration.

  Under most circumstances,  it's not difficult to get Apache to  
saturate the Ethernet connection,  so network configuration turns out to  
be quite important.  We've had a Linux system that's been through a lot of  
changes,  and usually when we changed something,  the GigE would revert to  
half duplex mode.  We ended up writing a script that checks that the GigE  
is in the right state after boot completes and beeps my cell phone if it  
isn't.

==
	Whenever we commission a new server we do some testing on the machine to  
get some idea of what it's capable of.  I don't put a lot of effort into  
'realistic' testing,  but rather do some simple work with ApacheBench.   
Often the answers are pretty rediculous:  for instance,  we've got a site  
that ranks around 30,000 in Alexa that does maybe 10 hits per second at  
peak times...  We've clocked it doing 4000+ static hits per second w/  
small files,  fewer hits per second for big files because we were  
saturating the GigE.

	What was useful,  however,  was quantifying the performance effects of  
configuration changes.  For instance,  the Apache documentation warns that  
"ExtendedStatus On" hurts performance.  A little testing showed the effect  
was minor enough that we don't need to worry about it with our workload.

	Similarly,  we found we could put ~1000 rewriting rules in the httpd.conf  
file w/o really impacting our system perfomance.  We found that simple PHP  
scripts ran about 10x faster than our CGI's,  and that static pages are  
about 10x faster than that.

	We've found tactical microbenchmarking quite useful at resolving our pub  
table arguments about engineering decisions that effect Apache performance.

	Personally,  I'd love to see a series of microbenchmarks that address  
issues like

* Solaris/SPARC vs. Linux/x86 vs. Mac OS X/PPC w/ different MPMs
* Windows vs Linux on the same hardware
* configuration in .htaccess vs. httpd.conf
* working set smaller/larger than RAM
* cgi vs. fastcgi vs. mod-perl
* SATA vs. Ultra320 SCSI for big working sets
	and so on...  It would be nice to have an "Apache tweakers guide" that  
would give people the big picture of what affects Apache performance under  
a wide range of conditions -- really I don't need precise numbers,  but a  
feel of around 0.5 orders of magnitude or so.

	It would be nice to have a well-organized website with canned numbers,   
plus tools so I can do these benchmarks easily on my own systems.

===
	Speaking of performance,  the most frustrating area I've dealt with is  
performance of reverse DNS lookups.  This is another area where the Apache  
manual is less than helpful -- it tells you to "not do it" rather than  
give constructive help in solving problems.

	We had a server that had heisenbug problems running RHEL 3,  things  
stabilized with a 2.6 mainline kernel -- in the process of dealing with  
those problems,  we developed diagnostic tools that picked up glitches in  
our system that people 

Re: simple-conf branch

2005-04-05 Thread Paul A. Houle
On Mon, 4 Apr 2005 15:01:34 -0700, Greg Stein <[EMAIL PROTECTED]> wrote:

Sorry, but I very much disagree. I think back to the old days of
access.conf, httpd.conf, and srm.conf. As an administrator, I absolutely
detested that layout. I could NEVER figure out which file a given
configuration was in. I always had to search, then edit.
We've been to the "multiple .conf world" before. It sucked. We pulled
everything back into a single .conf to get the hell outta there.
Small examples are fine. The default configuration should remain as a
single .conf file.
	After a few years of running moderate-sized virtual hosting servers
(2 to a few hundred) I've settled in on a multiple-file organization for  
virtual hosts.

	I use the usual httpd.conf for server-wide settings,  but that includes  
"vhost.conf" which contains a bunch of virtual host containers that then  
include configuration files for the various virtual hosts.  These days I  
tend to create a set of directories,  like

/www/sites/vhost-1.com
which then have logs,  htdocs,  and other supporting directories.  What's  
nice about this is that vhosts are easily portable from one server to  
another.  I've got a script that automatically punches in a new vhost,  so  
I can have one up and running in two minutes.

	My big project these is a site with a database that's still useful in  
read-only mode when the database goes down;  it's got mirror sites and all  
kinds of funny details,  and we have a "runlevel.conf" symlink that points  
to one of several files that let us adapt the system to various degraded  
states such as database maintainance,  software upgrade,  etc.

	That same site also has several test instances,  and we have a single  
configuration file that has all the variables that change between  
different instances,  so it's easy to maintain the conf files in CVS.

	There are good operational reasons to split up configuration in different  
files -- if the Apache install can encourage good practices,  based on the  
decade of experience we've had with it,  that's a good thing.


Re: Do these broken clients still exist?

2005-04-04 Thread Paul A. Houle
On Sun, 3 Apr 2005 13:58:56 -0400 (Eastern Daylight Time), Joshua Slive  
<[EMAIL PROTECTED]> wrote:

Does someone with a high-traffic, general-interest web site want to take  
a look through their logs for these user-agent strings.  I don't mind  
keeping them if they make up even 1/100 of a percent of the trafic, but  
it seems silly to keep these extra regexes on every single request if  
these clients don't exist anymore in the wild.


Regexes are pretty cheap for a 'normal' apache setup.
	In the initial testing of a production server (2x 3.2Ghz Xeon,  6 GB  
RAM;)  we found that,  serving static pages,  the overhead of processing  
regexes didn't become noticable until we had >1000 rewriting rules.  Even  
then,  at least 30% of the hits on this server are cgi-scripts,  so the  
overhead of regexes is really nothing compared to the other ways we abuse  
our machine.

	In doing this testing I did notice that Apache's handling of regexes is  
pretty simplistic.  Much of the time you can consolidate a large stack of  
regexes into a single state machine,  and that could give vast (factors of  
hundreds or thousands) improvements in performance for handling large rule  
sets.  On the other hand,  it doesn't really matter.

	The people we've inherited this server from left us several very large  
regexes with a few hundred pipe symbols each that match UA's of  
non-browser clients that we don't want using our service.  The trouble is  
that inevitably this kind of regex starts mutating into malignant forms as  
people start using parens,  also we have no documentation for the rules;   
on slow days I think about breaking these up into 500-1000 rules,  which  
we could in principle comment one-by-one...  This wouldn't really impact  
the performance of our machine under 'real' circumstances,  but we could  
measure the impact under specialized testing.


Re: Puzzling News

2005-03-02 Thread Paul A. Houle
On Tue, 1 Mar 2005 16:18:17 +0200 (SAST), Graham Leggett  
<[EMAIL PROTECTED]> wrote:

The trouble with the authentication problem is that the credentials used
for authentication are often used for way more than just finding out
whether a user has access. That said, this is definitely a very useful
addition.
That's exactly the point.
	I think of this in terms of "user management",  as I find the terms  
"authentication",  "authorization" have too much baggage and often lead  
people to making the same mistake too many times.

Something like an auth module that can do "form based" auth, in addition
to "basic" and "digest" etc would probably be very useful.
	Well,  on one level I think the API for authentication is basically  
inadequate.  One major problem is that many applications (if they're going  
to be USABLE rather than to just exist) need "optional" authentication --  
Amazon.com's personalization is an obvious case,  but a more common  
"intranet" application is as follows:  we have an event calendar which has  
some events that are open to the public and some of which are intended for  
staff.  We've got an apache auth module that plugs into our campus  
kerberos system,  and in a lot of ways it's pretty good.  But a staff  
person who wants to see what events are coming up needs to log in and then  
has to click two more times (the fault of our designer) to actually see  
the private events.

	Let's face it -- few users are going to bother to do that,  so the event  
calendar is going to get a small fraction of the usage that it should.

  If on the other hand,  the system could remember the identity of the  
user,  they could just go to the URL and see the events;  that improves  
the chance that they'll actually look at the calendar on a regular basis.


  The other trouble with the Apache authentication system is that the  
whole API between Apache and most programming environments is inflexible.   
You can do better in mod_perl,  but with PHP,  JSP and most other  
environments,  you get a lunk of data from Apache at the beginning of the  
request and you send back a lunk of data,  but you don't get to call  
subroutines on Apache and get results;  for instance,  you can't ask  
Apache what the name of the user is,  but instead Apache has to pass  
everything that it's going to pass in notes or environment variables.

  (One of the things I'd love,  for instance,  is lazy evaluation and  
caching for reverse DNS lookups,  something that would mean big changes...)


	On top of that,  the real problem is user management isn't the actual  
authentication (I've got an "authentication module" that supports  
cookie-based authentication that I've written again and again in different  
languages that's about 100-200 lines of code),  but rather in a domain  
that Apache has never concerned itself:  user interface.

	For instance,  public web sites that let anybody join need to verify the  
e-mail address of people who join.  There's basically two ways to do this:

(1) Send a randomly generated password to the user
(2) Let the user pick a password (that they've got a fighting chance to  
remember) and send them a token that they need to supply to activate their  
account

	Empirical research shows that approach 2 has less than half the failure  
rate of 1,  and helps in turning registrants into active users because  
their relationship of the site doesn't begin with a password change,  or,   
more likely,  a password reset.

	Over the course of a few years,  we made obserations and found we could  
cut 2's failure rate in half again by making little improvements in the  
details.

	Similarly,  if a site has passwords,  there are going to be password  
resets,  and you need a good interface for users to reset their own  
passwords.

  Big sites will spend what it takes to get a good UI,  but most web  
app developers still see user management as an afterthought,  and will  
quickly pound out something that will cause 20-50% of users to drop out  
before they've participated in the system.


	And then there's the question of user interface for the administrators:   
one of the great pains of Apache administration is getting phone calls  
from non-technical people who need changes made to an .htpasswd file.   
Make that a public site with 10^3 or 10^6 users,  and the problem takes on  
a whole new dimension.

  UNLESS people adopt a user-management framework that's already  
written,  they're just not going to give this problem the resources it  
deserves,  and they're going to suffer with half-assed solutions.


	I've been thinking about this a lot,  and my answer to this problem is  
here

http://www.honeylocust.com/x/products/tum/
	I have to apologize that I haven't made a public release in a while;  the  
version that's there is pretty far behind what I've got running on some  
production sites.  The version that's up there is pretty battle tested --  
it's run 

Re: Puzzling News

2005-03-01 Thread Paul A Houle
> William A. Rowe, Jr. wrote:
>> At 03:17 PM 2/28/2005, Paul A. Houle wrote:
>>
>>>On Mon, 28 Feb 2005 21:09:55 +, Wayne S. Frazee <[EMAIL PROTECTED]>
>>>wrote:
>
>> Oh boy - you don't know *what* you are missing :)  Threads on
>> Linux barely differ from distinct processes, while on Solaris
>> they are truly lightweight.
>
> Well.. thats true for LinuxThreads.. but NPTL is a different and better
> story. (Requires newer glibc and a 2.6 kernel)
>

  Yeah,  but that's because Solaris has embarassingly bad performance
running Apache/prefork;   I've seen Apache run an order of magnitude
faster on Pentium II machines that I've fished out of the dumpster
than on Sun servers that we spent $30,000 for.

  (Not to bash Sun by any means...  Solaris will embarass Linux about
as badly running most MTA's;  it's not so much that threads are bad
on Linux,  but that processes are incredibly fast on Linux -- NTPL
doesn't so much speed up threads on Linux as make it possible for
them to scale.)

  I think there is a reason why it matters that there is uptake on
Apache 2,  and that's so development can really move forward.  Don't
take it personally,  but I'm quite depressed that Apache is still
the dominant web server after all these years;  the chief
competition,  Microsoft IIS,  isn't all that different functionally
from Apache.

  I think of all the features that web site authors and developers
need that still don't exist in mainstream web servers;  part of this
is in the area of "content management" and another major are is
authentication -- pretty much any serious interactive web site needs
a cookie-based authentication system with the features seen on big
sites like amazon.com and yahoo!  and one of the reasons there is so
little code reuse on the web is that every application winds up
impementing it's own authentication system;  if there was something
really good built into a market-leading web server,  this picture
would change completely.

 A lot of the problem is that the market for advanced web servers is
tiny.  Looking at Alexa and applying a little guesswork,  I get the
sense that there are probably 10,000 or so sites that get more than
one million hits per day;  and if you're doing static pages,  you
could probably multiply that by 30x before Apache starts becoming
performance limited on reasonable hardware.

The remaining sites that need to push the limits of Apache all have
different circumstances,  and they'll find different answers;  if
you're running a dynamic site with PHP and you're hitting the CPU
wall,  really the only good answer is to build a load-balancing
cluster.  People who are running mod_perl often really do have memory
problems (but there's enough instability in mod_perl,  using processes
instead of threads makes all the difference between a server that
gives a 500 from time to time and a server that goes down regularly at
2 AM.)  Java's got it's own bunch of problems.

In that sense,  the market for higher-end web systems is going to be
fragmented enough that getting enough sites running the same software
to really standardize and make progress will be difficult.

The one 'generic' area where I think Apache/prefork isn't so
satisfactory is in serving large numbers of large files...  On some
platforms you might get a factor of 2 or 3 better with worker,  but
you'll do that again (and maybe more) with something along the lines
of mod_event.

Another area that fascinates me is the 'perchild' MPM.  Having the
ability to segregate different parts of the server to different users
could be useful for some systems I run.  (One of my projects is a web
server with 25 virtual hosts,  one of which has about 100
subdirectories managed by I don't know how many different people; 
some of these sites are running PHP apps,  some are using the accounts
to archive old files.)  Needless to say,  between the limited UNIX
permissions model and running Apache under one UID,  we've had to make
some compromises between security and having to answer support calls
continuously from people because "it doesn't work".

   If perchild could be made smart enough to have different classes of
server:  little lean ones that can handle static pages,  other bigger
ones that can do mod_perl,  that could be useful.  Perhaps we could
have some hybrid between worker and prefork that make it possible to
run threadsafe and nonthreadsafe apps in the same server...

   The critical thing is that Apache shouldn't forget that "prefork" has
it's strengths.  I think the main reason that Apache is so reliable is
that a hand grenade can go off in one process and it doesn't hurt the
server at all.  It took Microsoft IIS many iterations to find a
reasonable balance between performance and stability,  and the
evolution of Apache MPM's is likely to face similar pitfalls.









Re: Puzzling News

2005-02-28 Thread Paul A. Houle
On Mon, 28 Feb 2005 21:31:19 +, Wayne S. Frazee <[EMAIL PROTECTED]>  
wrote:

Correct me if I am wrong, but I have seen much that would purport the  
worker MPM to deliever gains in terms of capacity handling and  
capacity-burst-handling as well as slimming down the resource footprint  
of the Apache 2 server on a running system under normal load conditions.
	Well,  our big production machine has 6G of RAM and never gets close to  
running out even in testing when we stacked it up to the (compiled in)  
limit of 255 processes.  Under normal operations we have 50 running,   
mostly because of keep-alive (helps a lot with the performance of our  
cookie-based authentication system) and people downloading moderately big  
(>100k) files.

	Even though RAM is pretty cheap,  there probably are people who are more  
constrained.

I would also like to point out I too have seen inconclusive evidence on  
MPM "advantage".  I think that is part of the problem... without a clear  
business-case-defendable advantage to the features implemented in Apache  
2... why upgrade?
	Altruism.  If people don't use Apache 2,  then Apache development will  
keep going sideways forever.


Re: Puzzling News

2005-02-28 Thread Paul A. Houle
On Mon, 28 Feb 2005 21:09:55 +, Wayne S. Frazee <[EMAIL PROTECTED]>  
wrote:

A move to 2.0 or 2.1 will take place gradually over time, I think, once  
PHP can be used with some expectation of stability on a  
non-prefork-MPM.  Note: I am not insinuating PHP is not thread safe, but  
rather many of the elements it works with or relies on are not.  I also  
want to make it clear, I am not blaming the PHP community either. I  
think it is a broader based problem in that commercial projects such as  
company web server implementations and hosting infrastructure packages  
are just not seeing enough value in the move to 2.0 to justify the  
development expenditures.
	We've got production instances of Apache 2 running on Linux and Solaris,   
all of which are running PHP on prefork.

	Honestly,  I don't see a huge advantage in going to worker.  On Linux  
performance is about the same as prefork,  although I haven't done  
benchmarking on Solaris.

	On some of these machines I have a lot of control,  on others I have  
people calling me every week wanting me to link something new into PHP.

	If mod_event becomes mainstream,  however,  I'll have a reason to really  
want threaded PHP,  since that will give real performance improvements for  
static files.  On the other hand,  it might be possible to make a version  
of mod_event that uses processes for PHP.





Re: [VOTE] 2.1.3 as beta

2005-02-23 Thread Paul A. Houle
On Wed, 23 Feb 2005 16:57:20 +0100, Matthieu Estrade <[EMAIL PROTECTED]>  
wrote:

Justin Erenkrantz wrote:
I think it's the best way.
Maybe we could also provide two packages, httpd-with-apr and another one  
without apr

	No,  they should be separate,  and having two different packages is just  
a recipie for trouble.  (When you have to deal with a support problem,   
the guy who's having the problem will never know which package he  
downloaded.)

	Remember you're dealing with two audiences here.  There's an audience of  
people who compile from source,  there's also the audience of people who  
use packaged libraries.

	People in the second camp aren't going to suffer much hardship.  Rpm,   
deb and other package managers know about dependencies,  and newer tools  
(yum) will automatically track them down.

	I'm in the first camp.  Apache is an important enough piece of software  
that I'm not going to switch to some other web server because it's a  
little more work to compile.  When compiling software,  it's common to  
have to compile some libraries to compile the app,  and this kind of split  
is even done with trivial software such as mp3-taggers.

	The one thing I'd worry about is maintainance.  Suppose there's a  
security flaw in APR...  Well,  Apache is in the front of your mind,  but  
APR isn't,  so it might be easy to overlook the advisory.  Upgrading APR  
might be a bit more confusing because the system might spend some time in  
a state where APR is upgraded and Apache isn't -- what effects will that  
have?  (Probably none,  but "probably" isn't a good answer for a  
production server that 30,000 people depend on.)

	Also the stability of APR is going to matter.  For a long time you had to  
run Subversion on APR out of CVS and I'd often update svn and then find  
that I had to update APR because they'd changed it so it depends on the  
latest APR.

	If APR is going to be reasonably stable in the future,  then I feel OK  
thinking of it as a system library.  If,  on the other hand,  it's really  
joined-at-the-hip to Apache and lots of app developers are building things  
that need the CVS version of APR,  it had better not be.




Re: UNIX MPMs

2005-02-10 Thread Paul A. Houle
On Thu, 10 Feb 2005 11:56:47 +, Nick Maynard  
<[EMAIL PROTECTED]> wrote:


UNIX MPMs that actually _work_ in Apache 2:
worker
prefork (old)
Yeah,  but what if you want to run PHP or mod_perl?
	Sure,  PHP or mod_perl ~might~ work for you if you're lucky and you don't  
compile in the wrong third-party library,  but you'll be in a world of  
pain if you don't.

	It's very hard to bolt threading onto an existing system that links to  
legacy libraries.  Java managed to provide a thread-safe environment by  
having a painful API to link to C code,  "100% Pure Java" xenophobia --  
and it still took them 5 years to make a JVM which was reliable enough for  
government work.

	On Linux I've done some benchmarking and found that worker isn't any  
faster than prefork at serving static pages.  (Is it any different on  
other platforms,  such as Solaris?)  In principle you might save RAM by  
running prefork,  but in this day and age you can fit 16 GB in a 1U pretty  
easily and it's cheaper than hiring a programmer who doesn't know how to  
track down race conditions,  never mind one that does.