Re: mod_mbox helper scripts and programs
Justin Erenkrantz wrote: On Tue, Jan 10, 2006 at 06:55:37PM +0100, Mads Toftum wrote: On Tue, Jan 10, 2006 at 09:51:36AM -0800, Paul Querna wrote: Python! Excellent choice - at least that way I won't have to even consider trying ;) Even Perl would be an improvement over zsh. =) -- justin I'd learn towards Perl. You're more likely to find Perl installed on a system than Python.
Re: [Fwd: [EMAIL PROTECTED] [announce] Apache HTTP Server 2.2.0 Released]
William A. Rowe, Jr. wrote: If the user must rewrite their content handlers to pay attention to thread jumps, and must use some sort of yeild beyond apr_socket_poll/select(), then I'd agree it becomes 3.0. I wouldn't worry about protocol/mpm authors, who are a breed to ourselves and should be able to handle any bump. It's the joe who put together a nifty auth or content module who has to jump through hoops to learn the new API who will suffer if we don't yell out **3.0**!!! Well, I get back to thinking about the cultural difference between open source and commercial software here. That is, the difference between a conceptually simple system (Unix) and a system with optimization for every special case, codesign, et. al (OS/360) When Microsoft decided that threading would be ~the way~ in Windows, they put a lot of effort into letting people run old components by tagging old components as threadsafe or not, and automatically serializing components that aren't threadsafe. The prefork model has contributed to the reality and perception of Apache as a reliable product. Threading is great for a system that's built like a swiss watch, but a disaster for the tangle of scripts (or tangle of objects) that most web sites run. In the days of MS-DOS, I did a lot of downloading over a 2400 baud modem, so I wrote an xmodem/ymodem client that ran in the background as you did other things -- this was your classic interrupt-driven state machine. I never got it to work 100% right. It's not easy to do this kind of programming AND when you push the hardware and OS in this direction you find out things you wish you hadn't. I've had the same experience with single-threaded web servers such as thttpd and boa. I'd do a quick evaluation and I'd be like "damn that's fast" and then I'd put it into production and find that they didn't work 100% right. I'd go back to Apache because, even if ate more RAM than I liked, Apache worked correctly. As Apache has been moving in the direction of more efficient concurrency models, IIS has been moving in the direction of more isolation. I'd love more efficiency, but I don't want to give up reliability -- if the world's web apps are going to need to be designed like swiss watches in the future, the world is going to pass Apache 3 by.
Re: OT: performance FUD
Nick Kew wrote: That looks a lot like Windows' market position. And I suspect it's no accident: both products have heaped on new 'goodies', all too often at the expense of other considerations. It's IMO also no accident that PHP is moving towards a Windows-like security track record. You'll find skeletons if you go looking in CPAN. Market share is a lot of the reason why people target malware at Windows. If you wrote an email virus for the mac, one mac would infect the other mac and that would be the end of your fun. The real trouble with PHP is that it's sparked a revolution in web server software: code reuse. Before PHP, you couldn't find affordable web hosting for dynamic sites: cgi-bin was so expensive and problematic that mass hosting facilities couldn't afford to host it. Mod_perl would be out of the question. If you wanted to start a weblog or a wiki four years ago, you couldn't find reliable software that would hold up in the real world unless you were willing to put a lot of work in it. Today you can download Drupal, Wordpress or any of a large number of packages. So now there are tens of thousands of site running the same software with predictable URLs that people can mess around with and find bugs in the underlying software. If there were any Perl or Java apps of the same popularity, we'd be seeing the same thing. The difference is you can get a shared web hosting account for $10 / month if you want to run a Wordpress site on PHP, but you really want a dedicated server, more like $200 /month if you want to run mod_perl or Java. If you wanted to match the functionality of PHP, in mod_perl or Java, you'd have to install twenty or so framework modules -- everybody is going to pick a different set of modules, so attackers aren't going to have a consistent profile to hit, but on the other hand, this inconsistency makes it harder to incorporate other people's code into your site.
Re: OT: performance FUD
Colm MacCarthaigh wrote: Is there anything we can do in 2.4/3.0 that will help gain that trust? It's not Apache's fault. It's not even PHP's fault. It's a much bigger problem with the open source libraries that people link into PHP, Perl, Python and the like. The problem is particularly perceived as a PHP problem because (1) PHP is the market leader, and (2) the PHP developers are a lot more responsible than, say, the Python developers, who tell you to go ahead and write threaded apps in Python anyway. I suppose that the PHP developers could set up some system where extensions are marked as being threadsafe or not, and there's a lock on every untrusted module, then do a program of certifying modules as safe, but that's a ~big~ project: race conditions and deadlocks are a bitch to debug, particularly when the problems are in somebody else's code. PHP's market position is as a product that any idiot can download and install, just following the instructions, and get a system with good reliability and performance -- a painful phase of shaking out threading bugs would endanger that perception. The best thing I can see Apache doing is some kind of hybrid model which works like mod_event for static pages, and passes off (some or all) dynamic requests to something like prefork. Dynamic requests would eat more memory than worker, but you don't have the problem of using a heavyweight mod_perl or mod_php process spending two hours blasting bits out of a file to somebody on dialup. A process-based model is always going to be more reliable than a thread-based model. A hand grenade can go off in an server process, a server process can hemorage memory terribly, and nobody gets hurt. The user on the other end just hits 'reload' and goes on hs way.
Re: OT: performance FUD
Jess Holle wrote: So if one uses worker and few processes (i.e. lots of threads per), then Solaris should be fine? That's what people think, but I'd like to see some numbers. I've never put a worker Apache into production because most of our systems depend on PHP or something else which I wouldn't trust 100% in a threaded configuration. Now that I think about it, there is a common situation where people with modest web sites (at the 50,000 ranking in Alexa) have performance problems with Apache... That's the case of people doing downloads of big (>>1 M files.) Conventional benchmarking, which fetishizes a large and constant number of connections on a LAN doesn't model the situation well (it doesn't model any real-world situation well.) The trouble you have a population of people with really bad connections that take forever to download things... Back when I had dialup, I used to download ISO images, I'd just use a download manager and have my computer running overnight to do it. For one project I work on, we have people uploading files that sometimes are in the ~1 M range, then we do processing on the files that is sometimees extensive. We were worried that some processes were running for 20, 30, 40 minutes, but we discovered that many of our users have horrible connections. The result is that a site with a modest number of "hits" per day can have > 1000 simultaneous connections. With prefork you end up burning a lot more RAM than really seems fair -- although it's not so bad if you can afford to load your machine with 8G.
Re: OT: performance FUD
Justin Erenkrantz wrote: If it's on equivalent hardware (i.e. Linux/Intel vs. Solaris/Intel on the same box), I doubt there will be an extreme performance gap. In fact, I've often seen Solaris outperform Linux on certain types of loads. In my experience, a lot of Linux network card drivers are sub-standard; if it's supported by Solaris, there's a fair chance the driver takes full advantage of the hardware. (Netgear GigE drivers on Linux are abysmal.) -- justin I think the issue with Apache/Solaris is that process switches take a long time on Solaris. I've got a computer that I need to rehabilitate for the family this Christmas. I think I'll put Solaris 10 on it and do some web server benching before I put Ubuntu on it.
Re: OT: performance FUD
Joshua Slive wrote: Suggestions to improve http://httpd.apache.org/docs/2.1/misc/perf-tuning.html are very welcome. Suggestions backed by data are even better. Basically there's nothing quantitative there. There's a lot of talk about "some operating systems" and not a lot of talk about specifics. One issue is that this page was written for (and, in fact, by) the Dean Gaudet-type performance freak who was looking to squeeze every last ounce of performance when serving static pages. All you need to do is add one CGI script or php app to your site and everything on that page after the hardware section gets lost in the noise. So when people mail [EMAIL PROTECTED] asking how to fix performance problems, the answer is almost always "fix your database" or "rewrite your web app" and not "change your apache configuration" or "get a faster web server". For me, that's the reason why quantitative information is so important. I did extensive performance testing on the new server we commissioned precisely because of the situation you describe: we had people saying "rewriting is slow", "extendedstatus on is slow" -- people were making decisions based on qualitative statements about performance, not qualitative performance. After doing those tests, I learned that I had nothing to fear if I wanted to put in 500 rewriting rules, but that 50,000 is too much.
Re: OT: performance FUD
Brian Akins wrote: This is probably way off topic for this list. I was searching for something related to php this morning (I know, I know... But some people here need php) and the majority of the google hits where FUD sites. Most of them generally say "Apache is bloated and slow, you should use X." I know we have several people on this list who run Apache on very high traffic sites. While we cannot answer every single piece of FUD out there, do we need a general page to answer some of them. Maybe "testimonials" or something. I know, with my config, I can easily saturate multiple gig interfaces and have a rather full feature installation. Apache isn't the fastest web server -- at least without mod_event. I've seen data corruption with all of the free single-process web servers, although I'd assume that products like Zeus do better. Looking at Alexa, the logs from a few sites I run, and benchmarking I've done, there are probably only a few thousand web sites in the world that push the limits of a single Apache web server. Perhaps 100x as many PHB's ~might~ pick a web server because of numbers in a glossy ad. The real competition is with IIS, and people don't choose Apache or IIS based on performance numbers -- they choose it because they are familiar with Unix or familiar with Windows. Other web servers are at the 1% market share level: http://news.netcraft.com/archives/2005/11/07/november_2005_web_server_survey.html Don't make it a "fudbusting" site, make it a "apache performance tuning" site. There are all of these statements in the apache docs that * .htaccess is slow * ExtendedStatus on reduces performance We did a round of performance testing on a server that we commissioned last year and took measurements of these things, and found that we'd need to put >1000 rewriting rules to harm performance noticably, that the overhead of ExtendedStatus On is negligible for a site that gets 500 hits/sec, etc. I might see if I can find my report about this on this and put it online -- there some things that I know, and even more that I don't... * prefork and worker seem to be about equally fast on linux? * is the case on Solaris? * MacOS X? * Solaris 9 is embarassingly slow running Apache compared to Linux -- is the same the case with Solaris 10?
Re: pgp trust for https?
Peter J. Cranstone wrote: Currently Windows, Linux and Unix only use two levels of privilege - Ring 3 and Ring 0. Everybody and there uncle's code want to run at Ring 0. Another really bad idea, as once I introduce a network/video/keyboard/whatever driver at that level I can execute malicious code. From there I can control the machine. You'd need a new hardware architecture for ring 1 drivers to be worth it. The trouble is that drivers can initiate DMA operations against physical memory. Unless you devise some system where the OS can veto DMA operations, protection in the CPU is worthless.
Re: httpd has no reason to improve...
Joe Schaefer wrote: Paul A Houle <[EMAIL PROTECTED]> writes: IMO market-share doesn't relate to project activity. The word I most associate with apache development is empowerment; cheifly to empower users to build better web stuff. Users that need to tweak the software in order to make that happen become committers, who are further empowered to make commits, and eventually make decisions about the project as a whole when they join the pmc. Market share does matter. It's a function of a product having a niche. A project that's thinking about market share is thinking about end users. A project that isn't thinking about end users is a project of people working on something for their own amusement. What I personally enjoy seeing on [EMAIL PROTECTED] is "piling on", where one person implements something, then two or three other people (not always committers) jump on the bandwagon and take it in a better direction. It's good to see more of that happening lately. Yeah, soon httpd will have a second- or third-rate implementation of every protocol under the sun. mod_ftp will underperform mainstream ftp servers so long as its running under prefork. Similarly, cacheing proxy servers will outperform squid. I don't see end users clamoring for mod_ftp, or mod_snmpd. What's the point of writing a squid replacement unless you can actually make something better? According to the statistics, most users are content with 1.3. But that is a very short-sighted way to measure progress, because the situation will change as the *nix distros move away from 1.3. And it's got nothing to do with how content users are with 1.3, but with how, after 54 revisions, Apache has finally produced a server that doesn't have serious problems running in production environments. Exploring new protocols within httpd will almost certainly will make the internal architecture better, even if none of those modules you mention are distributed with httpd. IMO mod_smtpd will be a fine example of that, because unlike httpd, almost all of the real action will be on the input_filter side (which IMO hasn't received the same amount of polish that the output_filter side has). A server framework that can implement any protocol under the sun might be a nice PhD thesis, but what does it do for end users? I don't see excellence coming from "swiss army knife" frameworks that do everything, but from systems that are developed from a whole system viewpoint, that have a good amount of codesign between layers of the system -- if you build a system that lets you plug in arbitrary junk, you're going to get arbitrary performance and reliability. (I think of the old days of Java Applet programming, where people who were just learning OO techniques were writing great complicated classes that were trying to be infinitely reusable, and it took them a long time to realize that people were going to have download every byte of the class files they were generating... But it's not just the old "bloat" argument, but the burden of maintaining superfluous code.) To sum it up: better server architecture => better modules => more toys for users => more interest in the 2.x internals => more patches => more activity => better server architecture ... Sure, I appreciate mod_ssl and mod_dav in Apache 2 -- but the reason why so many people have stuck with 1.3 for so long as that 2.0 has had very little to offer end users. A few people thought it would be great to have pluggable MPM's, and a few other people introduced half-baked systems such as mod_cache and filters. You know a tree by its fruit, and the fruit I care about is performance and reliability. Apache 2.0.54 is a great server, but the fact that it took 54 revisions to get there isn't a good indication about the design and development process. Were Apache development targeting real problems that real users have, I think things would be quite different. A big part of the problem is that the Apache project has settled into a local equilibrium -- this explains the paradox of a product that obviously satisfies end user's needs well (no competition has emerged) but has a moribund development process. Any real innovation in the web server space will need to be disruptive, to break things. Apache, as we know it, just can't do that.
httpd has no reason to improve...
Ben Collins-Sussman wrote: I see a lot of frustration going on. The thing is, httpd's development process is nearly identical to Subversion's process... we stole most of it from you folks! So why all the angst in httpd-land, but not in Subversion-land? It's really a lack of direction. Apache already is the market leader in the web server space and has no potential for further growth -- Windows users aren't going to switch to Apache in significant numbers. (They never were, so developing a good NT port was a waste of resources that has helped paralyze the development of Apache for it's core users.) This in contrast to svn, which has a long way to go in growth. There is very little going on in Apache to make it be a better web server -- the frontiers of development are trying to make it an ftp server, an smtp server, a ntp server, a snpd server, a bgp server... Efforts that 99.44% of httpd users are completely indifferent to (Ok, maybe a few of them will want to turn on an ftp server because they're not getting enough support calls from people who are behind NAT or firewalls.) The module mechanism means that significant improvements in Apache can be made without making changes to the Apache distribution -- people are who are serious about improving Apache as a web server are taking this route and not adding things to the Apache distribution. (Any change which affects the core use of Apache is politically controversial and gets an automatic -1.) Pluggable MPMs sound like a good idea in theory, but in practice they're a failure. It's hard enough writing code that runs in a single concurrent environment, but impossible if the concurrency environment is unknown and unknowable. It's hard enough keeping the modules that are part of the distribution working under all MPMs, even harder to support third-party modules (mod_php) and even harder to support all the libraries that link against third-party modules (kiss of death for mod_php and mod_perl in worker.) --- People who want to wake Apache up would be best to try a fork. Throw things that are irrelevant to running httpd on Posix overboard. Pick ONE mpm that offers a significant advance in functionality (mod_event or mod_perchild that actually works) and qualify everything to run against it. Add mod_macro. But why? Apache is good enough for most people, and you can fix any real problems for a particular project by adding modules or making little source code patches. There are 1,000 or so sites in the world that need an MPM faster than prefork, and they'd be better off going to a cluster strategy for availability anyway. Apache 2 was a rough ride, but Apache 2.0.54 runs smooth on every installation I have... Why change anything?
Re: httpd-2.x servertype inetd
Nick Kew wrote: Jan Kratochvil wrote: In what sense low-cost? Apache's high startup cost is self-reinforcing. We know it's a once-only thing, so we have every module do expensive things at startup rather than per-request. I don't see how inetd would affect that. The only thing you'd save is the spawning of children, which you already noted ... Funny enough, the case where I'd think that partial restarts would come in handy would be to be able to change configuration on one virtual host on a production system without bringing them all down. Practically, it isn't a big issue: I've run systems in the 10-50 virtual host range, and startup time seems to be a few seconds. Our apache is basically stateless, so probably one person has to reload and a few people see a transient broken image: no apps screw up. We add a new virtual host or make a config change maybe once or twice a week, so it's not terrible. I wonder what mass hosting places with 1000's of hosts do, but I guess they use dynamic methods of implementing vhosts, batch config file changes, and are a little more tolerant of downtime. Contrast this the ColdFusion MX server, which back ends one Apache system I look after. It runs very well once it's up, but startup takes about 30 seconds. Internal state kept on the server has two consequences: we have to restart it often when we make configuration changes in an app or after an app transitions to a bad state, also a restart breaks sessions, seriously inconveniencing users. I always laugh when people tell me that web apps need to keep state in RAM in order to be scalable...
macro facility/default conf file/auth groups
I'm in the middle of setting up (selective) WebDAV access for particular directories on a set of (almost) identical vhosts with complicated requirements for access control. I thought of a bunch of Apache gripes that I have often: (1) I've evolved a nice system for implementing virtual hosts (directory layouts, using Include directives to segment configuration) files that benefits from years of mistakes that I and others have made. I'm wondering if there's any interest in refactoring the apache conf file into a number of separate files to reflect "better practices" than the current default conf file. The idea isn't to force people to change, but to encourage future installations to be more managable. I'll be happy to give more specifics if people are interested. (2) This particular system has a production and a test instance, so I'd love to have a way to set variables that I can interpolate into arbitrary strings. For instance, everything connected with the production system may be under /production/ in the filesystem and the test system stuff is under /test/ so I want ServerRoot to be ServerRoot {mybasedir}/apache2 I have repetitive stuff all over my conf files, for instance, a virtual hostsis stored at /{mybasedir}/sites/{vhost} and the DocumentRoot is at /{mybasedir}/sites/{vhost}/htdocs so it would be nice to write something like set vhost myvhost.com DocumentRoot /{mybasedir}/sites/{vhost}/htdocs I'd have to think about how scoping works, because I'd obviously like to reuse the same variable name for different vhosts. VirtualDocumentRoot doesn't really cut it, because there are other things, such as log files, containers, rewrite rules (already using SetEnvIf cheats) and such that need to be configured as well. (3) Along these lines I can imagine that a macro facility could be useful (one answer to the scoping problem) but if I did my containers as macros, it would be a little ugly. Another case where I might like macros (but settled for using an Include) is filling out all the auth configuration for files. A whole bunch of vhosts will share a users and group file across a number of directories. It's really a drag to specify four configuration parameters per- that are the same everywhere, and some way to repeat this would be welcome. (4) Once again I'm stung by the inability to make AND criteria stick. For instance, we have 2 kinds of people working on N projects: developers and designers. All of these projects have certain directories that developers are interested in and others that designers are interested on. I'd like to say something like require group projectA AND group developer for the developer-only directories. I've similarly had cases where I've wanted to have AND criteria involving IP address AND user agent, IP address and authentication, etc. I've heard talk about overhauling this area, but I don't know what the status is. - If were to make a more concrete proposal for (2) is there any chance of it getting in Apache 2.1?
Re: Initial mod_smtpd code.
Jem Berkes wrote: I could also start work on a mod_smtpd_dnsbl if the mentors feel that is worthwhile? This would look up a connecting IP address against a blacklist and return a descriptive string to mod_smtpd if the client should be rejected with an error: "550 5.7.1 Email rejected because 127.0.0.2 is listed by sbl-xbl.spamhaus.org" I'd also like to include support for RHSBL, a newer type of listing by domain names from the envelope sender address. That's used by a growing number of projects. Overall blacklists aren't that effective and cause a lot of false positives. They may make sense in the case of something like SpamAssassin which uses a blacklist in conjunction with other false positives, but by themselves they really aren't a responsible way of dealing with the spam problem. I think it's better to discourage "worst practices" than to sucumb to plugin mania.
Re: mod_mbox user interface
Maxime Petazzoni wrote:. As promised : a first try at XHTML mockup. Comments about DOM structure and XHTML semantics are welcome ! http://skikda.bulix.org/~sam/ajax/jsless/ On firefox/win32, the "box list" and "message lists" aren't vertically aligned correctly (at least for me.) Other than that, it's quite pretty. You could reduce the whitespace margins considerably, maybe even lose the box around the year/month listings on the side. From a usability perspective, I like 'three-pane' sorts of interfaces, but I'm often infuriated by things that waste a lot of space so I can't read the actual message. The huge space allocated to the logo at the top could be reduced (that will probably be easy to template away, but if you set a bad example for people, they'll follow it -- also I get the feeling that the message list takes up space that I could use to read the message...
Re: [VOTE] mod_ftp for HTTP Server Project
Jim Jagielski wrote: I therefore Call A Vote on whether we should support mod_ftp for inclusion into the Incubator and if we should accept mod_ftp upon graduation from the Incubator. I don't know if I get a vote, but it's -1 This would have been an exciting project in 1989, but ftp doesn't work well with today's internet.: today it's just a way to make systems that "just don't work" for people.
Re: mod_smtpd project planning
Jem Berkes wrote: This is the problem encountered by many spam filters, as to be most effective they really need to be _involved_ in the SMTP transaction and not just stage 2, after receipt happens. Think greylisting as an example. You read this? http://www.acme.com/mail_filtering/ One thing that's critical isn't just having access to information from early stages of mail processing, but being able to intervene at early stages in the processing so to avoid the CPU and bandwidth waste at advanced stages. This particularly matters during a computer virus outbreak: I remember hitting on many of Jeff's solutions when a mail server I managed was getting hammered by an incredible volume of viruses, and I wrote scripts that picked up bad addresses from the virus filter output and put them into the software firewall.
Re: mod_smtpd project planning
Luo Gang wrote: > > mod_smtpd is a SMTP protocol handler, used to receive mails by SMTP, > maybe it will use sendmail as its MTA(not sure). Somebody hope it could also > include a spam filter. > > > Hooks for a spam/virus filter aren't optional if it's an autoresponder: running an autoresponder that doesn't filter is about the same as sending spam in the first place.
Re: mod_smtpd project planning
Jem Berkes wrote: Hi all, I'm another student working on mod_smtpd Been running httpd 2.x since it appeared, but am new to development. What does mod_smtpd do? Is it a sendmail replacer or does it let people request content via smtp or what?
Re: how do i debug a segfault?
Akins, Brian wrote: Sorry if I missed it, which mpm are you using? prefork
how do i debug a segfault?
This weekend we had the kind of experience with Apache httpd which we expect from Microsoft IIS or Tomcat. We're running a self-compiled 2.0.54 on RHEL 4 on x86_64 on a 4-way machine. Our server got kicked around midnight to rotate logs, but around 5AM we started getting a large volume (> 1 /sec) of messages like [Tue Jun 28 14:45:53 2005] [notice] child pid 28182 exit signal Segmentation fault (11) [Tue Jun 28 14:45:53 2005] [notice] child pid 28183 exit signal Segmentation fault (11) [Tue Jun 28 14:45:53 2005] [notice] child pid 28184 exit signal Segmentation fault (11) This server isn't very heavily loaded, it's lucky if it's getting 1 hits/day at this point. The site still uses CGI extensively: some CGIs worked just fine, but other CGIs failed with a 0 length document, I think nothing in the log. Kicking the server resolved the problem, at least for now. It has ExtendedStatus on and my hunches are: (i) the problem is x86_64 specific (haven't seen this on a heavily loaded x86 machine) and (ii) the underlying problem is in server global state. One obvious step is to set up monitoring of stderr (as has been discussed) to page me and maybe auto-kick the server if this happens again -- but I'd like to see a real fix.
Re: mod_smtpd project planning
Paul Querna wrote: As some of you might be aware, one of the Summer of Code Projects is an SMTP protocol module for httpd 2.x. Huh?
Re: Monitoring HTTP error logs
William A. Rowe, Jr. wrote: Offhand, no, but I'd suggest looking at Piped Log scripts. This would be pretty trivial to do (even looking for very specific messages or masking out other common occurances.) The messages can then be written to one or more log file, as well. See the ErrorLog documentation for pipe syntax, and rotatelogs or logresolve for additional examples. Another possibility is to, more or less, write a script that does the same thing as 'tail -f', or alternately a script that runs periodically and keeps track of the position it left off at in the log.
Re: stress testing of Apache server
On Tue, 03 May 2005 13:51:55 -0700, Paul Querna <[EMAIL PROTECTED]> wrote: Sergey Ten wrote: Hello all, SourceLabs is developing a set of tests (and appropriate workload data) to perform stress testing of an Apache server using requests for static HTML pages only. We are interested in getting feedback on our plans from the Apache server community, which has a lot of experience in developing, testing and using the Apache server. Although Apache is hardly the fastest web server, it's fast enough at serving static pages that there are only about 1000 sites in the world that would be concerned with it's performance in that area... Ok, there's one area where I've had trouble with Apache performance, and that's in serving very big files. If you've got a lot of people downloading 100 MB files via dialup connections, the process count can get uncomfortably high. I've tried a number of the 'single process' web servers like thttpd and boa, and generally found they've been too glitchy for production work -- a lot of that may involve spooky problems like sendfile() misbehavior on Linux. Information available on the Internet, as well as our own experiments, make it clear that stressing a web server with requests for static HTML pages requires special care to avoid situations when either network bandwidth or disk IO become a limiting factor. Thus simply increasing the number of clients (http requests sent) alone is not the appropriate way to stress the server. We think that use of a special workload data (including httpd.conf and .htaccess files) will help to execute more code, and as a result, better stress the server. If you've got a big working set, you're in trouble -- you might be able to get a factor of two by software tweaking, but the answers are: (i) 64-bit (or PAE) system w/ lots of RAM. (ii) good storage system: Ultra320 or Fibre Channel. Think seriously about your RAID configuration. Under most circumstances, it's not difficult to get Apache to saturate the Ethernet connection, so network configuration turns out to be quite important. We've had a Linux system that's been through a lot of changes, and usually when we changed something, the GigE would revert to half duplex mode. We ended up writing a script that checks that the GigE is in the right state after boot completes and beeps my cell phone if it isn't. == Whenever we commission a new server we do some testing on the machine to get some idea of what it's capable of. I don't put a lot of effort into 'realistic' testing, but rather do some simple work with ApacheBench. Often the answers are pretty rediculous: for instance, we've got a site that ranks around 30,000 in Alexa that does maybe 10 hits per second at peak times... We've clocked it doing 4000+ static hits per second w/ small files, fewer hits per second for big files because we were saturating the GigE. What was useful, however, was quantifying the performance effects of configuration changes. For instance, the Apache documentation warns that "ExtendedStatus On" hurts performance. A little testing showed the effect was minor enough that we don't need to worry about it with our workload. Similarly, we found we could put ~1000 rewriting rules in the httpd.conf file w/o really impacting our system perfomance. We found that simple PHP scripts ran about 10x faster than our CGI's, and that static pages are about 10x faster than that. We've found tactical microbenchmarking quite useful at resolving our pub table arguments about engineering decisions that effect Apache performance. Personally, I'd love to see a series of microbenchmarks that address issues like * Solaris/SPARC vs. Linux/x86 vs. Mac OS X/PPC w/ different MPMs * Windows vs Linux on the same hardware * configuration in .htaccess vs. httpd.conf * working set smaller/larger than RAM * cgi vs. fastcgi vs. mod-perl * SATA vs. Ultra320 SCSI for big working sets and so on... It would be nice to have an "Apache tweakers guide" that would give people the big picture of what affects Apache performance under a wide range of conditions -- really I don't need precise numbers, but a feel of around 0.5 orders of magnitude or so. It would be nice to have a well-organized website with canned numbers, plus tools so I can do these benchmarks easily on my own systems. === Speaking of performance, the most frustrating area I've dealt with is performance of reverse DNS lookups. This is another area where the Apache manual is less than helpful -- it tells you to "not do it" rather than give constructive help in solving problems. We had a server that had heisenbug problems running RHEL 3, things stabilized with a 2.6 mainline kernel -- in the process of dealing with those problems, we developed diagnostic tools that picked up glitches in our system that people
Re: simple-conf branch
On Mon, 4 Apr 2005 15:01:34 -0700, Greg Stein <[EMAIL PROTECTED]> wrote: Sorry, but I very much disagree. I think back to the old days of access.conf, httpd.conf, and srm.conf. As an administrator, I absolutely detested that layout. I could NEVER figure out which file a given configuration was in. I always had to search, then edit. We've been to the "multiple .conf world" before. It sucked. We pulled everything back into a single .conf to get the hell outta there. Small examples are fine. The default configuration should remain as a single .conf file. After a few years of running moderate-sized virtual hosting servers (2 to a few hundred) I've settled in on a multiple-file organization for virtual hosts. I use the usual httpd.conf for server-wide settings, but that includes "vhost.conf" which contains a bunch of virtual host containers that then include configuration files for the various virtual hosts. These days I tend to create a set of directories, like /www/sites/vhost-1.com which then have logs, htdocs, and other supporting directories. What's nice about this is that vhosts are easily portable from one server to another. I've got a script that automatically punches in a new vhost, so I can have one up and running in two minutes. My big project these is a site with a database that's still useful in read-only mode when the database goes down; it's got mirror sites and all kinds of funny details, and we have a "runlevel.conf" symlink that points to one of several files that let us adapt the system to various degraded states such as database maintainance, software upgrade, etc. That same site also has several test instances, and we have a single configuration file that has all the variables that change between different instances, so it's easy to maintain the conf files in CVS. There are good operational reasons to split up configuration in different files -- if the Apache install can encourage good practices, based on the decade of experience we've had with it, that's a good thing.
Re: Do these broken clients still exist?
On Sun, 3 Apr 2005 13:58:56 -0400 (Eastern Daylight Time), Joshua Slive <[EMAIL PROTECTED]> wrote: Does someone with a high-traffic, general-interest web site want to take a look through their logs for these user-agent strings. I don't mind keeping them if they make up even 1/100 of a percent of the trafic, but it seems silly to keep these extra regexes on every single request if these clients don't exist anymore in the wild. Regexes are pretty cheap for a 'normal' apache setup. In the initial testing of a production server (2x 3.2Ghz Xeon, 6 GB RAM;) we found that, serving static pages, the overhead of processing regexes didn't become noticable until we had >1000 rewriting rules. Even then, at least 30% of the hits on this server are cgi-scripts, so the overhead of regexes is really nothing compared to the other ways we abuse our machine. In doing this testing I did notice that Apache's handling of regexes is pretty simplistic. Much of the time you can consolidate a large stack of regexes into a single state machine, and that could give vast (factors of hundreds or thousands) improvements in performance for handling large rule sets. On the other hand, it doesn't really matter. The people we've inherited this server from left us several very large regexes with a few hundred pipe symbols each that match UA's of non-browser clients that we don't want using our service. The trouble is that inevitably this kind of regex starts mutating into malignant forms as people start using parens, also we have no documentation for the rules; on slow days I think about breaking these up into 500-1000 rules, which we could in principle comment one-by-one... This wouldn't really impact the performance of our machine under 'real' circumstances, but we could measure the impact under specialized testing.
Re: Puzzling News
On Tue, 1 Mar 2005 16:18:17 +0200 (SAST), Graham Leggett <[EMAIL PROTECTED]> wrote: The trouble with the authentication problem is that the credentials used for authentication are often used for way more than just finding out whether a user has access. That said, this is definitely a very useful addition. That's exactly the point. I think of this in terms of "user management", as I find the terms "authentication", "authorization" have too much baggage and often lead people to making the same mistake too many times. Something like an auth module that can do "form based" auth, in addition to "basic" and "digest" etc would probably be very useful. Well, on one level I think the API for authentication is basically inadequate. One major problem is that many applications (if they're going to be USABLE rather than to just exist) need "optional" authentication -- Amazon.com's personalization is an obvious case, but a more common "intranet" application is as follows: we have an event calendar which has some events that are open to the public and some of which are intended for staff. We've got an apache auth module that plugs into our campus kerberos system, and in a lot of ways it's pretty good. But a staff person who wants to see what events are coming up needs to log in and then has to click two more times (the fault of our designer) to actually see the private events. Let's face it -- few users are going to bother to do that, so the event calendar is going to get a small fraction of the usage that it should. If on the other hand, the system could remember the identity of the user, they could just go to the URL and see the events; that improves the chance that they'll actually look at the calendar on a regular basis. The other trouble with the Apache authentication system is that the whole API between Apache and most programming environments is inflexible. You can do better in mod_perl, but with PHP, JSP and most other environments, you get a lunk of data from Apache at the beginning of the request and you send back a lunk of data, but you don't get to call subroutines on Apache and get results; for instance, you can't ask Apache what the name of the user is, but instead Apache has to pass everything that it's going to pass in notes or environment variables. (One of the things I'd love, for instance, is lazy evaluation and caching for reverse DNS lookups, something that would mean big changes...) On top of that, the real problem is user management isn't the actual authentication (I've got an "authentication module" that supports cookie-based authentication that I've written again and again in different languages that's about 100-200 lines of code), but rather in a domain that Apache has never concerned itself: user interface. For instance, public web sites that let anybody join need to verify the e-mail address of people who join. There's basically two ways to do this: (1) Send a randomly generated password to the user (2) Let the user pick a password (that they've got a fighting chance to remember) and send them a token that they need to supply to activate their account Empirical research shows that approach 2 has less than half the failure rate of 1, and helps in turning registrants into active users because their relationship of the site doesn't begin with a password change, or, more likely, a password reset. Over the course of a few years, we made obserations and found we could cut 2's failure rate in half again by making little improvements in the details. Similarly, if a site has passwords, there are going to be password resets, and you need a good interface for users to reset their own passwords. Big sites will spend what it takes to get a good UI, but most web app developers still see user management as an afterthought, and will quickly pound out something that will cause 20-50% of users to drop out before they've participated in the system. And then there's the question of user interface for the administrators: one of the great pains of Apache administration is getting phone calls from non-technical people who need changes made to an .htpasswd file. Make that a public site with 10^3 or 10^6 users, and the problem takes on a whole new dimension. UNLESS people adopt a user-management framework that's already written, they're just not going to give this problem the resources it deserves, and they're going to suffer with half-assed solutions. I've been thinking about this a lot, and my answer to this problem is here http://www.honeylocust.com/x/products/tum/ I have to apologize that I haven't made a public release in a while; the version that's there is pretty far behind what I've got running on some production sites. The version that's up there is pretty battle tested -- it's run
Re: Puzzling News
> William A. Rowe, Jr. wrote: >> At 03:17 PM 2/28/2005, Paul A. Houle wrote: >> >>>On Mon, 28 Feb 2005 21:09:55 +, Wayne S. Frazee <[EMAIL PROTECTED]> >>>wrote: > >> Oh boy - you don't know *what* you are missing :) Threads on >> Linux barely differ from distinct processes, while on Solaris >> they are truly lightweight. > > Well.. thats true for LinuxThreads.. but NPTL is a different and better > story. (Requires newer glibc and a 2.6 kernel) > Yeah, but that's because Solaris has embarassingly bad performance running Apache/prefork; I've seen Apache run an order of magnitude faster on Pentium II machines that I've fished out of the dumpster than on Sun servers that we spent $30,000 for. (Not to bash Sun by any means... Solaris will embarass Linux about as badly running most MTA's; it's not so much that threads are bad on Linux, but that processes are incredibly fast on Linux -- NTPL doesn't so much speed up threads on Linux as make it possible for them to scale.) I think there is a reason why it matters that there is uptake on Apache 2, and that's so development can really move forward. Don't take it personally, but I'm quite depressed that Apache is still the dominant web server after all these years; the chief competition, Microsoft IIS, isn't all that different functionally from Apache. I think of all the features that web site authors and developers need that still don't exist in mainstream web servers; part of this is in the area of "content management" and another major are is authentication -- pretty much any serious interactive web site needs a cookie-based authentication system with the features seen on big sites like amazon.com and yahoo! and one of the reasons there is so little code reuse on the web is that every application winds up impementing it's own authentication system; if there was something really good built into a market-leading web server, this picture would change completely. A lot of the problem is that the market for advanced web servers is tiny. Looking at Alexa and applying a little guesswork, I get the sense that there are probably 10,000 or so sites that get more than one million hits per day; and if you're doing static pages, you could probably multiply that by 30x before Apache starts becoming performance limited on reasonable hardware. The remaining sites that need to push the limits of Apache all have different circumstances, and they'll find different answers; if you're running a dynamic site with PHP and you're hitting the CPU wall, really the only good answer is to build a load-balancing cluster. People who are running mod_perl often really do have memory problems (but there's enough instability in mod_perl, using processes instead of threads makes all the difference between a server that gives a 500 from time to time and a server that goes down regularly at 2 AM.) Java's got it's own bunch of problems. In that sense, the market for higher-end web systems is going to be fragmented enough that getting enough sites running the same software to really standardize and make progress will be difficult. The one 'generic' area where I think Apache/prefork isn't so satisfactory is in serving large numbers of large files... On some platforms you might get a factor of 2 or 3 better with worker, but you'll do that again (and maybe more) with something along the lines of mod_event. Another area that fascinates me is the 'perchild' MPM. Having the ability to segregate different parts of the server to different users could be useful for some systems I run. (One of my projects is a web server with 25 virtual hosts, one of which has about 100 subdirectories managed by I don't know how many different people; some of these sites are running PHP apps, some are using the accounts to archive old files.) Needless to say, between the limited UNIX permissions model and running Apache under one UID, we've had to make some compromises between security and having to answer support calls continuously from people because "it doesn't work". If perchild could be made smart enough to have different classes of server: little lean ones that can handle static pages, other bigger ones that can do mod_perl, that could be useful. Perhaps we could have some hybrid between worker and prefork that make it possible to run threadsafe and nonthreadsafe apps in the same server... The critical thing is that Apache shouldn't forget that "prefork" has it's strengths. I think the main reason that Apache is so reliable is that a hand grenade can go off in one process and it doesn't hurt the server at all. It took Microsoft IIS many iterations to find a reasonable balance between performance and stability, and the evolution of Apache MPM's is likely to face similar pitfalls.
Re: Puzzling News
On Mon, 28 Feb 2005 21:31:19 +, Wayne S. Frazee <[EMAIL PROTECTED]> wrote: Correct me if I am wrong, but I have seen much that would purport the worker MPM to deliever gains in terms of capacity handling and capacity-burst-handling as well as slimming down the resource footprint of the Apache 2 server on a running system under normal load conditions. Well, our big production machine has 6G of RAM and never gets close to running out even in testing when we stacked it up to the (compiled in) limit of 255 processes. Under normal operations we have 50 running, mostly because of keep-alive (helps a lot with the performance of our cookie-based authentication system) and people downloading moderately big (>100k) files. Even though RAM is pretty cheap, there probably are people who are more constrained. I would also like to point out I too have seen inconclusive evidence on MPM "advantage". I think that is part of the problem... without a clear business-case-defendable advantage to the features implemented in Apache 2... why upgrade? Altruism. If people don't use Apache 2, then Apache development will keep going sideways forever.
Re: Puzzling News
On Mon, 28 Feb 2005 21:09:55 +, Wayne S. Frazee <[EMAIL PROTECTED]> wrote: A move to 2.0 or 2.1 will take place gradually over time, I think, once PHP can be used with some expectation of stability on a non-prefork-MPM. Note: I am not insinuating PHP is not thread safe, but rather many of the elements it works with or relies on are not. I also want to make it clear, I am not blaming the PHP community either. I think it is a broader based problem in that commercial projects such as company web server implementations and hosting infrastructure packages are just not seeing enough value in the move to 2.0 to justify the development expenditures. We've got production instances of Apache 2 running on Linux and Solaris, all of which are running PHP on prefork. Honestly, I don't see a huge advantage in going to worker. On Linux performance is about the same as prefork, although I haven't done benchmarking on Solaris. On some of these machines I have a lot of control, on others I have people calling me every week wanting me to link something new into PHP. If mod_event becomes mainstream, however, I'll have a reason to really want threaded PHP, since that will give real performance improvements for static files. On the other hand, it might be possible to make a version of mod_event that uses processes for PHP.
Re: [VOTE] 2.1.3 as beta
On Wed, 23 Feb 2005 16:57:20 +0100, Matthieu Estrade <[EMAIL PROTECTED]> wrote: Justin Erenkrantz wrote: I think it's the best way. Maybe we could also provide two packages, httpd-with-apr and another one without apr No, they should be separate, and having two different packages is just a recipie for trouble. (When you have to deal with a support problem, the guy who's having the problem will never know which package he downloaded.) Remember you're dealing with two audiences here. There's an audience of people who compile from source, there's also the audience of people who use packaged libraries. People in the second camp aren't going to suffer much hardship. Rpm, deb and other package managers know about dependencies, and newer tools (yum) will automatically track them down. I'm in the first camp. Apache is an important enough piece of software that I'm not going to switch to some other web server because it's a little more work to compile. When compiling software, it's common to have to compile some libraries to compile the app, and this kind of split is even done with trivial software such as mp3-taggers. The one thing I'd worry about is maintainance. Suppose there's a security flaw in APR... Well, Apache is in the front of your mind, but APR isn't, so it might be easy to overlook the advisory. Upgrading APR might be a bit more confusing because the system might spend some time in a state where APR is upgraded and Apache isn't -- what effects will that have? (Probably none, but "probably" isn't a good answer for a production server that 30,000 people depend on.) Also the stability of APR is going to matter. For a long time you had to run Subversion on APR out of CVS and I'd often update svn and then find that I had to update APR because they'd changed it so it depends on the latest APR. If APR is going to be reasonably stable in the future, then I feel OK thinking of it as a system library. If, on the other hand, it's really joined-at-the-hip to Apache and lots of app developers are building things that need the CVS version of APR, it had better not be.
Re: UNIX MPMs
On Thu, 10 Feb 2005 11:56:47 +, Nick Maynard <[EMAIL PROTECTED]> wrote: UNIX MPMs that actually _work_ in Apache 2: worker prefork (old) Yeah, but what if you want to run PHP or mod_perl? Sure, PHP or mod_perl ~might~ work for you if you're lucky and you don't compile in the wrong third-party library, but you'll be in a world of pain if you don't. It's very hard to bolt threading onto an existing system that links to legacy libraries. Java managed to provide a thread-safe environment by having a painful API to link to C code, "100% Pure Java" xenophobia -- and it still took them 5 years to make a JVM which was reliable enough for government work. On Linux I've done some benchmarking and found that worker isn't any faster than prefork at serving static pages. (Is it any different on other platforms, such as Solaris?) In principle you might save RAM by running prefork, but in this day and age you can fit 16 GB in a 1U pretty easily and it's cheaper than hiring a programmer who doesn't know how to track down race conditions, never mind one that does.