README for httpdito (a tiny web server written in assembly)

Kragen Javier Sitaker Sun, 29 Dec 2013 14:31:13 -0800

We all have moments where we do things we regret.  Sometimes when
we're drunk, or under the influence of bad friends, or lonely.  In my
case, I guess it was probably sleep deprivation that led me to think,
"You know what?  Writing a web server in assembly language sounds like
a fun thing to do."


It turned out pretty well and revealed some surprising things about
modern Linux systems.  You might even find it useful.  You probably
didn't expect you could write a useful HTTP server in 

1. What httpdito is useful for
2. What httpdito is
3. How to build httpdito
4. How to configure httpdito
5. Performance
6. Security
7. What it tells us about Linux
8. What it suggests about operating system design
9. Are you insane?

What httpdito is useful for
---------------------------

httpdito can serve static files from your filesystem on Linux.  It's
probably the easiest way to serve static files, and like any static
web server written in a language faster than Tcl, it's plenty fast for
just about anything.  And it's smaller than 2K, so you can copy it
anywhere easily.

It serves up the files in the directory where you run it.  It won't
serve up files elsewhere, unless you have a symlink in there.

It doesn't serve up files whose names begin with ".", such as
".htaccess" or ".htpasswd".  This could be a problem if your intended
use is serving up non-bare Git repositories.  You can serve up a bare
clone, though:

    $ git clone --bare calculusvaporis calculusvaporis.git
    Cloning into bare repository 'calculusvaporis.git'...
    done.
    $ ./server &
    [1] 29425
    $ (cd calculusvaporis.git; git update-server-info)
    $ time git clone http://localhost:8086/calculusvaporis.git cv-clone
    Cloning into 'cv-clone'...

    real        0m0.259s
    user        0m0.016s
    sys         0m0.048s
    $

It *does* %-decode URLs, so it should work fine to serve up files
whose names are in UTF-8, not just ASCII.

It won't run on non-x86 machines; it doesn't do PHP, CGI, or any
embedded scripting language; it doesn't automatically generate
directory indexes; it doesn't do HTTPS or content negotiation; it
doesn't log.

With a little hacking, you could probably build it into some other
application as a low-overhead embedded HTTP server.

What httpdito is
----------------

It's for fun.

It's a full, useful web server in under two kilobytes.

It's written in self-contained i386 assembly language; while it uses a
lot of macros, all of those macros are defined in the source file
itself.  Similarly, although it calls a couple of functions, those
functions are defined in httpdito, not loaded from a library.  It
makes system calls, though.

It's well-commented: it's more intended to be read than to be
executed.  One commenter exclaimed that it looked more like Python
than assembly.

The system calls it uses are open(), close(), read(), write(), fork(),
_exit(), waitpid(), alarm(), socket(), bind(), setsockopt(), listen(),
and accept().  I think that none of them have changed their semantics
since BSD 4.3 around 1979, although in Linux some of the constants
have different values.  In that sense, httpdito is unreasonably
portable.  I think it's probably binary-compatible with even very
early Linux versions, although I haven't tried it.

How to build httpdito
---------------------

I'll write a Makefile soon, but for now:

    gcc -m32 -nostdlib server.s -o server.bloated

And then optionally

    objcopy -S -R .note.gnu.build-id server.bloated server

If you don't do that part, it will be over 4kB instead of under 2kB.
If you really care about that, you're probably insane.  (Take note
that I've spent many hours achieving its current size.)

How to configure httpdito
-------------------------

Before starting it, cd to the directory containing the files you want
it to serve up.

If you want it to run on a different port, edit the line near the top
of server.s that says

        .equiv port_number, 8086

and set it to the port you want.

In its current configuration, httpdito will run up to 2048 concurrent
child processes.  This sounds like an absurdly huge number, since
setting Apache's MaxChildren setting to more than about 100 or 1000 is
likely to start giving you OOM kills.  But that's because Apache
process working sets are typically a few megabytes, like 4 to 16
megabytes.  By contrast, ps usually displays httpdito RSS as between 4
and 16 *kilo*bytes.  That means you can have a thousand times as many
processes before you run out of memory.

In practice, 2048 processes on my netbook is no big deal, the same
netbook where Firefox struggles to render Twitter, often begging me
for permission to kill the JavaScript that is bogging it down.

Performance
-----------

httpdito typically performs a little bit worse than a default Debian
configuration of Apache on the same hardware.  For example, on my
8-core laptop, both Apache and httpdito manage in the neighborhood of
20000 to 30000 requests per second, according to apachebench (ab), but
while httpdito tops out at about 1.8 gigabits per second of outbound
bandwidth, Apache manages closer to 2.5.  But these numbers aren't
really trustworthy because the limit on them (the bottleneck) is
probably ab, not Apache or httpdito.

These numbers are not noticeably worsened by increasing the number of
concurrent clients up to 1000.

The bigger point, though, is that even httpdito is able to max out the
gigabit Ethernet network card in the machine.  So you probably
shouldn't worry about performance with httpdito.

Security
--------

I fucking hate dealing with security, but it's a necessity.

Assembly language is not the language of choice for writing software
that safely handles untrusted input.  It's easy to write bugs that
provide an attacker with remote arbitrary code execution.

However, httpdito is extremely simple (296 CPU instructions as of this
writing), and it's intended to be secure.  It is expected that an
attacker on a remote machine can:

- Use arbitrary amounts of outgoing bandwidth by requesting files from
  httpdito.

- Deny service to other clients by opening 2048 concurrent connections
  to httpdito and continuing to open 64 new connections per second.  
  (This could be mitigated in a variety of ways: a
  scoreboard by hashed IP address, and leaving SYN cookies enabled.)

- Use arbitrary amounts of CPU time by flooding httpdito with
  requests.

- Read the contents of the files in the directory where httpdito is
  running and its descendant directories, including via symlinks.

It is not expected that a remote attacker can:

- Exhaust your memory or your process table, causing problems for
  other processes on the machine where httpdito is running.

- Execute code they provide, for example via a buffer overflow.

- Cause any data to be written to your filesyste.

- Read any files outside the directory tree httpdito is serving up.

- Provide a link to a third party that causes attacker-provided data
  to be interpolated into HTML, JS, PDF, or other executable data sent
  by the server, in an executable way.  (That is, httpdito is intended
  to be XSS-proof.)

- Receive any information provided by other users of the server,
  except a crude measure of server load.

- Deny service to other clients without continually sending new
  requests.

Please let me know if you find a weakness (a way that a remote
attacker can do something unexpected), or something I should have
thought about.

What it tells us about Linux
----------------------------

1. Hardware is fast now.

    Apache became popular in 1994 and 1995 in large part because it
    performed better than its ancestor, NCSA httpd, while being a drop-in
    replacement for it.  Philip Greenspun [memorably explained in 1997, 
    speaking of CGI][0], "Computers do not like to fork 500 000 times a day."
    Spawning a new process is a fairly expensive operation.  Apache's
    "preforking" model, in which a pool of worker processes sits
    around waiting to handle incoming requests, each handling many
    requests (say, 100) before it dies, essentially eliminated that
    expense.

[0]: http://philip.greenspun.com/panda/server-programming

    But, at the time, a Linux web server might have been running on a
    133MHz Pentium.  As a rough estimate, that machine could do a
    32-bit add every cycle, 133 million times a second, while my
    laptop, a 2.5 GHz Core i7, can do ten 64-bit adds every cycle on
    each of its four cores, or 1 trillion 64-bit adds per second,
    roughly equivalent to 2 trillion 32-bit adds.  So it's about 7500
    times as fast.  One second for my laptop is like two hours for the
    1994 Linux web server.  So some things that were impractical then
    are practical now.

    That is, in 1997, forking 500 000 times a day was expensive.  Now
    my laptop can fork 500 000 times in about 20 seconds.

2. Processes are cheap now.

    But it's not all about hardware.  If you had thousands of
    concurrently runnable processes on 1994 Linux, it would spend most
    of its time in the scheduler.  But, in recent years, the scheduler
    has improved a lot; if I'm not mistaken, it's now O(1) in the
    number of runnable processes.  That means it's now practical to
    have truly ungodly numbers of processes awake, as long as they're
    doing useful work.

    In a sense, this is "back to the future": cheap processes were one
    of the interesting novelties of Unix.  The idea of spawning off
    multiple time-shared processes for a single command line was
    rather bizarre for the 1970s, but made it easy to do computations
    that did not fit comfortably into the 16-bit memory address space
    of the time, by breaking them into multiple cooperating processes.

    But there's cheap, and then there's *really* cheap.

3. System calls are fast now.

    Henry (now Alexia) Massalin's (1993?) dissertation, on the
    experimental operating system "Synthesis", passionately advocates
    making system calls cheaper, and describes several novel ways of
    achieving this.  Among other things, expensive system calls force
    you to add complexity to your userspace programs to reduce the
    number of calls to them: I/O buffering, caching, attempts to
    duplicate kernel logic in userspace, and so on; and they create
    incentives for "batch" system calls like writev() and poll().  And
    this complexity creates its own performance problems.

    Synthesis was noticeably faster than traditional Unixes on the
    same hardware, and dramatically faster when measuring latency
    rather than bandwidth --- when your write() calls are fast enough
    to efficiently pass a single audio sample through a pipe, you can
    get much lower end-to-end latency on an audio processing pipeline
    of several processes.

    Synthesis has been very influential on OS implementors since the
    dissertation was published, although to my knowledge, the code has
    never been published, and the results have never been reproduced.
    But many of the techniques for reducing system call costs have
    been incorporated into Linux.

    Partly as a consequence, httpdito can use a one-kilobyte buffer
    for reading file data and passing it to the network stack, yet
    still push out gigabits per second; that means each read()/write()
    pair is taking less than 16 microseconds.  If you increase the
    size of this buffer to reduce the number of calls, it doesn't
    help, because apachebench is still the bottleneck.

    And this helps a lot with making processes cheap.  Whenever you
    fork() a process, you inherently need to make a copy all the
    memory pages that process writes to.  Every time you can use a 1K
    buffer instead of a 32K buffer, you can use one page instead of
    eight pages, which means several times as many processes can fit
    into your L2 cache.

4. Most of the work of an HTTP server is in the kernel.

    httpdito is currently a 1928-byte executable, of which 1019 bytes
    are its 296 instructions of machine code.  If you look at this
    code, most of it looks like this:

         80482c3:       05 e0 96 04 08          add    $0x80496e0,%eax
         80482c8:       89 c1                   mov    %eax,%ecx
         80482ca:       89 eb                   mov    %ebp,%ebx
         80482cc:       b8 03 00 00 00          mov    $0x3,%eax
         80482d1:       cd 80                   int    $0x80
         80482d3:       c7 05 b8 96 04 08 c3    movl   $0x80495c3,0x80496b8
         80482da:       95 04 08 
         80482dd:       83 f8 00                cmp    $0x0,%eax
         80482e0:       7d 13                   jge    0x80482f5
         80482e2:       e8 d1 fd ff ff          call   0x80480b8
         80482e7:       89 eb                   mov    %ebp,%ebx
         80482e9:       b8 06 00 00 00          mov    $0x6,%eax
         80482ee:       cd 80                   int    $0x80

    That is, it consists largely of system calls (`int $0x80`) and
    setting up arguments for them.

    According to the time command, httpdito spends about 40 times as
    much "system" time as it does "user" time.  That is, for every
    cycle spent in httpdito's code itself, another 40 cycles are spent
    in the system calls httpdito is invoking.

    In essence, httpdito is just scripting the Linux kernel's capabilities
    in assembly language.  It could be even more extreme; there's a
    sendfile() system call which might save a loop of about 10
    instructions reading and writing file contents.  I chose not to
    use sendfile(), because although it would probably be faster,
    httpdito is already fast enough, and it's not clear that it would
    be simpler.

What it suggests about operating system design
----------------------------------------------

The Linux kernel is already capable of efficiently creating tens of
thousands of processes per second on modern hardware, if not hundreds
of thousands, and of hosting at least thousands, if not millions, of
live processes.  So maybe it's possible, even without changing
kernels, to break up new software systems into many tiny processes
like httpdito, with a working set of under 100 kilobytes.  Then the
kernel can provide fault isolation between the processes, and
integrating objects written in different languages should be little
harder than integrating objects written in the same language.

Erlang has been exploring this kind of design, and the results have
been promising.  But Erlang is slow, even with HIPE, and you have to
write your programs in Erlang's weird language (which, to be fair, is
pretty nice, even if it's weird.)

Imagine, for example, a feature-complete web server built this way.  A
front-end process, functionally equivalent to a load balancer, parses
the initial HTTP request and sends it to a router process, encoded
with something like protobufs or Thrift.  The router process examines
the path and decides whether to send it to a filesystem-server backend
like httpdito, a PHP server, or something else.  httpdito sends the
response back to the load balancer, which posts a message to the
access-logger process and sends the response back to the HTTP client.

Perhaps the parts of the system rendezvous with one another using the
filesystem, for example with Unix-domain sockets or named pipes; and
perhaps a supervisor process is responsible for starting up enough
backend processes to handle the current demand.

Rather than shared libraries, in this world, you would have server
processes that accept and respond to requests.

In essence, this is the original Unix architecture from the 1970s,
adapted to the much larger machines of today by using many more
processes, rather than only slightly more processes, each containing
many times the memory of a PDP-11.

This kind of architecture has major potential pitfalls, as the GNU
Hurd team found.  It's almost a distributed system, with many of the
difficulties of debugging distributed systems and making them robust.
On the plus side, we have a lot more knowledge now about how to do
that, thanks to Roy Fielding, the Erlang folks, Jeff Dean, and so on.

Are you insane?
---------------

Yes.  It's 2013.  You have to be insane to write application software
in assembly language.

You got the bonus package: here's the executable
------------------------------------------------

Feed this text to the uudecode command.

begin 755 httpdito.gz
M'XL("(NYM5("`W-E<G9E<@"556]H&V48?]/<=5TD=K6K3KK95YC80G))I!_&
MF@S7KG7BG]6NPJ;@N"17&YO>97?O,0M9TQ$+AK@/@E88HAO=ARJE#AQR26!F
MG;H6K+A^$&'##P6A)>Y+/[2K0\_GN21-VDW$%^Y][_D]O^=WS_L\[]TENE_N
ML=ELI#QJB)V@]=HYKJX=UGQM$6\G%#RMA"<<V62/<75X?<T1@A>/V"YB^:^"
M??4CK@ZONP#L!9W:DI\C#Q\<^7??]K'^:V\Z<'F"J_/ESQ\U5V?6;[V8FDO-
MIIT6=KNKRT"EA;'L6[!D,(LJ\,3R;EA]BYV9QS[>XKCA6\PX4'UC/,]V]O8M
M_V&:YALGQF:][S=-H3"$3&U]P(WET\`!P+!9=J81)'-X;PQ8P$T>4VH!-'F/
MQ&^EOD^Y76!D<,H]NIWW?(GW1<I]$EDXY6JVLX9+K&3*S9"%4X[;SHH76?4?
MG(0,4^YQ9.*4XRO,R>D)BW2VH1CS(?B7/__;-`L_^1:3@0M@$L9G<3^^>=]M
M8X<5!0'LJ7B=R:._\$ORWAJK+\9_AO&G(-ZH*3,3#>S9HG,*G9W@3._#QQJU
M%L5THLK*FY!DVHUXCF)^S:6=?`<(=OP2WF1-.R%ML_P<W*>NIPJ&O6J_7Q7W
M>[9Q^;>_8,.%DOZ*#919F]U2FL3)<#J<CFNHD7:BG5RQCU[LJ&5\S])&X9/"
MT]D9/#3?@!R(%-M<46-0"RA$,;UD`./Y^((O?X5^N["IN#HS/L^,'UXPR34$
M4^O3_F=T3Z;&.G:^_/3U)<[EC??Z=X[N=NWP.^(]_OI1AXO"G=]?GSA@FUVZ
MD_KYR_?R^LV+&.\7V'Z_A]&!K$=8XQ$QV1Y+>:W'9C+G0,?CS-Y36%WIA*UF
ML%HY=!I\N0?QWR?G+D#Y6G!CX8FJC9T/+%3>HN3&1F+_N<3]038<U9O'$G\*
M>E-V#\;<J2Y&H3'[)(+):J$L+IE/MT#X.F>L.N_#QY>Z!?F,-D_^"$"Z"6>+
M41&?+Y%KMQ8>:EY\P\A!&E64(:K':$2F'EU3/1$Y%-7#DD?4AMUO2[*D1D(>
M255E11BDHAS^#Y([*&H2,"TCX,`FE3]QFA(:DEAKVR9@L[Z1A+1"/3!?36+(
M46*L]?BQ4WW=KQ_O/GSD2%^;19:`M,L*:)E`.QB1PZU%%W8F&M&8))>`*U"#
M/%QB*"3%\(D$3OT9,<)B$8P94-0A6%1)1.MH?W^OQR=X:;NWG;ZJP,%0=#GL
M='0I,B@R-QN)20<ID]YE'FQE!PT-BBJD&M#9@/L`GGX_B["H=`CCY7(\G#`+
M=/ACAPZ'F"Y&HR,NR@8E5:)G1(V^HVN,:LJP1$4U&&&JJ(Y@S115H/05<20H
M.4847:6J=%J7@(DA03$L.#:S?<[KI<=>JJ397TDS%A4C\D/R)$30!C6F,C%(
M!&02(2PRD0A!32/_8SQ2ZAV>)0/^60:L]^U;_STX&DH\=%G_+W#<M3W(>X)8
?#;)X^.VY!,(N[D%>6:L\D+>WRBY+_P-R"FIKB`<`````
`
end
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol

README for httpdito (a tiny web server written in assembly)

Reply via email to