Brian,

I can't write much right now, but we are running our http server on AFS
with no problems.  I am running on a Sun SPARC 5 with a 2GB cache under
Solaris 2.5 and AFS 3.4.  Again, no problems at this time...Mic

----------
> From: Brian W. Spolarich <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Subject: AFS/HTTP Server Performance
> Date: Tuesday, September 17, 1996 8:44 AM
> 
> 
>   I'm attaching a writeup of some tests I did a few months ago using AFS
> to serve data to heavily-loaded HTTP server.  At that time I did not have
> an opportunity to follow up very much on some of the unanswered
questions,
> and didn't feel like I had characterized the situation clearly enough.
> 
>   The problems that we saw happened when we started the test scenario
> against a "cold" cache with a moderate (60Mb) amount of data to retrieve.
> In this scenario, the tests would proceed for a few minutes, and then the
> AFS client/HTTP server would lose contact with the AFS server.  The
> addition of an additional database server did not solve the problem.
> 
>   If anyone has any thoughts on this, I'd appreciate hearing them.  Some
> differences between these somewhat informal tests that I ran back in May
> and what we're going to do now include a change in operating system
> (Solaris 2.5.1 instead of 2.4), and Ultras (on 10Mb/sec ethernet) instead
> of Sparc 20s.  If we can get it, we'll try and run the enhanced release
of
> 2.5.1 (for Web servers) and see what happens.
> 
>   I believe tcp_max_conn_req was set to 128 during this test, but as I
> said, I did this somewhat informally and didn't collect all of the data
> that I should have. :-]
> 
>   What I'm looking for is responses like "Yeah, we saw similar behaviour
> when we did something like this and fixed it by <blah>", or "You might
try
> bumping up <bleh>".
> 
>   Transarc didn't really provide much help (although they tried to be
> helpful) as the support guy didn't feel like he had enough info to really
> understand the problem. 
> 
>   -brian
> 
>
---------------------------------------------------------------------------
> | APPENDIX A.  AFS/HTTP SERVER PERFORMANCE EVALUATION RESULTS
> |
>
---------------------------------------------------------------------------
> 
> 
> Overview -------- This test scenario is designed to do some stress
testing
> of an HTTP (Web)  server reading its data out of AFS versus local disk. 
> The HTTP client is running some home-grown software (webbash) which
allows
> us to fork off multiple simultaneous threads which act as HTTP clients,
> requesting documents from a list.  The document testbed is a set of files
> containing random ASCII characters ranging in size from 0k to 1Mb, and
> totals initially 2.2Mb.  Copying this testbed into subdirectories [a..z]
> yields a testbed of 60Mb. 
> 
> Test Environment 
> ---------------- 
> The test environment consists of three machines isolated from the rest of
> the network via an ethernet hub.  I connect to the machines via a Cisco
> terminal server which is not isolated from the local network.  All
> machines are running Solaris 2.4 w/o the recommended patches. :-]
> 
>         [prod-1b]       AFS Fileserver  Sparc20/128Mb
>                         AFS Client
> 
>         [  log  ]       AFS Client /    Sparc20/64Mb
>                         HTTP Server
> 
>         [prod-2a]       HTTP Client     Sparc20/128Mb
>                         ("bash" program)
>                         AFS Client
>                         (later becomes an additional AFS file/dbserver)
> 
> 
> Explanation of the Fields Below
> -------------------------------
> Data Source - local disk or AFS
> Client Threads - number of simultaneous threads created by "webbash". 
Each
>         thread reads the file list and randomizes it.  Each thread will
>         time out after ten seconds if it does not receive data from the
>         HTTP server.
> Iterations - number of times the test suite iterates over the list of
files
>         to retrieve.
> Cache/Data Ratio - the ratio of AFS cache size to data testbed size.
> Daemons (afsd) - number of extra afsd processes to run to handle service
>         requests.
> Volume Type - ReadWrite or ReadOnly.  Lack of per-file callbacks on
ReadOnly
>         volumes should reduce AFS client-server traffic.
> Throughput (Mb/Hr) - As reported by webbash in megabytes per hour.
> HTTP Ops/Sec - Number of HTTP operations services per second.  This is
probably
>         the real "thoughput benchmark".
> Comments - Describes various events that may have happened during the
test.
>         See the key below the table for details on this.
> 
>                         Cache/                          Throug-
> Data    Client  Itera-  Data    Data    Daemons Volume  hput    HTTP   
Com-
> Source  Threads tions   Ratio   Size    (afsd)  Type    (Mb/Hr) Ops/Sec
ments
>
----------------------------------------------------------------------------
---
> local   10      10      n/a     2.2Mb   n/a     n/a     4048    15.01
> afs     10      10      34/1    2.2Mb   3       RW      4053    15.04   
> afs     20      10      34/1    2.2Mb   3       RW      3959    14.69   #
> afs     50      10      34/1    2.2Mb   3       RW      3795    13.98  
#+
> afs     100     10      34/1    2.2Mb   3       RW      4306    23.06  
#+
> afs     200     10      34/1    2.2Mb   3       RW      5208    31.31  
#+
> afs     10      5       1.25    60Mb    3       RW      3793    37.59  
!@
> afs     1       5       1.25    60Mb    3       RW      1504     5.57   
> afs     3       3       1.25    60Mb    3       RW      3155    11.74  
$%
> afs     5       2       1.25    60Mb    5       RO      2927    12.53  
!@
> 
> # A second file/db server was added to the AFS cell at this point.
> (these were warm cache reads)
> afs     10      1       1.25    60Mb    5       RO      3957    14.68  
#$
> # Then I flushed the volume from the cache (i.e. cold reads)
> afs     10      2       1.25    60Mb    5       RO      1524     8.70
> !+#+@+
> 
> Comments Key:
> ! - Cell "Thrashing":  AFS client loses contact with fs and volserver.
> AFS
>         client freezes.
> @ - HTTP Server returned "403 Not Found" Errors (i.e. file did not
exist).
> # - Large files (500K+) took more than 10 seconds to transfer.
> $ - Timeout (>10 sec) trying to read data.
> % - HTTP Server returned "504" Error.
> 
> + - Modifer to above.  This event occurred many (>10, generally) times.
> 
> Problems Observed
> -----------------
> 
> Interestingly, reading data from a small to moderate local AFS cache
> appears to be slightly faster than local disk.  This seems to differ from
> work done by Michael Stolarchuck at U-M which suggested ways to improve
> AFS cache read performance.  The difference is small, and may be a
> statistical anonmaly. 
> 
> The biggest problem we've seen with this situation has been timeouts
> between the AFS client and server(s) as the data is being fetched into
the
> cache.  Generally what happens is that I can observe an initial burst of
> traffic between the AFS client and server, during which I see a large
> number of network collisions.  After a period of time, the traffic will
> come to a halt (I'm observing the lights on the hub).  The AFS
client/HTTP
> server will at this point freeze for a while, and will eventually report
> that the server(s)  for the realm are unavailable.  This will cause the
> HTTP server to report that the files the various threads are trying to
> fetch do not exist, which will generate the "403 Not Found" errors. 
> 
> AFS timeouts generally look like this:
> 
> afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
(all
> multi-homed ip addresses down for the server)
> afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
(all
> multi-homed ip addresses down for the server)
> afs: file server 198.83.22.104 in cell test.ans.net is back up
> (multi-homed address; other same-host interfaces may still be down)
> afs: file server 198.83.22.104 in cell test.ans.net is back up
> (multi-homed address; other same-host interfaces may still be down)
> 
> Once the data is in the cache, the timeouts do not happen.  With a larger
> number of clients (threads), the timeouts seem to happen more frequently.

> When the timeouts do happen, the AFS client machine freezes via serial
> console.  After a second file/dbserver was added to the cell, timeouts
> occurred for both AFS servers, although the initial timeouts were only to
> the new fileserver, which held one of the ReadOnly copies of the data I
> was trying to fetch. 
> 
> Only the AFS client itself seems to pause during this time.  The main AFS
> server and HTTP client do not seem to be affected.
> 
> During one test, the AFS client became unpingable and had to be rebooted
> by sending a break to the console.
> 
> Conclusions
> -----------
> The artificialness of the test scenario makes it hard to determine
whether
> or not these problems would occur in Real Life.  Multiple, parallel HTTP
> GET streams coming at this rate may not be a realistic scenario.  I don't
> know "how much" traffic each thread really represents in real terms. 
> 
> Still, we did observe some definite problems with the AFS client that may
> or may not be tunable.  Adding a second file/dbserver initially appeared
> to help the situation, but after flushing the volume from the cache in
> reality did nothing.  Increasing the number of afsd processes helped
> somewhat (it seemed to take longer for the client to lose contact with
the
> server), but didn't improve the situation very much. 
> 
> I know what AFS is in use under some relatively heavy usage conditions at
> U-M and NCSA.  I know that U-M's WWW servers are considered "slow" by
> their user populace (I helped administer them for a short time), but this
> is probably due to a number of factors, including network congestion and
> overburdened fileservers, and the growing pains of a very large AFS cell.

> On the other hand, I don't generally consider NCSA's site to be
> particularly slow. 
> 
> Perhaps a more rational way of getting the cache seeded beforehand would
> help.
> 
> I'd like to see some response from Transarc on this. 
> 
> 
> --
>        Brian W. Spolarich - ANS - [EMAIL PROTECTED] - (313)677-7311
>                 Look both ways before crossing the Net.
> 
> 

Reply via email to