Do you notice any problems with such a large cache? AFS cache
performance is going to be degraded by unhashed directory lookups on
standard UFS filesystems if the directory is very large (i.e. the number
of files > 10,000 or so).
-b.
On Thu, 19 Sep 1996, Mickey Beddingfield wrote:
> Brian,
>
> I can't write much right now, but we are running our http server on AFS
> with no problems. I am running on a Sun SPARC 5 with a 2GB cache under
> Solaris 2.5 and AFS 3.4. Again, no problems at this time...Mic
>
> ----------
> > From: Brian W. Spolarich <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Subject: AFS/HTTP Server Performance
> > Date: Tuesday, September 17, 1996 8:44 AM
> >
> >
> > I'm attaching a writeup of some tests I did a few months ago using AFS
> > to serve data to heavily-loaded HTTP server. At that time I did not have
> > an opportunity to follow up very much on some of the unanswered
> questions,
> > and didn't feel like I had characterized the situation clearly enough.
> >
> > The problems that we saw happened when we started the test scenario
> > against a "cold" cache with a moderate (60Mb) amount of data to retrieve.
> > In this scenario, the tests would proceed for a few minutes, and then the
> > AFS client/HTTP server would lose contact with the AFS server. The
> > addition of an additional database server did not solve the problem.
> >
> > If anyone has any thoughts on this, I'd appreciate hearing them. Some
> > differences between these somewhat informal tests that I ran back in May
> > and what we're going to do now include a change in operating system
> > (Solaris 2.5.1 instead of 2.4), and Ultras (on 10Mb/sec ethernet) instead
> > of Sparc 20s. If we can get it, we'll try and run the enhanced release
> of
> > 2.5.1 (for Web servers) and see what happens.
> >
> > I believe tcp_max_conn_req was set to 128 during this test, but as I
> > said, I did this somewhat informally and didn't collect all of the data
> > that I should have. :-]
> >
> > What I'm looking for is responses like "Yeah, we saw similar behaviour
> > when we did something like this and fixed it by <blah>", or "You might
> try
> > bumping up <bleh>".
> >
> > Transarc didn't really provide much help (although they tried to be
> > helpful) as the support guy didn't feel like he had enough info to really
> > understand the problem.
> >
> > -brian
> >
> >
> ---------------------------------------------------------------------------
> > | APPENDIX A. AFS/HTTP SERVER PERFORMANCE EVALUATION RESULTS
> > |
> >
> ---------------------------------------------------------------------------
> >
> >
> > Overview -------- This test scenario is designed to do some stress
> testing
> > of an HTTP (Web) server reading its data out of AFS versus local disk.
> > The HTTP client is running some home-grown software (webbash) which
> allows
> > us to fork off multiple simultaneous threads which act as HTTP clients,
> > requesting documents from a list. The document testbed is a set of files
> > containing random ASCII characters ranging in size from 0k to 1Mb, and
> > totals initially 2.2Mb. Copying this testbed into subdirectories [a..z]
> > yields a testbed of 60Mb.
> >
> > Test Environment
> > ----------------
> > The test environment consists of three machines isolated from the rest of
> > the network via an ethernet hub. I connect to the machines via a Cisco
> > terminal server which is not isolated from the local network. All
> > machines are running Solaris 2.4 w/o the recommended patches. :-]
> >
> > [prod-1b] AFS Fileserver Sparc20/128Mb
> > AFS Client
> >
> > [ log ] AFS Client / Sparc20/64Mb
> > HTTP Server
> >
> > [prod-2a] HTTP Client Sparc20/128Mb
> > ("bash" program)
> > AFS Client
> > (later becomes an additional AFS file/dbserver)
> >
> >
> > Explanation of the Fields Below
> > -------------------------------
> > Data Source - local disk or AFS
> > Client Threads - number of simultaneous threads created by "webbash".
> Each
> > thread reads the file list and randomizes it. Each thread will
> > time out after ten seconds if it does not receive data from the
> > HTTP server.
> > Iterations - number of times the test suite iterates over the list of
> files
> > to retrieve.
> > Cache/Data Ratio - the ratio of AFS cache size to data testbed size.
> > Daemons (afsd) - number of extra afsd processes to run to handle service
> > requests.
> > Volume Type - ReadWrite or ReadOnly. Lack of per-file callbacks on
> ReadOnly
> > volumes should reduce AFS client-server traffic.
> > Throughput (Mb/Hr) - As reported by webbash in megabytes per hour.
> > HTTP Ops/Sec - Number of HTTP operations services per second. This is
> probably
> > the real "thoughput benchmark".
> > Comments - Describes various events that may have happened during the
> test.
> > See the key below the table for details on this.
> >
> > Cache/ Throug-
> > Data Client Itera- Data Data Daemons Volume hput HTTP
> Com-
> > Source Threads tions Ratio Size (afsd) Type (Mb/Hr) Ops/Sec
> ments
> >
> ----------------------------------------------------------------------------
> ---
> > local 10 10 n/a 2.2Mb n/a n/a 4048 15.01
> > afs 10 10 34/1 2.2Mb 3 RW 4053 15.04
> > afs 20 10 34/1 2.2Mb 3 RW 3959 14.69 #
> > afs 50 10 34/1 2.2Mb 3 RW 3795 13.98
> #+
> > afs 100 10 34/1 2.2Mb 3 RW 4306 23.06
> #+
> > afs 200 10 34/1 2.2Mb 3 RW 5208 31.31
> #+
> > afs 10 5 1.25 60Mb 3 RW 3793 37.59
> !@
> > afs 1 5 1.25 60Mb 3 RW 1504 5.57
> > afs 3 3 1.25 60Mb 3 RW 3155 11.74
> $%
> > afs 5 2 1.25 60Mb 5 RO 2927 12.53
> !@
> >
> > # A second file/db server was added to the AFS cell at this point.
> > (these were warm cache reads)
> > afs 10 1 1.25 60Mb 5 RO 3957 14.68
> #$
> > # Then I flushed the volume from the cache (i.e. cold reads)
> > afs 10 2 1.25 60Mb 5 RO 1524 8.70
> > !+#+@+
> >
> > Comments Key:
> > ! - Cell "Thrashing": AFS client loses contact with fs and volserver.
> > AFS
> > client freezes.
> > @ - HTTP Server returned "403 Not Found" Errors (i.e. file did not
> exist).
> > # - Large files (500K+) took more than 10 seconds to transfer.
> > $ - Timeout (>10 sec) trying to read data.
> > % - HTTP Server returned "504" Error.
> >
> > + - Modifer to above. This event occurred many (>10, generally) times.
> >
> > Problems Observed
> > -----------------
> >
> > Interestingly, reading data from a small to moderate local AFS cache
> > appears to be slightly faster than local disk. This seems to differ from
> > work done by Michael Stolarchuck at U-M which suggested ways to improve
> > AFS cache read performance. The difference is small, and may be a
> > statistical anonmaly.
> >
> > The biggest problem we've seen with this situation has been timeouts
> > between the AFS client and server(s) as the data is being fetched into
> the
> > cache. Generally what happens is that I can observe an initial burst of
> > traffic between the AFS client and server, during which I see a large
> > number of network collisions. After a period of time, the traffic will
> > come to a halt (I'm observing the lights on the hub). The AFS
> client/HTTP
> > server will at this point freeze for a while, and will eventually report
> > that the server(s) for the realm are unavailable. This will cause the
> > HTTP server to report that the files the various threads are trying to
> > fetch do not exist, which will generate the "403 Not Found" errors.
> >
> > AFS timeouts generally look like this:
> >
> > afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
> (all
> > multi-homed ip addresses down for the server)
> > afs: Lost contact with file server 198.83.22.104 in cell test.ans.net
> (all
> > multi-homed ip addresses down for the server)
> > afs: file server 198.83.22.104 in cell test.ans.net is back up
> > (multi-homed address; other same-host interfaces may still be down)
> > afs: file server 198.83.22.104 in cell test.ans.net is back up
> > (multi-homed address; other same-host interfaces may still be down)
> >
> > Once the data is in the cache, the timeouts do not happen. With a larger
> > number of clients (threads), the timeouts seem to happen more frequently.
>
> > When the timeouts do happen, the AFS client machine freezes via serial
> > console. After a second file/dbserver was added to the cell, timeouts
> > occurred for both AFS servers, although the initial timeouts were only to
> > the new fileserver, which held one of the ReadOnly copies of the data I
> > was trying to fetch.
> >
> > Only the AFS client itself seems to pause during this time. The main AFS
> > server and HTTP client do not seem to be affected.
> >
> > During one test, the AFS client became unpingable and had to be rebooted
> > by sending a break to the console.
> >
> > Conclusions
> > -----------
> > The artificialness of the test scenario makes it hard to determine
> whether
> > or not these problems would occur in Real Life. Multiple, parallel HTTP
> > GET streams coming at this rate may not be a realistic scenario. I don't
> > know "how much" traffic each thread really represents in real terms.
> >
> > Still, we did observe some definite problems with the AFS client that may
> > or may not be tunable. Adding a second file/dbserver initially appeared
> > to help the situation, but after flushing the volume from the cache in
> > reality did nothing. Increasing the number of afsd processes helped
> > somewhat (it seemed to take longer for the client to lose contact with
> the
> > server), but didn't improve the situation very much.
> >
> > I know what AFS is in use under some relatively heavy usage conditions at
> > U-M and NCSA. I know that U-M's WWW servers are considered "slow" by
> > their user populace (I helped administer them for a short time), but this
> > is probably due to a number of factors, including network congestion and
> > overburdened fileservers, and the growing pains of a very large AFS cell.
>
> > On the other hand, I don't generally consider NCSA's site to be
> > particularly slow.
> >
> > Perhaps a more rational way of getting the cache seeded beforehand would
> > help.
> >
> > I'd like to see some response from Transarc on this.
> >
> >
> > --
> > Brian W. Spolarich - ANS - [EMAIL PROTECTED] - (313)677-7311
> > Look both ways before crossing the Net.
> >
> >
>
--
Brian W. Spolarich - ANS - [EMAIL PROTECTED] - (313)677-7311
And if I die before I learn to speak,
Will money pay for all the days I lived awake but half asleep?