Paul Richards said in "Re: patches for test / review":

> Richard, do you want to post a summary of your tests?

Well I'd best post the working draft of my report on the issues
I've seen, as I'm not going to have time to work on it in the near
future, and it raises serious performance issues that are best
looked at soon.  Note none of these detailed results are from
current, but Paul Richards has checked that these issues are still
present in current.

There are still issues to be explored so this report isn't in a
complete state, and not polished.  It's grown in 3 stages:

- initial Berkeley DB (random I/O) performance problem analysis
- side-issue of ATA outperforming SCSI systems at my synthetic benchmark
- interesting dramatic performance changes from changing seek multiple
  and I/O block size one byte from 8192

Note I've cc'd freebsd-fs, as this raises issues in the filesystem
area.  I've also changed the subject since I think there are broader
issues here than the clustering algorithm, and this email is rather
large to drop into an ongoing discussion.

The benchmark program source code is available, and easy to run,
the bottom of the report has links.

I don't have an explanation for the behaviour I have been measuring,
but I hope these quite extensive results will enable someone to
explain and perhaps suggest improvements.

        Richard.


Folks,

I appear to have found a serious performance problem with random
access file I/O in FreeBSD, and have a simple C benchmark program
which reproducibly demonstrates it.  In that the benchmark demonstrates
very poor non-async performance, this touches on the age-old
sync/async filesystem argument, and FreeBSD vs Linux debates.

I originally observed this problem with perl DB_File (Berkeley DB),
and with the help of truss have synthesised this benchmark as a
much simplified model of heavy Berkeley DB update behaviour.  Quite
probably other database-like software will have similar performance
issues.

This issue appears to be related to the traditional BSD behaviour
of immediately scheduling full disc block writes.  I think this
benchmark must be showing up a related bug.  But it is conceivable
that this is intended noasync behaviour, in which case the implications
need to be thought through.

The program does simple random I/O within a 64KB file, which should
I hope be fully cached so hardly any real I/O would be done.  Other
than mtime, this program makes no file meta-data or directory
changes; and the file remains the same size.

The file is used as 8 8KB blocks, and for each block in the order
0,5,2,7,4,1,6,3,0,... 10,000 lseek/read/lseek/write block updates
are done, much like updating 10,000 non-localised Berkeley DB file
records.

Using a tiny 64KB file is just to simplify and make a point.  My
original perl performance problems were with multi-megabyte files,
but still small enough to be fully cached.

I ran this on a large range of lightly loaded or idle machines,
which gave reproducible results.  Results and a summary of the
machines, which unless otherwise noted use SCSI 7200 RPM discs and
Adaptec controllers, are given in descending performance order
below.


  OS                                            Elapse secs, system

  FreeBSD 3.2-RELEASE, async mount              <1  (cheap ATA C433, 5400 RPM)
  Linux 2.2.13                                  <1  (Dell 1300, PIII 450MHz)
  Linux 2.0.36                                  3   (old ATA P200, 5400 RPM)
  Linux 2.0.36, sync [meta-data] mount          3   (old ATA P200, 5400 RPM)
  SunOS 5.5.1 (Solaris 2.5.1)                   7   (old SS4/110, 5400 RPM)
  FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=5       15  (PII 450MHz, 512MB, 10k RPM)
  FreeBSD 2.2.7-RELEASE+CAM                     21  (PII 400MHz, 512MB)
  FreeBSD 2.1.6.1-RELEASE                       32  (old P100, 64MB)
  FreeBSD 2.2.7-RELEASE+CAM, ccd stripe=2       39  (PII 400MHz, 512MB)
  FreeBSD 3.4-STABLE, vinum stripe+mirr=4       41  (dual PIII 500MHz, 1GB)
  FreeBSD 3.4-STABLE                            41  (dual PIII 500MHz, 1GB)
  FreeBSD 2.1.6.1-RELEASE, ccd stripe=2         52  (old P100, 64MB)
  FreeBSD 3.3-RELEASE, ccd stripe=2             53  (Dell 1300, PIII 450MHz)
  FreeBSD 3.2-RELEASE                           55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noatime mount            55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noclusterr mount         55  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.2-RELEASE, noclusterw mount         58  (cheap ATA C433, 5400 RPM)
  FreeBSD 3.3-RELEASE                           63  (Dell 1300, PIII 450MHz)
  FreeBSD 3.3-RELEASE, softupdates              63  (Dell 1300, PIII 450MHz)
  FreeBSD 3.2-RELEASE, sync mount               105 (cheap ATA C433, 5400 RPM)


I also have a range of results from an ATA (IDE) cheap deskside
Dell system running FreeBSD 3.3-RELEASE, with a range of wd(4)
flags.  This system exhibits much better performance than the SCSI
systems above at this benchmark, perhaps related to better DMA
ability.

ATA being faster than SCSI on this benchmark is a bit of a side-issue
to the thrust of this report, but the performance numbers may give
hints diagnosing the problem.

    Dell Dimension XPS T450 440BX
    IBM-DPTA-372730 (Deskstar 34GXP, 7200RPM, 2MB buffer)
    default mount options

        wd(4) flags                             Elapse secs

        0x0000                                  19
        0x00ff, multi-sector transfer mode      17
        0x8000, 32bit transfers                 13
        0x2000, bus-mastering DMA               4
        0xa0ff, BM-DMA+32bit+multi-sector       4


Note that Linux performs about the same for [meta-data] sync &
async mounts, which is as I'd expect for this program.  But FreeBSD
performance is hugely affected by async, sync or default (meta-data
sync) filesystem mounts, with noclusterw unsurprisingly making it
somewhat worse.

One interesting observation is that for non sync, async or noclusterw
mounts ~8750 I/O operations are done, which is 7/8ths of the 10,000
writes.  If I change the program to use 16 blocks there are ~9375
I/O operations which is 15/16ths of the 10,000 writes.  Guessing,
this is as if writes are forced for all blocks but one.

With async filesystem mounts very little I/O occurs, and with
noclusterw there are ~10,000 operations matching the number of
writes.

With sync it's ~20,000 operations matching the total of reads &
writes.  This demonstrates another aspect of the bug, sync behaviour
should cause 10,000 operations; the reads aren't being cached.

A quick softupdates test suggests this makes no difference, as
would be expected.

Looking at mount output on FreeBSD 3 the substantial part of the
I/O is async in all cases other than sync mounts; as expected.


Another aspect of this issue is the effect of changing the seek
blocksize, and write blocksize, by 1 byte each way from 8192, thus
doing block unaligned I/O.  In some cases this changes the amount
of I/O recorded by getrusage to zero, and drops elapse time from
half a minute or so to less than 1 second.

Thanks to Paul Richard for noticing this.  I've not spent much time
researching this, so can only present my small set of measurements.
To do these tests you have to recompile my test program each time eg

        gcc -O4 -DBLOCKSIZE=8191 -DWRITESIZE=8193 seekreadwrite.c

Sorry it's that crude.  These results are from a FreeBSD
2.2.7-RELEASE+CAM, ccd stripe=2 (PII 400MHz, 512MB) system,
though exactly the same pattern is apparent with 3.4-STABLE.
"****" indicate sub-second "zero I/O" results.

BLOCKSIZE   WRITESIZE   csh 'time' output

8191        8191        0.0u 1.5s 0:34.10 4.6% 5+186k 0+7500io 0pf+0w
8191        8192        0.0u 1.3s 0:31.52 4.5% 5+178k 0+7500io 0pf+0w
8191        8193        0.0u 1.4s 0:32.63 4.4% 5+189k 0+7500io 0pf+0w

8192        8191        0.0u 0.7s 0:01.97 37.5% 8+178k 0+0io 0pf+0w    ****
8192        8192        0.0u 1.3s 0:39.30 3.4% 7+196k 0+8750io 0pf+0w
8192        8193        0.0u 1.3s 0:40.09 3.4% 5+187k 0+8750io 0pf+0w

8193        8191        0.0u 1.4s 0:46.22 3.2% 5+192k 0+8750io 0pf+0w
8193        8192        0.0u 1.6s 0:40.48 4.0% 5+182k 0+8750io 0pf+0w
8193        8193        0.0u 1.5s 0:40.57 3.8% 5+175k 0+8750io 0pf+0w


8191        4095        0.0u 1.2s 0:33.79 3.6% 5+193k 0+7500io 0pf+0w
8191        4096        0.0u 1.2s 0:34.00 3.8% 5+190k 0+7500io 0pf+0w
8191        4097        0.0u 1.1s 0:33.58 3.6% 4+165k 0+7500io 0pf+0w

8192        4095        0.0u 0.5s 0:00.76 75.0% 5+189k 0+0io 0pf+0w    ****
8192        4096        0.0u 0.5s 0:00.58 100.0% 5+183k 0+0io 0pf+0w   ****
8192        4097        0.0u 0.5s 0:00.74 78.3% 5+181k 0+0io 0pf+0w    ****

8193        4095        0.0u 0.6s 0:01.00 67.0% 5+177k 0+0io 0pf+0w    ****
8193        4096        0.0u 0.6s 0:01.05 63.8% 5+179k 0+0io 0pf+0w    ****
8193        4097        0.0u 0.6s 0:01.02 66.6% 5+183k 0+0io 0pf+0w    ****



Any views gratefully received.  A fix would be much better :-)

Test program source, including compile & run instructions, is
available at:

        http://www.netcraft.com/freebsd/random-IO/seekreadwrite.c

Detailed notes on the test system configurations are at:

        http://www.netcraft.com/freebsd/random-IO/results-notes.txt

Thanks,
        Richard
-
Richard Wendland                                [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message

Reply via email to