Øystein Grøvlen wrote:
Some tests runs we have done show very long transaction response times
during checkpointing.  This has been seen on several platforms.  The
load is TPC-B like transactions and the write cache is turned off so
the system is I/O bound.  There seems to be two major issues:

1. Derby does checkpointing by writing all dirty pages by
   RandomAccessFile.write() and then do file sync when the entire
   cache has been scanned.  When the page cache is large, the file
   system buffer will overflow during checkpointing, and occasionally
   the writes will take very long.  I have observed single write
   operations that took almost 12 seconds.  What is even worse is that
   during this period also read performance on other files can be very
   bad.  For example, reading an index page from disk can take close
   to 10 seconds when the base table is checkpointed.  Hence,
   transactions are severely slowed down.

   I have managed to improve response times by flushing every file for
   every 100th write.  Is this something we should consider including
   in the code?  Do you have better suggestions?

probably the first thing to do is make sure we are doing a reasonable
amount of checkpoints, most people who run these benchmarks configure the system such that it either does 0 or 1 checkpoints during the run. This goes to the ongoing discussion on how best to automatically configure checkpoint interval - the current defaults don't make much
sense for an OLTP system.

I had hoped that with the current checkpoint design that usually by the time that the file sync happened all the pages would have already made
it to disk.  The hope was that while holding the write semaphore we
would not do any I/O and thus not cause much interruption to the rest of
the system.

What OS/filesystem are you seeing these results on? Any idea why a write
would take 10 seconds.  Do you think the write blocks when the sync is
called? If so do you think the block a Derby sync point or an OS internal sync point.

We moved away from using the write then sync approach for log files because we found that on some OS/Filesystems performance of the sync
was linearly related to the size of the file, rather than the number
of modified pages.  I left it for checkpoint as it seemed an easy
way to do async write which I thought would then provide the OS with
basically the equivalent of many concurrent writes to do.

Another approach may be to change checkpoint to use the direct sync write, but make it get it's own open on the file similar to what you
describe below - that would mean other reader/writer would not block
ever on checkpoint read/write - at least from derby level.  Whether
this would increase or decrease overall checkpoint elapsed time is
probably system dependent - I am pretty sure it would increase time
on windows, but I continue to believe elapsed time of checkpoint is
not important - as you point out it is more important to make sure
it interferes with "real" work as little as possible.

2. What makes thing even worse is that only a single thread can read a
   page from a file at a time.  (Note that Derby has one file per
   table). This is because the implementation of RAFContainer.readPage
   is as follow:

        synchronized (this) {  // 'this' is a FileContainer, i.e. a file object
            fileData.seek(pageOffset);  // fileData is a RandomAccessFile
            fileData.readFully(pageData, 0, pageSize);
        }

   During checkpoint when I/O is slow this creates long queques of
   readers.  In my run with 20 clients, I observed read requests that
   took more than 20 seconds.

   This behavior will also limit throughput and can partly explains
   why I get low CPU utilization with 20 clients.  All my TPCB-B
   clients are serialized since most will need 1-2 disk accesses
   (index leaf page and one page of the account table).

   Generally, in order to make the OS able to optimize I/O, one should
   have many outstanding I/O calls at a time.  (See Frederiksen,
   Bonnet: "Getting Priorities Straight: Improving Linux Support for
Database I/O", VLDB 2005).
   I have attached a patch where I have introduced several file
   descriptors (RandomAccessFile objects) per RAFContainer.  These are
   used for reading.  The principle is that when all readers are busy,
   a readPage request will create a new reader.  (There is a maximum
   number of readers.)  With this patch, throughput was improved by
   50% on linux.  The combination of this patch and the synching for
   every 100th write, reduced maximum transaction response times with
   90%.

   The patch is not ready for inclusion into Derby, but I would like
   to here whether you think this is a viable approach.

I now see what you were talking about, I was thinking at too high a level. In your test is the data spread across more than a single disk?
Especially with data spread across multiple disks it would make sense
to allow multiple concurrent reads.  That config was just not the target
of the original Derby code - so especially as we target more processors
and more disks changes will need to be made.

I wonder if java's new async interfaces may be more appropriate, maybe we just need to change every read into an async read followed by a wait,
and the same for write.  I have not used the interfaces, does anyone
have experience with them and is there any downside to using them vs.
the current RandomAccessFile interfaces?
Your approach may be fine, one consideration may be the number of file
descriptors necessary to run the system.  On some very small platforms
the only way to run the original Cloudscape was to change the size
of the container cache to limit the number of file descriptors.


Reply via email to