>>>>> "MM" == Mike Matrigali <[EMAIL PROTECTED]> writes:
MM> user thread initiated read
MM> o should be high priority and should be "fair" with other user
MM> initiated reads.
MM> o These happen anytime a read of a row causes a cache miss.
MM> o Currently only one I/O operation to a file can happen at a time,
MM> could be big problem for some types of multi-threaded,
MM> highly concurrent low number of table apps. I think
MM> the path here should be to increase the number of
MM> concurrent I/O's allowed to be outstanding by allowing
MM> each thread to have 1 (assuming sufficient open file
MM> resources). 100 outstanding I/O's to a single file may
MM> be overkill, but in java we can't know that the file is
MM> not actually 100 disks underneath. The number of I/O's
MM> should grow as the actual application load increases,
MM> note I still think max I/O's should be tied to number
MM> of user threads, plus maybe a small number for
MM> background processing.
There was an interesting paper at the last VLDB conference that
discussed the virtue of having many outstanding I/O requests:
http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf (paper)
http://www.vldb2005.org/program/slides/wed/s1116-hall.pdf (slides)
The basic message is that many outstanding requests are good. The
SCSI controller they used in their study was able to handle 32
concurrent requests. One reason database systems have been
conservative with respect to outstanding requests is that they want to
control the priority of the I/O requests. We would like user thread
initiated requests to have priority over checkpoint initiated writes.
(The authors suggest building priorities into the file system to solve
this.)
I plan to start working on a patch for allowing more concurrency
between readers within a few weeks. The main challenge is to find the
best way to organize the open file descriptors (reuse, limit the max.
number etc.) I will file a JIRA for this.
I also think we should consider mechanisms for read ahead.
MM> user thread initiated write
MM> o same issues as user initiated read.
MM> o happens way less than read, as it should only happen on a cache
MM> miss that can't find a non-dirty page in the cache. background
MM> cache cleaner should be keeping this from happening, though
MM> apps that only do updates and cause cache hits are worst case.
MM> checkpoint initiated write:
MM> o sometimes too many checkpoints happen in too short a time.
MM> o needs an improved scheduling algorithm, currently just defaults
MM> to N number of bytes to the log file no matter what the speed of
MM> log writes are.
MM> o currently may flood the I/O system causing user reads/writes to
MM> stall - on some OS/JVM's this stall is amazing like ten's of
MM> seconds.
MM> o It is not important that checkpoints run fast, it is more
MM> important that it prodede methodically to conclusion while
MM> causing a little interuption to "real" work by user threads.
MM> Various approaches to this were discussed, but no patches yet.
For the scheduling of checkpoints, I was hoping Raymond would come up
with something. Raymond are you still with us?
I have discussed our I/O architecture with Solaris engineers, and our
approach of doing buffered writes followed by a fsync, I was told was
the worst approach on Solaris. They recommended using direct I/O. I
guess there will be situations were single-threaded direct I/O for
checkpointing will give too low throughput. In that case, we could
consider a pool of writers. The challenge would then be how to give
priority to user-initiated requests over multi-threaded checkpoint
writes as discussed above.
--
Øystein