Hi everyone!
I haven't grok Sedna source code for years.
When I finally fetched the sources from git this week (to check it on OS X
Lion) I was struct with a desire to contribute to Sedna :)
So here is my enhancement proposal #1.
=== ABSTRACT ===
The world of computer hardware has drastically changed. Multicore CPU's is our
reality.
Many software designs require tweaks or even a major overhaul to take the full
advantage of multicores.
One particular design needing attention badly is the framework to reclaim
buffer memory.
I don't pretend having a slightest idea how to pick a buffer to reclaim in a
smarter way than it is currently done.
However the machinery to notify transactions of the buffer gone and to protect
the buffer a transaction currently
works with from being reclaimed can be improved.
The problem is that the current design doesn't scale well with the increase in
the number of active transactions.
The problem is actually threefold.
First the core algorithm processes transactions sequentially, each transaction
requiring
multiple IPC calls to carry out processing. It would be better if we could
process transactions in parallel.
Or we could make the processing much more faster so that it would be ok to stay
sequential.
Second the algorithm may restart from the very beginning if it discovers that
the buffer it attempts
to reclaim is currently in use by a transaction. I believe this being a minor
problem but nevertheless worth mentioning.
Finally the algoritm to process a transaction is synchronous: basically we send
a message to a transaction process
and wait for reply. It would be better if we didn't have to wait for reply at
all: i.e. go asynchronous.
THE SOLUTION: move more state to the shared memory and use lock-free approach.
THE COST: one InterlockedCompareExchange or similar intrinsic in CHECKP, one
conditional in CHECKP, slightly more
complex code due to indirection required to access the state in shared memory
in CHECKP
It is expected that the implyed overhead is low: Andrew Fomitchev got somewhat
near 7% slowdown when evaluating
a different design with a spinlock in CHECKP which is very similar to the
proposed design both in the terms of code size
increase and the use of interlocked intrinsic. Or does my memory serve me bad
and 7% being just the gay/total population
ratio?
=== CURRENT DESIGN OVERVIEW ===
If a buffer manager (BM) runs short of the buffer memory it has to reclaim a
buffer.
Buffer eviction policy was FIFO a few years ago, however it doesn't matter at
all in the scope of the current discussion.
So BM picks a buffer somehow and attempts to reclaim it. Unmap_block_in_trs()
does the job.
If it failed to reclaim the buffer because the buffer was in use when the call
was made BM picks
another buffer and retries to reclaim. These steps repeat untill
unmap_block_in_trs() finally succeeds.
Unmap_block_in_trs() basically calls unmap_block_in_tr() for each transaction
registered.
Unmap_block_in_tr() wakes the VMM thread in the corresponding transaction's
process. There is
a semaphore for this purpose. Then unmap_block_in_tr() waits for reply (another
semaphore). VMM thread suspends the
main transaction thread and checks the current XPTR to decide whether it is ok
to unmap the requested
block. If the decision was positive VMM thread unmaps the block. Finally it
resumes the main thread,
passes the decision result back to unmap_block_in_tr() through shared memory
and wakes
unmap_block_in_tr().
The design is slightly different on UNIX - instead of
SuspendThread()/ResumeThread() VMM thread
sends a signal to the main transaction thread and most code execute in the
signal handler.
=== NEW DESIGN PROPOSAL ===
So what if the buffer manager could check the current XPTR directly and send
unmap requests without
the need to wait for completion?
The following discussion omits some technical detail inherent to distinct
address spaces in TR and SM.
For simplicity we describe our design as if everything happened in one address
space. The real implementation
will be somewhat more complicated.
So we need to move the current XPTR to shared memory, huh?
struct VMMState
{
Xptr currentXPTR;
};
VMMState *pVMMState; /* allocated in the shared memory */
Wait! We won't be able to set pVMMState->currentXPTR atomically.
struct VMMState
{
struct VMMStateData
{
Xptr currentXPTR;
};
VMMStateData *currentState;
VMMStateData statesPool [2];
};
Lets apply the common aproach: if in need to modify a complex structure
atomically, allocate and initialize a new one
and then swap pointers atomically. Since there is a single thread modifying the
currentXPTR two instances of VMMStateData
would be enough so we use statesPool instead of malloc().
So far so good. But there is currently no way for the unmap request to step in.
struct VMMState
{
struct VMMStateData
{
Xptr currentXPTR;
};
VMMStateData *currentStateOrSpecial;
VMMStateData *currentState;
VMMStateData statesPool [2];
Mutex mutex;
VMMUnmapCmd
queuedUnmapCmds [42];
bool didQueueOverflow;
};
CurrentStateOrSpecial stores a pointer to the current state much like
currentState.
However the buffer manager can swap the current state pointer in
CurrentStateOrSpecial
with a special value to indicate that VMMState has been externally modified.
CHECKP
will update both currentState (regular assignment) and currentStateOrSpecial
(InterlockedCompareExchange).
Most of the time InterlockedCompareExchange will succeed since the buffer
manager
didn't step in and currentStateOrSpecial == currentState.
When the buffer manager processes VMMState it locks the mutex first and then
swaps currentStateOrSpecial
with a 'special' value. The operation yields the former currentStateOrSpecial
value which we use to check
whether the buffer we intend to unmap is currently in use. We then update
queuedUnmapCmds/ didQueueOverflow fields and unlock the VMMState.
When InterlockedCompareExchange in CHECKP fails it means that the VMMState was
modified by the buffer
manager. In the later case the "slow path" is executed. We lock the mutex and
execute whatever queued umap cmds
or reset VMM completely if the queue has overflowed. We then store a pointer to
the current state in currentStateOrSpecial
and unlock the mutex.
"Slow path" is obviously going to be a separate function (CHECKP code is
inlined).
=== Q&A ===
Q1: Is it safe for the buffer manager to proceed without waiting for unmap
command to complete?
A1: There must be a CHECKP before the unmapped memory is going to be accessed.
CHECKP performs all queued commands before processing. Don't write buggy code.
Q2: Is it safe for the buffer manager to access currentState data which is
being concurrently modified?
A2: Sure. There are two instances of the state. The pattern is as follows:
modify B, replace pointer to A with pointer to B;
modify A, replace pointer to B with pointer to A. The state instance is never
modified when accessible via
currentState pointer.
Assume we have just swapped currentStateOrSpecial with 'special' and find out
that the current state is A.
A <- (currentStateOrSpecial <- SPECIAL)
A is not going to be modified since it is stored in currentState. CurrentState
must point to B before A could be
modified. But replacing A with B in currentState requires currentStateOrSpecial
update as well which
will fail since we've stored a special value there. The failure will trigger
the "slow path" execution and it
locks pVMMState->mutex. But the buffer manager locks VMMState mutex before
touching currentStateOrSpecial,
so the slow path will block until we have done with the currentState data and
have released the mutex.
=== CONCLUSION ===
The presented design enbles to reclaim a buffer efficiently and it scales well
with the number of active transactions.
Transactions are still processed sequentially. This is a non-issue however
since much lighter synchronization primitives
are used and unmap requests are asynchronous. The expected performance
degradation in single user
scenarios due to CHECKP complications is negligible.
=== CONCERNING NON CHECKP-INTENSIVE CODE ===
The presented design assumes that CHECKP is executed quite often to process
queued unmap commands in the
timely fashion. Though queue overflow is non fatal ("only" need to reset the
entire VMM), it implies non-trivial
overhead.
We obviously need CHECKP-less mode to activate when (ex.) waiting for the lock
on XML document to be granted
or talking with a client through a socket.
The interesting property of CHECKP-less mode is that there are no CHECKPs being
executed, hence VMM is not being accessed,
hence there is no need to check the currentXptr. There is still the need to
send asynchronous unmap commands.
Lets introduce a secondary thread to process asynchronous unmap commands sent
in CHECKP-less mode.
To avoid overcomplicating things lets assume that the thread receives commands
through a pipe.
The write end of the pipe is accessible both to the buffer manager (it sends
asynchronous unmap commands)
and to the transaction main thread as well (to request secondary thread
shutdown and when leaving CHECKP-less mode,
see below).
Lets extend VMMState.
struct VMMState
{
struct VMMStateData
{
Xptr currentXPTR;
};
VMMStateData *currentStateOrSpecial;
VMMStateData *currentState;
VMMStateData statesPool [2];
Mutex mutex;
VMMUnmapCmd
queuedUnmapCmds [42];
bool didQueueOverflow;
bool isInCHECKPLessMode;
int pipeFD;
uint64_t lastAsyncCmdID;
};
To entering CHECKP-less mode just set isInCHECKPLessMode.
If isInCHECKPLessMode is set buffer manager will send asynchronous unmap
commands using pipeFD
instead of writing to queuedUnmapCmds. LastAsyncCmdID records the ID of last
async command sent.
When leaving CHECKP-less mode unset isInCHECKPLessMode.
To complete the transition from CHECKP-less to normal mode we must ensure that
all asynchronous
commands sent to the secondary thread are complete. We ask the secondary thread
to notify
the main thread as soon as a command with lastAsyncCmdID was processed. This is
implemented
by sending a command using the same pipe buffer manager uses.
OBVIOUS IDEA 1: Reset lastAsyncCmdID to 0 when entering CHECKP-less mode. If it
is still 0 when we leave the
mode there is no need to synchronize with the secondary thread.
CLEVER IDEA 1: If there are many transactions in CHECKP-less mode buffer
manager will slow down considerably
due to the need to write to many pipes. Arrange transactions in a tree-like
fashion. Buffer manager sends commands
to root node (1 write), the root node retransmits commands to immediate child
nodes and so forth.
------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management
Up to 160% more powerful than alternatives and 25% more efficient.
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion