Re: [Sedna-discussion] Enhancement proposal #1 (VMM improvements).

Ilya Taranov Fri, 26 Aug 2011 01:58:40 -0700

Hello, Nicklai.

  We are very thankful for your opinion and very glad that you are
ready to contribute to Sedna again.


  However, the question that bothers me is what outcome is supposed to
be seen after implementing this technique? I have seen many
bottlenecks in Sedna processing coming from architecture, but it very
rarely was buffer management (surprisingly!). E.g. the most
outstanding bottleneck I have seen was executor waiting for sending
and recving a message from socket client before processing the next
tuple. That was really fascinating: about 10 tuples per second, and 1%
CPU load. I understand, that there were some problems with sockets on
particular machine in that case, but I would rather call it a bug.

  So are you really sure, that the described one is just the primary
issue, we have to worry about? Currently, we state our primary goal as
reliability, but If your proposal is safe to implement without any
threat to latter, it is a great idea. I have always wanted our vmm
techniques to be somewhat like you described. I have just never seen a
real need to do it.

  We are very limited in human resources right now, so we cannot make
this ourselves, and if you are willing to implement your idea, we are
looking forward to help you in any way possible. But we are eager that
you will have done this with as minimal changes to unrelated code as
it is possible.

-- 
Thanks in advance,
Ilya Taranov.

On Wed, Aug 24, 2011 at 5:06 PM, Nick Zavaritsky <[email protected]> wrote:
> Hi everyone!
>
> I haven't grok Sedna source code for years.
> When I finally fetched the sources from git this week (to check it on OS X 
> Lion) I was struct with a desire to contribute to Sedna :)
> So here is my enhancement proposal #1.
>
>
>
>
> === ABSTRACT ===
>
> The world of computer hardware has drastically changed. Multicore CPU's is 
> our reality.
> Many software designs require tweaks or even a major overhaul to take the 
> full advantage of multicores.
> One particular design needing attention badly is the framework to reclaim 
> buffer memory.
> I don't pretend having a slightest idea how to pick a buffer to reclaim in a 
> smarter way than it is currently done.
> However the machinery to notify transactions of the buffer gone and to 
> protect the buffer a transaction currently
> works with from being reclaimed can be improved.
>
> The problem is that the current design doesn't scale well with the increase 
> in the number of active transactions.
> The problem is actually threefold.
>
> First the core algorithm processes transactions sequentially, each 
> transaction requiring
> multiple IPC calls to carry out processing. It would be better if we could 
> process transactions in parallel.
> Or we could make the processing much more faster so that it would be ok to 
> stay sequential.
>
> Second the algorithm may restart from the very beginning if it discovers that 
> the buffer it attempts
> to reclaim is currently in use by a transaction. I believe this being a minor 
> problem but nevertheless worth mentioning.
>
> Finally the algoritm to process a transaction is synchronous: basically we 
> send a message to a transaction process
> and wait for reply. It would be better if we didn't have to wait for reply at 
> all: i.e. go asynchronous.
>
> THE SOLUTION: move more state to the shared memory and use lock-free approach.
>
> THE COST: one InterlockedCompareExchange or similar intrinsic in CHECKP, one 
> conditional in CHECKP, slightly more
> complex code due to indirection required to access the state in shared memory 
> in CHECKP
>
> It is expected that the implyed overhead is low: Andrew Fomitchev got 
> somewhat near 7% slowdown when evaluating
> a different design with a spinlock in CHECKP which is very similar to the 
> proposed design both in the terms of code size
> increase and the use of interlocked intrinsic. Or does my memory serve me bad 
> and 7% being just the gay/total population
> ratio?
>
>
>
>
> === CURRENT DESIGN OVERVIEW ===
>
> If a buffer manager (BM) runs short of  the buffer memory it has to reclaim a 
> buffer.
> Buffer eviction policy was FIFO a few years ago, however it doesn't matter at 
> all in the scope of the current discussion.
> So BM picks a buffer somehow and attempts to reclaim it. Unmap_block_in_trs() 
> does the job.
> If it failed to reclaim the buffer because the buffer was in use when the 
> call was made BM picks
> another buffer and retries to reclaim. These steps repeat untill 
> unmap_block_in_trs() finally succeeds.
>
> Unmap_block_in_trs() basically calls unmap_block_in_tr() for each transaction 
> registered.
>
> Unmap_block_in_tr() wakes the VMM thread in the corresponding transaction's 
> process. There is
> a semaphore for this purpose. Then unmap_block_in_tr() waits for reply 
> (another semaphore). VMM thread suspends the
> main transaction thread and checks the current XPTR to decide whether it is 
> ok to unmap the requested
> block. If the decision was positive VMM thread unmaps the block. Finally it 
> resumes the main thread,
> passes the decision result back to unmap_block_in_tr() through shared memory 
> and wakes
> unmap_block_in_tr().
>
> The design is slightly different on UNIX - instead of 
> SuspendThread()/ResumeThread() VMM thread
> sends a signal to the main transaction thread and most code execute in the 
> signal handler.
>
>
>
>
> === NEW DESIGN PROPOSAL ===
>
> So what if the buffer manager could check the current XPTR directly and send 
> unmap requests without
> the need to wait for completion?
>
> The following discussion omits some technical detail inherent to distinct 
> address spaces in TR and SM.
> For simplicity we describe our design as if everything happened in one 
> address space. The real implementation
> will be somewhat more complicated.
>
> So we need to move the current XPTR to shared memory, huh?
>
> struct VMMState
> {
>        Xptr currentXPTR;
> };
>
> VMMState *pVMMState; /* allocated in the shared memory */
>
>
> Wait!  We won't be able to set pVMMState->currentXPTR atomically.
>
> struct VMMState
> {
>        struct VMMStateData
>        {
>                Xptr currentXPTR;
>        };
>
>        VMMStateData *currentState;
>        VMMStateData  statesPool [2];
> };
>
>
> Lets apply the common aproach: if in need to modify a complex structure 
> atomically, allocate and initialize a new one
> and then swap pointers atomically. Since there is a single thread modifying 
> the currentXPTR two instances of VMMStateData
> would be enough so we use statesPool instead of malloc().
>
> So far so good. But there is currently no way for the unmap request to step 
> in.
>
> struct VMMState
> {
>        struct VMMStateData
>        {
>                Xptr currentXPTR;
>        };
>
>        VMMStateData *currentStateOrSpecial;
>        VMMStateData *currentState;
>        VMMStateData  statesPool [2];
>
>        Mutex                  mutex;
>        VMMUnmapCmd
>                                     queuedUnmapCmds [42];
>        bool                     didQueueOverflow;
> };
>
>
> CurrentStateOrSpecial stores a pointer to the current state much like 
> currentState.
> However the buffer manager can swap the current state pointer in 
> CurrentStateOrSpecial
> with a special value to indicate that VMMState has been externally modified. 
> CHECKP
> will update both currentState (regular assignment) and currentStateOrSpecial 
> (InterlockedCompareExchange).
> Most of the time InterlockedCompareExchange will succeed since the buffer 
> manager
> didn't step in and currentStateOrSpecial == currentState.
>
> When the buffer manager processes VMMState it locks the mutex first and then 
> swaps currentStateOrSpecial
> with a 'special' value. The operation yields the former currentStateOrSpecial 
> value which we use to check
> whether the buffer we intend to unmap is currently in use. We then update
> queuedUnmapCmds/ didQueueOverflow fields and unlock the VMMState.
>
> When InterlockedCompareExchange in CHECKP fails it means that the VMMState 
> was modified by the buffer
> manager. In the later case the "slow path" is executed. We lock the mutex and 
> execute whatever queued umap cmds
> or reset VMM completely if the queue has overflowed. We then store a pointer 
> to the current state in currentStateOrSpecial
> and unlock the mutex.
>
> "Slow path" is obviously going to be a separate function (CHECKP code is 
> inlined).
>
>
>
>
> === Q&A ===
>
> Q1: Is it safe for the buffer manager to proceed without waiting for unmap 
> command to complete?
>
> A1: There must be a CHECKP before the unmapped memory is going to be accessed.
> CHECKP performs all queued commands before processing. Don't write buggy code.
>
>
> Q2: Is it safe for the buffer manager to access currentState data which is 
> being concurrently modified?
>
> A2: Sure. There are two instances of the state. The pattern is as follows: 
> modify B, replace pointer to A with pointer to B;
> modify A, replace pointer to B with pointer to A. The state instance is never 
> modified when accessible via
> currentState pointer.
>
> Assume we have just swapped currentStateOrSpecial with 'special' and find out 
> that the current state is A.
>
> A <- (currentStateOrSpecial <- SPECIAL)
>
> A is not going to be modified since it is stored in currentState. 
> CurrentState must point to B before A could be
> modified. But replacing A with B in currentState requires 
> currentStateOrSpecial update as well which
> will fail since we've stored a special value there. The failure will trigger 
> the "slow path" execution and it
> locks pVMMState->mutex. But the buffer manager locks VMMState mutex before 
> touching currentStateOrSpecial,
> so the slow path will block until we have done with the currentState data and 
> have released the mutex.
>
>
>
>
> === CONCLUSION ===
>
> The presented design enbles to reclaim a buffer efficiently and it scales 
> well with the number of active transactions.
> Transactions are still processed sequentially. This is a non-issue however 
> since much lighter synchronization primitives
> are used and unmap requests are asynchronous. The expected performance 
> degradation in single user
> scenarios due to CHECKP complications is negligible.
>
>
>
>
> === CONCERNING NON CHECKP-INTENSIVE CODE ===
>
> The presented design assumes that CHECKP is executed quite often to process 
> queued unmap commands in the
> timely fashion. Though queue overflow is non fatal ("only" need to reset the 
> entire VMM), it implies non-trivial
> overhead.
>
> We obviously need CHECKP-less mode to activate when (ex.) waiting for the 
> lock on XML document to be granted
> or talking with a client through a socket.
>
> The interesting property of CHECKP-less mode is that there are no CHECKPs 
> being executed, hence VMM is not being accessed,
> hence there is no need to check the currentXptr. There is still the need to 
> send asynchronous unmap commands.
>
> Lets introduce a secondary thread to process asynchronous unmap commands sent 
> in CHECKP-less mode.
> To avoid overcomplicating things lets assume that the thread receives 
> commands through a pipe.
> The write end of the pipe is accessible both to the buffer manager (it sends 
> asynchronous unmap commands)
> and to the transaction main thread as well (to request secondary thread 
> shutdown and when leaving CHECKP-less mode,
> see below).
>
> Lets extend VMMState.
>
> struct VMMState
> {
>        struct VMMStateData
>        {
>                Xptr currentXPTR;
>        };
>
>        VMMStateData *currentStateOrSpecial;
>        VMMStateData *currentState;
>        VMMStateData  statesPool [2];
>
>        Mutex                  mutex;
>        VMMUnmapCmd
>                                     queuedUnmapCmds [42];
>        bool                     didQueueOverflow;
>        bool                     isInCHECKPLessMode;
>        int                        pipeFD;
>        uint64_t              lastAsyncCmdID;
> };
>
> To entering CHECKP-less mode just set isInCHECKPLessMode.
> If  isInCHECKPLessMode is set buffer manager will send asynchronous unmap 
> commands using pipeFD
> instead of writing to queuedUnmapCmds. LastAsyncCmdID records the ID of last 
> async command sent.
> When leaving CHECKP-less mode unset isInCHECKPLessMode.
>
> To complete the transition from CHECKP-less to normal mode we must ensure 
> that all asynchronous
> commands sent to the secondary thread are complete. We ask the secondary 
> thread to notify
> the main thread as soon as a command with lastAsyncCmdID was processed. This 
> is implemented
> by sending a command using the same pipe buffer manager uses.
>
>
> OBVIOUS IDEA 1: Reset lastAsyncCmdID to 0 when entering CHECKP-less mode. If 
> it is still 0 when we leave the
> mode there is no need to synchronize with the secondary thread.
>
>
> CLEVER IDEA 1: If there are many transactions in CHECKP-less mode buffer 
> manager will slow down considerably
> due to the need to write to many pipes. Arrange  transactions in a tree-like 
> fashion. Buffer manager sends commands
> to root node (1 write), the root node retransmits commands to immediate child 
> nodes and so forth.
>
>
>
>
> ------------------------------------------------------------------------------
> EMC VNX: the world's simplest storage, starting under $10K
> The only unified storage solution that offers unified management
> Up to 160% more powerful than alternatives and 25% more efficient.
> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
> _______________________________________________
> Sedna-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/sedna-discussion
>

------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management 
Up to 160% more powerful than alternatives and 25% more efficient. 
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
_______________________________________________
Sedna-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/sedna-discussion

Re: [Sedna-discussion] Enhancement proposal #1 (VMM improvements).

Reply via email to