[lock-free] A fast concurrent FIFO Queue

Amine Moulay Ramdane Wed, 15 Apr 2015 15:01:36 -0700




















*Hello,A  fast concurrent FIFO Queue version 1.3Authors: Amine Moulay 
Ramdane Description:A very fast concurrent FIFO queue that satisfies many 
requirements: it has more parallelism than the two locks algorithm, it is 
FIFO fair , it's starvation-free and it minimizes efficiently the 
cache-coherence traffic and it is energy efficient on the pop() side when 
you set the wait parameter to true in the construtor: when there is no 
items in the queue it will not spin-wait , but it will block wait on my 
SemaMonitor, and when the wait parameter of the constructor is set to false 
it uses only an atomic increment on the push() side and an atomic increment 
on the pop() side, so it's very fast. The number of threads on the pop() 
side are limited by the length of the queue, the length of the queue must 
be greater or equal to 2^10, i have set it like that. You have 3 options 
for setting the kind of locks, just look inside defines.inc , if you want 
to set it for my scalable lock called scalable MLock just uncomment the 
option MLock inside defines.inc, if you want to set it for Ticket Spinlock 
just uncomment the option TicketSpinlock ,If you want to set it for 
Spinlock just uncomment the option Spinlock, the Ticket Spinlock option 
scored 12.5 millions of transactions per second on my 2.4 GHz Quadcore.The 
size of the queue must be passed to the constructor and it must be a power 
of 2.Here is my explanation of my algorithm, you will find the source code 
here:*
*https://sites.google.com/site/aminer68/concurrent-fifo-queue-1 
<https://sites.google.com/site/aminer68/concurrent-fifo-queue-1>*



































































































































*We begin by the push() method, its source code look like this: === 
function TWQueue.push(tm : long):boolean;var 
lastHead,newtemp:long;k:integer;beginresult:=true;newTemp:=LockedIncLong(head);lastHead:=newtemp-1;while
 
getlength1(newtemp) > fsize do {$IFDEF FPC} ThreadSwitch; {$ENDIF} {$IFDEF 
Delphi} sleep(0); {$ENDIF} repeatif fcount1^[lastHead and fMask].flag1=0 
then break; {$IFDEF FPC}ThreadSwitch;{$ENDIF}{$IFDEF 
Delphi}sleep(0);{$ENDIF}until 
false;setObject(lastHead,tm);fcount1^[lastHead and fMask].flag1:=1;end;=== 
Now we look at the rest of the code... Every cell of the array based queue 
look like this: type cell = record obj:long; flag1:long; flag2:long; 
{$IFDEF CPU32} cache:typecache2; {$ENDIF CPU32} {$IFDEF CPU64} 
cache:typecache3; {$ENDIF CPU64} end;It's cache padded to a cache-line size 
and the array is aligned on 64 bytes so that to avoid false-sharing... When 
the thread enters the push() method it will increment the "head" like this 
with an atomic increment like this:newTemp:=LockedIncLong(head); 
lastHead:=newtemp-1; after that the thread will test if 
"getlength1(newtemp) > fsize", that means if we have not gone beyong the 
length of the queue , if we have gone beyong the length of the queue we 
will spin-wait, if we have not gone beyong the length of the queue the 
thread will continue , and the thread will test with an if 
fcount1^[lastHead and fMask].flag1=0, that means it will test if there is 
no item in the cell , if fcount1^[lastHead and fMask].flag1 is not equal to 
0 we will spin-wait , if fcount1^[lastHead and fMask].flag1 is equal to 
zero that means there is no item in the cell so we will write the item on 
the cell by doing this: setObject(lastHead,tm); and after that we set the 
flag of the cell to 1 so that the pop() can read from it when it's set to 1 
like this: fcount1^[lastHead and fMask].flag1:=1; so as you have noticed i 
have reasonned and explained to you the push() side, and i think my 
reasonning is correct here. Now here is the new pop() method... == function 
TWQueue.pop(var obj:long):boolean;var lastTail,newtemp: long; 
k:integer;beginresult:=true;newTemp:=LockedIncLong(temp);lastTail:=newtemp-1;repeatif
 
fcount1^[lastTail and fMask].flag1=1 then break;{$IFDEF 
FPC}ThreadSwitch;{$ENDIF}{$IFDEF Delphi}sleep(0);{$ENDIF}until (false); 
obj:=getObject(lastTail);repeatif fcount1^[lastTail and fMask].flag2=1 then 
break;{$IFDEF FPC}ThreadSwitch;{$ENDIF}{$IFDEF 
Delphi}sleep(0);{$ENDIF}until (false); fcount1^[lastTail and 
fMask].flag2:=0;fcount1^[lastTail and 
fMask].flag1:=0;tail:=newtemp;fcount1^[newtemp and fMask].flag2:=1;end;== 
So as you have noticed we have to reason about this algorithm to prove that 
all is correct, so follow with me please... When the thread enters the 
pop() method they will atomically increment the "temp" variable , and after 
that the thread will wait that fcount1^[lastTail and fMask].flag1 is equal 
to 1, that means it will wait that an item is available on the cell of 
number "lastTail and fMask", and after that it will get the item from the 
cell , and after that the thread will wait on fcount1^[lastTail and 
fMask].flag2 to equal 1, than means it will wait for previous pop threads 
on lastTail-1 to set fcount1^[lastTail and fMask].flag2 to 1, and after 
that the thread will set the flag2 to 0 and after that the thread will set 
flag1 to 0 so that the push side will be able to put an item on the cell, 
but even though the thread sets flag1 to 0 before incrementing "Tail", the 
algorithm is correct since the push threads are limited on how much they 
can push by the length of the queue , and after that the thread will 
increment tail with "tail:=newtemp" and after that the thread will signal 
the next poping thread by setting flag2 to 1, so i think my algorithm is 
correct and efficient.Note: The number of threads on the pop() side are 
limited by the length of the queue, the length of the queue must be greater 
or equal to 2^10, i have set it like that.So as you have noticed i have 
reasonned about my algorithm an i think my algorithm is correct now. 
Finally i have benchmarked my algorithm without using my SemaMonitor and it 
has scored 12.5 millions of transactions per second on my 2.4 GHz Quadcore, 
and it minimizes efficiently the cache-coherence traffic and it reduces the 
contention efficiently so that it can better scale with more and more 
cores.You can download my new algorithm of a very fast concurrent FIFO 
queue from: **https://sites.google.com/site/aminer68/concurrent-fifo-queue-1 
<https://sites.google.com/site/aminer68/concurrent-fifo-queue-1>*







*Thank you,Amine Moulay Ramdane.*



-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/b07d982f-4779-474f-b4ae-e0911241fff2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[lock-free] A fast concurrent FIFO Queue

Reply via email to