Re: [lock-free] Work-stealing with LIFO local pop and FIFO steal

2017-08-04 Thread bdtrif


On Thursday, August 3, 2017 at 11:13:37 PM UTC-7, Dmitry Vyukov wrote:
>
> On Mon, Jul 31, 2017 at 9:14 PM,   wrote: 
> > 2) What's the most efficient way to prevent busy spinning when there's 
> no 
> > work left? This occurs often since my typical workloads come from 
> real-time 
> > data streams. Right now, I have workers doing exponential back-off after 
> > failed steal attempts (with victims chosen randomly). I was thinking of 
> > maybe using SPSC queues from each worker to periodically send its status 
> > (sleeping or not) so that new work is sent first to these threads 
> (rather 
> > than simple round-robin as I'm doing now, using SPSC per worker already 
> for 
> > sending new work). 
>
> Search this group and comp.programming.threads archives for 
> "eventcount". And also this: 
>
> https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/299245
>  
> Eventcounts are the generic way to wait for arbitrary predicates in 
> lock-free algorithms. 
> Looks like there's an efficient eventcount implementation 
> at https://locklessinc.com/articles/obscure_synch/ 2/3 down the page. It 
> uses futexes but on Windows (8+) it can be done with WaitOnAddress() etc. 
> If converting to C++11 atomics, there's no standard library operation that 
> will issue "lock bts"; I think atomic_fetch_or() would work but I wonder 
> about any differences in efficiency.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/49ba43e9-3438-4054-8402-9b2dad2d3168%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [lock-free] Work-stealing with LIFO local pop and FIFO steal

2017-08-04 Thread Dmitry Vyukov
On Mon, Jul 31, 2017 at 9:14 PM,   wrote:
> Hi all,
>
> I've been using the following lock-free work-stealing deque for a while:
> https://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/
> I'm not using an MPMC queue since pops by the owning worker thread should be
> from the end of the queue it's pushing from (opposite from the stealing
> end), as LIFO makes it more likely the working set will be in cache.
> I'm posting here with a couple of questions:
> 1) To inquire about whether there's a more efficient algorithm, as the pop
> in this implementation has both a full barrier and a CAS, and so it's
> expensive on the 2nd most common operation (after the push, assuming good
> initial task distribution that make steals less frequent).

Hi,

This one looks good.
Note that pop contains only full barrier in common case (when deque is
not close to empty). And it's as cheap as you can get, because pop
needs to achieve consensus with steal as to who gets what elements.
Just don't schedule a single integer addition as a parallel task.


> 2) What's the most efficient way to prevent busy spinning when there's no
> work left? This occurs often since my typical workloads come from real-time
> data streams. Right now, I have workers doing exponential back-off after
> failed steal attempts (with victims chosen randomly). I was thinking of
> maybe using SPSC queues from each worker to periodically send its status
> (sleeping or not) so that new work is sent first to these threads (rather
> than simple round-robin as I'm doing now, using SPSC per worker already for
> sending new work).

Search this group and comp.programming.threads archives for
"eventcount". And also this:
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/299245
Eventcounts are the generic way to wait for arbitrary predicates in
lock-free algorithms.


> 3) To ask whether in general there's any benefit of doing strided indexing
> of a buffer in this sort of data structure, so that operations by different
> threads on different ends of the (de)que would tend to not read/write in the
> same cache line when the buffer has only a few items. Related is whether,
> given both head and tail are accessed by the different oprations in this
> algorithm, there's any benefit of putting padding between them.

Yes.
Try benchmarking it. It will also help you answer other questions and
compare with alernatives.



> 4) Is there a performance benefit to making the size a template parameter as
> well, given that this class is already templated on item type?

See above.


> #include 
> #include 
>
> class Task;
>
> // Restricted deque for work-stealing parallelism
> // Based on
> https://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/
> // Push and LIFO pop at one end by a single thread and FIFO pop from the
> other end by other threads
>
> template
> class WorkStealQ
> {
> public:
> WorkStealQ(void) = delete;
> inline WorkStealQ(uint32_t bufferSize); // Must be positive power-of-two
> WorkStealQ(WorkStealQ const &) = delete;
> inline WorkStealQ(WorkStealQ &&) noexcept = default;
> WorkStealQ =(WorkStealQ const &) = delete;
> inline WorkStealQ =(WorkStealQ &&) = default;
> // PushUnsafe() will not check for a full queue, in the case where task
> counts are externally constrained
> // This is more efficient due to the lack of need to reference the atomic
> head variable
> inline void PushUnsafe(T &);
> inline void PushUnsafe(T const );
> inline bool Push(T &);
> inline bool Push(T const );
> inline T Pop(void);
> inline T Steal(void);
> inline uint32_t Space(void) const noexcept; // Slow as it checks atomic head
> and tail
> private:
> const uint32_t _mask;
> std::vector _tasks;
> std::atomic _head, _tail; /// TODO: Try padding
> };
>
> template
> inline WorkStealQ::WorkStealQ(uint32_t bufferSize) : _mask(bufferSize -
> 1), _tasks(bufferSize), _head{0}, _tail{0}
> {
> if (bufferSize < 2 || bufferSize >
> static_cast(std::numeric_limits::max()) // Limit is
> needed for signed comparison cast in Push()
> || bufferSize & _mask) throw std::invalid_argument("bufferSize must be a
> positive power-of-two that fits in a signed 32-bit integer");
> }
>
> template
> inline void WorkStealQ::PushUnsafe(T &)
> {
> auto head{_head.load(std::memory_order_relaxed)};
> _tasks[head & _mask] = std::move(task);
> _head.store(head + 1, std::memory_order_release);
> }
>
> template
> inline void WorkStealQ::PushUnsafe(T const )
> {
> auto head{ _head.load(std::memory_order_relaxed) };
> _tasks[head & _mask] = task;
> _head.store(head + 1, std::memory_order_release);
> }
>
> template
> inline bool WorkStealQ::Push(T &)
> {
> auto head{_head.load(std::memory_order_relaxed)};
> if (static_cast(head - _tail.load(std::memory_order_relaxed)) >=
> static_cast(_tasks.size())) return false;
> _tasks[head & _mask] = std::move(task);
> _head.store(head + 1,