Michael,

On Fri, 4 Sep 2020, Ferguson, Michael Paul Pratt (Chapel Developer) wrote:

Here I am interpreting "locking" as a software lock that protects a critical section and causes other tasks trying to enter the critical section to wait. (Like ` pthread_mutex_lock`).

That was my understanding although I thought it was simpler than a
pthread_mutex_lock.

Atomics don't use locks on any normal configurations of Chapel. Atomics are generally pretty fast and are directly supported by the processor (or possibly the network).

They use special CPU instructions to ensure atomicity. One potential source of confusion is that on x86 these might involve the `lock` prefix; but AFAIK that instruction prefix has this name for historical reasons and might more reasonably be named something else like `atomic`.

That expands my understanding greatly.

Your excellent explanation belongs in the documentation somewhere.

You can use a relaxed atomic to get pretty much the same effect
as a ``volatile`` in your program and I would expect it to have
similarly low overhead.

Ditto. But I need to learn more about all the various types of atomic.
And clear documentation on that seems very thin on the ground. Well, it was the last time I looked for a C++ content.

If you do some research on volatile, you might learn that it's not able to control the way the CPU processor itself optimizes loads/stores (rather; it only prevents the compiler from doing so). While that might be OK on x86 it would not function on other platforms like ARM for example. (x86 uses TSO - Total Store Order - which is a stronger constraint than other platforms).

OK. Stopping the compiler doing naughty things is a good start as I have found when trying to interrogating the Floating Point Control/Status Register.

In your example, a task calling `putStage` could, on a non-x86 system,
end up storing the value of `stage[i]` in a per-core cache somewhere,
(let's say, L1 cache, or a write reorder buffer), and put
off committing it to memory indefinitely. As a result, the parallel
program would have a load imbalance problem since the current value
of this variable isn't being communicated to the other tasks.

Understood. If I update a column of a matrix in one core in some parallel task T, I would still hope that if subsequently, i.e. in a serial sense and after the task T has finished, I try and read that same column of the matrix in another core in another parallel task T', I see only the updated column.

Or, worse than that, the write in `putStage` might store something in memory that is neither the previous value nor the new value but some mix of the two. (For example, maybe only the low byte is set to the new value on a platform only able to write 1 byte at a time to memory). It is hard to see how the algorithm could function correctly in this situation.

Agreed.  See my earlier comment.

These cases are part of the reason that processors support atomics -
they allow different tasks/threads/cores to communicate in a
reasonable manner.

As I said, I assume that a program spawing 2 tasks, say T1 and T2 updating columns 1 and 2 respectively of a matrix, can be assured on completion of both T1 and T2 that any subsequent tasks will see only the values written by T1 and T2 into the matrix.

But you don't have to just take it from me

I believe you. Your knowledge is good enough for me.

The atomic section of the spec is here:

https://chapel-lang.org/docs/master/language/spec/task-parallelism-and-synchronization.html#atomic-variables

I have read that numerous times and it found it less than useful.

However the information you probably need is here:

You are very correct.

https://chapel-lang.org/docs/language/spec/memory-consistency-model.html#relaxed-atomic-operations

I had seen/read this - many times. Every time I try to read it, my head hurts!

However the specification does not currently describe acquire, release, or acqRel orderings. I will add an open issue note so that it is clearer that this is missing from the spec and not just something one isn't finding.

Great.

I will update the link to refer to the right section (which BTW we
could not do before with the PDF spec).

Thanks.

PR #16341 will make the documentation improvements I mentioned here.

Perfect.

Thanks for the insight - Damian

P.S. If you want to know how useful your advice is, read on.

For my problem, I will now use atomics. I believe relaxed ordering will do the job based on my (very limited) understanding of the details of memory consistency.

This is for a Jacobi SVD.

My algorithm, now using atomics, with a matrix of order 36, i.e. a 36x36 real(w)'s. on my 6-core machine, would have 35 tasks created by a forall over 1..N-1. On average, almost 6 tasks going concurrently all the time. Each such task would on average, process N/2 columns of the matrix one after the other.

You can alternatively reorganize the algorithm into M (sequential) steps. Each one of those M steps, say the I'th, would involve multiple Jacobi rotations on a full column pair within the matrix. Each such column pair operation can be run in parallel as each I'th step has only operations which are independent of each other. No need for atomics. There are 70 such groups for the 36x36 matrix, requiring a total of 630 tasks where there are on average between 1 and 18 tasks per group. Obviouls you can merge some of the column pairs to reduce the parallelization load. But the code to do this seriously obscures the logic of the underlying algorithm and hence the program's readability, and makes the code parallel-centric which goes against all of our programming KPIs. There are several other reordering algorithms but I do not understand them well enough to program them in Chapel nor can I find a remotely readable parallel implementation of them in any reference in any language.

Apart from the readability concerns, it is the nominal 600+ tasks for the algorithm which avoids atomics as opposed to 36 tasks for one which needs atomics. I will run with the code which uses 'atomics'. There are also only 6 extra LOC (lines of code within the algorithm) over the serial version and the algorithm details are unchanged.

The discepancy just gets worse for even larger matrices.

Cache misses do occur in both approaches because they need to process a column-major matrix by columns. Unavoidable sadly.

For reasonably sized matrices, the Jacobi algorithm is arguably now the algorithm of preference for SVD. It is far more accurate than a 1970s Golub+Reinsch (Householder reduction-based) SVD so as that which I have provided as a reference point for the GSoC project. On the down side, it has more floating point computations than the Golub+Reinsch approach. But on the plus side from a parallelization perspective, the Jacobi algorithm (with atomics as I describe) will I think have a much reduced parallelism overhead compared to my Golub+Reinsch code.

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Reply via email to