Michael,
On Fri, 4 Sep 2020, Ferguson, Michael Paul Pratt (Chapel Developer) wrote:
Here I am interpreting "locking" as a software lock that protects a
critical section and causes other tasks trying to enter the critical
section to wait. (Like ` pthread_mutex_lock`).
That was my understanding although I thought it was simpler than a
pthread_mutex_lock.
Atomics don't use locks on any normal configurations of Chapel. Atomics
are generally pretty fast and are directly supported by the processor
(or possibly the network).
They use special CPU instructions to ensure atomicity. One potential
source of confusion is that on x86 these might involve the `lock`
prefix; but AFAIK that instruction prefix has this name for historical
reasons and might more reasonably be named something else like `atomic`.
That expands my understanding greatly.
Your excellent explanation belongs in the documentation somewhere.
You can use a relaxed atomic to get pretty much the same effect
as a ``volatile`` in your program and I would expect it to have
similarly low overhead.
Ditto. But I need to learn more about all the various types of atomic.
And clear documentation on that seems very thin on the ground. Well, it
was the last time I looked for a C++ content.
If you do some research on volatile, you might learn that it's not able
to control the way the CPU processor itself optimizes loads/stores
(rather; it only prevents the compiler from doing so). While that might
be OK on x86 it would not function on other platforms like ARM for
example. (x86 uses TSO - Total Store Order - which is a stronger
constraint than other platforms).
OK. Stopping the compiler doing naughty things is a good start as I have
found when trying to interrogating the Floating Point Control/Status
Register.
In your example, a task calling `putStage` could, on a non-x86 system,
end up storing the value of `stage[i]` in a per-core cache somewhere,
(let's say, L1 cache, or a write reorder buffer), and put
off committing it to memory indefinitely. As a result, the parallel
program would have a load imbalance problem since the current value
of this variable isn't being communicated to the other tasks.
Understood. If I update a column of a matrix in one core in some parallel
task T, I would still hope that if subsequently, i.e. in a serial sense
and after the task T has finished, I try and read that same column of the
matrix in another core in another parallel task T', I see only the updated
column.
Or, worse than that, the write in `putStage` might store something in
memory that is neither the previous value nor the new value but some mix
of the two. (For example, maybe only the low byte is set to the new
value on a platform only able to write 1 byte at a time to memory). It
is hard to see how the algorithm could function correctly in this
situation.
Agreed. See my earlier comment.
These cases are part of the reason that processors support atomics -
they allow different tasks/threads/cores to communicate in a
reasonable manner.
As I said, I assume that a program spawing 2 tasks, say T1 and T2 updating
columns 1 and 2 respectively of a matrix, can be assured on completion of
both T1 and T2 that any subsequent tasks will see only the values written
by T1 and T2 into the matrix.
But you don't have to just take it from me
I believe you. Your knowledge is good enough for me.
The atomic section of the spec is here:
https://chapel-lang.org/docs/master/language/spec/task-parallelism-and-synchronization.html#atomic-variables
I have read that numerous times and it found it less than useful.
However the information you probably need is here:
You are very correct.
https://chapel-lang.org/docs/language/spec/memory-consistency-model.html#relaxed-atomic-operations
I had seen/read this - many times. Every time I try to read it, my head
hurts!
However the specification does not currently describe acquire, release,
or acqRel orderings. I will add an open issue note so that it is clearer
that this is missing from the spec and not just something one isn't
finding.
Great.
I will update the link to refer to the right section (which BTW we
could not do before with the PDF spec).
Thanks.
PR #16341 will make the documentation improvements I mentioned here.
Perfect.
Thanks for the insight - Damian
P.S. If you want to know how useful your advice is, read on.
For my problem, I will now use atomics. I believe relaxed ordering will do
the job based on my (very limited) understanding of the details of memory
consistency.
This is for a Jacobi SVD.
My algorithm, now using atomics, with a matrix of order 36, i.e. a 36x36
real(w)'s. on my 6-core machine, would have 35 tasks created by a forall
over 1..N-1. On average, almost 6 tasks going concurrently all the time.
Each such task would on average, process N/2 columns of the matrix one
after the other.
You can alternatively reorganize the algorithm into M (sequential) steps.
Each one of those M steps, say the I'th, would involve multiple Jacobi
rotations on a full column pair within the matrix. Each such column pair
operation can be run in parallel as each I'th step has only operations
which are independent of each other. No need for atomics. There are 70
such groups for the 36x36 matrix, requiring a total of 630 tasks where
there are on average between 1 and 18 tasks per group. Obviouls you can
merge some of the column pairs to reduce the parallelization load. But the
code to do this seriously obscures the logic of the underlying algorithm
and hence the program's readability, and makes the code parallel-centric
which goes against all of our programming KPIs. There are several other
reordering algorithms but I do not understand them well enough to program
them in Chapel nor can I find a remotely readable parallel implementation
of them in any reference in any language.
Apart from the readability concerns, it is the nominal 600+ tasks for the
algorithm which avoids atomics as opposed to 36 tasks for one which needs
atomics. I will run with the code which uses 'atomics'. There are also
only 6 extra LOC (lines of code within the algorithm) over the serial
version and the algorithm details are unchanged.
The discepancy just gets worse for even larger matrices.
Cache misses do occur in both approaches because they need to process a
column-major matrix by columns. Unavoidable sadly.
For reasonably sized matrices, the Jacobi algorithm is arguably now the
algorithm of preference for SVD. It is far more accurate than a 1970s
Golub+Reinsch (Householder reduction-based) SVD so as that which I have
provided as a reference point for the GSoC project. On the down side, it
has more floating point computations than the Golub+Reinsch approach. But
on the plus side from a parallelization perspective, the Jacobi algorithm
(with atomics as I describe) will I think have a much reduced parallelism
overhead compared to my Golub+Reinsch code.
Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers