Re: [Chapel-developers] Parallellism (subject to a constraint)

Damian McGuckin Sat, 05 Sep 2020 23:43:13 -0700


Michael,

On Fri, 4 Sep 2020, Ferguson, Michael Paul Pratt (Chapel Developer) wrote:

Here I am interpreting "locking" as a software lock that protects acritical section and causes other tasks trying to enter the criticalsection to wait. (Like ` pthread_mutex_lock`).


That was my understanding although I thought it was simpler than a
pthread_mutex_lock.

Atomics don't use locks on any normal configurations of Chapel. Atomicsare generally pretty fast and are directly supported by the processor(or possibly the network).

They use special CPU instructions to ensure atomicity. One potentialsource of confusion is that on x86 these might involve the `lock`prefix; but AFAIK that instruction prefix has this name for historicalreasons and might more reasonably be named something else like `atomic`.


That expands my understanding greatly.

Your excellent explanation belongs in the documentation somewhere.

You can use a relaxed atomic to get pretty much the same effect
as a ``volatile`` in your program and I would expect it to have
similarly low overhead.


Ditto. But I need to learn more about all the various types of atomic.

And clear documentation on that seems very thin on the ground. Well, itwas the last time I looked for a C++ content.

If you do some research on volatile, you might learn that it's not ableto control the way the CPU processor itself optimizes loads/stores(rather; it only prevents the compiler from doing so). While that mightbe OK on x86 it would not function on other platforms like ARM forexample. (x86 uses TSO - Total Store Order - which is a strongerconstraint than other platforms).

OK. Stopping the compiler doing naughty things is a good start as I havefound when trying to interrogating the Floating Point Control/StatusRegister.

In your example, a task calling `putStage` could, on a non-x86 system,
end up storing the value of `stage[i]` in a per-core cache somewhere,
(let's say, L1 cache, or a write reorder buffer), and put
off committing it to memory indefinitely. As a result, the parallel
program would have a load imbalance problem since the current value
of this variable isn't being communicated to the other tasks.

Understood. If I update a column of a matrix in one core in some paralleltask T, I would still hope that if subsequently, i.e. in a serial senseand after the task T has finished, I try and read that same column of thematrix in another core in another parallel task T', I see only the updatedcolumn.

Or, worse than that, the write in `putStage` might store something inmemory that is neither the previous value nor the new value but some mixof the two. (For example, maybe only the low byte is set to the newvalue on a platform only able to write 1 byte at a time to memory). Itis hard to see how the algorithm could function correctly in thissituation.


Agreed.  See my earlier comment.

These cases are part of the reason that processors support atomics -
they allow different tasks/threads/cores to communicate in a
reasonable manner.

As I said, I assume that a program spawing 2 tasks, say T1 and T2 updatingcolumns 1 and 2 respectively of a matrix, can be assured on completion ofboth T1 and T2 that any subsequent tasks will see only the values writtenby T1 and T2 into the matrix.

But you don't have to just take it from me


I believe you. Your knowledge is good enough for me.

The atomic section of the spec is here:

https://chapel-lang.org/docs/master/language/spec/task-parallelism-and-synchronization.html#atomic-variables


I have read that numerous times and it found it less than useful.


However the information you probably need is here:


You are very correct.

https://chapel-lang.org/docs/language/spec/memory-consistency-model.html#relaxed-atomic-operations

I had seen/read this - many times. Every time I try to read it, my headhurts!

However the specification does not currently describe acquire, release,or acqRel orderings. I will add an open issue note so that it is clearerthat this is missing from the spec and not just something one isn'tfinding.


Great.

I will update the link to refer to the right section (which BTW we
could not do before with the PDF spec).


Thanks.

PR #16341 will make the documentation improvements I mentioned here.


Perfect.

Thanks for the insight - Damian

P.S. If you want to know how useful your advice is, read on.

For my problem, I will now use atomics. I believe relaxed ordering will dothe job based on my (very limited) understanding of the details of memoryconsistency.


This is for a Jacobi SVD.

My algorithm, now using atomics, with a matrix of order 36, i.e. a 36x36real(w)'s. on my 6-core machine, would have 35 tasks created by a forallover 1..N-1. On average, almost 6 tasks going concurrently all the time.Each such task would on average, process N/2 columns of the matrix oneafter the other.

You can alternatively reorganize the algorithm into M (sequential) steps.Each one of those M steps, say the I'th, would involve multiple Jacobirotations on a full column pair within the matrix. Each such column pairoperation can be run in parallel as each I'th step has only operationswhich are independent of each other. No need for atomics. There are 70such groups for the 36x36 matrix, requiring a total of 630 tasks wherethere are on average between 1 and 18 tasks per group. Obviouls you canmerge some of the column pairs to reduce the parallelization load. But thecode to do this seriously obscures the logic of the underlying algorithmand hence the program's readability, and makes the code parallel-centricwhich goes against all of our programming KPIs. There are several otherreordering algorithms but I do not understand them well enough to programthem in Chapel nor can I find a remotely readable parallel implementationof them in any reference in any language.

Apart from the readability concerns, it is the nominal 600+ tasks for thealgorithm which avoids atomics as opposed to 36 tasks for one which needsatomics. I will run with the code which uses 'atomics'. There are alsoonly 6 extra LOC (lines of code within the algorithm) over the serialversion and the algorithm details are unchanged.


The discepancy just gets worse for even larger matrices.

Cache misses do occur in both approaches because they need to process acolumn-major matrix by columns. Unavoidable sadly.

For reasonably sized matrices, the Jacobi algorithm is arguably now thealgorithm of preference for SVD. It is far more accurate than a 1970sGolub+Reinsch (Householder reduction-based) SVD so as that which I haveprovided as a reference point for the GSoC project. On the down side, ithas more floating point computations than the Golub+Reinsch approach. Buton the plus side from a parallelization perspective, the Jacobi algorithm(with atomics as I describe) will I think have a much reduced parallelismoverhead compared to my Golub+Reinsch code.


Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer


_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] Parallellism (subject to a constraint)

Reply via email to