Re: [Chapel-developers] Parallellism (subject to a constraint)

Ferguson, Michael Paul Pratt (Chapel Developer) Fri, 04 Sep 2020 06:22:21 -0700

Hi Damian -

(Just responding to the 2nd part of your mail).

See reply inline below.

So, to avoid treading on the heals of its predecessor,
Task#I+1 is continually reading/polling stage[I] which it must read it
from memory to ensure it knows what Task#I is doing. Again no locking is
required. This is your garden variety C volatile variable. How do I handle
it in Chapel? An atomic variable seems like massive overhill as I assume
locking is involved with an atomic. I can access stage[??] through a pair
of custom (tiny) external C routines

void putStage(long *stage, long i, long k) { stage[i] = k; }
and
long getStage(long *stage, long i) { return stage[i]; }

But that seems like I am sticking my head in the sand and avoiding the
underlying problem. And it probably cripples any optimization being done
in the code which is calling these routines.

I still contend that Chapel needs a 'vol[atile]' declaration concept or
something like it. The identifier so declared is much like a 'var[iable]'
but it must always be accessed through, i.e. written-to/read-from, memory.
But I do not know enough of Chapel's big picture so I could be talking
through my hat! I wish I had known enough of Chapel to be a productive
participant in the conversation in 2012 when volatile got removed from
Chapel 1.15.0. Then again, maybe I am talking about peek/poke for atomics
which appears to have lesser overhead than other types of memory ordering.
But again, I have no idea what the overhead for these really is but it all
seems way too high for what I want.

> An atomic variable seems like massive overhill as I assume
> locking is involved with an atomic

Here I am interpreting "locking" as a software lock that
protects a critical section and causes other tasks trying
to enter the critical section to wait. (Like ` pthread_mutex_lock`).

Atomics don't use locks on any normal configurations of
Chapel. Atomics are generally pretty fast and are directly
supported by the processor (or possibly the network).

They use special CPU instructions to ensure atomicity.
One potential source of confusion is that on x86 these
might involve the `lock` prefix; but AFAIK that instruction prefix
has this name for historical reasons and might more reasonably
be named something else like `atomic`.

You can use a relaxed atomic to get pretty much the same effect
as a ``volatile`` in your program and I would expect it to have
similarly low overhead.

If you do some research on volatile, you might learn that it's not
able to control the way the CPU processor itself optimizes
loads/stores (rather; it only prevents the compiler from doing so).
While that might be OK on x86 it would not function on other platforms
like ARM for example. (x86 uses TSO - Total Store Order - which is
a stronger constraint than other platforms).

In your example, a task calling `putStage` could, on a non-x86 system,
end up storing the value of `stage[i]` in a per-core cache somewhere,
(let's say, L1 cache, or a write reorder buffer), and put
off committing it to memory indefinitely. As a result, the parallel
program would have a load imbalance problem since the current value
of this variable isn't being communicated to the other tasks.

Or, worse than that, the write in `putStage` might store something
in memory that is neither the previous value nor the new value
but some mix of the two. (For example, maybe only the low
byte is set to the new value on a platform only able to write
1 byte at a time to memory). It is hard to see how the algorithm
could function correctly in this situation.

These cases are part of the reason that processors support atomics -
they allow different tasks/threads/cores to communicate in a
reasonable manner.

But you don't have to just take it from me - the linux kernel developers
also frown upon using volatile; see

https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

By the way, the online Chapel documentation on atomics does not appear to
explain (or link to an explanation of) memory ordering types. In an older
PDF document, it refers to C11 which then refers to the C++ definition and
other documentation.

The atomic section of the spec is here:

https://chapel-lang.org/docs/master/language/spec/task-parallelism-and-synchronization.html#atomic-variables

However the information you probably need is here:

https://chapel-lang.org/docs/language/spec/memory-consistency-model.html#relaxed-atomic-operations

However the specification does not currently describe
acquire, release, or acqRel orderings. I will add an open
issue note so that it is clearer that this is missing from the spec
and not just something one isn't finding.

If you click on the Chapel Language Specification you land on a page which
has no reference whatsoever about atomics. If I actually had any solid
grasp of the subject, I would offer to rewrite it, but I do not, which is
why I am reading about it in the first place. I find that by the time I
have gone through all the links to links, I have long forgotten what my
precise original Chapel problem was. Just my 2c. Might be worth putting
onto the to-do list.

I will update the link to refer to the right section (which BTW we
could not do before with the PDF spec).

PR #16341 will make the documentation improvements I mentioned
here.

Best,

-michael

_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] Parallellism (subject to a constraint)

Reply via email to