Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

Peter Levart Wed, 29 Apr 2020 00:40:23 -0700



On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:

On 28/04/2020 21:27, Peter Levart wrote:
Hi,
The problem with current implementation of MemoryScope is that if achild scope is frequently acquired and closed (which increments andthen decrements the parent scope counter atomically using CAS), andthat is performed from multiple concurrent threads, contention mightbecome prohibitive. And I think that is precisely what happens when aparallel pipeline is such that it might short-circuit the stream:
final boolean forEachWithCancel(Spliterator<P_OUT> spliterator,Sink<P_OUT> sink) {
        boolean cancelled;
do { } while (!(cancelled = sink.cancellationRequested()) &&spliterator.tryAdvance(sink));
        return cancelled;
    }
1st spliterators are created by trySplit (all of them inherit thesame MemoryScope) and then FJPool threads are busy concurrentlyexecuting above method which calls tryAdvance for each element of theparticular spliterator which does the following:
public boolean tryAdvance(Consumer<? super MemorySegment>action) {
            Objects.requireNonNull(action);
            if (currentIndex < elemCount) {
                AbstractMemorySegmentImpl acquired = segment.acquire();
                try {
action.accept(acquired.asSliceNoCheck(currentIndex * elementSize,elementSize));
                } finally {
                    acquired.closeNoCheck();
                    currentIndex++;
                    if (currentIndex == elemCount) {
                        segment = null;
                    }
                }
                return true;
            } else {
                return false;
            }
        }
... acquire/close at each call. If the Stream is played to the end(i.e. it can't short-circuit), then forEachRemaining is used whichperforms just one acquire/close for the whole remaining spliterator.So for short-circuiting streams it might be important to have aMemoryScope that is scalable. Here's one such attempt using a pair ofscalable counters (just one pair per root memory scope):
The current implementation has performances that are on par with theprevious acquire-based implementation, and also on par with what canbe achieved with Unsafe. We do have a micro benchmark in the patch(see ParallelSum (**)) which tests this, and I get _identical_ numberseven if I _comment_ the body of acquire/release - so that nocontention can happen; so, I'm a bit skeptical overall that contentionon acquire/release is the main factor at play here - but perhaps weneed more targeted benchmarks.

Right, summing is typically not a short-circuiting operation, so I betforEachRemaining is used in the leaf spliterators. FindAny or findFirstmight be the ones to test. I'll prepare a test and see whatexperimenting with alternative MemoryScope can do...

(**) - your email caused me to look deeper at the ParallelSumbenchmark which, as currently written seems to favor Unsafe over theMemorySegment API - but in reality, as I discovered, that is down toan issue in the implementation of the unsafe spliterator, whichdoesn't sum all the elements; I will fix the benchmark in an upcomingiteration
So, while I'm open to suggestion as to how to reduce contention on theacquire counter, I think we need more evidence that this is indeed anissue (or the _main_ issue, when it comes to parallel computation).That said, your implementation looks interesting - some questionsinline and also below:


answers inline...



import java.util.concurrent.atomic.LongAdder;

/**
 * @author Peter Levart
 */
public abstract class MemoryScope {

public static MemoryScope create(Object ref, RunnablecleanupAction) {

        return new Root(ref, cleanupAction);
    }

    MemoryScope() {}

    public abstract MemoryScope acquire();

    public abstract void close();

    private static class Root extends MemoryScope {
        private final LongAdder enters = new LongAdder();
        private final LongAdder exits = new LongAdder();
        private volatile boolean closed;

        private final Object ref;
        private final Runnable cleanupAction;

        Root(Object ref, Runnable cleanupAction) {
            this.ref = ref;
            this.cleanupAction = cleanupAction;
        }

        @Override
        public MemoryScope acquire() {
            // increment enters 1st
            enters.increment();
            // check closed flag 2nd
            if (closed) {
                exits.increment();

throw new IllegalStateException("This scope isalready closed");

            }

            return new MemoryScope() {
                @Override
                public MemoryScope acquire() {
                    return Root.this.acquire();
                }

                @Override
                public void close() {
                    exits.increment();

Here -- don't you mean Root.this.exits? Otherwise Root.exists is gonnaremain != from Root.enters?

'exits' field is declared in Root only, so 'exits' in anonymous innerclass actually refers to Root.this.exits...

}
            };
        }

        private final Object lock = new Object();

        @Override
        public void close() {
            synchronized (lock) {
Why the lock? If we are here we're already in the owner thread - e.g.it's not like multiple threads can call this at the same time. Or areyou trying to make the code more robust in the case a segment iscreated w/o a confinement thread (e.g. via the unsafe API) ?


That lock is unnecessary. A leftover.

// modify closed flag 1st
                closed = true;
                // check for no more active acquired children 2nd
                // IMPORTANT: 1st sum exits, then sum enters !!!
                if (exits.sum() != enters.sum()) {
throw new IllegalStateException("Cannot closethis scope as it has active acquired children");
                }
            }
            if (cleanupAction != null) {
                cleanupAction.run();
            }
        }
    }
}
This MemoryScope is just 2-level. The root is the one that is to becreated when the memory segment is allocated. A child is always achild of the root and has no own children. So a call tochild.acquire() gets forwarded to the Root. The Root.acquire() 1stincrements 'enters' scalable counter then checks the 'closed' flag.The child.close() just increments the 'exits' scalable counter. TheRoot.close() 1st modifies the 'closed' flag then checks to see thatthe sum of 'exits' equals the sum of 'enters' - the important thinghere is that 'exits' are summed 1st and then 'enters'. Theseorderings guarantee that either a child scope is successfullyacquired or the root scope is successfully closed but never both.
I guess what you mean here is that, by construction, exits <= enters.
So, we first read exists, then we read enters - and there can be onlytwo cases:
* exits < enters, in which case it means some other thread hasacquired but not closed (possibly even *after* the call to exits.sum())* exits == enters, in which case there's no pending acquire and we'regolden
While this all seems very clever there are some things that don't 100%convince me - for instance, I note that `closed` will stay set even ifwe later throw an ISE during close(). I suppose we *could* resetclosed = false in the throwing code path, but then there's apossibility of having generated spurious ISE in MemoryScope::acquirein the time span where `closed = true`.

Right, as you saw in a private Email, I did exactly that in a revisedversion (posted below). The spurious ISE may happen but only when closeis called prematurely relative to all child scope close(s) that were orare still active. So we could say the other way: if close was not calledprematurely, the ISE on acquire would not be spurious - so ordering ofclose relative to later acquire was wrong anyway and the exception is an"indication" of that wrong ordering

Unless we want to use close() to "probe" the scope whether it is stillactive or not, I don't think this should present a problem.

In other words, one of the big realization of the currentsynchronization mechanism behind acquire() is that if we fold the"closed" state with the "count" state, we then have to worry onlyabout one access, which makes it easier to reason about theimplementation. Here it seems that races between updates toexits/enters/closed would be possible, and I'm not sure we can fullyprotect against those w/o adding more locks?

The only problem I see with (the revised) version is this spurious ISEthat might be thrown on acquire executed concurrently with prematureclose. I see no other races. For any interesting readers, I'm postingthis revised version here so it doesn't look like I'm talking about somesecret:


import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
import java.util.concurrent.atomic.LongAdder;

/**
 * @author Peter Levart
 */
public abstract class MemoryScope {

    public static MemoryScope create(Object ref, Runnable cleanupAction) {
        return new Root(ref, cleanupAction);
    }

    boolean closed;
    private static final VarHandle CLOSED;

    static {
        try {

CLOSED = MethodHandles.lookup().findVarHandle(Root.class,"closed", boolean.class);

        } catch (Throwable ex) {
            throw new ExceptionInInitializerError(ex);
        }
    }

    MemoryScope() {}

    public abstract MemoryScope acquire();

    public abstract void close();

    public final boolean isAliveThreadSafe() {
        return !((boolean) CLOSED.getVolatile(this));
    }

    public final boolean isAliveConfined() {
        return !closed;
    }

    private static final class Root extends MemoryScope {
        private final LongAdder acquires = new LongAdder();
        private final LongAdder releases = new LongAdder();

        private final Object ref;
        private final Runnable cleanupAction;

        private Root(Object ref, Runnable cleanupAction) {
            this.ref = ref;
            this.cleanupAction = cleanupAction;
        }

        @Override
        public MemoryScope acquire() {
            // increment acquires 1st
            acquires.increment();
            // check closed flag 2nd
            if ((boolean) CLOSED.getVolatile(this)) {
                releases.increment();

throw new IllegalStateException("This scope is alreadyclosed");

            }
            return new Child();
        }

        @Override
        public void close() {
            if (closed) {

throw new IllegalStateException("This scope is alreadyclosed");

            }
            // modify closed flag 1st
            CLOSED.setVolatile(this, true);
            // check for no more active acquired children 2nd
            // IMPORTANT: 1st sum releases, then sum acquires !!!
            if (releases.sum() != acquires.sum()) {
                CLOSED.setVolatile(this, false); // undo before failing

throw new IllegalStateException("Cannot close thisscope as it has active acquired children");

            }
            if (cleanupAction != null) {
                cleanupAction.run();
            }
        }

        private final class Child extends MemoryScope {

            private Child() { }

            @Override
            public MemoryScope acquire() {
                return Root.this.acquire();
            }

            @Override
            public void close() {
                if (closed) {

throw new IllegalStateException("This scope isalready closed");

                }
                closed = true;

// following acts as a volatile write after plain writeabove so // plain write gets flushed too (which is important forisAliveThreadSafe())

                Root.this.releases.increment();
            }
        }
    }
}


Maurizio


Regards, Peter

WDYT?

Regards, Peter

On 4/28/20 6:12 PM, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case.I see the Spliterator.trySplit() is callingAbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(longoffset, long newSize) {
 103         return dup(offset, newSize, mask, owner, scope);
 104     }
...so here the "owner" of the slice is still the same as that ofparent segment...
But then later in tryAdvance or forEachRemaining, the segment isacquired/closed for each element of the stream (in case oftryAdvance) or for the whole chunk to the end of spliterator (incase of forEachRemaining). So some pipelines will be more optimalthan others...
So I'm thinking. Would it be possible to "lazily" acquire scope justonce in tryAdvance and then re-use the scope until the end?Unfortunately Spliterator does not have a close() method to becalled when the pipeline is done with it. Perhaps it could be addedto the API? This is not the 1st time I wished Spliterator had aclose method. I had a similar problem when trying to create aSpliterator with a database backend. When using JDBC API a separatetransaction (Connection) is typically required for each thread ofexecution since several frameworks bind it to the ThreadLocal.
WDYT?

Regards, Peter


On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi,
time has come for another round of foreign memory access APIincubation (see JEP 383 [3]). This iteration aims at polishing someof the rough edges of the API, and adds some of the functionalitiesthat developers have been asking for during this first round ofincubation. The revised API tightens the thread-confinementconstraints (by removing the MemorySegment::acquire method) andinstead provides more targeted support for parallel computation viaa segment spliterator. The API also adds a way to create a customnative segment; this is, essentially, an unsafe API point, verysimilar in spirit to the JNI NewDirectByteBuffer functionality [1].By using this bit of API, power-users will be able to add support,via MemorySegment, to *their own memory sources* (e.g. think of acustom allocator written in C/C++). For now, this API point iscalled off as "restricted" and a special read-only JDK propertywill have to be set on the command line for calls to this method tosucceed. We are aware there's no precedent for something like thisin the Java SE API - but if Project Panama is to remain true aboutits ultimate goal of replacing bits of JNI code with (low level)Java code, stuff like this has to be *possible*. We anticipatethat, at some point, this property will become a true launcherflag, and that the foreign restricted machinery will be integratedmore neatly into the module system.
A list of the API, implementation and test changes is providedbelow. If you have any questions, or need more detailedexplanations, I (and the rest of the Panama team) will be happy topoint at existing discussions, and/or to provide the feedbackrequired.
Thanks
Maurizio

Webrev:

http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev

Javadoc:

http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc

Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary.html
CSR:

https://bugs.openjdk.java.net/browse/JDK-8243496



API changes
===========

* MemorySegment
- drop support for acquire() method - in its place now you canobtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segmentshave access modes - added API to do serial confinement hand-off(MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of anexisting address; this API is "restricted" and only available ifthe program is executed using the -Dforeign.unsafe=permit flag.
  - the MemorySegment::mapFromPath now returns a MappedMemorySegment
* MappedMemorySegment
- small sub-interface which provides extra capabilities formapped segments (load(), unload() and force())
* MemoryAddress
- added distinction between *checked* and *unchecked* addresses;*unchecked* addresses do not have a segment, so they cannot bedereferenced
  - added NULL memory address (it's an unchecked address)
- added factory to construct MemoryAddress from long value(result is also an unchecked address) - added API point to get raw address value (where possible - e.g.if this is not an address pointing to a heap segment)
* MemoryLayout
- Added support for layout "attributes" - e.g. store metadatainside MemoryLayouts
  - Added MemoryLayout::isPadding predicate
- Added helper function to SequenceLayout to rehape/flattensequence layouts (a la NDArray [4])
* MemoryHandles
- add support for general VarHandle combinators (similar to MHcombinators) - add a combinator to turn a long-VH into a MemoryAddress VH (theresulting MemoryAddress is also *unchecked* and cannot bedereferenced)
Implementation changes
======================

* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as aset of method handles (one for each access mode supported by thevar handle) that are lazily linked. This gives us a relativelysimple idea upon which to build support for custom var handleadapters: we could create a VarHandle by passing an existing varhandle and also specify the set of adaptations that should beapplied to the method handle for a given access mode in theoriginal var handle. The result is a new VarHandle which mightsupport a different carrier type and more, or less coordinatetypes. Adding this support was relatively easy - and it onlyrequired one low-level surgery of the lambda forms generated foradapted var handle (this is required so that the "right" var handlereceiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are reallydefined inside VarHandles) are really just a bunch of MH adaptersthat are stitched together into a brand new VH. The only caveat isthat, we could have a checked exception mismatch: the VarHandle APImethods are specified not to throw any checked exception, whereasmethod handles can throw any throwable. This means that,potentially, calling get() on an adapted VarHandle could result ina checked exception being thrown; to solve this gnarly issue, wedecided to scan all the filter functions passed to the VHcombinators and look for direct method handles which throw checkedexceptions. If such MHs are found (these can be deeply nested,since the MHs can be adapted on their own), adaptation of thetarget VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First,we have added support for retrieval of "mapped" propertiesassociated with a ByteBuffer (e.g. the file descriptor, etc.). Thisis crucial if we want to be able to turn an existing byte bufferinto the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffersgiven existing parameters - which is needed when going from(mapped) segment to a buffer. These two pieces together allow us togo from segment to buffer and back w/o losing any information aboutthe underlying memory mapping (which was an issue in the previousimplementation).
Lastly, to support the new MappedMemorySegment abstraction, all thememory mapped supporting functionalities have been moved into acommon helper class so that MappedMemorySegmentImpl can reuse that(e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. Inthis round we aimed at splitting the various implementation classesso that we have a class for heap segments (HeapMemorySegmentImpl),one for native segments (NativeMemorySegmentImpl) and one formemory mapped segments (MappedMemorySegmentImpl, which extends fromNativeMemorySegmentImpl). Not much to see here - although oneimportant point is that, by doing this, we have been able to speedup performances quite a bit, since now e.g. native/mapped segmentsare _guaranteed_ to have a null "base". We have also done fewtricks to make sure that the "base" accessor for heap segment issharply typed and also NPE checked, which allows C2 to speculatemore and hoist. With these changes _all_ segment types havecomparable performances and hoisting guarantees (unlike in the oldimplementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImplto special case "small segments" so that VM can apply bound checkelimination
This is another important piece which allows to get very goodperformances out of indexes memory access var handles; as you mightknow, the JIT compiler has troubles in optimizing loops where theloop variable is a long [2]. To make up for that, in this round weadd an optimization which allows the API to detect whether asegment is *small* or *large*. For small segments, the API realizesthat there's no need to perform long computation (e.g. to performbound checks, or offset additions), so it falls back to integerlogic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memoryaccess var handle" terminology
This is mostly stylistic, nothing to see here.

Tests changes
=============
In addition to the tests for the new API changes, we've also addedsome stress tests for var handle combinators - e.g. there's a flagthat can be enabled which turns on some "dummy" var handleadaptations on all var handles created by the runtime. We've usedthis flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wiredthe new segment spliterator with the existing spliterator testharness.
We have also added several micro benchmarks for the memory segmentAPI (and made some changes to the build script so that nativelibraries would be handled correctly).
[1] -https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newdirectbytebuffer
[2] - https://bugs.openjdk.java.net/browse/JDK-8223051
[3] - https://openjdk.java.net/jeps/383
[4] -https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#numpy.reshape

Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

Reply via email to