Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

Peter Levart Tue, 28 Apr 2020 13:29:14 -0700

Hi,

The problem with current implementation of MemoryScope is that if achild scope is frequently acquired and closed (which increments and thendecrements the parent scope counter atomically using CAS), and that isperformed from multiple concurrent threads, contention might becomeprohibitive. And I think that is precisely what happens when a parallelpipeline is such that it might short-circuit the stream:

final boolean forEachWithCancel(Spliterator<P_OUT> spliterator,Sink<P_OUT> sink) {

        boolean cancelled;

do { } while (!(cancelled = sink.cancellationRequested()) &&spliterator.tryAdvance(sink));

        return cancelled;
    }

1st spliterators are created by trySplit (all of them inherit the sameMemoryScope) and then FJPool threads are busy concurrently executingabove method which calls tryAdvance for each element of the particularspliterator which does the following:


        public boolean tryAdvance(Consumer<? super MemorySegment> action) {
            Objects.requireNonNull(action);
            if (currentIndex < elemCount) {
                AbstractMemorySegmentImpl acquired = segment.acquire();
                try {

action.accept(acquired.asSliceNoCheck(currentIndex * elementSize,elementSize));

                } finally {
                    acquired.closeNoCheck();
                    currentIndex++;
                    if (currentIndex == elemCount) {
                        segment = null;
                    }
                }
                return true;
            } else {
                return false;
            }
        }

... acquire/close at each call. If the Stream is played to the end (i.e.it can't short-circuit), then forEachRemaining is used which performsjust one acquire/close for the whole remaining spliterator. So forshort-circuiting streams it might be important to have a MemoryScopethat is scalable. Here's one such attempt using a pair of scalablecounters (just one pair per root memory scope):



import java.util.concurrent.atomic.LongAdder;

/**
 * @author Peter Levart
 */
public abstract class MemoryScope {

    public static MemoryScope create(Object ref, Runnable cleanupAction) {
        return new Root(ref, cleanupAction);
    }

    MemoryScope() {}

    public abstract MemoryScope acquire();

    public abstract void close();

    private static class Root extends MemoryScope {
        private final LongAdder enters = new LongAdder();
        private final LongAdder exits = new LongAdder();
        private volatile boolean closed;

        private final Object ref;
        private final Runnable cleanupAction;

        Root(Object ref, Runnable cleanupAction) {
            this.ref = ref;
            this.cleanupAction = cleanupAction;
        }

        @Override
        public MemoryScope acquire() {
            // increment enters 1st
            enters.increment();
            // check closed flag 2nd
            if (closed) {
                exits.increment();

throw new IllegalStateException("This scope is alreadyclosed");

            }

            return new MemoryScope() {
                @Override
                public MemoryScope acquire() {
                    return Root.this.acquire();
                }

                @Override
                public void close() {
                    exits.increment();
                }
            };
        }

        private final Object lock = new Object();

        @Override
        public void close() {
            synchronized (lock) {
                // modify closed flag 1st
                closed = true;
                // check for no more active acquired children 2nd
                // IMPORTANT: 1st sum exits, then sum enters !!!
                if (exits.sum() != enters.sum()) {

throw new IllegalStateException("Cannot close thisscope as it has active acquired children");

                }
            }
            if (cleanupAction != null) {
                cleanupAction.run();
            }
        }
    }
}

This MemoryScope is just 2-level. The root is the one that is to becreated when the memory segment is allocated. A child is always a childof the root and has no own children. So a call to child.acquire() getsforwarded to the Root. The Root.acquire() 1st increments 'enters'scalable counter then checks the 'closed' flag. The child.close() justincrements the 'exits' scalable counter. The Root.close() 1st modifiesthe 'closed' flag then checks to see that the sum of 'exits' equals thesum of 'enters' - the important thing here is that 'exits' are summed1st and then 'enters'. These orderings guarantee that either a childscope is successfully acquired or the root scope is successfully closedbut never both.


WDYT?

Regards, Peter

On 4/28/20 6:12 PM, Peter Levart wrote:

Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. Isee the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset,long newSize) {
 103         return dup(offset, newSize, mask, owner, scope);
 104     }
...so here the "owner" of the slice is still the same as that ofparent segment...
But then later in tryAdvance or forEachRemaining, the segment isacquired/closed for each element of the stream (in case of tryAdvance)or for the whole chunk to the end of spliterator (in case offorEachRemaining). So some pipelines will be more optimal than others...
So I'm thinking. Would it be possible to "lazily" acquire scope justonce in tryAdvance and then re-use the scope until the end?Unfortunately Spliterator does not have a close() method to be calledwhen the pipeline is done with it. Perhaps it could be added to theAPI? This is not the 1st time I wished Spliterator had a close method.I had a similar problem when trying to create a Spliterator with adatabase backend. When using JDBC API a separate transaction(Connection) is typically required for each thread of execution sinceseveral frameworks bind it to the ThreadLocal.
WDYT?

Regards, Peter


On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi,
time has come for another round of foreign memory access APIincubation (see JEP 383 [3]). This iteration aims at polishing someof the rough edges of the API, and adds some of the functionalitiesthat developers have been asking for during this first round ofincubation. The revised API tightens the thread-confinementconstraints (by removing the MemorySegment::acquire method) andinstead provides more targeted support for parallel computation via asegment spliterator. The API also adds a way to create a customnative segment; this is, essentially, an unsafe API point, verysimilar in spirit to the JNI NewDirectByteBuffer functionality [1].By using this bit of API, power-users will be able to add support,via MemorySegment, to *their own memory sources* (e.g. think of acustom allocator written in C/C++). For now, this API point is calledoff as "restricted" and a special read-only JDK property will have tobe set on the command line for calls to this method to succeed. Weare aware there's no precedent for something like this in the Java SEAPI - but if Project Panama is to remain true about its ultimate goalof replacing bits of JNI code with (low level) Java code, stuff likethis has to be *possible*. We anticipate that, at some point, thisproperty will become a true launcher flag, and that the foreignrestricted machinery will be integrated more neatly into the modulesystem.
A list of the API, implementation and test changes is provided below.If you have any questions, or need more detailed explanations, I (andthe rest of the Panama team) will be happy to point at existingdiscussions, and/or to provide the feedback required.
Thanks
Maurizio

Webrev:

http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev

Javadoc:

http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc

Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary.html
CSR:

https://bugs.openjdk.java.net/browse/JDK-8243496



API changes
===========

* MemorySegment
- drop support for acquire() method - in its place now you canobtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments haveaccess modes - added API to do serial confinement hand-off(MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of anexisting address; this API is "restricted" and only available if theprogram is executed using the -Dforeign.unsafe=permit flag.
  - the MemorySegment::mapFromPath now returns a MappedMemorySegment
* MappedMemorySegment
- small sub-interface which provides extra capabilities for mappedsegments (load(), unload() and force())
* MemoryAddress
- added distinction between *checked* and *unchecked* addresses;*unchecked* addresses do not have a segment, so they cannot bedereferenced
  - added NULL memory address (it's an unchecked address)
- added factory to construct MemoryAddress from long value (resultis also an unchecked address) - added API point to get raw address value (where possible - e.g.if this is not an address pointing to a heap segment)
* MemoryLayout
- Added support for layout "attributes" - e.g. store metadatainside MemoryLayouts
  - Added MemoryLayout::isPadding predicate
- Added helper function to SequenceLayout to rehape/flattensequence layouts (a la NDArray [4])
* MemoryHandles
- add support for general VarHandle combinators (similar to MHcombinators) - add a combinator to turn a long-VH into a MemoryAddress VH (theresulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes
======================

* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as aset of method handles (one for each access mode supported by the varhandle) that are lazily linked. This gives us a relatively simpleidea upon which to build support for custom var handle adapters: wecould create a VarHandle by passing an existing var handle and alsospecify the set of adaptations that should be applied to the methodhandle for a given access mode in the original var handle. The resultis a new VarHandle which might support a different carrier type andmore, or less coordinate types. Adding this support was relativelyeasy - and it only required one low-level surgery of the lambda formsgenerated for adapted var handle (this is required so that the"right" var handle receiver can be used for dispatching the accessmode call).
All the new adapters in the MemoryHandles API (which are reallydefined inside VarHandles) are really just a bunch of MH adaptersthat are stitched together into a brand new VH. The only caveat isthat, we could have a checked exception mismatch: the VarHandle APImethods are specified not to throw any checked exception, whereasmethod handles can throw any throwable. This means that, potentially,calling get() on an adapted VarHandle could result in a checkedexception being thrown; to solve this gnarly issue, we decided toscan all the filter functions passed to the VH combinators and lookfor direct method handles which throw checked exceptions. If such MHsare found (these can be deeply nested, since the MHs can be adaptedon their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First,we have added support for retrieval of "mapped" properties associatedwith a ByteBuffer (e.g. the file descriptor, etc.). This is crucialif we want to be able to turn an existing byte buffer into the "rightkind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffersgiven existing parameters - which is needed when going from (mapped)segment to a buffer. These two pieces together allow us to go fromsegment to buffer and back w/o losing any information about theunderlying memory mapping (which was an issue in the previousimplementation).
Lastly, to support the new MappedMemorySegment abstraction, all thememory mapped supporting functionalities have been moved into acommon helper class so that MappedMemorySegmentImpl can reuse that(e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. Inthis round we aimed at splitting the various implementation classesso that we have a class for heap segments (HeapMemorySegmentImpl),one for native segments (NativeMemorySegmentImpl) and one for memorymapped segments (MappedMemorySegmentImpl, which extends fromNativeMemorySegmentImpl). Not much to see here - although oneimportant point is that, by doing this, we have been able to speed upperformances quite a bit, since now e.g. native/mapped segments are_guaranteed_ to have a null "base". We have also done few tricks tomake sure that the "base" accessor for heap segment is sharply typedand also NPE checked, which allows C2 to speculate more and hoist.With these changes _all_ segment types have comparable performancesand hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl tospecial case "small segments" so that VM can apply bound checkelimination
This is another important piece which allows to get very goodperformances out of indexes memory access var handles; as you mightknow, the JIT compiler has troubles in optimizing loops where theloop variable is a long [2]. To make up for that, in this round weadd an optimization which allows the API to detect whether a segmentis *small* or *large*. For small segments, the API realizes thatthere's no need to perform long computation (e.g. to perform boundchecks, or offset additions), so it falls back to integer logic,which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memoryaccess var handle" terminology
This is mostly stylistic, nothing to see here.

Tests changes
=============
In addition to the tests for the new API changes, we've also addedsome stress tests for var handle combinators - e.g. there's a flagthat can be enabled which turns on some "dummy" var handleadaptations on all var handles created by the runtime. We've usedthis flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired thenew segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segmentAPI (and made some changes to the build script so that nativelibraries would be handled correctly).
[1] -https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newdirectbytebuffer
[2] - https://bugs.openjdk.java.net/browse/JDK-8223051
[3] - https://openjdk.java.net/jeps/383
[4] -https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#numpy.reshape

Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

Reply via email to