Re: RFR (S) CR 6857566: (bf) DirectByteBuffer garbage creation can outpace reclamation

Peter Levart Mon, 07 Oct 2013 04:26:20 -0700

More test data...

Running the previously mentioned measurement (allocating direct buffersrandomly sized between 256KB and 1MB for 60 seconds) with a singleallocating tread that presents allocation pressure which is stillacceptable for original code (two threads are already too much an leadto OOME), the results, using 4-core i7 CPU and-XX:MaxDirectMemorySize=128m are:


- 482403 allocations satisfied without helping ReferenceHandler

- 75 allocations satisfied while helping ReferenceHandler but beforeSystem.gc()- 2373 allocations satisfied while helping ReferenceHandler, afterSystem.gc() but before any sleeps

- no sleeps

avg. allocation: 0.12 ms/op (original code: 0.6 ms/op)

This test may be regarded as an edge test. No current real-worldapplication exhibits substantialy higher allocation pressure. If it did,it would throw OOME, so it would be unusable. Above numbers show thatmajority of allocations (99.5%) happen without helping ReferenceHandlerthread. Only every 200th disturbs the ReferenceHandler thread, and itdoes exactly what ReferenceHandler does asynchronously and because theReferenceHandler thread is asleep. So it actually increases thepromptness of Reference enqueue-ing - not something that could hurt. Wecan reasonably expect that in current real-world applications helpingReferenceHandler would happen even less frequently, and could notnegatively impact application behaviour. What applications will see isup to 5x improvement in throughput of allocation (a result of usingatomic operations for reservation and less sleeping).

For comparison, here are also the results for 2 allocating threads(higher allocation pressure than any current real-world application):


- 734916 allocations satisfied without helping ReferenceHandler

- 3112 allocations satisfied while helping ReferenceHandler but beforeSystem.gc()- 3817 allocations satisfied while helping ReferenceHandler, afterSystem.gc() but before any sleeps

- no sleeps

avg. allocation: 0.16 ms/op (per thread)

This is still 99.1% allocations without disturbing the peaceful flow ofReferenceHandler thread.



Regards, Peter

On 10/07/2013 12:56 AM, Peter Levart wrote:

Hi Again,

The result of my experimentation is as follows:
Letting ReferenceHandler thread alone to en-queue References andexecute Cleaners is not enough to prevent OOMEs when allocation isperformed in large number of threads even if I let Cleaners do onlysynchronous announcing of what will be freed (very fast), delegate theactual de-allocation to a background thread and base reservationwaiting on announced free space (still wait that space is deallocatedand unreserved before satisfying reservation request, but wait as longas it takes if the announced free space is enough for reservationrequest).
ReferenceHandler thread, when it finds that it has no more pendingReferences, parks and waits for notification from VM. The VM promptlyprocess references (hooks them on the pending list), but withsaturated CPUs, waking-up the ReferenceHandler thread and re-gainingthe lock takes too much time. During that time allocating threads canreserve the whole permitted space and OOME must be thrown. So I'm backto strategy #1 - helping ReferenceHandler thread.
It's not so much about helping to achieve better throughput (as Inoted deallocating can not be effectively parallelized) but toovercome the latency of waking-up the ReferenceHandler thread. Here'smy attempt at doing this:
http://cr.openjdk.java.net/~plevart/jdk8-tl/DyrectBufferAlloc/webrev.01/
This is much simplified from my 1st submission of similar strategy. Itried to be as undisruptive to current logic of Reference processingas possible, but of course you decide if this is still too risky forinclusion into JDK8. Cleaner is unchanged - it processes it's thunksynchronously and ReferenceHandler thread invokes it directly.ReferenceHandler logic is the same - I just factored-out the contentof the loop into a private method to be able to call it from nio Bitswhere the bulk of change lies.
The (un)reservation logic is re-implemented with atomic operations -no locks. When large number of threads are competing for reservation,locking overhead can be huge and can slow-down unreservation (whichmust use the same lock as reservation). The reservation re-try logic1st tries to satisfy the reservation request while helpingReferenceHandler thread in en-queue-ing References and executingCleaners until the list of pending references is exhausted. If thisdoes not succeed, it triggers VM to process references (System.gc())and then enters similar re-try loop but introducing exponentiallyincreasing back-off delay every time the chain of pending referencesis exhausted, starting with 1ms sleep and doubling. This gives VM timeto process the references. Maximum number of sleeps is 9, giving max.accumulated sleep time of 0.5 s. This means that a request thatrightfully throws OOME will do so after 0.5 s sleep.
I did the following measurement: Using LongAdders (to avoidHeisenberg) I counted various exit paths from Bits.reserveMemory()during a test that spawned 128 allocating threads on a 4-core i7machine, allocating direct buffers randomly sized between 256KB and1MB for 60 seconds, using -XX:MaxDirectMemorySize=512m:
Total of 909960  allocations were performed:

- 247993 were satisfied before attempting to help ReferenceHandler thread
- 660184 were satisfied while helping ReferenceHandler thread butbefore triggering System.gc()- 1783 were satisfied after triggering System.gc() but before doingany sleep
- no sleeping has been performed
The same test, just changing -XX:MaxDirectMemorySize=128m (that means1MB per thread each allocating direct buffers randomly sized between256KB and 1MB):
Total of 579943 allocations were performed:

- 131547 were satisfied before attempting to help ReferenceHandler thread
- 438345 were satisfied while helping ReferenceHandler thread butbefore triggering System.gc()- 10016 were satisfied after triggering System.gc() but before doingany sleep
- 34 were satisfied after sleep(1)
- 1 was satisfied after sleep(1) followed by sleep(2)
That's it. I think this is good enough for testing on large scale. Ihave also included a modified DirectBufferAllocTest as a unit test,but I don't know if it's suitable since it takes 60s to run. The runtime could be lowered with less probability to catch OOMEs.
So what do you think? Is this still too risky for JDK8?


Regards, Peter

On 10/06/2013 01:19 PM, Peter Levart wrote:
Hi,
I agree the problem with de-allocation of native memory blocks shouldbe studied deeply and this takes time.
What I have observed so far on Linux platform (other platforms maybehave differently) is the following:
Deallocation of native memory with Unsafe.freeMemory(address) cantake various amounts of time. It can grow to a constant amount ofseveral milliseconds to free a 1MB block, for example, when there'salready lots of blocks allocated and multiple threads are constantlyallocating more. I'm not sure yet about the main reasons for that,but it could either be a contention with allocation from multiplethreads, interaction with GC, or even the algorithm used in thenative allocator. Deallocation is also not very parallelizable. Myobservation is that deallocating with 2 threads (on a 4 core CPU)does not help much.
Current scheme of deallocating in ReferenceHandler thread means thata lot of "pending" Cleaner objects can accumulate and although VM haspromptly processed Cleaner PhantomReferences (hooked them on thepending list), a lot of work is still to be done to actually free thenative blocks. This clogs ReferenceHandler thread and affects otherReference processing. It also presents difficulties for back-offstrategy for allocating native memory. The strategy has noinformation that would be needed to decide whether to wait more or tofail with OOME.
I'm currently experimenting with approach where Cleaner andReferenceHandler code stays as is, but the Cheaner's thunk (theDeallocator in DirectByteBuffer) is modified so that it performs someactions synchronously (announcing what will be de-allocated) anddelegates the actual deallocation and unreservation to a backgroundthread. Reservation strategy has more information to base it'sback-off strategy that way. I'll let you know if I get some resultsfrom that.
Regards, Peter

On 10/04/2013 08:39 PM, mark.reinh...@oracle.com wrote:
2013/10/2 15:13 -0700,alan.bate...@oracle.com:
BTW: Is this important enough to attempt to do this late in 8? I just
wonder about a significant change like switching to weak references and
whether it would be more sensible to hold it back to do early in 9.
I share your concern.  This is extraordinarily sensitive code.
Now is not the time to rewrite it for JDK 8.

- Mark

Re: RFR (S) CR 6857566: (bf) DirectByteBuffer garbage creation can outpace reclamation

Reply via email to