Re: JEP 132: More-prompt finalization

Kirk Pepperdine Tue, 02 Jun 2015 09:12:27 -0700

Hi Moh,

> 
> However, I was hoping this would have the effect of improving 
> (non-finalizable) reference handling. We've seen serious issues in 
> WeakReference handling and have had to write some twisted code to deal with 
> this.


Better reference life-cycle handling would actually be beneficial IMHO as many 
cache implementations suffer because of certain aspects of the current 
implementation. SoftReference is very difficult to use.
> 
> So I guess the question I have to Kirk and David is: do you feel a GC load of 
> 10K WeakReferences per cycle is also "doing something else wrong”?

Hard to say as this really has to be eval’ed on a case by case basis. But I’d 
wonder if the WeakReference was actually needed if you are recycling them so 
quickly.

Regards,
Kirk

> 
> Sorry if this is going off-topic.
> 
> Thanks
> Moh
> 
>> -----Original Message-----
>> From: core-libs-dev [mailto:core-libs-dev-boun...@openjdk.java.net] On Behalf
>> Of Kirk Pepperdine
>> Sent: Thursday, May 28, 2015 11:58 PM
>> To: david.hol...@oracle.com Holmes
>> Cc: hotspot-gc-...@openjdk.java.net openjdk.java.net; core-libs-
>> d...@openjdk.java.net
>> Subject: Re: JEP 132: More-prompt finalization
>> 
>> Hi Peter,
>> 
>> It is a very interesting proposal but to further David's comments, the life-
>> cycle costs of reference objects is horrendous of which the actual process of
>> finalizing an object is only a fraction of that total cost. Unfortunately 
>> your
>> micro-benchmark only focuses on one aspect of that cost. In other words, it
>> isn't very representative of a real concern. In the real world the finalizer
>> *must compete with mutator threads and since F-J is an "all threads on deck"
>> implementation, it doesn't play well with others. It creates a "tragedy of 
>> the
>> commons". That is situations where everyone behaves rationally with a common
>> resource but to the detriment of the whole group". In short, parallelizing 
>> (F-
>> Jing) *everything* in an application is simply not a good idea. We do not 
>> live
>> in an infinite compute environment which means to have to consider the impact
>> of our actions to the entire group.
>> 
>> This was one of the points of my recent article in Java Magazine which I 
>> wrote
>> to try to counter some of the rhetoric I was hearing in conference about the
>> universal benefits of being able easily parallelize streams in Java 8. Yes, I
>> agree it's a great feature but it must be used with discretion. Case in 
>> point.
>> After I finished writing the article, I started running into a couple of 
>> early
>> adopters that had swallowed the parallel message whole indiscriminately
>> parallelizing all of their streams. As you can imagine, they were quite
>> surprised by the results and quickly worked to de-parallelize *all* of the
>> streams in the application.
>> 
>> To add some ability to parallelize the handling of reference objects seems
>> like a good idea if you are collecting large numbers of reference objects
>> (>10,000 per GC cycle). However if you are collecting large numbers of
>> reference objects you're most likely doing something else wrong. IME,
>> finalization is extremely useful but really only for a limited number of use
>> cases and none of them (to date) have resulted in the app burning through
>> 1000s of final objects / sec.
>> 
>> It would be interesting to know why why you picked on this particular issue.
>> 
>> Kind regards,
>> Kirk
>> 
>> 
>> 
>> On May 29, 2015, at 5:18 AM, David Holmes <david.hol...@oracle.com> wrote:
>> 
>>> Hi Peter,
>>> 
>>> I guess I'm very concerned about the premise that finalization should scale
>> to millions of objects and be performed highly concurrently. To me that's
>> sending the wrong message about finalization. It also isn't the most 
>> effective
>> use of cpu resources - most people would want to do useful work on most cpu's
>> most of the time.
>>> 
>>> Cheers,
>>> David
>>> 
>>> On 29/05/2015 3:12 AM, Peter Levart wrote:
>>>> Hi,
>>>> 
>>>> Did you know that the following simple loop:
>>>> 
>>>> public class FinalizableBottleneck {
>>>>    static boolean no;
>>>> 
>>>>    @Override
>>>>    protected void finalize() throws Throwable {
>>>>        // empty finalize() method does not make the object finalizable
>>>>        // (it is not even registered on the finalizer's list)
>>>>        if (no) {
>>>>            throw new AssertionError();
>>>>        }
>>>>    }
>>>> 
>>>>    public static void main(String[] args) {
>>>>        while (true) {
>>>>            new FinalizableBottleneck();
>>>>        }
>>>>    }
>>>> }
>>>> 
>>>> 
>>>> ...quickly fills the entire heap with FinalizableBottleneck and internal
>>>> Finalizer objects and brings the JVM to a halt? After a few seconds of
>>>> running the above program, jmap -histo:live reports:
>>>> 
>>>> num     #instances         #bytes  class name
>>>> ----------------------------------------------
>>>>   1:      50048325     2001933000  java.lang.ref.Finalizer
>>>>   2:      50048278      800772448  FinalizableBottleneck
>>>> 
>>>> 
>>>> There are a couple of bottlenecks that make this happen:
>>>> 
>>>> - ReferenceHandler thread synchronizes with VM to unhook Reference(s)
>>>> from the pending chain one be one and dispatches them to their respected
>>>> ReferenceQueue(s) which also use synchronization for equeueing each
>>>> Reference.
>>>> - Enqueueing synchronizes with the finalization thread which removes the
>>>> Finalizer(s) (FinalReferences) from the finalization queue and executes
>>>> them.
>>>> - Executing the Finalizer(s) removes them from the doubly-linked list of
>>>> all Finalizer(s) which is used to retain them until they are needed and
>>>> this synchronizes with the threads that link new Finalizer(s) into the
>>>> doubly-linked list as new finalizable objects get registered.
>>>> 
>>>> We see that the creation of a finalizable object only takes one
>>>> synchronization (registering into the doubly-linked list) and is
>>>> performed synchronously, while finalization takes 4 synchronizations
>>>> among 4 different threads (in pairs) and happens when the Finalizer
>>>> instance "travels" over from VM thread to ReferenceHandler thread and
>>>> then to finalization thread. No wonder that finalization can not keep up
>>>> with allocation in a single thread. The situation is even worse when
>>>> finalize() methods do some actual work.
>>>> 
>>>> I have experimented with various approaches to widen these bottlenecks
>>>> and found out that I can not beat the ForkJoinPool when combined with
>>>> some improvements to internal data structures used in reference
>>>> processing. Here's a prototype I came up with:
>>>> 
>>>> 
>> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/
>>>> 
>>>> 
>>>> And this is the benchmark I use for measuring the throughput:
>>>> 
>>>> 
>> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThr
>> oughput.java
>>>> 
>>>> 
>>>> The benchmark shows (results inline in source) that using unpatched JDK,
>>>> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500
>>>> finalizable objects per ms in a single thread and that while doing so,
>>>> finalization only manages to process approx. 100 - 120 objects at the
>>>> same time. Objects "in-flight" quickly accumulate and bring the VM to a
>>>> halt, where it is not doing anything but full GC cycles.
>>>> 
>>>> When constructing in 4 threads, there's not much difference.
>>>> Construction of finalizable objects simply doesn't scale.
>>>> 
>>>> Patched JDK shows something completely different. Single thread
>>>> construction achieves a rate of 3600 objects / ms. Number of "in-flight"
>>>> objects is kept constant at about 5-6M instances which amounts to approx
>>>> 1.5 s of allocation. I think this is about the rate of GC cycles during
>>>> which VM also processes the references. The benchmark also shows the
>>>> ForkJoinPool statistics which shows that the number of queued tasks is
>>>> also kept low.
>>>> 
>>>> Increasing the allocation threads to 4 increases allocation rate to
>>>> about 4300 objects / ms and finalization keeps up. Increasing allocation
>>>> threads to 8, further increases allocation rate to about 4600 objects /
>>>> ms and finalization still keeps up. The increase in rate is not linear,
>>>> but keep in mind that i7 is a 4-core CPU.
>>>> 
>>>> About the implementation...
>>>> 
>>>> 1st improvement I did was for the doubly-linked list of Finalizer
>>>> instances that is used to keep them alive until they are needed. I
>>>> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin
>>>> Buchholz and just kept the internal link/unlink methods while
>>>> specializing them to Finalizer entries (very straight-forward). I
>>>> experimented with throughput and got some improvement, but throughput
>>>> has increased much more when I used several instances of independent
>>>> lists and distributed registrations among them randomly (unlinking
>>>> consequently is also distributed randomly).
>>>> 
>>>> I found out that no matter how hard I try to optimize ReferenceQueue
>>>> while keeping the API unchanged, I can only do so much and that was not
>>>> enough. I have been surprised by how well ForkJoinPool distributes tasks
>>>> among threads, so I concluded that leveraging it is the best choice. I
>>>> re-designed the pending-list unhooking loop to unhook pending references
>>>> in chunks which greatly improves the throughput. Since unhooking can be
>>>> performed by a single thread while holding a lock which is mandated by
>>>> interface between VM and Java, I didn't employ multiple threads, but a
>>>> single eternal ForkJoinTask that unhooks in chunks and forks-off other
>>>> processing tasks that process chunks. When there are just a couple of
>>>> References pending at one time and a not-full chunk is unhooked, then
>>>> the processing is performed by the same thread that unhooked the
>>>> refrences, but when there are more, worker tasks are forked off and the
>>>> unhooking thread continues with full peace. This processing includes
>>>> execution of Cleaners, forking the finalizer tasks and enqueue-ing other
>>>> references. Finalizer(s) are always executed as separate ForkJoinTask(s).
>>>> 
>>>> It's interesting how Runtime.runFinalizers() is implemented in this
>>>> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...
>>>> 
>>>> I also tweaked the ReferenceQueue implementation a bit (it is still used
>>>> for other kinds of references) so that it avoids synchronization with a
>>>> monitor lock when there are no blocking waiters and uses CAS to
>>>> enqueue/dequeue. This improves throughput when the queue is not empty.
>>>> Since in the prototype multiple threads can enqueue into the same queue,
>>>> I thought this would improve throughput in such situations.
>>>> 
>>>> Comments, suggestions, criticism are welcome.
>>>> 
>>>> Regards, Peter
>>>> 
>

Re: JEP 132: More-prompt finalization

Reply via email to