Hi Peter,

It is a very interesting proposal but to further David’s comments, the 
life-cycle costs of reference objects is horrendous of which the actual process 
of finalizing an object is only a fraction of that total cost. Unfortunately 
your micro-benchmark only focuses on one aspect of that cost. In other words, 
it isn’t very representative of a real concern. In the real world the finalizer 
*must compete with mutator threads and since F-J is an “all threads on deck” 
implementation, it doesn’t play well with others. It creates a “tragedy of the 
commons”. That is situations where everyone behaves rationally with a common 
resource but to the detriment of the whole group”. In short, parallelizing 
(F-Jing) *everything* in an application is simply not a good idea. We do not 
live in an infinite compute environment which means to have to consider the 
impact of our actions to the entire group.

This was one of the points of my recent article in Java Magazine which I wrote 
to try to counter some of the rhetoric I was hearing in conference about the 
universal benefits of being able easily parallelize streams in Java 8. Yes, I 
agree it’s a great feature but it must be used with discretion. Case in point. 
After I finished writing the article, I started running into a couple of early 
adopters that had swallowed the parallel message whole indiscriminately 
parallelizing all of their streams. As you can imagine, they were quite 
surprised by the results and quickly worked to de-parallelize *all* of the 
streams in the application.

To add some ability to parallelize the handling of reference objects seems like 
a good idea if you are collecting large numbers of reference objects (>10,000 
per GC cycle). However if you are collecting large numbers of reference objects 
you’re most likely doing something else wrong. IME, finalization is extremely 
useful but really only for a limited number of use cases and none of them (to 
date) have resulted in the app burning through 1000s of final objects / sec.

It would be interesting to know why why you picked on this particular issue.

Kind regards,
Kirk



On May 29, 2015, at 5:18 AM, David Holmes <david.hol...@oracle.com> wrote:

> Hi Peter,
> 
> I guess I'm very concerned about the premise that finalization should scale 
> to millions of objects and be performed highly concurrently. To me that's 
> sending the wrong message about finalization. It also isn't the most 
> effective use of cpu resources - most people would want to do useful work on 
> most cpu's most of the time.
> 
> Cheers,
> David
> 
> On 29/05/2015 3:12 AM, Peter Levart wrote:
>> Hi,
>> 
>> Did you know that the following simple loop:
>> 
>> public class FinalizableBottleneck {
>>     static boolean no;
>> 
>>     @Override
>>     protected void finalize() throws Throwable {
>>         // empty finalize() method does not make the object finalizable
>>         // (it is not even registered on the finalizer's list)
>>         if (no) {
>>             throw new AssertionError();
>>         }
>>     }
>> 
>>     public static void main(String[] args) {
>>         while (true) {
>>             new FinalizableBottleneck();
>>         }
>>     }
>> }
>> 
>> 
>> ...quickly fills the entire heap with FinalizableBottleneck and internal
>> Finalizer objects and brings the JVM to a halt? After a few seconds of
>> running the above program, jmap -histo:live reports:
>> 
>>  num     #instances         #bytes  class name
>> ----------------------------------------------
>>    1:      50048325     2001933000  java.lang.ref.Finalizer
>>    2:      50048278      800772448  FinalizableBottleneck
>> 
>> 
>> There are a couple of bottlenecks that make this happen:
>> 
>> - ReferenceHandler thread synchronizes with VM to unhook Reference(s)
>> from the pending chain one be one and dispatches them to their respected
>> ReferenceQueue(s) which also use synchronization for equeueing each
>> Reference.
>> - Enqueueing synchronizes with the finalization thread which removes the
>> Finalizer(s) (FinalReferences) from the finalization queue and executes
>> them.
>> - Executing the Finalizer(s) removes them from the doubly-linked list of
>> all Finalizer(s) which is used to retain them until they are needed and
>> this synchronizes with the threads that link new Finalizer(s) into the
>> doubly-linked list as new finalizable objects get registered.
>> 
>> We see that the creation of a finalizable object only takes one
>> synchronization (registering into the doubly-linked list) and is
>> performed synchronously, while finalization takes 4 synchronizations
>> among 4 different threads (in pairs) and happens when the Finalizer
>> instance "travels" over from VM thread to ReferenceHandler thread and
>> then to finalization thread. No wonder that finalization can not keep up
>> with allocation in a single thread. The situation is even worse when
>> finalize() methods do some actual work.
>> 
>> I have experimented with various approaches to widen these bottlenecks
>> and found out that I can not beat the ForkJoinPool when combined with
>> some improvements to internal data structures used in reference
>> processing. Here's a prototype I came up with:
>> 
>> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/
>> 
>> 
>> And this is the benchmark I use for measuring the throughput:
>> 
>> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThroughput.java
>> 
>> 
>> The benchmark shows (results inline in source) that using unpatched JDK,
>> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500
>> finalizable objects per ms in a single thread and that while doing so,
>> finalization only manages to process approx. 100 - 120 objects at the
>> same time. Objects "in-flight" quickly accumulate and bring the VM to a
>> halt, where it is not doing anything but full GC cycles.
>> 
>> When constructing in 4 threads, there's not much difference.
>> Construction of finalizable objects simply doesn't scale.
>> 
>> Patched JDK shows something completely different. Single thread
>> construction achieves a rate of 3600 objects / ms. Number of "in-flight"
>> objects is kept constant at about 5-6M instances which amounts to approx
>> 1.5 s of allocation. I think this is about the rate of GC cycles during
>> which VM also processes the references. The benchmark also shows the
>> ForkJoinPool statistics which shows that the number of queued tasks is
>> also kept low.
>> 
>> Increasing the allocation threads to 4 increases allocation rate to
>> about 4300 objects / ms and finalization keeps up. Increasing allocation
>> threads to 8, further increases allocation rate to about 4600 objects /
>> ms and finalization still keeps up. The increase in rate is not linear,
>> but keep in mind that i7 is a 4-core CPU.
>> 
>> About the implementation...
>> 
>> 1st improvement I did was for the doubly-linked list of Finalizer
>> instances that is used to keep them alive until they are needed. I
>> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin
>> Buchholz and just kept the internal link/unlink methods while
>> specializing them to Finalizer entries (very straight-forward). I
>> experimented with throughput and got some improvement, but throughput
>> has increased much more when I used several instances of independent
>> lists and distributed registrations among them randomly (unlinking
>> consequently is also distributed randomly).
>> 
>> I found out that no matter how hard I try to optimize ReferenceQueue
>> while keeping the API unchanged, I can only do so much and that was not
>> enough. I have been surprised by how well ForkJoinPool distributes tasks
>> among threads, so I concluded that leveraging it is the best choice. I
>> re-designed the pending-list unhooking loop to unhook pending references
>> in chunks which greatly improves the throughput. Since unhooking can be
>> performed by a single thread while holding a lock which is mandated by
>> interface between VM and Java, I didn't employ multiple threads, but a
>> single eternal ForkJoinTask that unhooks in chunks and forks-off other
>> processing tasks that process chunks. When there are just a couple of
>> References pending at one time and a not-full chunk is unhooked, then
>> the processing is performed by the same thread that unhooked the
>> refrences, but when there are more, worker tasks are forked off and the
>> unhooking thread continues with full peace. This processing includes
>> execution of Cleaners, forking the finalizer tasks and enqueue-ing other
>> references. Finalizer(s) are always executed as separate ForkJoinTask(s).
>> 
>> It's interesting how Runtime.runFinalizers() is implemented in this
>> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...
>> 
>> I also tweaked the ReferenceQueue implementation a bit (it is still used
>> for other kinds of references) so that it avoids synchronization with a
>> monitor lock when there are no blocking waiters and uses CAS to
>> enqueue/dequeue. This improves throughput when the queue is not empty.
>> Since in the prototype multiple threads can enqueue into the same queue,
>> I thought this would improve throughput in such situations.
>> 
>> Comments, suggestions, criticism are welcome.
>> 
>> Regards, Peter
>> 

Reply via email to