Hi Moh, > > However, I was hoping this would have the effect of improving > (non-finalizable) reference handling. We've seen serious issues in > WeakReference handling and have had to write some twisted code to deal with > this.
Better reference life-cycle handling would actually be beneficial IMHO as many cache implementations suffer because of certain aspects of the current implementation. SoftReference is very difficult to use. > > So I guess the question I have to Kirk and David is: do you feel a GC load of > 10K WeakReferences per cycle is also "doing something else wrong”? Hard to say as this really has to be eval’ed on a case by case basis. But I’d wonder if the WeakReference was actually needed if you are recycling them so quickly. Regards, Kirk > > Sorry if this is going off-topic. > > Thanks > Moh > >> -----Original Message----- >> From: core-libs-dev [mailto:core-libs-dev-boun...@openjdk.java.net] On Behalf >> Of Kirk Pepperdine >> Sent: Thursday, May 28, 2015 11:58 PM >> To: david.hol...@oracle.com Holmes >> Cc: hotspot-gc-...@openjdk.java.net openjdk.java.net; core-libs- >> d...@openjdk.java.net >> Subject: Re: JEP 132: More-prompt finalization >> >> Hi Peter, >> >> It is a very interesting proposal but to further David's comments, the life- >> cycle costs of reference objects is horrendous of which the actual process of >> finalizing an object is only a fraction of that total cost. Unfortunately >> your >> micro-benchmark only focuses on one aspect of that cost. In other words, it >> isn't very representative of a real concern. In the real world the finalizer >> *must compete with mutator threads and since F-J is an "all threads on deck" >> implementation, it doesn't play well with others. It creates a "tragedy of >> the >> commons". That is situations where everyone behaves rationally with a common >> resource but to the detriment of the whole group". In short, parallelizing >> (F- >> Jing) *everything* in an application is simply not a good idea. We do not >> live >> in an infinite compute environment which means to have to consider the impact >> of our actions to the entire group. >> >> This was one of the points of my recent article in Java Magazine which I >> wrote >> to try to counter some of the rhetoric I was hearing in conference about the >> universal benefits of being able easily parallelize streams in Java 8. Yes, I >> agree it's a great feature but it must be used with discretion. Case in >> point. >> After I finished writing the article, I started running into a couple of >> early >> adopters that had swallowed the parallel message whole indiscriminately >> parallelizing all of their streams. As you can imagine, they were quite >> surprised by the results and quickly worked to de-parallelize *all* of the >> streams in the application. >> >> To add some ability to parallelize the handling of reference objects seems >> like a good idea if you are collecting large numbers of reference objects >> (>10,000 per GC cycle). However if you are collecting large numbers of >> reference objects you're most likely doing something else wrong. IME, >> finalization is extremely useful but really only for a limited number of use >> cases and none of them (to date) have resulted in the app burning through >> 1000s of final objects / sec. >> >> It would be interesting to know why why you picked on this particular issue. >> >> Kind regards, >> Kirk >> >> >> >> On May 29, 2015, at 5:18 AM, David Holmes <david.hol...@oracle.com> wrote: >> >>> Hi Peter, >>> >>> I guess I'm very concerned about the premise that finalization should scale >> to millions of objects and be performed highly concurrently. To me that's >> sending the wrong message about finalization. It also isn't the most >> effective >> use of cpu resources - most people would want to do useful work on most cpu's >> most of the time. >>> >>> Cheers, >>> David >>> >>> On 29/05/2015 3:12 AM, Peter Levart wrote: >>>> Hi, >>>> >>>> Did you know that the following simple loop: >>>> >>>> public class FinalizableBottleneck { >>>> static boolean no; >>>> >>>> @Override >>>> protected void finalize() throws Throwable { >>>> // empty finalize() method does not make the object finalizable >>>> // (it is not even registered on the finalizer's list) >>>> if (no) { >>>> throw new AssertionError(); >>>> } >>>> } >>>> >>>> public static void main(String[] args) { >>>> while (true) { >>>> new FinalizableBottleneck(); >>>> } >>>> } >>>> } >>>> >>>> >>>> ...quickly fills the entire heap with FinalizableBottleneck and internal >>>> Finalizer objects and brings the JVM to a halt? After a few seconds of >>>> running the above program, jmap -histo:live reports: >>>> >>>> num #instances #bytes class name >>>> ---------------------------------------------- >>>> 1: 50048325 2001933000 java.lang.ref.Finalizer >>>> 2: 50048278 800772448 FinalizableBottleneck >>>> >>>> >>>> There are a couple of bottlenecks that make this happen: >>>> >>>> - ReferenceHandler thread synchronizes with VM to unhook Reference(s) >>>> from the pending chain one be one and dispatches them to their respected >>>> ReferenceQueue(s) which also use synchronization for equeueing each >>>> Reference. >>>> - Enqueueing synchronizes with the finalization thread which removes the >>>> Finalizer(s) (FinalReferences) from the finalization queue and executes >>>> them. >>>> - Executing the Finalizer(s) removes them from the doubly-linked list of >>>> all Finalizer(s) which is used to retain them until they are needed and >>>> this synchronizes with the threads that link new Finalizer(s) into the >>>> doubly-linked list as new finalizable objects get registered. >>>> >>>> We see that the creation of a finalizable object only takes one >>>> synchronization (registering into the doubly-linked list) and is >>>> performed synchronously, while finalization takes 4 synchronizations >>>> among 4 different threads (in pairs) and happens when the Finalizer >>>> instance "travels" over from VM thread to ReferenceHandler thread and >>>> then to finalization thread. No wonder that finalization can not keep up >>>> with allocation in a single thread. The situation is even worse when >>>> finalize() methods do some actual work. >>>> >>>> I have experimented with various approaches to widen these bottlenecks >>>> and found out that I can not beat the ForkJoinPool when combined with >>>> some improvements to internal data structures used in reference >>>> processing. Here's a prototype I came up with: >>>> >>>> >> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/ >>>> >>>> >>>> And this is the benchmark I use for measuring the throughput: >>>> >>>> >> http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThr >> oughput.java >>>> >>>> >>>> The benchmark shows (results inline in source) that using unpatched JDK, >>>> on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500 >>>> finalizable objects per ms in a single thread and that while doing so, >>>> finalization only manages to process approx. 100 - 120 objects at the >>>> same time. Objects "in-flight" quickly accumulate and bring the VM to a >>>> halt, where it is not doing anything but full GC cycles. >>>> >>>> When constructing in 4 threads, there's not much difference. >>>> Construction of finalizable objects simply doesn't scale. >>>> >>>> Patched JDK shows something completely different. Single thread >>>> construction achieves a rate of 3600 objects / ms. Number of "in-flight" >>>> objects is kept constant at about 5-6M instances which amounts to approx >>>> 1.5 s of allocation. I think this is about the rate of GC cycles during >>>> which VM also processes the references. The benchmark also shows the >>>> ForkJoinPool statistics which shows that the number of queued tasks is >>>> also kept low. >>>> >>>> Increasing the allocation threads to 4 increases allocation rate to >>>> about 4300 objects / ms and finalization keeps up. Increasing allocation >>>> threads to 8, further increases allocation rate to about 4600 objects / >>>> ms and finalization still keeps up. The increase in rate is not linear, >>>> but keep in mind that i7 is a 4-core CPU. >>>> >>>> About the implementation... >>>> >>>> 1st improvement I did was for the doubly-linked list of Finalizer >>>> instances that is used to keep them alive until they are needed. I >>>> ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin >>>> Buchholz and just kept the internal link/unlink methods while >>>> specializing them to Finalizer entries (very straight-forward). I >>>> experimented with throughput and got some improvement, but throughput >>>> has increased much more when I used several instances of independent >>>> lists and distributed registrations among them randomly (unlinking >>>> consequently is also distributed randomly). >>>> >>>> I found out that no matter how hard I try to optimize ReferenceQueue >>>> while keeping the API unchanged, I can only do so much and that was not >>>> enough. I have been surprised by how well ForkJoinPool distributes tasks >>>> among threads, so I concluded that leveraging it is the best choice. I >>>> re-designed the pending-list unhooking loop to unhook pending references >>>> in chunks which greatly improves the throughput. Since unhooking can be >>>> performed by a single thread while holding a lock which is mandated by >>>> interface between VM and Java, I didn't employ multiple threads, but a >>>> single eternal ForkJoinTask that unhooks in chunks and forks-off other >>>> processing tasks that process chunks. When there are just a couple of >>>> References pending at one time and a not-full chunk is unhooked, then >>>> the processing is performed by the same thread that unhooked the >>>> refrences, but when there are more, worker tasks are forked off and the >>>> unhooking thread continues with full peace. This processing includes >>>> execution of Cleaners, forking the finalizer tasks and enqueue-ing other >>>> references. Finalizer(s) are always executed as separate ForkJoinTask(s). >>>> >>>> It's interesting how Runtime.runFinalizers() is implemented in this >>>> patch - it basically amounts to ForkJoinPool.awaitQuiescence() ... >>>> >>>> I also tweaked the ReferenceQueue implementation a bit (it is still used >>>> for other kinds of references) so that it avoids synchronization with a >>>> monitor lock when there are no blocking waiters and uses CAS to >>>> enqueue/dequeue. This improves throughput when the queue is not empty. >>>> Since in the prototype multiple threads can enqueue into the same queue, >>>> I thought this would improve throughput in such situations. >>>> >>>> Comments, suggestions, criticism are welcome. >>>> >>>> Regards, Peter >>>> >