We've not-so-slightly hijacked Nils' thread here - apologies for that.
On 25/02/2013 8:05 AM, Peter Levart wrote:
Just looked at one way jstat accesses the counters. It runs in a separate VM and maps-in a file that is already mapped in the observing VM in the direct buffer. It then accesses it via a LongBuffer view (for long counters). So there's no synchronization between counter updater and counter reader. On ARM v6 jstat could see a "torn" long counter then, right?
Right. With current implementation of PerfLongCounter it uses simple stores (not atomic ops).
The double-32bit-CAS updater that I presented would not make it worse then on such platforms, I suppose.
No change in tearing abaility.
On the platforms that support 64bit atomic stores, there are not such problems. And I assume those same platforms also support 64bit CAS, or are there platforms with 64bit atomic stores and no 64bit CAS?
Most of them actually :) All Java platforms must support atomic load/store of 64-bit values to support volatile long and double variables. On 32-bit platforms this is done via a range of techniques - for example on x86 it is done via the FPU. But these atomic accesses are currently restricted to Java volatile field accesses via bytecode - there are not exposed via the Unsafe methods, nor are they made available via the Atomic:: class in the VM.
Some of these 32-bit platforms also support the 64-bit CAS, which is what supports_cx8() is intended to indicate.
If the PerfCounters were supposed to be thread-safe then they might use these alternate atomic access operations.
David
Regards, PeterDavidRegards, PeterDavid -----If this is true and it is not that important, then instead of a synchronized update of 64bit counter, a 32bit CAS could be used, optionally (rarely) followed by a second 32bit CAS, like for example: http://dl.dropbox.com/u/101777488/jdk8-tl/PerfCounter/webrev.01/index.html I tried this on ARM v6 and it works much better than synchronized access, but I don't know if it's acceptable. It guarantees eventual correctness of summed value if the only operation performed is add() (no set() intermingled) and has the same possibility of incorrect half-half reads by observers as current PerfCounter has for unsynchronized observers. Here's the comparison of unpatched/patched PerfCounter.increment() micro-benchmark on single-core ARM v6 (Raspbery-PI): *** Original PerfCounter, ARM v6 # # PerfCounter_increment: run duration: 5,000 ms, #of logical CPUS: 1 # 1 threads, Tavg = 269.34 ns/op (σ = 0.00 ns/op) [ 269.34] 2 threads, Tavg = 7,170.48 ns/op (σ = 410.77 ns/op) [ 6,783.73, 7,603.95] 3 threads, Tavg = 12,034.82 ns/op (σ = 418.99 ns/op) [11,792.33, 11,714.67, 12,639.26] 4 threads, Tavg = 16,029.76 ns/op (σ = 1,411.44 ns/op) [15,592.04, 18,511.52, 15,642.52, 14,818.16] *** Patched PerfCounter, ARM v6 # # PerfCounter_increment: run duration: 5,000 ms, #of logical CPUS: 1 # 1 threads, Tavg = 166.21 ns/op (σ = 0.00 ns/op) [ 166.21] 2 threads, Tavg = 332.58 ns/op (σ = 0.12 ns/op) [ 332.45, 332.70] 3 threads, Tavg = 500.30 ns/op (σ = 0.22 ns/op) [ 500.04, 500.29, 500.58] 4 threads, Tavg = 667.95 ns/op (σ = 2.11 ns/op) [ 665.22, 667.18, 668.40, 671.04] Regards, Peter On 02/24/2013 11:31 AM, David Holmes wrote:On 24/02/2013 6:50 PM, Peter Levart wrote:Hi David, I thought it was ok to pass null, but I don't know the "portability" issues in-depth. The javadoc for Unsafe says: /"This method refers to a variable by means of two parameters, and so it provides (in effect) a double-register addressing mode for Java variables. When the object reference is null, this method uses its offset as an absolute address. This is similar in operation to methods such as getInt(long), which provide (in effect) a single-register addressing mode for non-Java variables. However, because Java variables may have a different layout in memory from non-Java variables, programmers should not assume that these two addressing modes are ever equivalent. Also, programmers should remember that offsets from the double-register addressing mode cannot be portably confused with longs used in the single-register addressing mode."/That is the doc for getXXX but not for getAndAddXXX or compareAndSwapXXX. You can't have null here: UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapLong(JNIEnv *env, jobject unsafe, jobject obj, jlong offset, jlong e, jlong x)) UnsafeWrapper("Unsafe_CompareAndSwapLong"); Handle p (THREAD, JNIHandles::resolve(obj)); jlong* addr = (jlong*)(index_oop_from_field_offset_long(p(), offset)); if (VM_Version::supports_cx8()) return (jlong)(Atomic::cmpxchg(x, addr, e)) == e; else { jboolean success = false; ObjectLocker ol(p, THREAD); if (*addr == e) { *addr = x; success = true; } return success; } UNSAFE_END David -----Does anybody know the in-depth interpretation of the above? Is it only the particular Java/native type differences (for example, endianess of variables) that these two addressing modes might interpret differently or something else too? Regards, Peter On 02/24/2013 12:39 AM, David Holmes wrote:Peter, In your use of Unsafe you pass "null" as the object. I'm pretty certain you can't pass null here. Unsafe operates on fields or array elements. David On 24/02/2013 5:39 AM, Peter Levart wrote:Hi Nils, If the counters are updated frequently from multiple threads, there might be contention/scalability issues. Instead of synchronization on updates, you might consider using atomic updates provided by sun.misc.Unsafe, like for example: Index: jdk/src/share/classes/sun/misc/PerfCounter.java =================================================================== --- jdk/src/share/classes/sun/misc/PerfCounter.java +++ jdk/src/share/classes/sun/misc/PerfCounter.java @@ -25,6 +25,8 @@ package sun.misc; +import sun.nio.ch.DirectBuffer; + import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.LongBuffer; @@ -50,6 +52,8 @@ public class PerfCounter { private static final Perf perf = AccessController.doPrivileged(new Perf.GetPerfAction()); + private static final Unsafe unsafe = + Unsafe.getUnsafe(); // Must match values defined in hotspot/src/share/vm/runtime/perfdata.hpp private final static int V_Constant = 1; @@ -59,12 +63,14 @@ private final String name; private final LongBuffer lb; + private final DirectBuffer db; private PerfCounter(String name, int type) { this.name = name; ByteBuffer bb = perf.createLong(name, U_None, type, 0L); bb.order(ByteOrder.nativeOrder()); this.lb = bb.asLongBuffer(); + this.db = bb instanceof DirectBuffer ? (DirectBuffer) bb : null; } static PerfCounter newPerfCounter(String name) { @@ -79,23 +85,44 @@ /** * Returns the current value of the perf counter. */ - public synchronized long get() { + public long get() { + if (db != null) { + return unsafe.getLongVolatile(null, db.address()); + } + else { + synchronized (this) { - return lb.get(0); - } + return lb.get(0); + } + } + } /** * Sets the value of the perf counter to the given newValue. */ - public synchronized void set(long newValue) { + public void set(long newValue) { + if (db != null) { + unsafe.putOrderedLong(null, db.address(), newValue); + } + else { + synchronized (this) { - lb.put(0, newValue); - } + lb.put(0, newValue); + } + } + } /** * Adds the given value to the perf counter. */ - public synchronized void add(long value) { - long res = get() + value; + public void add(long value) { + if (db != null) { + unsafe.getAndAddLong(null, db.address(), value); + } + else { + synchronized (this) { + long res = lb.get(0) + value; - lb.put(0, res); + lb.put(0, res); + } + } } /** Testing the PerfCounter.increment() method in a loop on multiple threads sharing the same PerfCounter instance, for example, on a 4-core Intel i7 machine produces the following results: # # PerfCounter_increment: run duration: 5,000 ms, #of logical CPUS: 8 # 1 threads, Tavg = 19.02 ns/op (? = 0.00 ns/op) 2 threads, Tavg = 109.93 ns/op (? = 6.17 ns/op) 3 threads, Tavg = 136.64 ns/op (? = 2.99 ns/op) 4 threads, Tavg = 293.26 ns/op (? = 5.30 ns/op) 5 threads, Tavg = 316.94 ns/op (? = 6.28 ns/op) 6 threads, Tavg = 686.96 ns/op (? = 7.09 ns/op) 7 threads, Tavg = 793.28 ns/op (? = 10.57 ns/op) 8 threads, Tavg = 898.15 ns/op (? = 14.63 ns/op) With the presented patch, the results are a little better: # # PerfCounter_increment: run duration: 5,000 ms, #of logical CPUS: 8 # # Measure: 1 threads, Tavg = 5.22 ns/op (? = 0.00 ns/op) 2 threads, Tavg = 34.51 ns/op (? = 0.60 ns/op) 3 threads, Tavg = 54.85 ns/op (? = 1.42 ns/op) 4 threads, Tavg = 74.67 ns/op (? = 1.71 ns/op) 5 threads, Tavg = 94.71 ns/op (? = 41.68 ns/op) 6 threads, Tavg = 114.80 ns/op (? = 32.10 ns/op) 7 threads, Tavg = 136.70 ns/op (? = 26.80 ns/op) 8 threads, Tavg = 158.48 ns/op (? = 9.93 ns/op) The scalability is not much better, but the raw speed is, so it might present less contention when used in real-world code. If you wanted even better scalability, there is a new class in JDK8, the java.util.concurrent.LongAdder. But that doesn't buy atomic "set()" - only "add()". And it can't update native-memory variables, so it could only be used for add-only counters and in conjunction with a background thread that would periodically flush the sum to the native memory.... Regards, Peter On 02/08/2013 06:10 PM, Nils Loodin wrote:It would be interesting to know the number of thrown throwables in the JVM, to be able to do some high level application diagnostics / statistics. A good way to put this number would be a performance counter, since it is accessible both from Java and from the VM. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007806 http://cr.openjdk.java.net/~nloodin/8007806/webrev.00/ Regards, Nils Loodin