Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-16 Thread Aleksey Shipilev
On 8/16/19 10:07 AM, Gil Tene wrote:
> Classification terms evolve as the art evolves.

What I want to emphasize here that this discussion reinvents the meaning of 
"sweep". That definition
is not used in the way you describe in the sources I know of. Granted, 
definitions drift over time,
but we have to be careful to separate what is the "agreed on" definition, and 
what is whatever
definition we want to be the emergent one.

> On Thursday, August 15, 2019 at 1:42:23 AM UTC-7, Aleksey Shipilev wrote:
> I am trying to understand what is your central argument here. This seems 
> to be it. Are you saying
> that "sweeping" is when you visit dead objects, and non-"sweeping" is 
> when you do not?
> 
> Yes. I think that's a very good summary.

Would be helpful if we started from this next time around!

> Is walking _dead_ objects the discriminator for "sweeping" then? So in 
> your book, if we take the
> same Lisp2 collector, and compute-new-addresses and adjust-pointers steps 
> are walking the
> self-parsable heap (visiting dead objects along the way), it is M-S-C? 
> But if it uses marking
> bitmap
> (thus only visiting live objects), it becomes M-C? [This would be a weird 
> flex, but okay].
> 
> 
> Exactly. "In my book" adding an efficient livemark bit vector with possible 
> summary layers would
> covert the classic LISP2 GC from a Mark-Sweep-Compact to a Mark-Compact (with 
> no sweep).

So this is what irks me here. In my Epsilon's mark-compact toy collector
(https://shipilev.net/jvm/diy-gc/), changing from mark bitmaps to self-parsable 
heap traversal is
about 5 lines of code. This does not feel like passing the bar for 
reclassification of the collector
algo between major classes. It feels more like the implementation detail that 
can swing both ways,
depending on circumstances.

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/3bc6ee45-67d2-076c-879a-4bbf487f63a4%40gmail.com.


signature.asc
Description: OpenPGP digital signature


Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-15 Thread Aleksey Shipilev
On 8/14/19 9:21 PM, Gil Tene wrote:
> The fact that they visit and parse dead objects in order to identify 
> recycle-able memory is what makes them
> sweepers. Other techniques that don't visit dead objects do not perform a 
> sweep.

I am trying to understand what is your central argument here. This seems to be 
it. Are you saying
that "sweeping" is when you visit dead objects, and non-"sweeping" is when you 
do not?

>> I would leave "sweep" as something that reclaims the storage of the objects 
>> (i.e. "sweeps away" the
>> garbage), which makes M-C and M-S classes more clearly defined and easier to 
>> reason about.
> 
> When M-C implies "also does a full linear traversal that visits all dead 
> objects" (aka "a sweep"), that> terminology fails. Examples like C4, G1GC, 
> Shenandoah, and ZGC are all Mark-Compact (NO Sweep)
> collectors, while ParallelGC and Serial GC are Mark-Compact (WITH Sweep) 
> collectors. Since the math
> behind those varies dramatically in wasy that effect everyday choices, using 
> "M-C" as a classification
> that covers both leads people to misunderstand the interactions between e.g. 
> heap size and GC work.

M-C does not imply "full linear traversal that visits all dead objects". As the 
counter-example, it
is very practical to use off-side marking bitmaps to walk only live objects of 
the heap. Careful
design of such the marking bitmap would yield dramatically better results than 
using whatever
self-parsable-heap walk. You can even make it multi-level and quite dense to 
skip over large chunks
of sparse heap.

When the marking phase is a separate phase, it stands to reason that the 
subsequent phases would
have to process the marking results somehow. That would naturally do some sort 
of walk over the data
structure that handles marks. If we call that walk "sweeping", then every 
Mark-* algorithm is
necessarily "sweeping" too. So we have to break this somehow to make the 
definition viable.

Is walking _dead_ objects the discriminator for "sweeping" then? So in your 
book, if we take the
same Lisp2 collector, and compute-new-addresses and adjust-pointers steps are 
walking the
self-parsable heap (visiting dead objects along the way), it is M-S-C? But if 
it uses marking bitmap
(thus only visiting live objects), it becomes M-C? [This would be a weird flex, 
but okay].

-Aleksey


-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/1de1d683-164c-7163-7138-ee2f12f66638%40gmail.com.


signature.asc
Description: OpenPGP digital signature


Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-06 Thread Aleksey Shipilev
On 8/6/19 7:38 AM, Peter Veentjer wrote:
> There is still some confusion on my side. This time it is regarding the 
> algorithmic complexity of
> mark-sweep-compact.
> 
> The complexity of the mark phase is linear to the live set.
> The complexity of the sweep phase is linear to the heap size.
> The complexity of the compact phase is linear to the live set.

Well...

The multiplicative factors in either case might make one linearity perform 
better than the other on
wildly different heap sizes and live sets. The adversarial example of this is N 
large byte arrays of
size M. Compact might have to copy N*M bytes (especially when doing sliding 
compaction), and sweep
might only need to visit N locations to check on object metadata.

There are second-order heap size effects for both mark and compact. Take two 
heaps with the same
live set, one very dense and one very sparse, and the mere fact you have to 
walk memory far away
adds up the secondary effect of heap size. It would not linger for compacting 
collectors, though, as
first few compaction cycles would collapse the heap to denser variant.

That is all to say that algorithmic complexity is a seriously bad way to reason 
about (GC) performance.

> Why would you ever want to include the a sweep phase because the complexity 
> of mark-sweep-compact is
> now linear to the heap size; a lot worse compared to 'linear to the live 
> set'. Get rid of the sweep
> and complexity goes down to 'linear to the live set'. What is the added value 
> of such an inefficient
> GC implementation?

I think m-s-c nomenclature is confusing, as the actual implementations are 
doing either m-s or m-c
during their individual phases. The benefit of doing either is well covered.

The hard naming question is how to properly label the collector where different 
phases employ
different algos. For example, in CMS the young collection is copying, old 
collection is mark-sweep,
fall-back full GC is mark-compact. How would you label CMS then? I believe 
marking it m-s-c is
borderline tolerable, until users get confused thinking both sweep and compact 
happen in the same phase.

Also with implementations that manage backing storage in fine-grained chunks 
(for example, free
lists), "sweep" might mean dealing with them on per-dead-object basis. For 
example, this might mean
 preparing the storage for subsequent compaction by "sweeping" out old objects 
before compaction
kicks in, which requires free space. Practical implementations I know opt to 
instead reconstruct the
backing storage metadata after massive compaction moves, though.

My $0.02c, anyway.

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/e1755e01-58d6-3db4-bb65-ee7bc72b0ead%40gmail.com.


signature.asc
Description: OpenPGP digital signature


Re: Release fence pollution in constructors and builders

2019-07-17 Thread Aleksey Shipilev
On 7/17/19 9:06 AM, Andriy Plokhotnyuk wrote:
> 1) Using of release (and sometime store-store) fences in all constructors in 
> GraalVM:
> https://github.com/oracle/graal/blob/e9dbc9d6d3080daf22dec38ce8c234417e1a8e3c/compiler/src/org.graalvm.compiler.replacements/src/org/graalvm/compiler/replacements/DefaultJavaLoweringProvider.java#L925

I don't follow. It puts LoadStore for all constructors that store final fields 
(as expected), and
StoreStore for everything else (as expected). StoreStore is needed to protect 
the recently allocated
object metadata from the races. Note that you can see the "default" values of 
Java fields because
publishing the reference to the object may appear "before" constructor stores 
are observed. But the
same thing would be catastrophic for "metadata" stores like initial mark and 
klass words: in the
_best_ case scenario, JVM would just crash.

Graal is not alone in this, see e.g. 
https://bugs.openjdk.java.net/browse/JDK-8032481

> 2) Using of release fences in constructors or builders of immutable 
> collections in Scala 2.13:
> https://github.com/scala/scala/search?q=releaseFence%28%29=Code

I vaguely remember this being discussed, but cannot remember the conclusion. It 
tries to emulate
final fields without declaring them final? It does not work completely without 
LoadLoad on the
reader side 
(https://shipilev.net/blog/2014/all-fields-are-final/#_implementation_support), 
but it
is safer than not doing the release stores at all.

> The question is: Why all these fences cannot be moved to places which work 
> with concurrency
> immediately: GCs, static field initializers, concurrent data structures and 
> APIs like futures,
> actors, etc.?

I don't follow this suggestion. These fences are about surviving races (this is 
what finals are
for), and thus every follow-up access to the object is exposed on every path, 
including those that
do not deal with concurrency "immediately". 

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/3cca63d5-0085-2387-13c1-b5dfe2d26158%40gmail.com.


signature.asc
Description: OpenPGP digital signature


Re: JMM- conservative and formal point of view.

2018-05-16 Thread Aleksey Shipilev
On 05/16/2018 09:19 PM, John Hening wrote:
> By the way, in your transcription you said:
> |
>     inta;
> 
>     volatilebooleanready =false;
> 
>     voidactor(IntResult1r){
>         while(!ready){};
>         r.r1 =a;
>     }
> 
>     voidobserver(){
>         a =41;
>         a =42;
>         ready =true;
>         a =43;
>     }
> |
> 
> 
> You said that possible values for r1 are 42 and 43. I cannot understand why 
> 41 is impossible. What
> does prevent write(a, 41) and write(a, 42) from being reordered? 

Formally, there is no plausible JMM execution that yields both read(ready):true 
and read(a):41,
which means observing 41 is forbidden in any conforming implementation.

Proof sketch starts from the realization that in all executions where we 
observed r(ready):true it
is inevitable that:
  w(ready, true) --hb--> r(ready):true

...which transitively extends to:
  w(a, 41) --hb--> w(a, 42) --hb--> w(ready, true) --hb--> r(ready):true 
--hb--> r(a):?

Happens-before consistency mandates that reads have to see either the latest 
writes in
happens-before order, or any other write that does not connected by 
happens-before. So, in the
execution above, you can only have r(a):42 [latest in HB], or r(a):43 [race].

But all of that is too low-level for day-to-day work. What you need is a more 
high-level
abstraction: ready = true "released" the consistent view of stores in actor, 
and once observer read
"ready == true", it "acquired" that consistent view. That's what safe 
publication (or
release-acquire) basically is. Moreover, if there are no races, you cannot see 
inconsistent (racy)
values, and then your program behaves as if it is sequentially-consistent (and 
with caveats, this is
what SC-DRF says).

This high-level understanding plays well when you retract the thinking to 
single-threaded case. What
exactly prevents the single thread seeing 41 in this example? What exactly 
prevents the reordering
of two stores? ;)

void test() {
  a = 41;
  a = 42;
  r1 = a;
}

You can use the similar JMM proof to work out that only 42 is allowed. But on 
high level, it is the
same "consistent view", and release-acquire just extends it to the 
multi-threaded case.

> JMM is a weak model. It means that JMM is allowed to reorder all normal 
> memory operations provided
> that reordered program is equivalent in one-threaded environment. In my 
> example I see nothing
> prevents reordering.

That's where the problem lies: once you think about the reorderings, you are 
slipped into thinking
about the implementation, then barriers show up, and everything gets too 
confusing. The trick is to
slap yourself every time "barrier" or "reordering" bubbles up :) Hardware and 
runtime would speak to
each other to maintain the illusion that release-acquire works as you would 
expect.

Coming back to the original slide you mentioned, roach motel semantics 
basically says that it is
almost always safe for the implementation to move operations into the 
release-acquire. It is harder
to prove that moving operations out is safe, but it is sometimes easily 
provable too.

It is an interesting exercise to understand the implementation for 
mechanical-sympathetic reasons
e.g. for performance modeling, but not so much for correctness proofs.

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: JMM- conservative and formal point of view.

2018-05-16 Thread Aleksey Shipilev
On 05/16/2018 07:33 PM, John Hening wrote:
> Subject: JMM- conservative and formal point of view.

There is no "conservative" JMM point of view. There is JSR 133 Cookbook for 
Compiler Writers that
describes the conservative implementation, but it is not the JMM itself. 
Reductio ad absurdum:
suppose JSR 133 Fanfic for Concurrency Experts would say the easy way to be JMM 
conformant is to
acquire lock before entering any thread -- basically implementing Global 
Interpreter Lock. Would you
be willing to take the existence of GIL in some implementations as the guidance 
for writing reliable
software? I hope not.

> As I understand this slide points that JMM does not constitute that a 
> volatile write does not work
> as a (Load/Store)Store barrier, and it doesn't in fact.

Double negation is confusing here. That slide points out that roach motel 
semantics basically says
that some transformations are easy (putting stuff "into" the acq/rel block), 
and others are still
possible, but hard. There are cases where it is not hard, actually: for 
example, when we know the
access is provably thread-local.


> But, JVM is allowed to reorder a such situation only in special cases, when 
> the execution will be
> *equivalent as if write(x,1) wouldn't be reorderd with release(g)*. In 
> another words, here that
> reordering is allowed if and only if JVM is able to prove that it is 
> equivalent to the situation
> without reordering.

Caveat: when the *outcome* would be equivalent to some allowed JMM execution. 
It does not mean the
actual implementation should issue machine instructions in any given order.


> The question is:
> 
> So, for considering *correctness* of program (and only *correctness*) can we 
> assume that volatile
> store works as (Load/Store)Store barrier in fact?

No. The fluff about barriers is implementation details, which gets confusing 
very quickly as
implementations screw with your code.

I ranted about this a week ago at GeeCON:
 https://shipilev.net/talks/geecon-May2018-jmm.pdf

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: JITed very simple method.

2018-02-06 Thread Aleksey Shipilev
On 02/05/2018 09:58 PM, John Hening wrote:
> 1. Why do you find it is a bug in tierred compilation machinery? It is not on 
> my eye.  The
> compilation level is: C2, level 4

Wait a minute, where exactly does it say "C2, level 4" for you?

I am saying the disassembly you have provided in the original message is 
probably level 2/3, and if
it is hot, that might be a bug in tiered compilation machinery: we are not 
supposed to spend a lot
of time in methods with profiling enabled.

But you haven't verified that disassembly is actually on the hot path. Level 4 
code would be on
hotpath, and it would be without profiling.

> 2. I have still a doubt: why profiling counter is not synchronized. What if 2 
> or more threads
> executing a function

That is a race, so profile is not very accurate. We (mostly) do not care about 
that: there is a
tradeoff between profiling overhead and profile accuracy. (Yes, I know racy 
updates are quirky and
potentially lose the unbounded number of updates).

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Is there a way to print Escape Analysis?

2018-01-25 Thread Aleksey Shipilev
On 01/25/2018 03:44 AM, 'Carl Mastrangelo' via mechanical-sympathy wrote:
> Consider the following code:
> 
> public class Test {
>   static volatile Integer discard;
>   public static void main(String [] args) throws InterruptedException {
>     printMemory();
>     System.gc();
>     printMemory();
>     int iterations = 1000;
>     int[] vals = new int[100_000_000];
>     while (args.length == 0) {
>       printMemory();
>       System.gc();
>       Thread.sleep(200);
>       discard = iterations++;
>     }
>   }
> 
>   private static void printMemory() {
>     System.out.println(Runtime.getRuntime().totalMemory() - 
> Runtime.getRuntime().freeMemory());
>   }
> }
> 
> I am surprised to see the memory used by this code starts at about 200MB, 
> goes up to 600MB, but
> never seems to go back down.  The large int array accounts for the memory 
> usage jump, but it never
> seems to be garbage collected.  Why?   The variable is never read after it is 
> allocated.  It cannot
> be reordered by the compiler to be after the while loop, because I can see 
> the memory jump.  I am
> intentionally allocating memory in the loop by boxing the integer.  Enabling 
> +PrintCompilation and
> PrintInlining never shows main() being inlined; perhaps it is not being 
> compiled?

This is not escape analysis. It is more about compiler able to figure out the 
reachability of local
variable, and let GC act:
  https://shipilev.net/jvm-anatomy-park/8-local-var-reachability/

It is predicated on the condition that method is actually compiled and 
optimized accordingly. OSR
would not cut it here, I think, because the OSR version of the method would 
still treat that
variable alive.

$ java Test
10548688
260648
410809368
410673344
400260544
400260544
400260544
400260544
400260544
400260544

It changes if you compile the method before entering it:

$ java -Xcomp Test
73841024
296240
421393656
10580272
10580608
10580608
10580608
10580608

Thanks,
-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: JVM detection of thread at safepoint

2017-12-05 Thread Aleksey Shipilev
On 12/05/2017 09:26 AM, Mark Price wrote:
> I'm investigating some long time-to-safepoint pauses in oracle/openjdk. The 
> application in question
> is also suffering from some fairly nasty I/O problems where latency-sensitive 
> threads are being
> descheduled in uninterruptible sleep state due to needing a file-system lock.
> 
> My question: can the JVM detect that a thread is in signal/interrupt-handler 
> code and thus treat it
> as though it is at a safepoint (as I believe happens when a thread is in 
> native code via a JNI call)?
> 
> For instance, given the stack trace below, will the JVM need to wait for the 
> thread to be scheduled
> back on to CPU in order to come to a safepoint, or will it be treated as 
> "in-native"?
> 
>         7fff81714cd9 __schedule ([kernel.kallsyms])
>         7fff817151e5 schedule ([kernel.kallsyms])
>         7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms])
>         7fff813556e7 call_rwsem_down_write_failed ([kernel.kallsyms])
>         7fff817172ad down_write ([kernel.kallsyms])
>         7fffa0403dcf xfs_ilock ([kernel.kallsyms])
>         7fffa04018fe xfs_vn_update_time ([kernel.kallsyms])
>         7fff8122cc5d file_update_time ([kernel.kallsyms])
>         7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms])
>         7fff811ba935 do_page_mkwrite ([kernel.kallsyms])
>         7fff811bda74 handle_pte_fault ([kernel.kallsyms])
>         7fff811c041b handle_mm_fault ([kernel.kallsyms])
>         7fff8106adbe __do_page_fault ([kernel.kallsyms])
>         7fff8106b0c0 do_page_fault ([kernel.kallsyms])
>         7fff8171af48 page_fault ([kernel.kallsyms])
>      java stack trace ends here 

I am pretty sure out-of-band page fault in Java thread does not yield a 
safepoint. At least because
safepoint polls happen at given location in the generated code, because we need 
the pointer map as
the part of the machine state, and that is generated by Hotspot (only) around 
the safepoint polls.
Page faulting on random read/write insns does not have that luxury. Even if JVM 
had intercepted that
fault, there is not enough metadata to work on.

The stacktrace above seems to say you have page faulted and this incurred disk 
I/O? This is
swapping, I think, and all performance bets are off at that point.

Thanks,
-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: allocation memory in Java

2017-11-20 Thread Aleksey Shipilev
On 11/18/2017 01:45 PM, John Hening wrote:
> Thanks for your replies :)
> 
> @Aleksey, how to get that?
> 
> |-17.12%0.00%org.openjdk.Allperf-31615.map-0x7faaa3b2d125-16.59%OptoRuntime::new_instance_C
> -11.49%InstanceKlass::allocate_instance 
> 2.33%BlahBlahBlahCollectedHeap::mem_allocate  point to GC 0.35%AllocTracer::send_allocation_outside_tlab_event|
> 
> I mean, how to get a such stacktrace with profiling information? I see 
> perf-31615.map what indicates
> on perf.

With perf. OpenJDK might need to be configured with debug symbols: 
--with-native-debug-symbols=internal

-Aleksey



-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: allocation memory in Java

2017-11-18 Thread Aleksey Shipilev
On 11/17/2017 11:46 PM, John Hening wrote:
> For my eye, common objects in Java, like 
> 
> newString();
> 
> is a CHeapObj. I looked at an implementation of new operator in that class 
> and it looks like:

In related news, there is javaClasses.cpp, and therefore Java class library is 
written in C++! /s

> No, I have no idea why it is said: allocation in Java in very fast- it is 
> fater than C++/C

Start from here:
  https://shipilev.net/jvm-anatomy-park/4-tlab-allocation/

Thanks,
-Aleksey



-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Mystery: Why does the JVM show more latency for the same block of code after a busy spin pause?

2017-04-29 Thread Aleksey Shipilev
On 04/29/2017 07:12 PM, Aleksey Shipilev wrote:
> On 04/29/2017 06:56 PM, Aleksey Shipilev wrote:
>> On 04/29/2017 04:56 PM, J Crawford wrote:
>>> The mystery boils down to: 
>>>
>>> "/The exact same block of code becomes slower after a busy spin pause./"
>>>
>>> I posted a short source code that *unequivocally* proves and demonstrates 
>>> the
>>> problem: 
>>> http://stackoverflow.com/questions/43696948/why-does-the-jvm-show-more-latency-for-the-same-block-of-code-after-a-busy-spin
>>
>> I don't think this is much of the mystery, because wakeup from a long sleep
>> causes gradual transition from low power/frequency/scheduling states to 
>> higher
>> ones.
>>
>> This is clearly visible if you do several consecutive operations after the
>> sleep, e.g. by adding this:
>>
>>while(count < results.length) {
>> +if ((count & 7) > 0) interval = 0;
>>  double x = busyPause(interval);
> 
> Ah, I misread the test. busyPause is not powering down anything. What's 
> weirder,
> the first pair of nanoTime calls after exiting busyPause is hiccuping even
> without calculations in between.

Still flying in blind for fun, trying different things:
  http://cr.openjdk.java.net/~shade/scratch/JvmPauseLatency.java

a) "Warming up" both branches in busyPause seems to help first hiccup. Notice
that with small interval one of the loop branches would be non-taken, and C2
would put a trap there, blowing up on first access. So much for "avoiding the
deoptimization"...

b) Inlining busyPause seems to help to avoid every hiccup but the first. I
wonder if this is actually the machine effect when e.g. the branch mispredict on
loop exit that faces unusual code shape ("ret" from uninlined method?) makes the
lag worse. When method is inlined, only the first trap would introduce lag, not
any subsequent one.

At this point you want to dive into the generated code, and there JMH perfasm is
your friend.

-Aleksey


-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Concurrency question.

2017-03-03 Thread Aleksey Shipilev
On 03/03/2017 10:51 AM, Grzegorz Gierlach wrote:
> It passes on x86 however for some reason I'm not 100% sure it is a correct
> implementation. Is there a better way of doing it?

The discussion about "correct", "way of doing it" is irrelevant if you don't
describe what is your intent with this test.

But mechanically, this is a Dekker test in disguise:

http://hg.openjdk.java.net/code-tools/jcstress/file/tip/tests-custom/src/main/java/org/openjdk/jcstress/tests/volatiles/DekkerTest.java

The only forbidden state is (0, 0) in plain old Dekker. It stands to reason that
gating the 0->1 transition with CAS makes (1, 1) forbidden too. So your test
should pass, modulo runtime and hardware bugs.

> @JCStressTest
> 
> @State
> @Outcome(id ="1, 0",expect =ACCEPTABLE,desc ="first executed")
> @Outcome(id ="0, 1",expect =ACCEPTABLE,desc ="second executed")
> @Outcome(id ="1, 1",expect =FORBIDDEN,desc ="both executed!")
> @Outcome(id ="0, 0",expect =FORBIDDEN,desc ="none executed!")
> publicclassVolatileConcurrencyTest{
> 
> volatileObjectreceiver;
> 
> volatileObjectresolution;
> 
> AtomicBooleanapplied =newAtomicBoolean();
> 
> @Actor
> voidactor1(IntResult2r){
> receiver =newObject();
> if(resolution !=null&(false,true)){
> r.r1 =1;
> }else{
> r.r1 =0;
> }
> }
> 
> @Actor
> voidactor2(IntResult2r){
> resolution =newObject();
> if(receiver !=null&(false,true)){
> r.r2 =1;
> }else{
> r.r2 =0;
> }
> }
> }

-Aleksey



-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: HotSpot code cache and instruction prefetching

2017-02-03 Thread Aleksey Shipilev
On 02/03/2017 10:26 AM, Chris Newland wrote:
> Do you think the HotSpot designers took this into account but found 
> empirically
> that the simple algorithm is adequate (cost complexity outweighs gains and 
> hotimp
> methods are generally JIT-compiled together).

Let's ask another question: do you have the example where that matters?

> Could there be any benefit in relocating blocks for hot call chains to match 
> the
> call pattern once the program has reached a steady state? (assuming inlining 
> has
> already succeeded as much as possible).

Well, the "hot path" is supposed to be inlined and critical path laid out
sequentially within the compilation unit, so it is not catastrophic.

> Since tiered compilation became the default, do you think the many (possibly
> unconnected) intermediate compilations have made prefetching worse?

...but yes, for tiered, there are versions of the code that are known to be
temporary (e.g. compilations on level 1,2,3), while the final compilation stays
around for longer (level 4). This is why Segmented Code Cache was implemented in
JDK 9: http://openjdk.java.net/jeps/197

IIRC, there were improvements on torturous workloads, and improvements in
nmethod scans.

Thanks,
-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Futex flood

2017-01-23 Thread Aleksey Shipilev
On 01/23/2017 05:42 AM, Дмитрий Пеньков wrote:
> Hello everyone. I have following situation: multiple HotSpot JVMs with 
> multiple
> GCs' (ParallelGC) threads on multiple CPUs. I've already seen recommendations 
> to
> have min 2 cores per JVM, but in my case it's 80 CPU's and about 200 JVMs. 
> After
> working for about an hour, I have >95% of time for futex syscall futex_wait. I
> tried to fix the number of parallel GC threads in 4, but it didn't make 
> effect.
> Mb GC changing can help? How to solve this situation?

$ perf record -g java ...

...and then

$ perf report

...to see where those futexes are called from.

It might be relatively benign, like the awaits from the GC worker taskqueue. But
maybe there are actual Java locks involved.

Thanks,
-Aleksey


-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Operation Reordering

2017-01-17 Thread Aleksey Shipilev
(triggered again)

On 01/18/2017 12:33 AM, Sergey Melnikov wrote:
> mov (%rax), %rbx
> cmp %rbx, %rdx
> jxx Lxxx
> 
> But if you schedule them this way
> 
> mov (%rax), %rbx
> cmp %rbx, %rdx
> ... few instructions
> jxx Lxxx

...doesn't this give up on macro-fusion, and effectively sets up for a better
chance of a "bottleneck in instruction fetch/decode phases"? :)

-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Operation Reordering

2017-01-17 Thread Aleksey Shipilev
On 01/17/2017 12:55 PM, Vitaly Davidovich wrote:
> Atomicity of values isn't something I'd assume happens automatically.  Word
> tearing isn't observable from single threaded code.

On 01/17/2017 09:17 PM, Michael Barker wrote:
> That was my understanding too.  Normal load/stores on 32 bit JVMs would tear 
> 64
> bit values.  Although, I think object references are guaranteed to be written
> atomically. 

(Triggered by my pet peeve)

Please don't confuse "access atomicity" and "word tearing". Access atomicity
means reading/writing the value in full. Word tearing (at least per JLS 17.6)
means that fields are considered distinct, and the read/write of one field
cannot disturb neighbors.

Speaking of access atomicity, the upcoming value types would probably be
non-access-atomic by default too, because enforcing access atomicity for widths
larger than a machine one is painful. volatile longs/doubles had it easy on
32-bit VMs with 64-bit FPUs, compared to what variable-len values would have to
experience.

Thanks,
-Aleksey


-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Single writer counter: how expensive is a volatile read?

2016-10-30 Thread Aleksey Shipilev
On 10/30/2016 05:55 AM, Peter Veentjer wrote:
> Let me clarify.
> The discussion is around removing the volatile read in the inc method.

Ah, sorry I misinterpreted the question. It usually goes the other way:
the reads vastly outnumber the writes.

Well, since the writer is single-threaded, there is no need for a
volatile/atomic reads/updates in the inc(). But I think it's useless to
optimize the volatile reads there, because you have the impending
volatile/release store, which would dominate the cost. So, having that
in mind, the better strategy might be dispensing with Atomic on every
inc()? The excess branch might be much less hassle than a
volatile/release store.

public class Counter {
private long local;
private volatile long publish;

public void inc(){
long n = local++;
if ((n & 0xFF) == 0) {
   publish = n;
}
}

public long get() {
return publish;
}
}

...with VarHandles, you can even collude the access modes:

public class Counter {
static final VH = ;
private long local;

public void inc(){
long n = VH.get(this);
n++;
if ((n & 0xFF) == 0) {
   VH.set(this, n);
} else {
   VH.set{Volatile,Release,Opaque}(this, n);
}
}

public long get(){
return VH.get{Volatile,Acquire,Opaque}(this);
}
}

...with some choice of the modes.

>  A counter could be incremented >100k times a second and we don't want a
>  counter to cause too much of a performance degradation. A counter will
>  always be incremented by the same thread.

There is a subtle difference between "a single writer to Counter", and
"each thread gets its own Counter". I would venture to guess that
(false) sharing in your case should be the concern, not the cost of
volatile ops.

Having a referenced AtomicLong instead of padded out "local" field would
set you up for false sharing very nicely :)

Thanks,
-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature


Re: Single writer counter: how expensive is a volatile read?

2016-10-29 Thread Aleksey Shipilev
On 10/29/2016 10:13 AM, Peter Veentjer wrote:
> So you get something like this:
> 
> public class Counter {
> 
> private final AtomicLong c = new AtomicLong();
> 
> public void inc(){
> long newValue = c.get()+1;
> c.lazySet(newValue);
> }
> 
> public long get(){
> return c.get();
> }
> }

Does not work. With non-atomic updates, you set yourself for losing a
virtually unbounded number of updates. See e.g:
https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/#pitfall-no-sync

> But in this case you still need to do a volatile read to read the actual
> value. In theory one could optimize this to:
> 
> public class Counter {
> 
> private final AtomicLong c = new AtomicLong();
> private long local;
> 
> public void inc(){
> long newValue = local+1;
> this.local = newValue;
> c.lazySet(newValue);
> }
> 
> public long get(){
> return c.get();
> }
> }

I guess get() was meant to have "return local"? If so, this does not
work. With no ordering implied for the reads, you set yourself up for
this trouble:

Counter c = ...;
while (c.get() < 100); // wait for it... forever!

Because optimizers are free from hoisting your read before the loop,
reading the counter once, and reducing predicated loop into the infinite
one. I would *guess* the fastest (not sanest, mind you) way out of this
is doing VarHandles:

public class Counter {
 static final VarHandle VH = ;

 long counter;

 public void inc(){
 VH.getAndAdd(this, 1);
 }

 public long get(){
 return VH.getOpaque(this);
 }
}

Still, removing volatiles from get breaks whatever happens-before
guarantees for the counter. Which means, for example, you cannot use it
to "version" the objects:

Counter cnt = new Counter();

Thread 1:
  // assume cnt is zero
  x = 1;
  cnt.inc();

Thread 2:
  if (cnt.get() == 1) {
assert (x == 1); // oops, fails.
  }

Thanks,
-Aleksey

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


signature.asc
Description: OpenPGP digital signature