Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-19 Thread 'Gil Tene' via mechanical-sympathy
Yes yes yes. I want LAM. I've wanted it (and have been asking for it or 
something like it on x86) for over 14 years. And *when* LAM becomes a real 
thing we can run on and manipulate from user mode, I'll absolutely prefer 
using it to what we do today with VM multi-mapping, for many many reasons. 
Including (but not limited to) not having to deal with vma count limits, 
virtual memory fragmentation (and "innovative" ways to cap it and 
defragment it), cost of creating/changing mappings, etc. etc., and not 
having to explain to people that Linux's RSS is one of those words they use 
a lot, and that I do not think it means what they think it means (but that 
e.g. cgroup memory use is a thing tracks what they think RSS means but 
doesn't), etc. etc. 

And when LAM is available, I''d much prefer 4-level page table to 5-level, 
because it would give me 9 more LAM-provided metadata bits to play with...

But until LAM becomes a thing (in hardware lots of people run on, with 
Linux kernel and glibc support in Linux distros people actually run their 
OS images and container hosts on, etc.), 5-level page tables currently add 
speed (when dealing with >256GB of heap) on modern JVMs running on actual 
production x86 servers. And I think there may be other (non-Java) workloads 
for which the same is true. So I personally like 5 level page tables being 
on by default (on hardware that supports them) and prefer that people not 
turn them off unless their OS images are provisioned with physical memory 
small enough that they have no value even there. I.e. for physical machines 
or VM images with <256GB or RAM, turning off level 5 page tables makes 
perfect sense. But above that, I like them on better than I like them off.

BTW, several other processors have (and had) this masking or "don't 
careness" stuff that LAM will finally be bringing to x86, and on those 
processors, we have (and had) better options than doing more than 2-way VM 
multi-mapping. But since the vast majority of the world runs on x86, we do 
what we must to makes things work there, and that makes us do some funny 
stuff with virtual memory, and some a-little-slower funnier stuff (with 4 
level page tables) when heap sizes go above sizes that, even when not 
super-common, are very real today. 5-level page tables move the line where 
we need the  a-little-slower funnier stuff out ot well beyond the RAM sizes 
mortals can get their hands on today. Hopefully LAM arrives before 
real-world RAM grows too much even for that.

On Saturday, June 18, 2022 at 7:58:00 PM UTC+1 Mark E. Dawson, Jr. wrote:

> Thanks, Gil, for the compliment!
>
> Regarding the VM tricks used by certain GCs out there, wouldn't that be 
> better served instead by Intel's upcoming Linear Address Masking (LAM) in 
> Sapphire Rapids, which would allow you to apply all types of tricks to the 
> upper 15 bits of unused bits on 4-level Page Table systems (only the upper 
> 7 bits on 5-level Page Table systems)? With LAM, user applications no 
> longer need to worry about masking off updates to those bits when accessing 
> memory via those pointers since the system will ignore them (i.e., the CPU 
> no longer checks for canonicality).
>
> In that case, 4-level Page Tables maintain their advantage over 5-level 
> Page Tables provided an end user does not require the 64TB or more of RAM 
> which 5-level Page Tables enable.
>
>
> On Sat, Jun 18, 2022, 10:30 AM 'Gil Tene' via mechanical-sympathy <
> mechanica...@googlegroups.com> wrote:
>
>> A good read, and very well written. 
>>
>> I'm with you in general on most/all of it except for one thing: That 64TB 
>> line. 
>>
>> I draw that same line at 256GB. Or at least at a 256GB 
>> guest/pod/container/process. There are things out there that use virtual 
>> memory multi (and "many") mapping tricks for speed, and eat up a bunch of 
>> that seemingly plentiful 47 bit virtual user space in the process. The ones 
>> I know the most about (because I am mostly to blame for them) are high 
>> throughput concurrent GC mechanisms. Those have a multitude of 
>> implementation variants, all of which encode 
>> phases/generations/spaces/colors in higher-order virtual bits and use 
>> multi-mapping to efficiently recycle physical memory.
>>
>> For a concrete example, when running on a vanilla linux kernel with 
>> 4-level page tables, the current C4 collector in the Prime JVM (formerly 
>> known as "Zing"), uses different phase encoding, and different LVB barrier 
>> instruction encodings, depending on whether the heap size is above or below 
>> 256GB. Below 256GB, C4 gets to use sparse phase and generation encodings 
>> (using up 6 bits of virtual space) and a faster LVB test (test), and 
>> above 256GB, it uses denser enc

Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-19 Thread 'Gil Tene' via mechanical-sympathy
Yes yes yes. I want LAM. I've wanted it (and have been asking for it or 
something like it on x86) for over 14 years. And *when* LAM becomes a real 
thing we can run on and manipulate form user mode, I'll absolutely prefer 
using it to what we do today with VM multi-mapping, for many many reasons. 
Including (but not limited to) not having to deal with vma count limits, 
virtual memory fragmentation (and "innovative" ways to cap it and 
defragment it), cost of creating/changing mappings, etc. etc., and not 
having to explain to people that Linux's RSS is one of those words they use 
a lot, and that I do not think it mean what they think it means (but that 
e.g. cgroup memory use is a thing tracks what they think RSS means but 
doesn't), etc. etc. 

And when LAM is available, I''d much prefer 4-level page table to 5-level, 
because it would give me 9 more LAM-provided metadata bits to play with...

But until LAM becomes a thing (in hardware lots of people run on, with 
Linux kernel and glibc support in Linux distros people actually run their 
OS images and container hosts on, etc.), 5-level page tables currently add 
speed (when dealing with >256GB of heap) on modern JVMs running on actual 
production x86 servers. And I think there may be other (non-Java) workloads 
for which the same is true. So I personally like 5 level page tables being 
on by default (on hardware that supports them) and prefer that people not 
turn them off unless their OS images are provisioned with physical memory 
small enough that they have no value even there. I.e. for physical machines 
or VM images with <256GB or RAM, turning off level 5 page tables makes 
perfect sense. But above that, I like them on better than I like them off.

BTW, several other processors have (and had) this masking or "don't 
careness" stuff that LAM will finally be bringing to x86, and on those 
processors, we have (and had) better options than doing more than 2-way VM 
multi-mapping. But since the vast majority of the world runs on x86, we do 
what we must to makes things work there, and that makes us do some funny 
stuff with virtual memory, and some a-little-slower funnier stuff (with 4 
level page tables) when heap sizes go above sizes that, even when not 
super-common, are very real today. 5-level page tables move the line where 
we need the  a-little-slower funnier stuff out ot well beyond the RAM sizes 
mortals can get their hands on today. Hopefully LAM arrives before 
real-world RAM grows too much even for that.
 
On Saturday, June 18, 2022 at 7:58:00 PM UTC+1 Mark E. Dawson, Jr. wrote:

> Thanks, Gil, for the compliment!
>
> Regarding the VM tricks used by certain GCs out there, wouldn't that be 
> better served instead by Intel's upcoming Linear Address Masking (LAM) in 
> Sapphire Rapids, which would allow you to apply all types of tricks to the 
> upper 15 bits of unused bits on 4-level Page Table systems (only the upper 
> 7 bits on 5-level Page Table systems)? With LAM, user applications no 
> longer need to worry about masking off updates to those bits when accessing 
> memory via those pointers since the system will ignore them (i.e., the CPU 
> no longer checks for canonicality).
>
> In that case, 4-level Page Tables maintain their advantage over 5-level 
> Page Tables provided an end user does not require the 64TB or more of RAM 
> which 5-level Page Tables enable.
>
>
> On Sat, Jun 18, 2022, 10:30 AM 'Gil Tene' via mechanical-sympathy <
> mechanica...@googlegroups.com> wrote:
>
>> A good read, and very well written. 
>>
>> I'm with you in general on most/all of it except for one thing: That 64TB 
>> line. 
>>
>> I draw that same line at 256GB. Or at least at a 256GB 
>> guest/pod/container/process. There are things out there that use virtual 
>> memory multi (and "many") mapping tricks for speed, and eat up a bunch of 
>> that seemingly plentiful 47 bit virtual user space in the process. The ones 
>> I know the most about (because I am mostly to blame for them) are high 
>> throughput concurrent GC mechanisms. Those have a multitude of 
>> implementation variants, all of which encode 
>> phases/generations/spaces/colors in higher-order virtual bits and use 
>> multi-mapping to efficiently recycle physical memory.
>>
>> For a concrete example, when running on a vanilla linux kernel with 
>> 4-level page tables, the current C4 collector in the Prime JVM (formerly 
>> known as "Zing"), uses different phase encoding, and different LVB barrier 
>> instruction encodings, depending on whether the heap size is above or below 
>> 256GB. Below 256GB, C4 gets to use sparse phase and generation encodings 
>> (using up 6 bits of virtual space) and a faster LVB test (test), and 
>> above 256GB, it uses denser enc

Re: The "5-level Page Table" Default in Linux May Impact Throughput

2022-06-18 Thread 'Gil Tene' via mechanical-sympathy
A good read, and very well written. 

I'm with you in general on most/all of it except for one thing: That 64TB 
line. 

I draw that same line at 256GB. Or at least at a 256GB 
guest/pod/container/process. There are things out there that use virtual 
memory multi (and "many") mapping tricks for speed, and eat up a bunch of 
that seemingly plentiful 47 bit virtual user space in the process. The ones 
I know the most about (because I am mostly to blame for them) are high 
throughput concurrent GC mechanisms. Those have a multitude of 
implementation variants, all of which encode 
phases/generations/spaces/colors in higher-order virtual bits and use 
multi-mapping to efficiently recycle physical memory.

For a concrete example, when running on a vanilla linux kernel with 4-level 
page tables, the current C4 collector in the Prime JVM (formerly known as 
"Zing"), uses different phase encoding, and different LVB barrier 
instruction encodings, depending on whether the heap size is above or below 
256GB. Below 256GB, C4 gets to use sparse phase and generation encodings 
(using up 6 bits of virtual space) and a faster LVB test (test), and 
above 256GB, it uses denser encodings (using up only 3 bits) with slightly 
more expensive LVBs (test a bit in a mask & jmp). A 5 level page table (on 
hardware that supports it) we can move that line out by 512x, which means 
that even many-TB Java heaps can use the same cheap LVB tests that the 
smaller ones do.

I expect the (exact) same cosniderations will be true for ZGC in OpenJDK 
(once it adds a generational mode to be able to keep up with high 
throughout allocations and large live sets), as ZGC's virtual space 
encoding needs and resulting LVB test instruction encodings are identical 
to C4's.

So I'd say you can safely turn off 5 level tables on machines that physical 
have less than 256GB of memory, or on machines that are known to not run 
Java (now or in the future), or some other in-memory application technology 
that uses virtual memory tricks at scale. But above 256GB, I'd keep it 
on, especially if the thing is e.g. a Kubernetes node that may want to run 
some cool Java workload tomorrow with the best speed and efficiency.
On Friday, June 17, 2022 at 7:06:42 PM UTC+2 Mark E. Dawson, Jr. wrote:

> In the article below, I address just *one* of the areas where this new 
> default Linux build option in most recent distros can adversely impact 
> multithreaded performance - page fault handling. But any workload which 
> requires the kernel to mimic MMU page table walking to accomplish a task 
> could be impacted adversely, as well (e.g., pipe communication). You'd do 
> well to do your own testing:
>
> https://www.jabperf.com/5-level-vs-4-level-page-tables-does-it-matter/
>
> *NOTE*: 5-level Page Table can be disabled with "no5lvl" on the kernel 
> boot command line.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/c7d49c5d-4872-4f2d-8010-3035ccf5d7d8n%40googlegroups.com.


Re: Resource to learn JIT compiler

2022-04-29 Thread 'Gil Tene' via mechanical-sympathy
There are some some recorded talks of mine under the title “Java at Speed” 
that cover some JIT things common across. For 
example: https://youtu.be/ot3PESmNXhE

HTH,

— Gil.
On Thursday, April 28, 2022 at 10:30:13 PM UTC-10 gros...@gmail.com wrote:

> Hello,
> do you know good resources to learn a bit about JIT compiler in Java? 
> (Beyond reading JDK sources ;) )
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/8e56cdcb-2032-43c1-926e-69204ed35ebbn%40googlegroups.com.


Re: Resource to learn JIT compiler

2022-04-29 Thread 'Gil Tene' via mechanical-sympathy
There are some some recorded talks of mine under the title “Java at Speed” 
that cover some JIT things common across. For 
example:https://www.infoq.com/presentations/java-jvm-perf/

HTH,

— Gil.
On Thursday, April 28, 2022 at 10:30:13 PM UTC-10 gros...@gmail.com wrote:

> Hello,
> do you know good resources to learn a bit about JIT compiler in Java? 
> (Beyond reading JDK sources ;) )
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/4acfe283-bba2-48d8-a26a-3dea7014dbb4n%40googlegroups.com.


Re: Megamorphic virtual call optimization in Java

2022-02-05 Thread 'Gil Tene' via mechanical-sympathy
You might want to give it a spin on Prime (Zing) 11 and see what you get, 
the Falcon JIT in Prime will do polymorphic guarded in-line caching for up 
to 6 implementers by default, I believe (as is configurable to higher if 
needed). This is exactly the sort of thing that capability is meant for.

On Saturday, February 5, 2022 at 1:35:31 AM UTC-10 gros...@gmail.com wrote:

> JVM 11+ (OpenJDK / Zulu)
>
> sobota, 5 lutego 2022 o 12:30:38 UTC+1 gregor...@gmail.com napisał(a):
>
>> which jvm?
>>
>> On Sat, Feb 5, 2022 at 6:26 AM r r  wrote:
>>
>>> Hello,
>>> we know that there are some techniques that make virtual calls not so 
>>> expensive in JVM like Inline Cache or Polymorphic Inline Cache. 
>>>
>>> Let's consider the following situation:
>>>
>>> Base is an interface. 
>>>
>>> public void f(Base[] b) {
>>> for(int i = 0; i < b.length; i++) {
>>>   b[i].m();   
>>> }
>>> }
>>>
>>> I see from my profiler that calling virtual (interface) method m is 
>>> relatively expensive.
>>> f is on the hot path and it was compiled to machine code (C2) but I see 
>>> that call to m is a real virtual call. It means that it was not 
>>> optimised by JVM. 
>>>
>>> The question is, how to deal with a such situation? Obviously, I cannot 
>>> make the method m not virtual here because it requires a serious 
>>> redesign. 
>>>
>>> Can I do anything or I have to accept it? I was thinking how to "force" 
>>> or "convince" a JVM to 
>>>
>>> 1. use polymorphic inline cache here - the number of different types in 
>>> b is quite low - between 4-5 types.
>>> 2. to unroll this loop - length of b is also relatively small. After an 
>>> unroll it is possible that Inline Cache will be helpful here.
>>>
>>> Thanks in advance for any advices.
>>> Regards,
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-symp...@googlegroups.com.
>>> To view this discussion on the web, visit 
>>> https://groups.google.com/d/msgid/mechanical-sympathy/9b52b34e-6388-4cba-b89e-ce7521109cban%40googlegroups.com
>>>  
>>> 
>>> .
>>>
>>
>>
>> -- 
>> Studying for the Turing test
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/79ac85ac-73f6-4b7b-b775-85a02f838939n%40googlegroups.com.


Re: Reliably allocating large arrays

2020-10-02 Thread Gil Tene
Your stated working assumption (that while you have enough free memory, you 
don't have enough contiguous free memory)
is wrong. As a result, there is no need to seek a solution for that 
problem, because that is not the problem. At least on all production JVMs 
in the last 22 years or so, JVMs (through their garbage collectors) compact 
memory and create contiguous memory regions. Without doing that, most Java 
applications would generally stop working after running for a while.

Can you demonstrate (with specific output and numbers) your starting 
assertion? Showing the situation under which you are getting an OOME under 
a condition that you think it should not be happening, and how you are 
determining the available memory?

In general the criteria for JVMs OOME is not that you have no more empty 
bytes left in the heap. It is that you are thrashing so badly that “you’d 
probably rather be dead”. This is a subjective criteria which is often 
configurable via flags. There are other reasons for throwing OOME ( like 
running out of non-Java-heap types of memory), but the “running low enough 
on heap memory that The JVM is thrashing badly” is a common reason and the 
likely one in your case.

A significant contributing portion of garbage collector efficiency (in the 
vast majority of GC algorithms used in real world VMs) is generally linear 
to 1/portion-of-heap-that-is-empty. E.g. regardless of GC algorithm choice, 
if you are repeatedly allocating (and forgetting at the same rate) 64 byte 
objects in a 1GB heap that has only 256 bytes empty (not occupied by 
currently reachable objects) on a steady basis, you would need to run a 
complete GC cycle on every 4th allocation, and that GC cycle would have to 
examine the entire 1GB of heap each time to find the empty 256 GBs and 
potentially compact the heap to make them contiguous, That would be so much 
work per allocation that you would never want the JVM to continue running 
under that condition (and no current JVM will AFAIK). On the other hand the 
exact same application and load would run very efficiently when there was 
more empty memory in the heap (improving semi-linearly as 
1/portion-of-heap-that-is-empty 
grows). Note that portion-of-heap-that-is-empty here refers to the portion 
of the heap that is not occupied by live, reachable objects (not the much 
small portion that may be currently unused until,the GC gets rid of 
unreachable objects).

E.g. some (stop the world) collectors will through an OOME when too much 
time (e.g. 98%) of a time interval (e.g.  no less than 1 minute) during 
which multiple (e.g. at least N) full GC cycles have run was spent in 
 stop-the world GC. This gets more intricate with concurrent collectors, 
but the principle is the same,

IMO, the most likely explanation (given the available data) is that you r 
heap is not large enough to continue to run your application given its live 
set and load, and that increasing the heap will resolve your problem 
(assuming you application doesn’t then react by increasing the live set to 
fill the heap up too far again). If this explanation applies, then OOME is 
a wholly appropriate indication during your situation.

HTH,

— Gil.

On Thursday, October 1, 2020 at 8:43:59 AM UTC-10 Shevek wrote:

> When I do new byte[N], I get OutOfMemoryError, despite that the VM 
> claims to have more than enough free space (according to MemoryMXBean, 
> Runtime.freeMemory, visualvm, etc).
>
> My working assumption is that while I have enough free memory, I don't 
> have enough contiguous free memory. Is there a solution to this? Will I 
> get better results from any of:
>
> * ByteBuffer.allocateDirect() - presumably yes, but has other issues 
> relating to overall memory usage on the system
> * G1GC (or other GC which allocates (relocatable?) regions) - this is a 
> deep hole I haven't yet explored.
> * Calling System.gc() before allocating a contiguous region [apparently 
> doesn't help].
> * Other?
>
> If we do follow a strategy using allocateDirect, will we end up with the 
> same fragmentation issue in the native heap, along with committed 
> off-heap memory which we can no longer effectively use, or is the 
> off-heap memory managed in some manner which avoids this problem?
>
> Thank you.
>
> S.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/ff735126-c158-48ef-a1e8-4d9e12770432n%40googlegroups.com.


Re: Single digit ms latency with thread per core architecture

2020-09-02 Thread Gil Tene
Peter Lawry created some cool tooling for java thread affinity a while 
back. Look at https://github.com/OpenHFT/Java-Thread-Affinity

On Wednesday, September 2, 2020 at 9:03:28 PM UTC-7 masquera...@gmail.com 
wrote:

> Hi there,
>
> Once in a while on twitter will pop-up with thread per core architecture 
> links.
>
> The only product on the market I know who does that is ScyllaDb.
> The underlaying infrastructure behind Scylla is Seastar/C++ 
> http://seastar.io/
>
> I know you can isolate cores and pin threads but was wondering if there is 
> something similar to Seastar in Java.
>
> Thank you,
> Ruslan
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/d73f30aa-36fc-488e-a109-99cd2c13552fn%40googlegroups.com.


Re: Long unexplained time-to-safepoint (TTSP) in the JVM

2020-06-08 Thread Gil Tene
Here is an idea for a quick way to analyze this. Zing has a built-in TTSP 
profiler (since we pretty much coined the term "TTSP"). You could just 
download the (free) Zing trial version, run your application on it, see if 
the TTSP behavior is still there, and potentially get some quick clarity 
about what is causing it.

Zing will, by default, print out "Detected TTSP" warnings to stderr when it 
sees any thread experience a TTSP above a detection threshold (the warning 
threshold defaults to 1 second, but can be set with 
-XX:SafepointProfilerThresholdMS=). You don't have to wait for a real 
pause to hit you to find these. Zing runs all threads through a background 
checkpoint every 5 seconds, so code paths with long TTSPs have a hard time 
hiding from the detector. Each thread crosses the checkpoint individually 
at its next safepoint polling opportunity, with no actual pause, and no 
actual delay in execution. The only time a thread will bear a real cost and 
interact with IPC is when its TTSP is so long that it crosses the 
threshold, in which case signals are used to collect context and report it.

For an example of the kind of output you'd see when e.g. mapped file page 
faults are the thing causing long TTSPs, see the below [note the clear 
finger pointing at org.apache.lucene.store.ByteBufferIndexInput.readByte() 
as the culprit in this example].

-- Gil.

Detected TTSP issue: start: 41.768 wait: 1748.384
Dumping stack for thread 0x44001960
"elasticsearch[localhost][search][T#49]" id: 171575 prio: 5 os_prio: 0 sched
: SCHED_OTHER allowed_cpus: 0
lock_release: 43.516 last_cpu: 9 cpu_time: 25
  0 0x7fc63245f5bc SafepointProfilerBuf::record_sync_stack(JavaThread*)
  1 0x7fc6324817e5 JavaThread::poll_at_safepoint_static(JavaThread*)
  2 0x3000eae9 StubRoutines::safepoint_handler
  3 0x300dfc73 org.apache.lucene.store.DataInput.readVInt()I
  4 0x31fa91df org.apache.lucene.codecs.blocktree.
SegmentTermsEnumFrame.loadBlock()V
  5 0x31f99664 org.apache.lucene.codecs.blocktree.SegmentTermsEnum.
seekExact(Lorg/apache/lucene/util/BytesRef;)Z
  6 0x31fad584 org.apache.lucene.index.TermContext.build(Lorg/apache
/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)Lorg/apache/
lucene/index/TermContext;
  7 0x319a40f6 interpreter
  8 0x319a3cba interpreter


signal_sent: 42.768 signal_responded: 43.516 state: D wchan: 
wait_on_page_bit_killable last_cpu: 9 cpu_time: 25
  0 0x300926f2 org.apache.lucene.store.ByteBufferIndexInput.readByte
()B
  1 0x300dfacd org.apache.lucene.store.DataInput.readVInt()I
  2 0x31fa9119 org.apache.lucene.codecs.blocktree.
SegmentTermsEnumFrame.loadBlock()V
  3 0x31f99664 org.apache.lucene.codecs.blocktree.SegmentTermsEnum.
seekExact(Lorg/apache/lucene/util/BytesRef;)Z
  4 0x31fad584 org.apache.lucene.index.TermContext.build(Lorg/apache
/lucene/index/IndexReaderContext;Lorg/apache/lucene/index/Term;)Lorg/apache/
lucene/index/TermContext;
  5 0x319a40f6 interpreter
  6 0x319a3cba interpreter
  7 0x319a3cba interpreter
  8 0x319a3cba interpreter


On Thursday, June 4, 2020 at 2:07:10 PM UTC-7, Zac Blanco wrote:
>
> Hi All,
>
> I'm hoping someone might be able to point me in the right direction with 
> this or offer some ideas.
> I've done a lot of research but haven't found any other threads with 
> similar behavior. The closest
> one I could find was this 
>  
> which actually never seems like it was solved.
>
> I've been working on this issue for a few weeks. We have a JVM-based 
> server application which
> is experiencing long (anywhere from 10s to multiple minutes) pauses. Below 
> is an example of
> the safepoint statistics output:
>
>  vmop[threads: total initially_running 
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 298.236: no vm operation  [ 385  0   
>0]  [ 0 0 21638 1 0]  0   
>  vmop[threads: total initially_running 
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 320.877: no vm operation  [ 336  0   
>0]  [ 0 0  1978 1 0]  0   
>  vmop[threads: total initially_running 
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 336.859: no vm operation  [ 309  1   
>2]  [ 0 0  6497 1 0]  0   
>  vmop[threads: total initially_running 
> wait_to_block][time: spin block sync cleanup vmop] page_trap_count
> 351.362: no vm operation  [ 294  0   
>0]  [ 0 0 32821 0 0]  0   
>

Re: Why are page faults used for safe point polling instead of other signals?

2020-05-30 Thread Gil Tene
The fast-path  overhead in this polling scheme (calling code at a specific 
memory location and returning from it, relying on remapping of the page the 
code us in to change its behavior from do-nothing to take-safepoint) is 
much higher than the currently-popular polling schemes of loading from a 
protect-able page or testing a bit in memory. It also has the downside of 
only working globally (no thread-specific safepointing capability like the 
one we use in Zing, or the one a HotSpot introduced in recent versions).

On Friday, May 29, 2020 at 11:25:02 AM UTC-7, Steven Stewart-Gallus wrote:
>
> Behold!
>
> I don't think this actually safely works though on x86 at least.
> Pretty sure they use the virtual address for instruction caching and 
> debuggers have to do synchronisation when modifying from a different 
> address space.
>
> oh well.
>
> On Friday, May 29, 2020 at 9:46:38 AM UTC-7, Steven Stewart-Gallus wrote:
>>
>> Okay I have an idea.
>> I can't shake the idea you could do fun tricks with thread local 
>> executable pages.
>>
>> The theoretically fastest way of safepoint polling is inserting a trap 
>> instruction. But icache overheads dominate. If the icache is based on 
>> physical addresses and not virtual ones then it should be possible to remap 
>> the page without doing icache synchronization.
>> You should be able to have very fast safepoints by remapping the page.
>>
>> But I'm not sure there's a fast way to do a call/return from thread local 
>> storage. And a call/return from a constant page still might not be faster 
>> than just a load.
>>
>> TLDR:
>> Limited self modifying code without icache syncing stuff could be 
>> possible with memory management tricks as long as the icache and other 
>> stuff is based on physical addresses.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/1c4004a9-5236-4e2f-aa15-56599cb4fe3a%40googlegroups.com.


Re: Why are page faults used for safe point polling instead of other signals?

2020-05-30 Thread Gil Tene
HotSpot used to actually safepoint by patching the running code of threads, 
at some point ahead of where you suspended them. The notion was that this 
lets you completely avoid any polling instructions and Keeps the fast path 
as fast as possible. HotSpot gave up on doing this almost 20 years 
ago.because of a myriad of annoying tail bugs, including ones that had to 
do with edge cases around how OSs deal with suspension, signal delivery, 
etc., and around the delicate challenges in patching running code safely 
and atomically. It’s not that such a scheme couldn’t be made to work (it 
actually ran in production versions for a while), it’s that it had too much 
hair on it to keep going, and ultimately was not worth it given the very 
low cost of actual polling instructions on modern OOO cpus.

On Friday, May 29, 2020 at 9:46:38 AM UTC-7, Steven Stewart-Gallus wrote:
>
> Okay I have an idea.
> I can't shake the idea you could do fun tricks with thread local 
> executable pages.
>
> The theoretically fastest way of safepoint polling is inserting a trap 
> instruction. But icache overheads dominate. If the icache is based on 
> physical addresses and not virtual ones then it should be possible to remap 
> the page without doing icache synchronization.
> You should be able to have very fast safepoints by remapping the page.
>
> But I'm not sure there's a fast way to do a call/return from thread local 
> storage. And a call/return from a constant page still might not be faster 
> than just a load.
>
> TLDR:
> Limited self modifying code without icache syncing stuff could be possible 
> with memory management tricks as long as the icache and other stuff is 
> based on physical addresses.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/05af901a-99fe-49f7-acb3-e8444e38ec1b%40googlegroups.com.


Re: Why are page faults used for safe point polling instead of other signals?

2020-05-18 Thread Gil Tene
This is an evolving and ever-explored field...

The "current" (and in typically used in production 8 and 11) versions of 
OpenJDK and HotSpot performs safepoint as an all-or-nothing, Stop-The-World 
(STW) event. Since the frequency of STW pauses will generally tend to be 
low (for obvious reasons, if it were high, you'd be facing much bigger 
complaints), the likelihood of a safepoint poll actually triggering will be 
VERY low (think "1 in a billion" or less for a practical order of magnitude 
feel). As such, code that accepts STW pauses tends to be optimized for trhe 
NOT triggering case, and a "blind load" from a potentially protected 
location has bee (empirically chosen as) the fastest way to perform the 
poll.

A way to look at a read from a potentially protected page is as

1) An implicit "predicted not taken" form of a check

CPUs will never predict that fault will be taken (that would be silly) so 
no branch prediction state is needed for this.

 

 

 
 


On Sunday, May 17, 2020 at 9:27:36 PM UTC-7, Steven Stewart-Gallus wrote:
>
> Hi
> As I understand it most VMs poll for safepoints by using memory management 
> tricks to page fault polling threads. A page is set aside to read from and 
> whenever a safepoint is desired the page is set to be unreadable.
>
> But can't a number of other hardware traps be used instead 
> https://wiki.osdev.org/Exceptions ?
>
> Not sure if a conditional jump to a trapping debug instruction would be 
> slow or not.
>
> Also why not read from a pointer to a page instead of reading directly? 
> That way only a an atomic write to a pointer is needed instead of a whole 
> memory protection syscall.
>
> Also an atomicly written boolean flag is only one of many possible types 
> of branches.
> You could have an indirect call or possibly jump and just overwrite a 
> function pointer to install your handler for example.
>
> The standard way of doing things seems pretty sensible but it's just I've 
> never actually seen a concrete comparison.
>
> Basically in pseudocode why
>
> void poll() {
> page[0];
> }
> void trap() {
> mprotect(page, 0);
> }
>
> over
>
> void poll() {
> page[0];
> }
> void trap() {
> page = 0;
> }
>
> or
>
> void poll() {
> 1 / bit;
> }
> void trap() {
> bit = 0;
> }
>
> And why
>
> void poll() {
> if (flag) dostuff();
> }
>
> over
>
> void poll() {
> f();
> }
> void trap() {
> f = handler;
> }
>
> Or
>
> void poll() {
> if (flag) __builtin_trao();
> }
>
> Probably makes sense to do safepoint polls the standard way but I've never 
> seen a detailed comparison exactly why.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/8cd7f05d-def0-4f9c-8fff-e54ca39312f2%40googlegroups.com.


Re: does call site polymorphism factor in method overrides?

2019-12-31 Thread Gil Tene
In the specific case below, of a concrete method implemented in a base 
class (abstract or not) which has no overriding methods in any loaded 
subclass, CHA will identify the fact that only a single possible target 
exists and get the speed you seek regardless of encountered polymorphism 
level at the call site, (so yes, works even in metamorphic call sites).

Getters and setters are great (and very common) examples of this.

And when CHA can prove a single target, the cost is no higher than an 
invokeststic (assuming the object reference is already known to be non-null 
for some other reason in the same frame, which tends to overwhelmingly be 
the case).

To Remi’s point, in cases that are monomorphic, guarded inlining can be so 
close to CHA in results that it’s hard to measure the difference, since the 
difference becomes a perfectly predicted cmp + jmp. When going metamorphic, 
CHA benefits (compared to guarded inlining) depend on the level of 
megamorphism and the JIT you end up using. E.g. HotSpot/OpenJDK will do 
biomorphic guarded inlining, and while the jmp is not perfectly 
predictable, it often is “very very predictable” (hardware branch 
predictors are amazing things). Falcon (in Zing) and the Graal JITs will 
both do guarded inlining on much more metamorphic cases (e.g. hexamorphic, 
septamorphic, or octamorphic guarded inlining have shown real benefits). 
But if your megamorphism level blows past the JITs guarded inlining levels, 
you end up with a vocal, and more importantly, loose the optimization 
benefits that come from inlining the target. In the case where there is a 
single target that CHA can identify, even that level of megamorphism will 
get the benefits of inlining. But in practice, those situations (where CHA 
is a win) are very rare these days (especially with the level of 
megamorphism that a Falcon and a Graal will put up with and still inline).

On Monday, December 30, 2019 at 8:13:53 AM UTC-8, Brian Harris wrote:
>
> Good to know Vitaly! 
>
> So a poor example then. Better example is an abstract class with a method 
> implementation that no subtypes override, yet multiple subtypes are found 
> to be the receiver of a particular call site. Should we expect a 
> monomorphic call site in that case.
>
> On Sun, Dec 29, 2019 at 12:42 PM Vitaly Davidovich  > wrote:
>
>>
>>
>> On Sun, Dec 29, 2019 at 10:22 AM Brian Harris > > wrote:
>>
>>> Hello!
>>>
>>> I was hoping to get one point of clarification about avoiding 
>>> megamorphic call sites, after reading these excellent articles:
>>>
>>>
>>> http://www.insightfullogic.com/2014/May/12/fast-and-megamorphic-what-influences-method-invoca/
>>>  
>>> https://shipilev.net/blog/2015/black-magic-method-dispatch/
>>> https://gist.github.com/djspiewak/464c11307cabc80171c90397d4ec34ef 
>>>
>>>
>>> When the runtime type of the call receiver is being observed, is it 
>>> considered whether that type actually overrides the method in question? For 
>>> example, when the method is an interface's default method that none of the 
>>> runtime call receivers override, can we expect to get a monomorphic call 
>>> site regardless of how many different receiver types are observed at 
>>> runtime, given there is only one method body to invoke?
>>>
>> In Hotspot, CHA is currently (AFAIK) disabled for default methods (
>> https://bugs.openjdk.java.net/browse/JDK-8036580).  So you have to be 
>> careful putting hot code into them.  Overriding the method in the impl and 
>> just calling super will at least restore some performance if type profiling 
>> at the callsite helps.
>>
>>>
>>> Thanks
>>> Brian
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com 
>>> .
>>> To view this discussion on the web, visit 
>>> https://groups.google.com/d/msgid/mechanical-sympathy/CAFtUM9Zt%2BwJvkDix_ZEYA%2B5u6hA86VQW7d5ceEyvxetiLvq%2BfA%40mail.gmail.com
>>>  
>>> 
>>> .
>>>
>> -- 
>> Sent from my phone
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> To view this discussion on the web, visit 
>> https://groups.google.com/d/msgid/mechanical-sympathy/CAHjP37EiQQKRfJvm3AVv87DiUpimxckyTVTtHRxgZDYUhfMC4g%40mail.gmail.com
>>  
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To 

Re: JMeter and HdrHistogram Integration

2019-12-21 Thread Gil Tene


On Friday, December 20, 2019 at 1:18:12 PM UTC-8, Mark E. Dawson, Jr. wrote:
>
> So our company is evaluating a set of messaging platforms, and we're in 
> the process of defining non-functional requirements. In preparation for 
> evaluating performance, I was considering suggesting JMeter since it 
> appears to support testing messaging platforms (with several specific 
> tutorials online). However, these tutorials show the Response Time by 
> Percentile graphs from the tool, and they all appear to show evidence of 
> CO. 
>
> Does anyone know if the latest versions include support for HdrHistogram 
> either out-of-the-box or via extra configuration?
>

Yeah, JMeter is a good tool for generating load, but not a good tool for 
reporting what client-experienced latency or response time behaviors would 
be if you care about things other than averages or medians. If you want an 
“ok approximation”, you can generate load at a constant rate from JMeter 
(non of those cool rammup or think time things) so that you *know* what the 
expected intervals between logged events are supposed to be, and then take 
the detailed logs from JMeter and post process them with some 
coordinated-omission correction tooling. e.g. jHiccup’s -f flag can be used 
to injest a stream of timestamp, latency tulle lines (rather than measure 
anything) and it’s -r parameter can then be used to control the expected 
interval. It will then produce a histogram log that is a good approximation 
corrected for coordinated omission (it is conservative: it may under 
correct,  it will not over correct). (See 
https://github.com/giltene/jHiccup#using-jhiccup-to-process-latency-log-files). 
You can then plot those histogram logs using e.g. 
https://github.com/HdrHistogram/HistogramLogAnalyzer





-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/7d6c7f8c-e602-40ee-89da-9515cbf5d82f%40googlegroups.com.


Re: Volatile semantic for failed/noop atomic operations

2019-10-05 Thread Gil Tene
That's actually the common idiom for performant concurrent code that 
occasionally needs to do something atomic, but usually (the hot case) 
doesn't. Without the first read check, there is a good likelihood 
(depending on how the CPU and prefetchers work for CAS, but very probably) 
that you will run into serious false sharing contention on the line under 
high load on multiple cores, as the CAS will bring the line in exclusive 
and invalidate it in other caches, while the first read check will bring it 
in shared, and the invalidation will only happen if you actually need to 
change the value.

On Saturday, October 5, 2019 at 8:41:26 AM UTC-7, Steven Stewart-Gallus 
wrote:
>
> Couldn't you do a compare and compare and swap? With VarHandles something 
> like
>
> if (ACTION.getOpaque(this) != expected) return false;
> return compareAndExchange(this, expected, newValue) == expected;
>
> Not sure I got this correct
>
> On Saturday, September 14, 2019 at 11:29:00 AM UTC-7, Vitaly Davidovich 
> wrote:
>>
>> Unlike C++, where you can specify mem ordering for failure and success 
>> separately, Java doesn’t allow that.  But, the mem ordering is the same for 
>> failure/success there.  Unfortunately it doesn’t look like the javadocs 
>> mention that, but I recall Doug Lea saying that’s the case on the 
>> concurrency-interest mailing list (btw that’s probably the more appropriate 
>> channel for this Java-centric question).
>>
>> For your case, it seems like an AtomicReference is more 
>> appropriate.  terminate() sets it, then checks the count via a volatile 
>> load (or maybe it can decrement() itself?); if zero, CAS null into the 
>> action field to take/claim the action.  decrement() likewise tries to claim 
>> the action via a CAS.  The snippet you have now would allow for concurrent 
>> action execution, which is likely unsafe/wrong.
>>
>> On Fri, Sep 13, 2019 at 3:08 AM Simone Bordet  
>> wrote:
>>
>>> Hi,
>>>
>>> I have an atomic counter that gets incremented and decremented over
>>> time (non monotonically).
>>> At a certain point, I would like to enter a termination protocol where
>>> increments are not possible anymore and I set an action to run if/when
>>> the counter reaches zero.
>>> Trivial when using synchronized/lock, but I'd like to give it a try
>>> without them.
>>>
>>> class A {
>>>   private final AtomicLong counter;
>>>   // Non-volatile
>>>   private Runnable action;
>>>
>>>   void terminate(Runnable action) {
>>> this.action = action;
>>> // Volatile write needed here for visibility.
>>> if (counter.addAndGet(0) == 0) {
>>>   action.run();
>>> }
>>>   }
>>>
>>>   void decrement() {
>>> // Volatile read required to see this.action.
>>> if (counter.decrementAndGet() == 0) {
>>>   Runnable a = this.action;
>>>   if (a != null) {
>>> a.run()
>>>   }
>>> }
>>>   }
>>> }
>>>
>>> Is addAndGet(0) a volatile write? Can the write be optimized away?
>>> Similarly (although not relevant for this particular example), a
>>> _failed_ compareAndSet() has the semantic of a volatile write even if
>>> the set part was not done because the compare part failed?
>>>
>>> Thanks!
>>>
>>> -- 
>>> Simone Bordet
>>> ---
>>> Finally, no matter how good the architecture and design are,
>>> to deliver bug-free software with optimal performance and reliability,
>>> the implementation technique must be flawless.   Victoria Livschitz
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> To view this discussion on the web, visit 
>>> https://groups.google.com/d/msgid/mechanical-sympathy/CAFWmRJ3qGJ_qqrXmAHNDZ6ro01BQwe8czHZP7b-SoZ%2BrULhJAw%40mail.gmail.com
>>> .
>>>
>> -- 
>> Sent from my phone
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/bf62a5dc-0c89-44cf-aa36-65f0655d0b3a%40googlegroups.com.


Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-18 Thread Gil Tene


On Friday, August 16, 2019 at 1:30:10 AM UTC-7, Aleksey Shipilev wrote:
>
> On 8/16/19 10:07 AM, Gil Tene wrote: 
> > Classification terms evolve as the art evolves. 
>
> What I want to emphasize here that this discussion reinvents the meaning 
> of "sweep". That definition 
> is not used in the way you describe in the sources I know of. Granted, 
> definitions drift over time, 
> but we have to be careful to separate what is the "agreed on" definition, 
> and what is whatever 
> definition we want to be the emergent one. 
>
> > On Thursday, August 15, 2019 at 1:42:23 AM UTC-7, Aleksey Shipilev 
> wrote: 
> > I am trying to understand what is your central argument here. This 
> seems to be it. Are you saying 
> > that "sweeping" is when you visit dead objects, and non-"sweeping" 
> is when you do not? 
> > 
> > Yes. I think that's a very good summary. 
>
> Would be helpful if we started from this next time around! 
>

True. Maybe this discussion will help others start from there next time. 
It's certainly helped me hone
on that specific summary.
 

>
> > Is walking _dead_ objects the discriminator for "sweeping" then? So 
> in your book, if we take the 
> > same Lisp2 collector, and compute-new-addresses and adjust-pointers 
> steps are walking the 
> > self-parsable heap (visiting dead objects along the way), it is 
> M-S-C? But if it uses marking 
> > bitmap 
> > (thus only visiting live objects), it becomes M-C? [This would be a 
> weird flex, but okay]. 
> > 
> > 
> > Exactly. "In my book" adding an efficient livemark bit vector with 
> possible summary layers would 
> > covert the classic LISP2 GC from a Mark-Sweep-Compact to a Mark-Compact 
> (with no sweep). 
>
> So this is what irks me here. In my Epsilon's mark-compact toy collector 
> (https://shipilev.net/jvm/diy-gc/), changing from mark bitmaps to 
> self-parsable heap traversal is 
> about 5 lines of code. This does not feel like passing the bar for 
> reclassification of the collector 
> algo between major classes.


Self-parseable heaps are expensive to traverse, and that expense grows with 
the size non-live
part of the heap. Being sensitive to live-occupancy % (or not) is a 
fundamental quality in
a collector.

If a "5 lines of code" change allows your collector to no longer be 
sensitive to heap size, and
stay sensitive only to the live set, it is well worth it.  And it also 
changes its classification.

E.g. sometimes changing from a linked list to an array list touches only 
one line of code, but
that change can dramatically alter the classification from a complexity and 
fit-for-use point
of view.
 

> It feels more like the implementation detail that can swing both ways, 
> depending on circumstances. 
>

ParalellGC probably wouldn't call the file that implements it's
Mark-Compact (with sweep) algorithm "MarkSweep.cpp", if the authors
of that code thought of their implementation as a Mark Compact that
does not include a sweep, or if they thouight of the sweeping as a
small implementation detail.

If it were just an implementation detail, there wouldn't be scores of 
academic
papers that seems to classify the "Mark Compact" (the historic kind that 
visits all
dead objects, so implies sweeping) as inherently less CPU-efficient, and 
back
that category-wide conclusion with analyses and measurements that all use
sweeping in their Mark-Compact implementations.

And there wouldn't be similar analyses that seem to classify all 
non-Mark-Compact
moving collectors (putting them into the only remaining categories of 
semi-space or 
regional evacuating) as less memory-efficient based on their assumed need
for space outside of the "in place" compaction that canonical Mark-Compact
(with implied sweeping) is known to be able to do.

These classifications (historically assuming that Mark-Compact includes
visiting all dead objects, and/or that anything else can't do in-place 
compaction)
both lead to historical conclusions that no longer apply to more modern
implementations that still clearly belong in the Mark-Compact category.

Here are two classifications of implementation-orthogonal characteristics
of any moving collector, and are fundamental to analyzing fit for use in
various common scenarios:

1. Do you visit all objects, or just the live ones?
[which I call "do you sweep?]

2. Do you need extra space in order to compact, or can you
perform the compaction within the memory previously occupied
by dead objects?
[which I don't have a great name for, but it is demonstrably not 
 "do you do in place compaction"?]

E.g. for an algorithm used in a young generation collector, which 
fundamentally 
hopes to leverage 

Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-16 Thread Gil Tene
use the term 'defragmentation' to refer to the use of evacuation to free up 
space in a regional collector.

--Steve

On Friday, 16 August 2019 18:30:10 UTC+10, Aleksey Shipilev wrote:
On 8/16/19 10:07 AM, Gil Tene wrote:
> Classification terms evolve as the art evolves.

What I want to emphasize here that this discussion reinvents the meaning of 
"sweep". That definition
is not used in the way you describe in the sources I know of. Granted, 
definitions drift over time,
but we have to be careful to separate what is the "agreed on" definition, and 
what is whatever
definition we want to be the emergent one.

> On Thursday, August 15, 2019 at 1:42:23 AM UTC-7, Aleksey Shipilev wrote:
> I am trying to understand what is your central argument here. This seems 
> to be it. Are you saying
> that "sweeping" is when you visit dead objects, and non-"sweeping" is 
> when you do not?
>
> Yes. I think that's a very good summary.

Would be helpful if we started from this next time around!

> Is walking _dead_ objects the discriminator for "sweeping" then? So in 
> your book, if we take the
> same Lisp2 collector, and compute-new-addresses and adjust-pointers steps 
> are walking the
> self-parsable heap (visiting dead objects along the way), it is M-S-C? 
> But if it uses marking
> bitmap
> (thus only visiting live objects), it becomes M-C? [This would be a weird 
> flex, but okay].
>
>
> Exactly. "In my book" adding an efficient livemark bit vector with possible 
> summary layers would
> covert the classic LISP2 GC from a Mark-Sweep-Compact to a Mark-Compact (with 
> no sweep).

So this is what irks me here. In my Epsilon's mark-compact toy collector
(https://shipilev.net/jvm/diy-gc/), changing from mark bitmaps to self-parsable 
heap traversal is
about 5 lines of code. This does not feel like passing the bar for 
reclassification of the collector
algo between major classes. It feels more like the implementation detail that 
can swing both ways,
depending on circumstances.

-Aleksey


--
You received this message because you are subscribed to a topic in the Google 
Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit 
https://groups.google.com/d/topic/mechanical-sympathy/Kk0rQij7q8I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to 
mechanical-sympathy+unsubscr...@googlegroups.com<mailto:mechanical-sympathy+unsubscr...@googlegroups.com>.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/869e1ac0-7edd-43ca-afe1-973b0410ff76%40googlegroups.com<https://groups.google.com/d/msgid/mechanical-sympathy/869e1ac0-7edd-43ca-afe1-973b0410ff76%40googlegroups.com?utm_medium=email_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/F29FB55B-5DE3-413D-A648-1897A28F3EA2%40azul.com.


Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-16 Thread Gil Tene
You make a good point about historical terminology classifying "mark/sweep" 
and
"mark/compact". But that terminology predates, and could not have 
anticipated, the
multitude of (sometimes fairly dramatic) variants within the universe of 
compacting
collectors that came in the decades that followed.

I'll remind folks that back when those terms were used to mean those 
specific
things, the term "parallel" was also used to describe what we've evolved to
calling "concurrent" collectors for the past 25 years or so.

Classification terms evolve as the art evolves.

On Thursday, August 15, 2019 at 1:42:23 AM UTC-7, Aleksey Shipilev wrote:
>
> On 8/14/19 9:21 PM, Gil Tene wrote: 
> > The fact that they visit and parse dead objects in order to identify 
> recycle-able memory is what makes them 
> > sweepers. Other techniques that don't visit dead objects do not perform 
> a sweep. 
>
> I am trying to understand what is your central argument here. This seems 
> to be it. Are you saying 
> that "sweeping" is when you visit dead objects, and non-"sweeping" is when 
> you do not? 
>

Yes. I think that's a very good summary.

To draw a specific line: If you end up visiting each individual dead 
object, you are sweeping.
If you only visit each "hole" once (which may comprise of 1 or more dead 
objects, and covers
everything between one line object and the next), the number of visits you 
make will be no
larger than the number of live objects, and you would not be sweeping.
 

>
> >> I would leave "sweep" as something that reclaims the storage of the 
> objects (i.e. "sweeps away" the 
> >> garbage), which makes M-C and M-S classes more clearly defined and 
> easier to reason about. 
> > 
> > When M-C implies "also does a full linear traversal that visits all dead 
> objects" (aka "a sweep"), that> terminology fails. Examples like C4, G1GC, 
> Shenandoah, and ZGC are all Mark-Compact (NO Sweep) 
> > collectors, while ParallelGC and Serial GC are Mark-Compact (WITH Sweep) 
> collectors. Since the math 
> > behind those varies dramatically in wasy that effect everyday choices, 
> using "M-C" as a classification 
> > that covers both leads people to misunderstand the interactions between 
> e.g. heap size and GC work. 
>
> M-C does not imply "full linear traversal that visits all dead objects". 


We agree on that.

I am saying that M-C (with sweep) does imply that, and M-C (with no sweep) 
does not.

And I am pointing out that the difference between the two is profound 
enough that
use of "M-C" with no qualification makes the following misquotation of 
Stanislaw Ulam
a good fit:

"Describing all non-copying relocating collectors as 'Mark Compact' is like 
referring to the
bulk of zoology as the study of non-elephant animals."
 

> As the counter-example, it is very practical to use off-side marking 
> bitmaps to walk only live 

objects of the heap. Careful design of such the marking bitmap would yield 
> dramatically better 

results than using whatever self-parsable-heap walk. You can even make it 
> multi-level and 

quite dense to skip over large chunks of sparse heap. 


Yes. both straight livemark bit vectors and livemark bit vectors with 
summaries allow you to
avoid visiting individual dead objects, by effciency jumping form one live 
object to another
without ever visiting or parsing a dead object. And neither one would be 
"sweeping".


> When the marking phase is a separate phase, it stands to reason that the 
> subsequent phases would 
> have to process the marking results somehow. That would naturally do some 
> sort of walk over the data 
> structure that handles marks. If we call that walk "sweeping", then every 
> Mark-* algorithm is 
> necessarily "sweeping" too. So we have to break this somehow to make the 
> definition viable. 
>

Sweeping in not the processing of marking results, or the visiting or 
processing of the live objects.
It is the visiting and processing of dead objects.
 

>
> Is walking _dead_ objects the discriminator for "sweeping" then? So in 
> your book, if we take the 
> same Lisp2 collector, and compute-new-addresses and adjust-pointers steps 
> are walking the 
> self-parsable heap (visiting dead objects along the way), it is M-S-C? But 
> if it uses marking bitmap 
> (thus only visiting live objects), it becomes M-C? [This would be a weird 
> flex, but okay]. 
>

Exactly. "In my book" adding an efficient livemark bit vector with possible 
summary layers would
covert the classic LISP2 GC from a Mark-Sweep-Compact to a Mark-Compact 
(with no sweep).

ParallelGC doesn't do that, and sticks with the 

Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-14 Thread Gil Tene


> On Aug 14, 2019, at 10:40 AM, Aleksey Shipilev  
> wrote:
> 
> On 8/14/19 11:30 AM, Gil Tene wrote:
>> For an easy to follow code example of Mark, Sweep, and Compact, look at the 
>> (ahem...) psMarkSweep.c
>> code for the compacting OpenJDK ParallelGC oldgen implementation, which does 
>> 4 separate passes (
>> http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/7576bbd5a03c/src/share/vm/gc_implementation/parallelScavenge/psMarkSweep.cpp#l201
>>  ):
>> 
>> 1. A  tracing, pointer chasing pass that marks the live set.
>> Line 514: // Recursively traverse all live objects and mark them
>> 
>> 2. A sweeping pass that computes the new object addresses
>> Line 579: // Now all live objects are marked, compute the new object 
>> addresses.
>> (new addresses are computed to ”shift everything to one side”,
>>  which, when actually done in pass 4, will compact the heap in place)
>> 
>> 3. A sweeping pass to adjust references to point to the new object addresses:
>> Line 598: // Adjust the pointers to reflect the new locations
>> (this is sometimes referred to as a fixup pass)
>> 
>> 4. A compact pass that moves the objects to their previously chosen target 
>> locations.
>> Line 645: // All pointers are now adjusted, move objects accordingly
>> 
>> This is a classic Mark, Sweep, and Compact collector. Pass 1 only visits  
>> live objects, while passes
>> 2-4 linearly traverse the entire heap, live or dead), going forward one 
>> object at a time, and
>> skipping over the dead objects one by one.
> 
> I think the descriptions of steps (2) and (3) seriously stretch the 
> definition of "sweep". If we
> creatively redefine it to mean any (linear) heap walk, then most (all, except 
> for copying?)
> collectors should be marked as "sweep", which defies the purpose of the 
> definition.
> 
> This describes Lisp2-style collector, which is listed as (*drumroll*) 
> "Mark-Compact":
>  https://en.wikipedia.org/wiki/Mark-compact_algorithm#LISP2_algorithm
> 
> GC handbook mentions only "Mark-Compact" and "Mark-Sweep" styles, no 
> "Mark-Sweep-Compact".

Yup. those parts were written before Quick Release was commonly used to makes 
pure
Mark/Compact (no sweep) practical and profitable (over the other two variants).
I think that the G1GC evacuation stuff was also "too young" at the time of the 
writing to
consider the implications on terminology there, as G1GC-style collectors, and 
(I think Shenandoah
as well, since they share the same SATB marker logic), tend to keep separate 
live marks in
bits outside of object header for the same avoid-the-sweep reasons, and don't 
visit dead
objects when evacuating regions.

> The MM
> Glossary [https://www.memorymanagement.org/glossary/s.html#sweeping] says: 
> "Sweeping is the second
> phase (“the sweep phase”) of the mark-sweep algorithm. It performs a 
> sequential (address-order) pass
> over memory ***to recycle unmarked blocks.***" (emphasis mine)

Pass 2 above does exactly the "sequential (address-order) pass over memory to 
recycle unmarked blocks"
thing, and that makes it a classic sweep. It does a classic "find, visit, 
parse, and track all the dead stuff"
pass, and will use that information to recycle all of them (by compacting those 
holes). The way it tracks
the "free stuff" is just one of many possible ways to track the recycle-able 
holes identified in a sweep:
It 2 tracks them by reflecting the information about where the the dead holes 
are in the newly computed
address for the next live object it finds (the "how much to shift the next live 
object's address" amount
grows by the size of each dead object encountered and parsed to determine that 
size). If you wanted to,
you could directly deduce or reconstruct the list of dead hole addresses and 
sizes by walking the newly
computed object addresses backwards. It's just that the way this "list" is 
stored is optimized for the next
passes in this collector.

The way different sweepers track the recycle-able holes they find varies 
dramatically (lists, trees, bucketed
lists, buddy lists, some magic encoding in other data) but the work done to 
find them is fundamentally the
same in all sweepers: they linearly parse the heap, one object at a time, 
figuring out where the next object
starts by looking the at material within the object (dead or alive) being 
visited, and store (or embed) the
information they find about where the holes are into some data structure.

The fact that they visit and parse dead objects in order to identify 
recycle-able memory is what makes them
sweepers. Other techniques that don't visit dead obje

Re: Confusion regarding 'mark-sweep-compact' naming

2019-08-14 Thread Gil Tene

On Aug 13, 2019, at 11:34 PM, Peter Veentjer 
mailto:pe...@hazelcast.com>> wrote:

Thanks for your answer Aleksey,

comments inline.

On Tuesday, August 6, 2019 at 12:41:28 PM UTC+3, Aleksey Shipilev wrote:
On 8/6/19 7:38 AM, Peter Veentjer wrote:
> There is still some confusion on my side. This time it is regarding the 
> algorithmic complexity of
> mark-sweep-compact.
>
> The complexity of the mark phase is linear to the live set.
> The complexity of the sweep phase is linear to the heap size.
> The complexity of the compact phase is linear to the live set.

Well...

The multiplicative factors in either case might make one linearity perform 
better than the other on
wildly different heap sizes and live sets. The adversarial example of this is N 
large byte arrays of
size M. Compact might have to copy N*M bytes (especially when doing sliding 
compaction), and sweep
might only need to visit N locations to check on object metadata.

There are second-order heap size effects for both mark and compact. Take two 
heaps with the same
live set, one very dense and one very sparse, and the mere fact you have to 
walk memory far away
adds up the secondary effect of heap size. It would not linger for compacting 
collectors, though, as
first few compaction cycles would collapse the heap to denser variant.

That is all to say that algorithmic complexity is a seriously bad way to reason 
about (GC) performance.

I understand. I created a bunch of flashcard to get a better conceptual 
understanding of the different gc algorithms;
I'll add some additional flashcard to refine my understanding.


> Why would you ever want to include the a sweep phase because the complexity 
> of mark-sweep-compact is
> now linear to the heap size; a lot worse compared to 'linear to the live 
> set'. Get rid of the sweep
> and complexity goes down to 'linear to the live set'. What is the added value 
> of such an inefficient
> GC implementation?

I think m-s-c nomenclature is confusing, as the actual implementations are 
doing either m-s or m-c
during their individual phases.

Exactly. And this was the source of my confusion and caused by an abuse of 
names.

There are very real and popular collectors that do Mark, Sweep, *and* Compact, 
e.g the OpenJDK oldgen collectors in both ParallelGC and SerialGC. There are 
very real and popular collectors that do only Mark and Sweep, e.g. the 
non-compacting oldgen of the CMS collector (as long as it does not fall back to 
the STW compacting fullGC). And there are very real and popular collectors that 
do only Mark and Compact (no sweep), e.g. C4, Pauseless, Compressor, and 
recently ZGC.

For an easy to follow code example of Mark, Sweep, and Compact, look at the 
(ahem...) psMarkSweep.c code for the compacting OpenJDK ParallelGC oldgen 
implementation, which does 4 separate passes (
http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/file/7576bbd5a03c/src/share/vm/gc_implementation/parallelScavenge/psMarkSweep.cpp#l201
 ):

1. A  tracing, pointer chasing pass that marks the live set.
Line 514: // Recursively traverse all live objects and mark them

2. A sweeping pass that computes the new object addresses
Line 579: // Now all live objects are marked, compute the new object 
addresses.
(new addresses are computed to ”shift everything to one side”,
 which, when actually done in pass 4, will compact the heap in place)

3. A sweeping pass to adjust references to point to the new object addresses:
Line 598: // Adjust the pointers to reflect the new locations
(this is sometimes referred to as a fixup pass)

4. A compact pass that moves the objects to their previously chosen target 
locations.
Line 645: // All pointers are now adjusted, move objects accordingly

This is a classic Mark, Sweep, and Compact collector. Pass 1 only visits  live 
objects, while passes 2-4 linearly traverse the entire heap, live or dead), 
going forward one object at a time, and skipping over the dead objects one by 
one. For at least the first sweeping pass, when you arrive at a dead object’s 
header, you still need to follow its type indicator to determine the object 
size in order to know how far to skip ahead to the next object. So you visit 
and follow pointer information in each and every object header in the heap, 
dead or alive.

In contrast, a pure mark-compact can be built to never do a “move from one 
object to another, including the dead ones” pass. E.g. an evacuating mark 
compact can use the mark pass to produce an efficient way to move from one live 
object to the next when evacuating and doing pointer fixup.

There are various schemes for doing this efficiently without ever visiting a 
dead object. E.g. In C4 we keep live marks in a bit vector with one bit per 64 
bit word of heap (we mark in this bit vector, not in the object headers). When 
it comes time to evacuate e.g. a 2MB region, we just scan through the 
associated 32KB of memory that contains the live bits for that 2MB region, 

Re: how to replace Unsafe.objectFieldOffset in jdk 11

2019-06-18 Thread Gil Tene
Out of curiosity: What do you use Unsafe.objectFieldOffset for that 
VarHandle is lacking functionality for?

It is useful to gather use cases not covered by VarHandle so that we can 
propose additional capabilities there (preparing for a world with no Unsafe 
in it). Can you share an example of code?

As to your question: I assume it is Unsafe as a whole you are having issues 
with, and not just Unsafe.objectFieldOffset, is that correct? You can find 
examples of how to use Unsafe in a post-java-8 world e.g. 
here: http://gregluck.com/blog/archives/2017/03/using-sun-misc-unsafe-in-java-9/

On Tuesday, June 18, 2019 at 5:48:14 AM UTC-7, xiaobai andrew wrote:
>
> i usr Unsafe.objectFieldOffset in jdk 8 ,but when i want to update jdk to 
> 11, the idea compile error , it cannot  found the symbol Unsafe.
> i have try to find VarHandle api,but it did not 
> support Unsafe.objectFieldOffset
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/fd3d12a2-5f1d-4795-9e58-28723b8f02d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Faster System.nanotime() ?

2019-05-02 Thread Gil Tene


> On May 2, 2019, at 2:53 PM, dor laor  wrote:
> 
> On Thu, May 2, 2019 at 7:14 AM Gil Tene  <mailto:g...@azul.com>> wrote:
> 
> 
> Sent from my iPad
> 
> On May 1, 2019, at 1:38 PM, dor laor  <mailto:dor.l...@gmail.com>> wrote:
> 
>> On Wed, May 1, 2019 at 9:58 AM Gil Tene > <mailto:g...@azul.com>> wrote:
>> There are many ways for RDTSC to be made "wrong" (as in non-monotonic within 
>> a software thread, process, system, etc.) on systems, but AFAIK "most" 
>> modern x86-64 bare metal systems can be set up for good clean, monotonic 
>> system-wide TSC-ness. The hardware certainly has the ability to keep those 
>> TSCs in sync (enough to not have detectable non-sync effects) both within a 
>> socket and across multi-socket systems (when the hardware is "built right"). 
>> The TSCs all get reset together and move together unless interfered with...
>> 
>> Two ways I've seen this go wrong even on modern hardware include:
>> 
>> A) Some BIOSes resetting TSC on a single core or hyperthread on each socket 
>> (usually thread 0 of core 0) for some strange reason during the boot 
>> sequence. [I've conclusively shown this on some 4 socket Sandy Bridge 
>> systems.] This leads different vcores to have vastly differing TSC values, 
>> which gets bigger with every non-power-cycling reboot, with obvious negative 
>> effects and screams from anyone relying on TSC consistency for virtually any 
>> purpose.
>> 
>> B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some 
>> versions of VMWare) will virtualize the TSC and "slew" the virtualized value 
>> to avoid presenting guest OSs with huge jumps in TSC values when a core was 
>> taken away for a "long" (i.e. many-msec) period of time. Instead, the 
>> virtualized TSC will incrementally move forward in small jumps until it 
>> catches up. The purpose of this appears to be to avoid triggering guest OS 
>> panics in code that watches TSC for panic-timeouts and other sanity checks 
>> (e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC 
>> values can easily jump backward, even within a single software thread.
>> 
>> A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not 
>> take it backward, unless they virtualize the cpu bits for stable tsc 
>> differently which
>> happens but I doubt VMware (and better hypervisors) take the TSC back
> 
> A hypervisor wouldn't take the TSC backwards within one vcore.
> 
> But vcores are scheduled individually, which means that any slewing done to 
> hide a long jump forward in the physical TSC in situations where a vcore was 
> not actually running on a physical core for a “long enough” period of time is 
> done individually within each vcore and its virtualized TSC. (synchronizing 
> the virtualized TSC slewing across vcores would require either synchronizing 
> their scheduling such that the entire VM would be either “on” or “off” cores 
> at the same time, or making the virtualuzed a TSC only tick forward in large 
> quantum’s, or only when all vcores are actively running on physical cores,  
> all of which would cause some other dramatic strangeness).
> 
> Multiple vcores belonging to the same guest OS can (and usually will) end up 
> running simultaneously on multiple real cores, which obviously means that 
> during slewing periods they will be showing vastly differing virtualized TSC 
> values (with gaps of 10s of msec) until the “slewing” is done. All it takes 
> is a “lucky timing” context switch within the Guest OS, moving a thread from 
> one vcore to another (for whichever of the many reasons the guest OS might 
> decide to do that) for *your* program to observe the TSC “jumping backwards” 
> by 10s of msec between one RDTSC execution and another.
> 
> It's the same issue as a physical machine with multiple sockets, the tsc 
> isn't synced across those different sockets.

Except that since ~Nehalem, the hardware (if built to recommended specs) does 
keep the tsc on all cores sync'ed across all sockets. The only non-perfectly 
sync'ed TSCs I've ever seen on any modern Intel hardware were due to BIOS 
messing with them after the hardware reset that placed them all in sync has 
already happened.. On those platforms where I observed out-of-sync TSCs, all 
hyper-threads except for one per socket were perfectly in sync, And the two 
hyperthreads on core 0 were out of sync with each other]

> The hypervisor keeps an offset per unscheduled vcore and makes sure it is 
> monotonic. Although we at KVM considered to
> slew/speed the TSC on vcores, primarily for live migration, we

Re: Faster System.nanotime() ?

2019-05-02 Thread Gil Tene


Sent from my iPad

On May 1, 2019, at 1:38 PM, dor laor 
mailto:dor.l...@gmail.com>> wrote:

On Wed, May 1, 2019 at 9:58 AM Gil Tene mailto:g...@azul.com>> 
wrote:
There are many ways for RDTSC to be made "wrong" (as in non-monotonic within a 
software thread, process, system, etc.) on systems, but AFAIK "most" modern 
x86-64 bare metal systems can be set up for good clean, monotonic system-wide 
TSC-ness. The hardware certainly has the ability to keep those TSCs in sync 
(enough to not have detectable non-sync effects) both within a socket and 
across multi-socket systems (when the hardware is "built right"). The TSCs all 
get reset together and move together unless interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket 
(usually thread 0 of core 0) for some strange reason during the boot sequence. 
[I've conclusively shown this on some 4 socket Sandy Bridge systems.] This 
leads different vcores to have vastly differing TSC values, which gets bigger 
with every non-power-cycling reboot, with obvious negative effects and screams 
from anyone relying on TSC consistency for virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some 
versions of VMWare) will virtualize the TSC and "slew" the virtualized value to 
avoid presenting guest OSs with huge jumps in TSC values when a core was taken 
away for a "long" (i.e. many-msec) period of time. Instead, the virtualized TSC 
will incrementally move forward in small jumps until it catches up. The purpose 
of this appears to be to avoid triggering guest OS panics in code that watches 
TSC for panic-timeouts and other sanity checks (e.g. code in OS spinlocks). The 
effect of this "slewing" is obvious: TSWC values can easily jump backward, even 
within a single software thread.

A hypervisor wouldn't take the TSC backwards, it can slow the TSC but not take 
it backward, unless they virtualize the cpu bits for stable tsc differently 
which
happens but I doubt VMware (and better hypervisors) take the TSC back

A hypervisor wouldn't take the TSC backwards within one vcore.

But vcores are scheduled individually, which means that any slewing done to 
hide a long jump forward in the physical TSC in situations where a vcore was 
not actually running on a physical core for a “long enough” period of time is 
done individually within each vcore and its virtualized TSC. (synchronizing the 
virtualized TSC slewing across vcores would require either synchronizing their 
scheduling such that the entire VM would be either “on” or “off” cores at the 
same time, or making the virtualuzed a TSC only tick forward in large 
quantum’s, or only when all vcores are actively running on physical cores,  all 
of which would cause some other dramatic strangeness).

Multiple vcores belonging to the same guest OS can (and usually will) end up 
running simultaneously on multiple real cores, which obviously means that 
during slewing periods they will be showing vastly differing virtualized TSC 
values (with gaps of 10s of msec) until the “slewing” is done. All it takes is 
a “lucky timing” context switch within the Guest OS, moving a thread from one 
vcore to another (for whichever of the many reasons the guest OS might decide 
to do that) for *your* program to observe the TSC “jumping backwards” by 10s of 
msec between one RDTSC execution and another.



The bottom line is that TSC can be relied on bare metal (where there is no 
hypervisor scheduling of guest OS cores) if the system is set up right, but can 
do very wrong things otherwise. People who really care about low cost time 
measurement (like System.nanotime()) can control their systems to make this 
work and elect to rely on it (that's exactly what Zing's -XX:+UseRdtsc flag is 
for), but it can be dangerous to rely on it by default.

On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
I'd assumed that the monotonicity of System.nanoTime() on modern
systems was due to the OS compensating, rather than any changes at the
hardware level. Is that not the case?

In particular, Rust definitely still seems to think that their
SystemTime (which looks to back directly on to a RDTSC) can be
non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html

On Tue, 30 Apr 2019 at 07:50, dor laor  wrote:
>
> It might be since in the past many systems did not have a stable rdtsc and 
> thus if the instruction is executed
> on different sockets it can result in wrong answers and negative time. Today 
> most systems do have a stable tsc
> and you can verify it from userspace/java too.
> I bet it's easy to google the reason
>
> On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via mechanical-sympathy 
>  wrote:
>>
>> This may be a dumb question, but why (on Linux) is Sy

Re: Faster System.nanotime() ?

2019-05-01 Thread Gil Tene
There are many ways for RDTSC to be made "wrong" (as in non-monotonic 
within a software thread, process, system, etc.) on systems, but AFAIK 
"most" modern x86-64 bare metal systems can be set up for good clean, 
monotonic system-wide TSC-ness. The hardware certainly has the ability to 
keep those TSCs in sync (enough to not have detectable non-sync effects) 
both within a socket and across multi-socket systems (when the hardware is 
"built right"). The TSCs all get reset together and move together unless 
interfered with...

Two ways I've seen this go wrong even on modern hardware include:

A) Some BIOSes resetting TSC on a single core or hyperthread on each socket 
(usually thread 0 of core 0) for some strange reason during the boot 
sequence. [I've conclusively shown this on some 4 socket Sandy Bridge 
systems.] This leads different vcores to have vastly differing TSC values, 
which gets bigger with every non-power-cycling reboot, with obvious 
negative effects and screams from anyone relying on TSC consistency for 
virtually any purpose.

B) Hypervisors virtualizing TSC. Some hypervisors (notably at least some 
versions of VMWare) will virtualize the TSC and "slew" the virtualized 
value to avoid presenting guest OSs with huge jumps in TSC values when a 
core was taken away for a "long" (i.e. many-msec) period of time. Instead, 
the virtualized TSC will incrementally move forward in small jumps until it 
catches up. The purpose of this appears to be to avoid triggering guest OS 
panics in code that watches TSC for panic-timeouts and other sanity checks 
(e.g. code in OS spinlocks). The effect of this "slewing" is obvious: TSWC 
values can easily jump backward, even within a single software thread.

The bottom line is that TSC can be relied on bare metal (where there is no 
hypervisor scheduling of guest OS cores) if the system is set up right, but 
can do very wrong things otherwise. People who really care about low 
cost time measurement (like System.nanotime()) can control their systems to 
make this work and elect to rely on it (that's exactly what Zing's 
-XX:+UseRdtsc flag is for), but it can be dangerous to rely on it by 
default.

On Tuesday, April 30, 2019 at 3:07:11 AM UTC-7, Ben Evans wrote:
>
> I'd assumed that the monotonicity of System.nanoTime() on modern 
> systems was due to the OS compensating, rather than any changes at the 
> hardware level. Is that not the case? 
>
> In particular, Rust definitely still seems to think that their 
> SystemTime (which looks to back directly on to a RDTSC) can be 
> non-monotonic: https://doc.rust-lang.org/std/time/struct.SystemTime.html 
>
> On Tue, 30 Apr 2019 at 07:50, dor laor > 
> wrote: 
> > 
> > It might be since in the past many systems did not have a stable rdtsc 
> and thus if the instruction is executed 
> > on different sockets it can result in wrong answers and negative time. 
> Today most systems do have a stable tsc 
> > and you can verify it from userspace/java too. 
> > I bet it's easy to google the reason 
> > 
> > On Mon, Apr 29, 2019 at 2:36 PM 'Carl Mastrangelo' via 
> mechanical-sympathy > wrote: 
> >> 
> >> This may be a dumb question, but why (on Linux) is System.nanotime() a 
> call out to clock_gettime?It seems like it could be inlined by the JVM, 
> and stripped down to the rdtsc instruction.   From my reading of the vDSO 
> source for x86, the implementation is not that complex, and could be copied 
> into Java. 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups "mechanical-sympathy" group. 
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an email to mechanical-sympathy+unsubscr...@googlegroups.com . 
>
> >> For more options, visit https://groups.google.com/d/optout. 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "mechanical-sympathy" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to mechanical-sympathy+unsubscr...@googlegroups.com . 
>
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Avoid stepping on page faults while writing to MappedByteBuffer

2019-03-30 Thread Gil Tene


On Saturday, March 30, 2019 at 10:17:15 AM UTC-7, Steven Stewart-Gallus 
wrote:
>
> I feel like this is just a bug in the JDK that should be patched.
>

And how would you "patch" it? Without the result being sucky performance, 
that is?

The tension is between the performance of mapped byte buffer access, and 
the wish to avoid being caught in a page fault while not at a safepoint. 
You can do one or the other "easily": either be at a safepoint on all 
buffer accesses, or don't. Being at a safepoint in e.g. every call to 
ByteBuffer.get() [in the actual memory accessing code that is susceptible 
to page faults] would certainly prevent the TTSP-due-to-page-fault 
problems. But it would also dramatically reduce the performance of loops 
with such access in them. Not only because of the need to poll for 
safepoint conditions on every access but [mostly] because many compiler 
optimizations are "hard" to do across safepoint opportunities. 
 

> Couldn't this all be solved by replacing UNSAFE.copyMemory with a call to 
> a different method that isn't a HotSpot intrinsic?
>
>
> https://hg.openjdk.java.net/jdk/jdk/file/235883996bc7/src/java.base/share/classes/java/nio/Direct-X-Buffer.java.template#l313
>  
> 
>
> Interestingly I don't believe that copySwapMemory is an intrinsic so an 
> ugly kludge might be to use a nonnative byte order deliberately.
>

Counting on things not being intrinsified (or treated as leaf functions 
that don't need safepoints) is a dangerous practice, since anything in the 
jdk may become intrinsifed or otherwise optimized tomorrow.

But specifcially to the above, copySwapMemory (via copySwapMemory0) is 
already treated as a known leaf method 
(http://hg.openjdk.java.net/jdk/jdk/file/235883996bc7/src/hotspot/share/prims/unsafe.cpp#l1095),
 
which (among other things) means that [unlike generic JNI calls] no 
safepoints will be taken on calling it.


-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Avoid stepping on page faults while writing to MappedByteBuffer

2019-01-17 Thread Gil Tene
mincore is no more useful than pretouch. Unless you lock, a page that was 
resident at the time of the mincore/pretouch operation can be evicted a few 
microseconds later. A “this seems to be resident right now” quality does not 
prevent page faults or long TTSPs and STW pauses on mapped buffer access, it 
just improves the stats a bit.

Sent from my iPad

On Jan 16, 2019, at 9:41 PM, Steven Stewart-Gallus 
mailto:stevenselectronicm...@gmail.com>> wrote:

I think what you want is something like the mincore system call on Linux so 
your thread can write directly if the page is mapped but offload the work to 
another thread if it is not mapped. I don't have any experience with the system 
call though.

--
You received this message because you are subscribed to a topic in the Google 
Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit 
https://groups.google.com/d/topic/mechanical-sympathy/yL4Yaedgqg4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to 
mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Happens before between putting and getting from and to ConcurrentHashMap

2018-11-18 Thread Gil Tene
Well, I was wrong on the internet. (Again).

Vladimir is right, the happens-before is transitive. Regardless of the 
order of execution on either side, every field initialization (final or 
otherwise) happens-before the put(), and therefor happens-before any 
subsequent get() on any thread, such that the get()'ing thread cannot 
observe pre-initialization values in the value being put() in the CHM.

For others that might follow the confusing weekend logic that led me to the 
mistake: I need to remind myself that it is complicated (and error-prone) 
to deduce negatives from positive observations. E.g. "happened before" just 
means "there is no happens-before that prohibited it from being observed 
this way" (from JLS 1.4.5: "...Informally, a read r is allowed to see the 
result of a write w if there is no happens-before ordering to prevent that 
read.").

The process of making this mistake (for me) comes it comes from figuring 
out too many negatives in a row, as in when trying to fully figure out the 
meaning of "anti-anti-anti-missile-missile" in your head, without unfolding 
it on paper. It can lead to falsely jumping from "I have an example where I 
definitely saw A happen before B happened" to hb(A, B) instead of just the 
!hb(B, A) it actually means. And/or something like that backward.

For me, these multi-deep-negatives thinking sequence often start with 
something informal like "so we know that another thread *can* see 
pre-initialized values of non-final fields if the reference to the object 
was published without an ordering operation of some sort" followed by "so 
since I know that is allowed, what is it in this stuff [like the CHM 
put()/get() example above] that actually prevents it from happening in this 
case?", then, after finding the concrete ordering operation [the volatile 
write and read] in the implementation but not stated *directly* in the 
contract (now we are in a two-deep negative assessment), taking the path of 
"I can construct a sequence where I *know* that an initialization of a 
field in an object happened before a publication of the object, and where 
without this ordering operation another thread could still see a value that 
predates that initialization, so if I remove this implementation-specific 
ordering operation this can still happen under this contract" takes me into 
a triple negative territory.

The best thing to do is to erase the board and start from the other end...

The contract's happens-before statement between the put() and subsequent 
non-null get()s, even while not directly saying anything about the 
relationships with program operations prior to the put(), does carry a 
transitive property that covers them. So the contract would require all 
future implementations to apply some ordering operation(s) that would still 
enforce the happens-before, including its transitive properties.  What that 
operation would be doesn't matter, it is the implementor's responsibility 
to make sure it is there.

On Sunday, November 18, 2018 at 2:37:27 AM UTC-8, Vladimir Sitnikov wrote:
>
> Gil>I'd be happy
>
> I wish you all the best. I truly adore the way you explain things.
>
> Gil>There is no contract I've found that establishes a 
> happens-before relationship between the initialization of non-final fields 
> in some object construction and the put of that object (or of some object 
> that refers to it) as a value into the CHM.
>
> In this case that contract is provided by JLS "17.4.3. Programs and 
> Program Order" and "17.4.5. Happens-before Order" .
> TL;DR: "If x and y are actions of the same thread and x comes before y in 
> program order, then hb(x, y)."
>
> Let me take an example:
>
> class Wrapper {
>   int value;
> }
> static CHM map;
>
> Thread1:
> val w = new Wrapper();
> w.value=42;
> map.put("wrap", w);
>
> Thread2:
> val u = map.get("wrap");
> if (u != null) { println(u.value); }
>
> 1) In thread 1 there's happens-before between write "value=42", and write 
> of "w" into the map since "program order implies happens-before"
> 2) CHM provides happends-before for non-null retrieval of the value
> 3) "retrieval of the value u" happens-before "read of u.value" since 
> "program order implies happens-before"
>
> The happens-before order is a partial order, so it is transitive, so 1+2+3 
> gives "write of value=42 happens-before read of u.value".
> The key point of having a CHM is to have 2 (happens-before across threads) 
> which is not provided automatically if simple HashMap is used.
>
> What do you think?
>
> Gil>It is true that* in current implementation* a put() involves a 
> volatile store
>
> JavaDoc contract is there, so CHM would provide "happens-before relation 
> between update and non-null retrieval" in future one way or another.
>
> Vladimir
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 

Re: Happens before between putting and getting from and to ConcurrentHashMap

2018-11-16 Thread Gil Tene
Well, "yes, sort of".

Your example works for String, because it is immutable.

Specifically, there is a happens-before relationship between the 
constructor's initialization of the final fields of the String and the 
subsequent publishing store of the result of the constructor.

However, non-final fields have no such happens-before relationship with the 
publishing store. E.g. the cached hash field in String 

 may 
or may not be have been initialized when p.x is read. 

This [race on hash field initialization] has no visible side effect because 
of how the hash field is used in String: it caches the hash code, and will 
freshly compute the hash from the immutable char[] if the field holds a 0. 
So even if the initialization races with a call to hashCode() 

 [e.g. 
a call to hashcode happens on another thread before the hash field is 
initialized, and the initialization overwrites the cached value with a 0], 
all that would happen is a recomputation of the same hash. The value 
returned by hashCode() won't change.

But in other cases where non-final fields are involved, e.g. if p.x was a 
Foo with a non-final field y with a getter, p.x.getY() may return an 
uninitialized value after the get() returns a non-null p.

On Friday, November 16, 2018 at 3:59:47 AM UTC-8, John Hening wrote:
>
> Hello, let's look at:
>
>
>
> class P {  public String x;
> }
> 
> ConcurrentHashMap x = new ConcurrentHashMap<>();
>
> new Thread(() -> { // Thread 1
> x.put(1, new String("x")); // 1
> }).start();
> 
> new Thread(() -> { // Thread 2 
> P p = x.get(1); // 2
> if(p != null){ 
> print(p.x);// 4
> }
> }).start();
>
>
>
>
> If thread 2 observes p != null is it guaranteed by JMM that p.x is 
> initialized? For my eye, yes, because:
>
> Let's assume a such execution when p != null was true. It means that there 
> is a synchronize-with relation between x.put(1, new String("x")); --sw--> 
> x.get().
> Putting and getting an element from ConcurrentHashMap contain 
> synchronization access (and, actually, synchronization-with is between 
> them). 
>
> In a result, there is a chain of happens-before relation:
>
> tmp = new String("x") --hb--> x.put(1, tmp) --hb--> x.get(1) --hb--> 
> read(p.x)
>
>
>
>
> Yes?
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Free-before-allocate in Java

2018-11-15 Thread Gil Tene
This one is "tricky" in a fun way.

>From a language semantics point of view, the compiler is probably allowed 
to perform the optimization suggested (removing the null assignment to a 
non-volatile field, knowing that it will soon be overwritten), which does 
raise an interesting point about the language semantics and OOM conditions.

However, from a practical implementation point of view, it is "hard" [my 
shorthand for "it may be impossible, but I can't prove it, or don't want to 
bother trying"] to perform this optimization in a way that would eliminate 
the null assignment from a GC perspective, because there is an object 
allocation between the two assignments. The difficulty comes from the fact 
that even in optimized code (in current JVM implementations), all 
allocation sites retain the ability to take a safepoint. Here is the logic:

- The allocation attempt *may* need to wait for a GC to complete (e.g. if a 
GC is needed in order to produce the empty memory that the allocation will 
use).

- A GC can't be guaranteed to complete (in all current practical JVM 
implementations) without transitioning all threads (at the very least 
temporarily and individually, if not simultaneously and globally) to a 
safepoint.

- Since the thread must be able to "come to a safepoint" at the allocation 
site (which sits between the first and second assignments), and since 
safepoints can end up being used for things other than GC (such as 
deoptimization, breakpoints, etc.), the JVM state at the safepoint must be 
completely reconstructible.

- If a safepoint is taken at the allocation site. the state of the JVM at 
that safepoint would include the memory outcome of the first assignment and 
*not* include the outcome second assignment.

- Therefore, *IF* the safepoint is taken, the null assignment must occur 
before it is taken.

Technically, this can either defeat the optimization altogether (which is 
what it will do for most JITs). However, it is possible to keep the 
optimization in the fast path (allocation doesn't take a safepoint) if a 
JIT was able to push code into the path between the poll to determine if a 
safepoint is needed and actually reaching the safepoint. If the JIT has 
that ability, the null assignment code could be moved around such that it 
occurs only if a safepoint is actually taken, and is skipped if a safepoint 
is not taken.

Either way (regardless of whether the optimization is defeated, or the null 
is moved to happen only in the safepoint-taking path), the null assignment 
would occur before GC is ever forced at the allocation site. This will 
explain why you won't see an OOM on that allocation on current JVMs, even 
with JIT-optimized code, even if the heap is only large enough to 
accommodate one copy of the byte[].

On Tuesday, November 13, 2018 at 9:28:18 AM UTC-8, Shevek wrote:
>
> Given the following code: 
>
> byte[] data = ...; 
>
> { 
> data = null; 
> data = new byte[N]; 
> } 
>
> Is the compiler allowed to discard the assignment of null, because it's 
> "dead" in language terms? My argument is that it isn't dead because it 
> allows the garbage collector to respond to pressure within the 'new' and 
> reuse the space, but in language terms, this presumably isn't defined, 
> and it would seem to be legal for the first assignment to be removed. 
>
> Thank you. 
>
> S. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Sorting a very large number of objects

2018-11-11 Thread Gil Tene


On Sunday, November 11, 2018 at 9:53:45 PM UTC-8, Shevek wrote:
>
> I'm using the serialized approach with an ergonomics approach to heap 
> management. protobuf really sucks for a lot of reasons which are easily 
> remediable if one is a protobuf maintainer, but I'm not and I don't care 
> to be, and I don't have the time or inclination to write a new serializer. 
>
> However, and to Gil's point, following the ergonomics approach, when I 
> resize an array, I drop a garbage-collectable array which I no longer 
> "own", so from the PoV of the memory management code, it forms part of 
> the "other" part of the heap, i.e. I have to subtract it from the total 
> memory before computing the proportion of the heap I'm allowed to use. 
> Over time, given that GC isn't timely, I can lose all my heap to these 
> collectable arrays, and if I follow the rule about "Never call 
> System.gc()" this actually isn't a very efficient approach either.


The importance or impact of timeliness of GC is often misunderstood. In 
fact, from a GC cpu overhead perspective, the less timely it is, the better.

Once your code no longer holds any references to an array or object, you 
shouldn't count it as part of the live set, and shouldn't treat it as 
something that holds down any memory or that you are "losing all your heap 
to". From a GC efficiency (and GC cpu cost) perspective, if the array is no 
longer reachable, it is empty (non-live) heap regardless of whether or not 
GC collects it in a timely manner. The GC will get to recycling the memory 
it used to be in eventually, and the more of it there is (as a % of Xmx 
when GC actually triggers) the more efficient that GC will be. The thing to 
keep in mind when imagining/modeling this is that most GC algorithms work 
hard to avoid spending CPU cycles on dead stuff. It's the live (reachable) 
stuff that costs (and causes GC cpu spend). The larger the empty (dead when 
GC is triggered) parts of the heap are as a % of the overall Xmx, the less 
often that CPU cost (of dealing with the live stuff) needs to be paid.

When trying to evaluate the amount of empty heap available, in order to 
e.g. figure out how to size your temporary structures or caches, the things 
to estimate are not the "current" amount of heap use or "current" amount of 
empty heap. They are the live set (the sum of getUsed() calls on 
MemoryPoolMXBean.getCollectionUsage() 
<https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#getCollectionUsage()>
 for 
the various pools in the collector) and the empty heap after collection 
(the sum of the differences between the getMax() and getUsed() on the 
same), which will best approximate what you need to consider. It is 
critical to look at the values that reflect use levels immediately after GC 
is performed by using values obtained from 
MemoryPoolMXBean.getCollectionUsage() 
<https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#getCollectionUsage()>*
 as 
opposed to* using MemoryPoolMXBean.getUsage() 
<https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#getUsage()>.
 The 
latter simply show up as random numbers that fall somewhere between the 
live set and Xmx, depending on when you measure them. They make for nice 
plots for tracking GC activity but are generally useless for estimating 
available empty heap when trying to make sizing choices.
 

>
>
> The next best approach is to allocate all my memory in 1Mb blocks, make 
> each memory "region" be backed by some list of these blocks, so I never 
> actually hand them to the GC, and just reduce my overall memory usage by 
> dropping blocks, rather than totally resizing byte arrays. 
>
> My main disbelief is that I'm inventing all of this from scratch. It HAS 
> to have been done before, right? All I want to do is sort a load of 
> objects...! 
>
> S. 
>
> The presumed solution to this 
>
> On 11/10/18 8:51 AM, Gil Tene wrote: 
> > 
> > 
> > On Friday, November 9, 2018 at 7:08:23 AM UTC-8, Shevek wrote: 
> > 
> > Hi, 
> > 
> > I'm trying to sort/merge a very large number of objects in Java, and 
> > failing more spectacularly than normal. The way I'm doing it is 
> this: 
> > 
> > * Read a bunch of objects into an array. 
> > * Sort the array, then merge neighbouring objects as appropriate. 
> > * Re-fill the array, re-sort, re-merge until compaction is "not very 
> > successful". 
> > * Dump the array to file, repeat for next array. 
> > * Then stream all files through a final merge/combine phase. 
> > 
> > This is failing largely because I have no idea how large to make the 
> >

Re: Sorting a very large number of objects

2018-11-10 Thread Gil Tene


On Friday, November 9, 2018 at 7:08:23 AM UTC-8, Shevek wrote:
>
> Hi, 
>
> I'm trying to sort/merge a very large number of objects in Java, and 
> failing more spectacularly than normal. The way I'm doing it is this: 
>
> * Read a bunch of objects into an array. 
> * Sort the array, then merge neighbouring objects as appropriate. 
> * Re-fill the array, re-sort, re-merge until compaction is "not very 
> successful". 
> * Dump the array to file, repeat for next array. 
> * Then stream all files through a final merge/combine phase. 
>
> This is failing largely because I have no idea how large to make the 
> array. Estimating the ongoing size using something like JAMM is too 
> slow, and my hand-rolled memory estimator is too unreliable. 
>
> The thing that seems to be working best is messing around with the array 
> size in order to keep some concept of runtime.maxMemory() - 
> runtime.totalMemory() + runtime.freeMemory() within a useful bound. 
>
> But there must be a better solution. I can't quite think a way around 
> this with SoftReference because I need to dump the data to disk when the 
> reference gets broken, and defeating me right now. 
>
> Other alternatives would include keeping all my in-memory data 
> structures in serialized form, and paying the ser/deser cost to compare, 
> but that's expensive - my main overhead right now is gc. Serialization 
> is protobuf, although that's changeable, since it's annoying the hell 
> out of me (please don't say thrift - but protobuf appears to have no way 
> to read from a stream into a reusable object - it has to allocate the 
> world every single time). 
>

In general, whenever I see "my overhead is gc" and "unknown memory size" 
together, I see it as a sign of someone pushing heap utilization high and 
getting into the inefficient GC state. Simplistically, you should be able 
to drop the GC cost to an arbitrary % of overall computation cost by 
increasing the amount (or relative portion) of empty heap in your set. So 
GC should never be "a bottleneck" from a throughput point of view unless 
you have constraints (such as a minimum required live set and a maximum 
possible heap size) that force you towards a high utilization of the heap 
(in terms of LiveSet/HeapSize). The answer to such a situation is generally 
"get some more RAM for this problem" rather than put in tons of work to fit 
this in". 

For most GC algorithms, or at least for the more costly parts of such 
algorithms, GC efficiency is roughly linear to EmptyHeap/LiveSet. Stated 
otherwise, GC cost grows with LiveSet/EmptyHeap or 
LiveSet/(HeapSize-LiveSet). As you grow the amount you try to cram into a 
heap of a given size, you increase the GC cost to the square of your 
cramming efforts. And for every doubling of the empty heap [for a given 
live set] you will generally half the GC cost.

This should [hopefully] make it obvious why using SoftReferences is a 
generally terrible idea.
 

>
> Issues: 
> * This routine is not the sole tenant of the JVM. Other things use RAM. 
>
 
You can try to establish what an "efficient enough" heap utilization level 
is for your use case (a level that keeps overall GC work as a % of CPU 
spend to e.g. below 10%), and keep your heap use to a related fraction of 
whatever heap size you get to have on the system you land on.
 

> * This has to be deployed and work on systems whose memory config is 
> unknown to me. 
>
> Can anybody please give me pointers? 
>
> S. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Getting clock difference of 2 servers upto single digit micros (as close as possible)

2018-10-23 Thread Gil Tene
The mean end-to-end (from writing to a socket to reading from a socket), 
round-trip latency across a modern 10G+ can be brought down to 30-40usec on 
modern hardware with relatively low effort or specialized  equipment 
(e.g. https://blog.cloudflare.com/how-to-achieve-low-latency/), and can be 
driven as low as 3-5 usec with specialized hardware and software stacks 
(kernel bypass, etc) (e.g. 
http://www.mellanox.com/related-docs/whitepapers/HP_Mellanox_FSI%20Benchmarking%20Report%20for%2010%20%26%2040GbE.pdf).
 

A trivial round trip ("what time do you have? [my time is X]" to "My clock 
shows Y for your request sent at X" [recieved at Z]". would allow you to 
measure the delta between the perceived wall clock difference between two 
machines to within the round trip latency. e.g. The difference between the 
clocks (at the time measured) in the above sequence is known to be (Z-Y) 
+/- (Z-X). You can use various statistical techniques to more closely 
estimate the bound when repeating the round trip queries many times and 
across periods of time. E.g. the amazingly effective techniques used 
(decades ago) by NTP to synchronize clocks to within milliseconds across 
wide geographical distances and slow/jittery networks still apply even at 
low latency scales (e.g. start with something 
like http://www.ntp.org/ntpfaq/NTP-s-algo.htm or 
https://www.cisco.com/c/en/us/about/press/internet-protocol-journal/back-issues/table-contents-58/154-ntp.html
 
and dig into references if interested).

Keep in mind that at the levels you are looking at clock skew and drift are 
very real things. And then there is jitter...

On Tuesday, October 23, 2018 at 5:05:22 AM UTC-7, Himanshu Sharma wrote:
>
> As the title suggests, consider 2 servers connected via an L3 switch. How 
> can we find the absolute time difference between the clocks running on the 
> servers. I want to go as close as possible. 
>
> Actually syncing the clocks is not possible due to some constraints so I 
> want to know the time difference. Is there any opensource tool I can use 
> readily.
>
>
> Many thanks in advance
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Concurrent retrieval of statistics

2018-10-17 Thread Gil Tene


On Wednesday, October 17, 2018 at 7:47:05 PM UTC-4, Carl Mastrangelo wrote:
>
> Without forking this thread too much, I read your post a few months back 
> (the link you posted is dead though!).  
>

I fixed the link in the post. Thx.
 

> One of the questions I had is that WRP still synchronizes all of the 
> writers in writerCriticalSectionEnter, since they all will have to contend 
> on the long.
>

While this technically is synchronization, it is wait free synchronization. 
The writers remain to wait-free (on architectures that support atomic 
increments, e.g. x86) while contending for write access to the cache line 
containing the long epoch counter. One processor will push through at a 
time, but there are no retry loops or other wait situations involved.

My question is if it's possible to shard that counter somehow, since 
> presumably the number of writers is high, and they enter the writer 
> critical section significantly more than the reader does?  (and if it is 
> possible, was this unnecessary for WRP?)
>

Yes, implementations that use wider forms of counters (e.g. using a 
LongAdder 
<https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/LongAdder.html>
 instead 
of volatile longs and atomic updaters) are certainlyu possible, and would 
not change the algorithm in any way. I found that in most use cases the 
number of writers is low enough and the rate of writing low enough enough 
(even when it is in the millions per second) that reducing contention on 
the counters is not worth the extra indirection cost of AtomicAdder. But 
you can implement variosus equivalent-logic striped scheme sthat ar

>
> On Wednesday, October 17, 2018 at 5:34:09 AM UTC-7, Gil Tene wrote:
>>
>> See "WriterReaderPhaser: A story about a new (?) synchronization 
>> primitive" 
>> <http://stuff-gil-says.blogspot.com/2014/11/writerreaderphaser-story-about-new.html>.
>>  
>> WriterReaderPhaser (WRP) was designed for the common pattern of speed or 
>> latency critical writers writing and accumulating data into a data 
>> structure (e.g. a stats collecting like a histogram), and less-critical 
>> background threads occasionally or periodically gathering up that 
>> accumulated data is a lossless manner, and provides for wait-free writers 
>> and potentially blocking readers. It's a common pattern used and needed 
>> when e.g. logging of metrics or other stats is performed periodically, and 
>> I've used the pattern often enough that I decided it needs a new 
>> synchronization primitive.
>>
>> For a full code example of how to use this synchronization primitive in 
>> e.g. a classic double-buffered approach around a simple data structure, you 
>> can see how an implementation of WriterReaderPhaser 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/WriterReaderPhaser.java>
>>  
>> is used in HdrHistogram's SingleWriterRecorder 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/SingleWriterRecorder.java>
>>  and Recorder 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/Recorder.java>
>>  classes, 
>> both of which record data into an interval histogram, and provide for 
>> lossless reading of that interval histogram. Those are classic examples of 
>> stats written by latency critical writers in a wait-free manner, and 
>> collected by a non-latency critical background thread, and 
>> WriterReaderPhaser can be similarly used to coordinate this sort of work 
>> around any sort of stats-gathering data structure. The WRP's current 
>> implementation has the writer use an atomic increment (which translates to 
>> a LOCK XADD on x86) to enter 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/WriterReaderPhaser.java#L73>
>>  
>> and leave 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/WriterReaderPhaser.java#L91>
>>  
>> the critical section. Single writer cases need no further synchronization, 
>> and multi-writer cases would need to coordinate on writing to the common 
>> data structure (e.g. a AtomicHistogram 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/Recorder.java#L314>
>>  or 
>> ConcurrentHistogram 
>> <https://github.com/HdrHistogram/HdrHistogram/blob/master/src/main/java/org/HdrHistogram/Recorder.java#L326>
>>  
>> in Recorder, depending on whether or not auto-resizing of the histogram is 
>> needed).
>>
>> A java implementation of WRP c

Re: Concurrent retrieval of statistics

2018-10-17 Thread Gil Tene
See "WriterReaderPhaser: A story about a new (?) synchronization primitive" 
. 
WriterReaderPhaser (WRP) was designed for the common pattern of speed or 
latency critical writers writing and accumulating data into a data 
structure (e.g. a stats collecting like a histogram), and less-critical 
background threads occasionally or periodically gathering up that 
accumulated data is a lossless manner, and provides for wait-free writers 
and potentially blocking readers. It's a common pattern used and needed 
when e.g. logging of metrics or other stats is performed periodically, and 
I've used the pattern often enough that I decided it needs a new 
synchronization primitive.

For a full code example of how to use this synchronization primitive in 
e.g. a classic double-buffered approach around a simple data structure, you 
can see how an implementation of WriterReaderPhaser 

 
is used in HdrHistogram's SingleWriterRecorder 

 and Recorder 

 classes, 
both of which record data into an interval histogram, and provide for 
lossless reading of that interval histogram. Those are classic examples of 
stats written by latency critical writers in a wait-free manner, and 
collected by a non-latency critical background thread, and 
WriterReaderPhaser can be similarly used to coordinate this sort of work 
around any sort of stats-gathering data structure. The WRP's current 
implementation has the writer use a CAS to enter 

 
and leave 

 
the critical section. Single writer cases need no further synchronization, 
and multi-writer cases would need to coordinate on writing to the common 
data structure (e.g. a AtomicHistogram 

 or 
ConcurrentHistogram 

 
in Recorder, depending on whether or not auto-resizing of the histogram is 
needed).

A java implementation of WRP currently lives in HdrHistogram (which is 
available on maven central). It is too small for its own package and 
probably needs a better home. If interest in it grows, Martin can probably 
find it a home in Agrona. Other implementations have been built in e.g. 
other language versions of HdrHistogram (e.g. C 
, .NET 

).


On Tuesday, October 16, 2018 at 4:33:03 AM UTC-4, Mohan Radhakrishnan wrote:
>
> Hi,
> There is streaming data everywhere like trading data , JVM 
> logs.etc. Retrieval of statistics of this data need fast data structures. 
> Where can I find the literature on such fast data structures to store and 
> retrieve timestamps and data in O(!) time ?  Should this always be 
> low-level Java concurrent utilities ?
>
> Thanks,
> Mohan
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: JMM- synchronization access in a concrete example.

2018-09-25 Thread Gil Tene
As Tom noted, The Executor's submission happens-before promise prevents a 
reordering of (1) and (2) above.

Note that, as written, the reason you you don't have data races between (2) 
and (2) is that executor is known to be a single threaded executor (and 
will only run one task at a time). Without that quality, you would have 
plenty of (2) vs. (2) races. It is not that "doers contain different 
objects": your code submits executions of functions using the same x member 
of xs to all doers, and it is only the guaranteed serialization in your 
chosen executor implementation that prevents x,f()s from racing on the same 
x...

On Tuesday, September 25, 2018 at 8:52:14 AM UTC-7, John Hening wrote:
>
> public class Test {
> ArrayList xs;  
> ArrayList doers;
> Executor executor = Executors.newSingleThreadExecutor();
>
> static class Doer {
>   public void does(X x){
>x.f(); // 
> (2)
>   }
> } 
>
> void test() {
> for(X x : xs){
> x.f();  // 
> (1)
> 
> for(Doer d : doers) {
> executor.execute(() -> d.does(x));
> }
> }
> }
> }
>
>
>
>
> For my eye, if X.f is not synchronized it is incorrect because of two 
> facts (and only that two facts): 
>
> 1. Obviously, there is data race between (1) and (2). There are no more 
> data races here. (doers contains different objects)
> 2. There is no guarantee that (1) will be executed before (2). Yes?
>
> If X.f would be synchronized that code will be correct because:
> 1. There is no data race.
> 2. There is guarantee that (1) will be executed before (2) because (1) is 
> a synchronization action and Executor.execute is also a synchronization 
> access (not specifically execute itself)
>
> Yes?
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Pushing Terrabytes of files across a WAN

2018-09-16 Thread Gil Tene
Assuming your 40TB is spread across many files, the main thing I'd play 
with is the number of threads used in the copy (Robocopy has a /MT:n option 
which defaults to 8 but can be set as high as 128 
per 
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/robocopy).
 
On the assumption that each thread uses blocking APIs on a TCP connection, 
using the highest number of threads (128) will allow you to keep the 
throughput and TCP window sizes of each TCP connection closer to the range 
that more mere mortals work at. and still be able to saturate this cool 
10Gbps high latency link.

To saturate the 10Gbps link, your 128 connections will average ~80Mbps per 
connection. With an e.g. ~50msec actual RTT, this means connections will 
need at least a ~512KB send buffer (and probably 2-3x that since you'll see 
variance across them, and some will need to be faster than the average). On 
older OSs this could be a challenge (and require tweaking default send and 
receive buffer limits), but it should not be a problem if RFC 1323 TCP 
auto-scaling is used an supported on both sides. I would like to assume 
that any Windows and CentOS setup that has a 10Gbps NIC also has TCP window 
scaling support on by default and that all network elements on the path 
(including e.g. firewalls) don't block TCP Window Scale Option in RFC 1323, 
but assumptions like that could be a bit presumptuous. Just in case, there 
is some useful discussion here 

 .

One other obvious question to ask, since (as noted by Thomas) your average 
transfer speed will need to be at least 3.7Gbps in order to complete 40TB 
in each 24hr period, is whether anyone else is using that nice fat 10Gbps 
link... It the link dedicated to you? You'll be using well over 30% of it's 
capacity,, and it's enough for two or more other users like you to want to 
do the same for the math to never work out...

[ has window scaling enabled (which I'd hope to see on on any machine with 
a 10Gbps network card), this shouldn't be a problem. But just in case, I'd 

On Thursday, September 13, 2018 at 10:15:23 PM UTC+2, Jay Askren wrote:
>
> Todd Montgomvery,
>
> Correct, robocopy uses TCP.  We have a 10 Gbps terrestrial line and are 
> working on getting a second line for redundancy.
>
>
> Thomas, 
> Thanks for the link.  I will look at that page.
>
> Jay
>
>
>
> On Thursday, September 13, 2018 at 11:04:54 AM UTC-6, Todd L. Montgomery 
> wrote:
>>
>> Hi Jay.
>>
>> Going to assume Robocopy uses TCP
>>
>> As you had no real issues with things without a WAN, I would assume the 
>> TCP window sizes, etc. are all good for the rates you need.
>>
>> Latency will play a role, but more likely loss is a more impactful factor 
>> as congestion control will be more of a throttle than flow control. With 
>> TCP (low loss rate), RTT scales linearly with throughput. Well, as RTT goes 
>> up, throughput goes down, but it is linear. With loss, even low loss, 
>> throughput scales with sqrt(loss rate). After about 5%, TCP-Reno goes into 
>> stop-and-wait. 1 MSS per RTT. This scale is non-linear and in the < 5% loss 
>> rate area is really painful on throughput.
>>
>> In short, WANs will slow down with loss quite a lot. Latency will also 
>> have an impact, though. Just not as much potential.
>>
>> Running multiple TCP connections over the same path will mean that they 
>> will fight with one another via congestion control trying to find a 
>> fairness point that jumps around and can end up underutilizing the 
>> bandwidth at times. This is where things like TCP BBR can be helpful. But 
>> still, loss will cause quite a slow down.
>>
>> What can you do? Well, it depends on what your links between the areas 
>> actually are. terrestrial vs. satellite, etc. Lots of options.
>>
>> On Thu, Sep 13, 2018 at 9:41 AM Jay Askren  wrote:
>>
>>> We need to push 40 TB of images per day from our scanning department in 
>>> Utah to our storage servers in Virginia and then we download about 4 TB of 
>>> processed images per day back to Utah.  In our previous process we had no 
>>> problem getting the throughput we needed by using Robocopy which comes with 
>>> Windows, but our old storage servers were here in Utah.  We can get 
>>> Robocopy to work across the WAN but we have to run 3 or 4 Robocopy 
>>> processes under different Windows users which is somewhat fragile and feels 
>>> like a bad hack.  The files here in Utah are on a Windows server because of 
>>> the proprietary software needed to run the scanner.  All of our servers in 
>>> Virginia run Centos.
>>>
>>> Any thoughts on how to transfer files over long distance and still get 
>>> high throughput?  I believe the issue we are running into is high latency.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> 

Re: jHiccup for .NET?

2018-08-29 Thread Gil Tene
If the .NET profiler api can be used to launch a managed code thread without 
modifying the application, then a simple port of the java agent code to C# 
should be possible. The hiccup observation must be done in managed code, and a 
separate thread, in order to observe the hiccups that an independent thread 
running such code would see.

Sent from my iPad

On Aug 29, 2018, at 5:49 AM, Greg Young 
mailto:gregoryyou...@gmail.com>> wrote:

If someone wants to work on it I have a profiler api implementation that could 
be a useful starting point. It supports both mono and the CLR (two separate 
implementations the mono one is in C the CLR C++)

On Wed, Aug 29, 2018 at 1:07 PM Remi Forax 
mailto:fo...@univ-mlv.fr>> wrote:


________
De: "Gil Tene" mailto:g...@azul.com>>
À: "mechanical-sympathy" 
mailto:mechanical-sympathy@googlegroups.com>>
Envoyé: Mardi 28 Août 2018 00:28:00
Objet: Re: jHiccup for .NET?
There is a great implementation of HdrHistogram for 
.NET<https://github.com/HdrHistogram/HdrHistogram.NET>, which makes the rest of 
what jHiccup does nearly-trivial to do. I think the main thing keeping from 
porting jHiccup itself to .NET is that it's most common use mode is as a java 
agent (adding hiccup recording to a java program without modifying it in any 
way), and AFAIK .NET does not have a similar agent mechanism.

I believe the .NET Profiling API provides something equivalent to the Java 
agent API.


jHiccup itself is fairly simple and should be easy to port into a library you 
can invoke from within your application, and into a standalone program (for 
measuring control hiccups on an otherwise idle process). It's main 
class<https://github.com/giltene/jHiccup/blob/master/src/main/java/org/jhiccup/HiccupMeter.java>
 is only ~800 lines of code, over half of it in comments and parameter parsing 
logic. People have replicated some of it's logic in their C# stuff before (e.g. 
Matt Warren used it 
here<http://mattwarren.org/2014/06/23/measuring-the-impact-of-the-net-garbage-collector-an-update/>).

-- Gil.

Rémi



On Monday, August 27, 2018 at 12:49:15 PM UTC-7, Mark E. Dawson, Jr. wrote:
Does there exist a port for, or a similar tool to, jHiccup for .NET?
--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
mechanical-sympathy+unsubscr...@googlegroups.com<mailto:mechanical-sympathy+unsubscr...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
mechanical-sympathy+unsubscr...@googlegroups.com<mailto:mechanical-sympathy+unsubscr...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.


--
Studying for the Turing test

--
You received this message because you are subscribed to a topic in the Google 
Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit 
https://groups.google.com/d/topic/mechanical-sympathy/aylGMNy9Z2E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to 
mechanical-sympathy+unsubscr...@googlegroups.com<mailto:mechanical-sympathy+unsubscr...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: jHiccup for .NET?

2018-08-27 Thread Gil Tene
There is a great implementation of HdrHistogram for .NET 
, which makes the rest of 
what jHiccup does nearly-trivial to do. I think the main thing keeping from 
porting jHiccup itself to .NET is that it's most common use mode is as a 
java agent (adding hiccup recording to a java program without modifying it 
in any way), and AFAIK .NET does not have a similar agent mechanism.

jHiccup itself is fairly simple and should be easy to port into a library 
you can invoke from within your application, and into a standalone program 
(for measuring control hiccups on an otherwise idle process). It's main 
class 

 
is only ~800 lines of code, over half of it in comments and parameter 
parsing logic. People have replicated some of it's logic in their C# stuff 
before (e.g. Matt Warren used it here 

). 

-- Gil.

On Monday, August 27, 2018 at 12:49:15 PM UTC-7, Mark E. Dawson, Jr. wrote:
>
> Does there exist a port for, or a similar tool to, jHiccup for .NET?
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: LOCK XADD wait-free or lock free.

2018-08-25 Thread Gil Tene


On Saturday, August 25, 2018 at 8:18:53 AM UTC-7, Peter Veentjer wrote:
>
> Hi Gil,
>
> thanks for your answer. 
>
> However it still feels a bit like 'it works because it does' (or my skull 
> is just too thick).
>
> So what prevents a cpu from perpetually being denied to execute the LOCK 
> XADD? So imagine there are 2 cpu's, both executing a LOCK XADD in a loop, 
> what prevents that one of the cpu's isn't perpetually being starved from 
> its resource. 
>

What prevents it is the fact that if this was possible, you've just 
described a trivial way for you to crash any server and any laptop on earth 
from user-mode code. One of the jobs of the hardware designers is to make 
it impossible for you to do that, and they've usually done that job to an 
ok enough level on the systems we run on (assuming we haven't read anything 
in the news about "how to bring all of AWS down from a laptop").
 

> And even if it isn't perpetual, what guarantees that the CPU completes in 
> a bounded number of steps? 
>


 

>
>
> On Saturday, August 25, 2018 at 9:39:13 AM UTC+3, Gil Tene wrote:
>>
>> Wait-free, Lock-free, and Obstruction-free are well defined things. They 
>> are all non-blocking, such that suspended threads cannot indefinitely 
>> prevent all others from making progress, but they differ in the forward 
>> progress expectations of individual threads. The "Non-blocking 
>> algorithm" wikipedia entry 
>> <https://en.wikipedia.org/wiki/Non-blocking_algorithm#cite_note-wf-queue-16> 
>> does a fair job of explaining what each means. It is forward progress in a 
>> bounded number of steps, rather than "fairness", that is key. You can be 
>> extremely unfair (as in making forward progress at 1,000,000,000:1 ratios) 
>> and wait-free at the same time.
>>
>> There is a difference between the wait-free/lock-free classification of 
>> an operation/instruction and the classification of a mechanism/algorithm 
>> using that operation/instruction...
>>
>> Both the LOCK XADD operation and LOCK CAS operations are wait free. There 
>> is nothing any other processor can do to prevent your processor from making 
>> forward progress and eventually (in a bound number of steps) executing 
>> either instruction. In fact, [at the very least] all user-mode x86 CPU 
>> instructions are trivially wait free, otherwise very bad things could be 
>> done by user-mode code...
>>
>> The key difference between a LOCK XADD and a LOCK CAS is that the LOCK 
>> XADD will actually and unconditionally force a visible change to the shared 
>> memory state in a way that no other processor can prevent, while the shared 
>> memory effect of a LOCK CAS operation is a conditional, and will only occur 
>> if the compare shows that the expected value and the memory value are the 
>> same. Simple (but by no means all) uses of CAS often perform retry loops in 
>> lock-free, but not wait-free, mechanisms. 
>>
>> It is easy to see how one can use LOCK XADD operations to build wait-free 
>> mechanisms. E.g. implementing a wait-free atomic counter using XADD is 
>> trivial. It is similarly useful in more complicated synchronization 
>> primitives, like e.g. my WriterReaderPhaser 
>> <http://stuff-gil-says.blogspot.com/2014/11/writerreaderphaser-story-about-new.html>,
>>  
>> which is wait free on architectures that support atomic increment 
>> operations, but (in my current implementation) is only lock free on 
>> architectures that force us to resort to a CAS.. It is somewhat harder to 
>> construct wait-free mechanisms using LOCK CAS, but it is by no means not 
>> impossible, and usually just means some extra complexity and spending some 
>> more memory per concurrent thread or cpu. e.g. see "Wait-Free Queues 
>> With Multiple Enqueuers and Dequeuers" 
>> <http://www.cs.technion.ac.il/~erez/Papers/wfquque-ppopp.pdf>, where 
>> wait free queues are implemented using CAS constructs.
>>
>> On Friday, August 24, 2018 at 10:00:22 PM UTC-7, Peter Veentjer wrote:
>>>
>>> I'm polishing up my lock free knowledge and one question keeps on 
>>> bugging me.
>>>
>>> The question is about the LOCK XADD and why it is considered to be wait 
>>> free.
>>>
>>> AFAIK for wait freedom there needs to be some fairness. 
>>>
>>> So image a concurrent counter using a spin on a cas to increment a 
>>> counter, then at least 1 thread wins and makes progress. Therefor this 
>>> implementation is lock free. It isn't wait free because some thread could 
>>> be spinning with

Re: LOCK XADD wait-free or lock free.

2018-08-25 Thread Gil Tene


On Saturday, August 25, 2018 at 9:11:26 AM UTC-7, Martin Thompson wrote:
>
> To perform an update via the XADD instruction the cache line containing 
> the word must first be acquired. x86 uses the MESI cache coherence protocol 
> and to get the cacheline for update an RFO (Read/Request For Ownership) 
> message must be sent as a bus transaction. These requests are queued per 
> core and in the uncore and effectively avoid starvation. Events are 
> available, e.g. OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO. From the 
> perspective of instructions each thread takes a finite number of steps to 
> complete. The CAS equivalent would be lock-free, rather than wait-free, as 
> the number of steps per thread is not finite.
>

The CAS might never "succeed" in a capped number of steps (i.e. someone 
else could forever beat it to the memory location and the compare could 
fail in each and every attempt), but the CAS instruction will complete, one 
way or another, on all practical hardware implementations, in a capped 
number of steps. A fundamental design requirement in all hardware 
implementations I'm aware of is to prevent the starvation of any possible 
(at least user mode, and probably all) instructions, including CAS. This 
might be achieved in many different ways, often leveraging qualities of the 
cache protocols and the knowledge that the number of processors and other 
components (like memory controllers, coherence protocol coordination 
points, queue depths in various places, etc.) in the system is capped. "How 
exactly" doesn't matter. It's the job of whomever designs the hardware to 
make sure it is so.

Think about it this way: In any system where it is impossible for you to 
cause a hardware watchdog reset from user code, there's a "capped number of 
steps" for executing any individual user-mode instruction, which by 
definition makes them all (individually) wait-free. There are no 
non-wait-free user mode instructions in such systems. QED.
 

>
> On Saturday, 25 August 2018 06:00:22 UTC+1, Peter Veentjer wrote:
>>
>> I'm polishing up my lock free knowledge and one question keeps on bugging 
>> me.
>>
>> The question is about the LOCK XADD and why it is considered to be wait 
>> free.
>>  
>> AFAIK for wait freedom there needs to be some fairness. 
>>
>> So image a concurrent counter using a spin on a cas to increment a 
>> counter, then at least 1 thread wins and makes progress. Therefor this 
>> implementation is lock free. It isn't wait free because some thread could 
>> be spinning without bound. The problem here is that there is no fairness.
>>
>> Now imagine this counter would be implemented using a LOCK XADD and 
>> therefor there is no need for a loop. What is guaranteeing that every 
>> thread is going to make progress in a bound number of steps? E.g. could it 
>> happen that one thread is continuously denied exclusive access to the 
>> memory and therefor it won't complete in a bound number of steps? Or do the 
>> request get stored in some kind of list and the requests are processed in 
>> this order? This order would provide fairness and then it will be wait free.
>>
>> I have been looking at the Intel documentation of LOCK XADD; but it 
>> remains unclear.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: LOCK XADD wait-free or lock free.

2018-08-25 Thread Gil Tene
Wait-free, Lock-free, and Obstruction-free are well defined things. They 
are all non-blocking, such that suspended threads cannot indefinitely 
prevent all others from making progress, but they differ in the forward 
progress expectations of individual threads. The wikipedia entry does a 
fair job of explaining what each means. It is forward progress in a bounded 
number of steps, rather than "fairness", that is key, is not part of any of 
them. You can be extremely unfair (as in making forward progress at 
1,000,000,000:1 ratios) and wait-free at the same time.

There is a difference between the wait-free/lock-free classification of an 
operation/instruction and the classification of a mechanism/algorithm using 
that operation/instruction...

Both the LOCK XADD operation and LOCK CAS operations are wait free. There 
is nothing any other processor can do to prevent your processor from making 
forward progress and eventually (in a bound number of steps) executing 
either instruction. In fact, [at the very least] all user-mode x86 CPU 
instructions are trivially wait free, otherwise very bad things could be 
done by user-mode code...

The key difference between a LOCK XADD and a LOCK CAS is that the LOCK XADD 
will actually and unconditionally force a visible change to the shared 
memory state in a way that no other processor can prevent, while the shared 
memory effect of a LOCK CAS operation is a conditional, and will only occur 
if the compare shows that the expected value and the memory value are the 
same. Simple (but by no means all) uses of CAS often perform retry loops in 
lock-free, but not wait-free, mechanisms. 

It is easy to see how one can use LOCK XADD operations to build wait-free 
mechanisms. E.g. implementing a wait-free atomic counter using XADD is 
trivial. And more complicated synchronization primitives, like e.g. my 
WriterReaderPhaser 

 is 
wait free on architectures that support atomic increment operations, but 
(in my current implementation) is only lock free on architectures that 
force us to resort to a CAS.. It is somewhat harder to construct wait-free 
mechanisms using LOCK CAS, but it is by no means not impossible, and 
usually just means spending some more memory per concurrent thread or cpu. 
e.g. see "Wait-Free Queues With Multiple Enqueuers and Dequeuers" 
, where wait 
free queues are implemented using CAS constructs.

On Friday, August 24, 2018 at 10:00:22 PM UTC-7, Peter Veentjer wrote:
>
> I'm polishing up my lock free knowledge and one question keeps on bugging 
> me.
>
> The question is about the LOCK XADD and why it is considered to be wait 
> free.
>
> AFAIK for wait freedom there needs to be some fairness. 
>
> So image a concurrent counter using a spin on a cas to increment a 
> counter, then at least 1 thread wins and makes progress. Therefor this 
> implementation is lock free. It isn't wait free because some thread could 
> be spinning without bound. The problem here is that there is no fairness.
>
> Now imagine this counter would be implemented using a LOCK XADD and 
> therefor there is no need for a loop. What is guaranteeing that every 
> thread is going to make progress in a bound number of steps? E.g. could it 
> happen that one thread is continuously denied exclusive access to the 
> memory and therefor it won't complete in a bound number of steps? Or do the 
> request get stored in some kind of list and the requests are processed in 
> this order? This order would provide fairness and then it will be wait free.
>
> I have been looking at the Intel documentation of LOCK XADD; but it 
> remains unclear.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: FlatBuffers, ByteBuffers, and escape analysis

2018-08-08 Thread Gil Tene
Oh, and there is MethodHandles.byteBufferViewVarHandle 
<https://docs.oracle.com/javase/10/docs/api/java/lang/invoke/MethodHandles.html#byteBufferViewVarHandle(java.lang.Class,java.nio.ByteOrder)>
 
if you (for some reason) want to do the same but keep ByteBuffers around.

On Tuesday, August 7, 2018 at 9:41:01 PM UTC-7, Gil Tene wrote:
>
> *IF* you can use post-java-8 stuff, VarHandles may have a more systemic 
> and intentional/explicit answer for expressing what you are trying to do 
> here, without resorting to Unsafe. Specifically, using a 
> MethodHandles.byteArrayViewVarHandle 
> <https://docs.oracle.com/javase/10/docs/api/java/lang/invoke/MethodHandles.html#byteArrayViewVarHandle(java.lang.Class,java.nio.ByteOrder)>()
>  
> that you would get once (statically), you should be able to peek into your 
> many different byte[] instances and extract a field of a different 
> primitive type (int, long, etc.) at some arbitrary index, without having to 
> wrap it up in the super-short-lived ByteBuffer in your example, and hope 
> for Escape analysis to take care of it...
>
> Here is a code example that does the same wrapping you were looking to do, 
> using VarHandles:
>
> import java.lang.invoke.MethodHandles;
> import java.lang.invoke.VarHandle;
> import java.nio.ByteOrder;
>
>
> public class VarHandleExample {
>
> static final byte[] bytes = {0x02, 0x00, (byte) 0xbe, (byte) 0xba, (
> byte) 0xfe, (byte) 0xca};
>
> private static class FileDesc {
> static final VarHandle VH_intArrayView = MethodHandles.
> byteArrayViewVarHandle(int[].class, ByteOrder.LITTLE_ENDIAN);
> static final VarHandle VH_shortArrayView = MethodHandles.
> byteArrayViewVarHandle(short[].class, ByteOrder.LITTLE_ENDIAN);
> private final byte[] buf;
> int bufPos;
>
> FileDesc(byte[] buf, int headerPosition) {
> bufPos = ((short) VH_shortArrayView.get(buf, headerPosition)) 
> + headerPosition;
> this.buf = buf;
> }
>
> public int getVal() {
> return (int) VH_intArrayView.get(buf, bufPos);
> }
> }
>
>
> public static void main(String[] args) {
> FileDesc fd = new FileDesc(bytes, 0);
> System.out.format("The int we get from fd.get() is: 0x%x\n", fd.
> getVal());
> }
> }
>
> Running this results in the probably correct output of:
>
> The int we get from fd.get() is: *0xcafebabe*
>
> Which means that the byte offset reading in the backing byte[], using 
> little endian, and even at not-4-byte-offset-aligned locations, seems to 
> work.
>
> NOTE: I have NOT examined what it looks like in generated code, beyond 
> verifying that everything seems to get inlined, but as stated, the code 
> would not incur an allocation or need an intermediate object per buffer 
> instance.
>
> Now, since this only works in Java9+, you could code it that way for those 
> versions, and revert to the Unsafe equivalent for Java 8-. You could even 
> convert the code above to code that dynamically uses VarHandle (when 
> available) without requiring javac to know anything about them (using 
> reflection and MethodHandles), and uses Usafe only if VarHandle is not 
> supported. Ugly ProtableVarHandleExample that does that (and would run on 
> Java 7...10) *might* follow...
>
> On Tuesday, August 7, 2018 at 1:55:35 PM UTC-7, Todd Lipcon wrote:
>>
>> Hey folks,
>>
>> I'm working on reducing heap usage of a big server application that 
>> currently holds on to tens of millions of generated FlatBuffer instances in 
>> the old generation. Each such instance looks more or less like this:
>>
>> private static class FileDesc {
>>   private final ByteBuffer bb;
>>   int bbPos;
>>
>>   FileDesc(ByteBuffer bb) {
>> bbPos = bb.getShort(bb.position()) + bb.position();
>> this.bb = bb;
>>   }
>>
>>   public int getVal() {
>> return bb.getInt(bbPos);
>>   }
>> }
>>
>> (I've simplified the code, but the important bit is the ByteBuffer member 
>> and the fact that it provides nice accessors which read data from various 
>> parts of the buffer)
>>
>> Unfortunately, the heap usage of these buffers adds up quite a bit -- 
>> each ByteBuffer takes 56 bytes of heap, and each 'FileDesc' takes 32 bytes 
>> after padding. The underlying buffers themselves are typically on the order 
>> of 100 bytes, so it seems like almost 50% of the heap is being used by 
>> wrapper objects instead of the underlying data itself. Additionally, 2/3 of 
>> the object count are overhead, which I im

Re: FlatBuffers, ByteBuffers, and escape analysis

2018-08-07 Thread Gil Tene
*IF* you can use post-java-8 stuff, VarHandles may have a more systemic and 
intentional/explicit answer for expressing what you are trying to do here, 
without resorting to Unsafe. Specifically, using a 
MethodHandles.byteArrayViewVarHandle 
()
 
that you would get once (statically), you should be able to peek into your 
many different byte[] instances and extract a field of a different 
primitive type (int, long, etc.) at some arbitrary index, without having to 
wrap it up in the super-short-lived ByteBuffer in your example, and hope 
for Escape analysis to take care of it...

Here is a code example that does the same wrapping you were looking to do, 
using VarHandles:

import java.lang.invoke.MethodHandles;
import java.lang.invoke.VarHandle;
import java.nio.ByteOrder;


public class VarHandleExample {

static final byte[] bytes = {0x02, 0x00, (byte) 0xbe, (byte) 0xba, (byte
) 0xfe, (byte) 0xca};

private static class FileDesc {
static final VarHandle VH_intArrayView = MethodHandles.
byteArrayViewVarHandle(int[].class, ByteOrder.LITTLE_ENDIAN);
static final VarHandle VH_shortArrayView = MethodHandles.
byteArrayViewVarHandle(short[].class, ByteOrder.LITTLE_ENDIAN);
private final byte[] buf;
int bufPos;

FileDesc(byte[] buf, int headerPosition) {
bufPos = ((short) VH_shortArrayView.get(buf, headerPosition)) + 
headerPosition;
this.buf = buf;
}

public int getVal() {
return (int) VH_intArrayView.get(buf, bufPos);
}
}


public static void main(String[] args) {
FileDesc fd = new FileDesc(bytes, 0);
System.out.format("The int we get from fd.get() is: 0x%x\n", fd.
getVal());
}
}

Running this results in the probably correct output of:

The int we get from fd.get() is: *0xcafebabe*

Which means that the byte offset reading in the backing byte[], using 
little endian, and even at not-4-byte-offset-aligned locations, seems to 
work.

NOTE: I have NOT examined what it looks like in generated code, beyidn 
verifying that everything seems to get inlined, but as stated, the code 
would not incur an allocation or need an intermediate object per buffer 
instance.

Now, since this only works in Java9+, you could code it that way for those 
versions, and revert to the Unsafe equivalent for Java 8-. You could even 
convert the code above to code that dynamically uses VarHandles without 
requiring javac to know anything about them (using reflection and 
MethodHandles). Ugly ProtableVarHandleExample that does that (and would run 
on Java 7...10) *might* follow...

On Tuesday, August 7, 2018 at 1:55:35 PM UTC-7, Todd Lipcon wrote:
>
> Hey folks,
>
> I'm working on reducing heap usage of a big server application that 
> currently holds on to tens of millions of generated FlatBuffer instances in 
> the old generation. Each such instance looks more or less like this:
>
> private static class FileDesc {
>   private final ByteBuffer bb;
>   int bbPos;
>
>   FileDesc(ByteBuffer bb) {
> bbPos = bb.getShort(bb.position()) + bb.position();
> this.bb = bb;
>   }
>
>   public int getVal() {
> return bb.getInt(bbPos);
>   }
> }
>
> (I've simplified the code, but the important bit is the ByteBuffer member 
> and the fact that it provides nice accessors which read data from various 
> parts of the buffer)
>
> Unfortunately, the heap usage of these buffers adds up quite a bit -- each 
> ByteBuffer takes 56 bytes of heap, and each 'FileDesc' takes 32 bytes after 
> padding. The underlying buffers themselves are typically on the order of 
> 100 bytes, so it seems like almost 50% of the heap is being used by wrapper 
> objects instead of the underlying data itself. Additionally, 2/3 of the 
> object count are overhead, which I imagine contributes to GC 
> scanning/marking time.
>
> In practice, all of the ByteBuffers used by this app are simply 
> ByteBuffer.wrap(byteArray). I was figuring that an easy improvement here 
> would be to simply store the byte[] and whenever we need to access the 
> contents of the FlatBuffer, use it as a flyweight:
>
>   new FileDesc(ByteBuffer.wrap(byteArray)).getVal();
>
> ... and let the magic of Escape Analysis eliminate those allocations. 
> Unfortunately, I've learned from this group that magic should be tested, so 
> I wrote a JMH benchmark: 
> https://gist.github.com/4b6ddf0febcc3620ccdf68e5f11c6c83 and found that 
> the ByteBuffer.wrap allocation is not eliminated.
>
> Has anyone faced this issue before? It seems like my only real option here 
> is to modify the flatbuffer code generator to generate byte[] members 
> instead of ByteBuffer members, so that the flyweight allocation would be 
> eliminated, but maybe I missed something more clever.
>
> -Todd
>

-- 
You received this 

Re: synchronized and ReentrantLock are REALLY the same?

2018-05-21 Thread Gil Tene
That is indeed an interesting example, involving time-observation, where 
you do not want the time measurement to be delayed past a synchronization 
event.

While I'm not sure there is a "guaranteed to be correct" way to do this in 
the presence of a hyper-capable optimizer, there is a very practical way to 
do it in the real world: Surround the prior-to-the-block time-measurement 
with another potentially-blocking synchronization mechanism (a lock of some 
sort, like a synchronized block on a different object, or a different 
ReentrantLock). Now you have two locks on two variables. Under "normal 
conditions", an optimizer will not be allowed to change the in which the 
order with which the two lock operations are done since that might cause 
semantically visible effects, like deadlocks. And as long as the optimizer 
cannot reorder the two locks, you have the order you wanted for the timing 
operation.

The reason I believe that this is a practical solution but not a 
stands-up-to-hypothetically-infinitely-smart-optimizer one is that an 
optimizer CAN reorder the locks if it can prove that no semantically 
visible effects that violate the JMM or potentially cause deadlocks will 
occur. A good example of an allowed optimization would be ignoring the 
added lock operation if it can be proven to be thread-local (e.g. via 
escape analysis). You can (and should) probably attempt to prevent that to 
some degree by declaring the extra lock as a static variable. But that 
doesn't prohibit some super-smart code-and-thread analysis that would show 
that "under current conditions" only one thread ever actually operates on 
the lock, and de-optimizes all related code only once a second thread 
starts interacting with it. This may sound far-fetched, but it is a very 
possible future optimization for biased locks (if extensive profiling shows 
a hot lock that is constantly biased to a single thread, optimize all code 
that relates to that lock to ignore the lock/unlock operations, and make 
the de-biasing of that lock de-optimize all related code). You can (and 
should) try to combat that by trying to prevent biasing (e.g. make sure at 
least two threads lock/unlock the lock every once in a while, performing 
some state change to a shared variable under the lock), but that while this 
can help prevent lasting bias, it doesn't prevent the optimization from 
happening (and then going away) temporarily within a time window. For the 
specific time-observation purpose you have here, this (two threads 
locking/unlocking e.g. each second) is probably good enough (as the 
possible optimization will not last more than e.g. 1 second, and your 
detector will still work), but I wouldn't rely on such a thing if stronger 
semantic correctness was at stake.

Note: the reason I'd do some state-changing operation on a static variable 
under the lock in the "bias-preventing" threads is to make sure that 
optimizers cannot somehow optimize away the lock block in those threads as 
effective no-ops for some reason.

On Monday, May 21, 2018 at 12:29:06 AM UTC-7, Francesco Nigro wrote:
>
> Preventing your appear-prior-to-lock-acquisition writes from "moving into 
>> the block" is subtly different from preventing their re-ordering with 
>> writes and reads that are within the block. Are you sure you want the 
>> former and not just the latter?
>
>
> Yes or, at least, it seems the only way I have found to implement a sort 
> of naive "deadlock"( or better, a "suspected" slowness) detector.
> Supposing to have a ordered executor API (ie and executor that ensure that 
> all the submitted tasks would be executed in order, but not necessarly by 
> one/the same thread) and 
> some alien client library with synchronized calls, in order to detect at 
> runtime that the synchronized calls are not deadlocked or simply unable to 
> leave the synchronized block due to 
> some suspected slowness I was thinking to implement a watchdog service 
> that monitor at interval each Thread used by the ordered executor, polling 
> the last elapsed time before a thread 
> was seen approaching to enter into a synchronized block.
> Translated into pseudo-code:
>
> you hava a 
>
> public synchronized foo(){
> }
>
> and some code:
>
> ...
> executor.submit(()->alien.foo());
> ...
>
> it should became:
> ...
> executor.submit(watchDog->{
> watchDog.beforeEnter();
> try {
> alien.foo();
> } finally {
> watchDog.exit(); 
> }
> });
> ...
> The API is not definitive at all, but the point is that 
> watchDog.beforeEnter(); should not be moved into the synchronized block 
> because that woud not make possibile to compute the elapsed time into the
> synchronized block if some other thread is preventing ali

Re: synchronized and ReentrantLock are REALLY the same?

2018-05-18 Thread Gil Tene


On Friday, May 18, 2018 at 11:10:45 PM UTC-5, Gil Tene wrote:
>
>
>
> On Friday, May 18, 2018 at 2:30:08 AM UTC-5, Francesco Nigro wrote:
>>
>> Thanks Gil!
>>
>> I will failback to my original (semi-practical) concern, using this 
>> renewed knowledge :)
>> Suppose that we want to perform write operations surrounding both a 
>> j.u.c. Lock and synchronized mutual exclusion block and we want:
>>
>>1. these writes operations to not being moved inside the block and 
>>maintain their relative positions from it
>>
>> Preventing your appear-prior-to-lock-acquisition writes from "moving into 
> the block" is subtly different from preventing their re-ordering with 
> writes and reads that are within the block. Are you sure you want the 
> former and not just the latter?
>

It is also worth noting that since any writes moving into the synchronized 
(or locked) block WILL appear atomically with any other reads and writes 
within the block to any other thread that synchronizes (or locks) on the 
same object (or lock), the only threads that may observe this 
reorder-into-the-block effects are ones that would also potentially observe 
the things *within* the block in non-atomic and non-synchronized ways.

So the purpose of preventing the reordering of things into the block looks 
suspicious, to begin with. I'm not saying there is no possible reason for 
it, just that it seems suspect. As in "uses non-synchronized accesses to 
things that are elsewhere intentionally done under a lock", Which spells 
"many bugs can probably found here" to me.

 

> It is easier to see how to prevent the latter. All you have to do is order 
> the earlier writes against those in-block writes and/or reads, ignoring the 
> lock. This is where, for example, a lazySet on the inside-the-block writes 
> will order them against the before-the-block writes, regardless of how the 
> monitor enter is dealt with, which can save you from using volatiles. If 
> there are reads within the block that you need to order against the 
> before-the-block writes, you'd need to use volatiles (e.g. a volatile store 
> for the last before-the-block write, AND a volatile load for the first 
> in-the-block read).
>
> If you actually want to prevent the former (and there are ways to observe 
> whether or not reordering "into" the block occur), you may need more 
> complicated things. But do you really need that? Someone may be able to 
> find some way to detect whether or not such reordering of the writes and 
> the lock-enter happens [I'm actually not sure whether such reordering, 
> without also reordering against writes and reads in the block, is 
> detectable]. An if that detection is possible, it also means that [by 
> definition] someone can build a concurrent algorithm that depends on the 
> reordering behavior not occurring. But it seems [to me] like a pretty 
> complicated and sensitive thing to build a dependency on.
>
>>
>>- 2.  the effects of the writes would be atomically readable from 
>>other threads
>>
>>  Which sort of atomicity are you referring to here? atomicity within a 
> single store (i.e. no word tearing of a long or a double), or atomicity 
> across multiple such stores?
>
> If it's the first, (no word tearing) a volatile write will ensure that (in 
> a fairly expensive way), but a lazyset will also ensure the same with a 
> much lower cost.
>
> If it's the second (atomicity across multiple such stores, you need 
> something like a synchronized block or a locked code region for the writes 
> that you need atomicity across to live in. Such a block can be coarsened 
> (e.g. joined with a subsequent block, or hoisted out of a loop) block, or 
> it may be optimized in various other ways (e.g. biased locking), but 
> whatever valid things happen to it, the atomicity across the writes within 
> it will remain when seen by other threads.
>
> Given that we can't assume any semantic difference betwenn the j.u.c Lock 
>> and intrinsic ones and there aren't clearly listed (but on specific 
>> implementations/the Cookbook) any effects on the sorrounding code, how we 
>> can implement it correctly?
>> And...we can do it just with this knowledge?
>> The only solution I see is by using a volatile store on both (or at least 
>> on the first) writes operation, while ordered (aka lazySet) ones can't work 
>> as expected. 
>>
>> Cheers,
>> Franz
>>
>>
>> Il giorno mercoledì 16 maggio 2018 17:26:59 UTC+2, Gil Tene ha scritto:
>>>
>>> Note that Doug' Lea's JMM cookbook is written for implementors of JDKs 
>>> and related libraries and JITs, NOT fo

Re: synchronized and ReentrantLock are REALLY the same?

2018-05-18 Thread Gil Tene


On Friday, May 18, 2018 at 2:30:08 AM UTC-5, Francesco Nigro wrote:
>
> Thanks Gil!
>
> I will failback to my original (semi-practical) concern, using this 
> renewed knowledge :)
> Suppose that we want to perform write operations surrounding both a j.u.c. 
> Lock and synchronized mutual exclusion block and we want:
>
>1. these writes operations to not being moved inside the block and 
>maintain their relative positions from it
>
> Preventing your appear-prior-to-lock-acquisition writes from "moving into 
the block" is subtly different from preventing their re-ordering with 
writes and reads that are within the block. Are you sure you want the 
former and not just the latter?

It is easier to see how to prevent the latter. All you have to do is order 
the earlier writes against those in-block writes and/or reads, ignoring the 
lock. This is where, for example, a lazySet on the inside-the-block writes 
will order them against the before-the-block writes, regardless of how the 
monitor enter is dealt with, which can save you from using volatiles. If 
there are reads within the block that you need to order against the 
before-the-block writes, you'd need to use volatiles (e.g. a volatile store 
for the last before-the-block write, AND a volatile load for the first 
in-the-block read).

If you actually want to prevent the former (and there are ways to observe 
whether or not reordering "into" the block occur), you may need more 
complicated things. But do you really need that? Someone may be able to 
find some way to detect whether or not such reordering of the writes and 
the lock-enter happens [I'm actually not sure whether such reordering, 
without also reordering against writes and reads in the block, is 
detectable]. An if that detection is possible, it also means that [by 
definition] someone can build a concurrent algorithm that depends on the 
reordering behavior not occurring. But it seems [to me] like a pretty 
complicated and sensitive thing to build a dependency on.

>
>- 2.  the effects of the writes would be atomically readable from 
>other threads
>
>  Which sort of atomicity are you referring to here? atomicity within a 
single store (i.e. no word tearing of a long or a double), or atomicity 
across multiple such stores?

If it's the first, (no word tearing) a volatile write will ensure that (in 
a fairly expensive way), but a lazyset will also ensure the same with a 
much lower cost.

If it's the second (atomicity across multiple such stores, you need 
something like a synchronized block or a locked code region for the writes 
that you need atomicity across to live in. Such a block can be coarsened 
(e.g. joined with a subsequent block, or hoisted out of a loop) block, or 
it may be optimized in various other ways (e.g. biased locking), but 
whatever valid things happen to it, the atomicity across the writes within 
it will remain when seen by other threads.

Given that we can't assume any semantic difference betwenn the j.u.c Lock 
> and intrinsic ones and there aren't clearly listed (but on specific 
> implementations/the Cookbook) any effects on the sorrounding code, how we 
> can implement it correctly?
> And...we can do it just with this knowledge?
> The only solution I see is by using a volatile store on both (or at least 
> on the first) writes operation, while ordered (aka lazySet) ones can't work 
> as expected. 
>
> Cheers,
> Franz
>
>
> Il giorno mercoledì 16 maggio 2018 17:26:59 UTC+2, Gil Tene ha scritto:
>>
>> Note that Doug' Lea's JMM cookbook is written for implementors of JDKs 
>> and related libraries and JITs, NOT for users of those JDKs and libraries. 
>> It says so right in the title. It describes rules that would result in a 
>> *sufficient* implementation of the JMM but is not useful for deducing 
>> the *required or expected *behavior of all JMM implementations. Most JMM 
>> implementations go beyond the cookbook rules in at least some places and 
>> apply JMM-valid transformations that are not included in it and can be 
>> viewed as "shortcuts" that bypass some of the rules in the cookbook. There 
>> are many examples of this in practice. Lock coarsening and lock biasing 
>> optimizations are two good example sets.
>>
>> This means that you need to read the cookbook very carefully, and 
>> (specifically) that you should not interpret it as a promise of what the 
>> relationships between various operations are guaranteed to be. If you use 
>> the cookbook for the latter, your code will break.
>>
>>
>> Putting aside the current under-the-hood implementations of monitor 
>> enter/exit and of ReentrantLock (which may and will change), the 
>> requirements are clear:
>>
>> from e.g. 
&g

Re: Analysis of perf sched for simple server.

2018-04-23 Thread Gil Tene


On Monday, April 23, 2018 at 12:33:17 PM UTC-7, John Hening wrote:
>
> Hello,
>
> 1. I have a simple one-threaded tcp server written in Java. I try to 
> measure its receive-and-process-response latency. More preceisely, the 
> client sends (by loopback) the 128-bytes message with a timestamp in the 
> header. The server receives a message, reads a content byte by byte and 
> compute a difference between `now` and the timestamp from the header. The 
> difference is more or less 6μs.
>

That "more or less" is probably a pretty a big range ;-).  As in 
it's probably 6usec +/- 10% or more. As in "that occasional 6msec 
glitch you'll be very lucky stay below is 1000x as big as your common case".
 

>
> Now, I am trying to make it faster.
>
> But, firstly I would like to examine a scheduling issue. So, I've 
> collected results with:
>
> perf sched record -p  -- sleep 10
>
>
>
> and then:
>
> perf sched timehist -V
>
>
>
> Excerpt from the result is presented below: (I've filtered it for my 
> server thread)
>
> 
> wait 
> time  sch delay   run time
>time  cputask name [tid/pid] (msec) 
> (msec) (msec)
> --- --  -  --  
> -  -  -
>
>1849.531842  [0002]sjava[7340/7338] 
> 0.000  0.000 56.003
>21849.587844 [0002]sjava[7340/7338] 
> 0.000  0.000 56.001
>21849.607838 [0002]sjava[7340/7338] 
> 0.000  0.000 19.994
>21849.615836 [0002]sjava[7340/7338] 
> 0.000  0.000  7.998
>
> ...
>21849.691834 [0002]sjava[7340/7338] 
> 0.000  0.000  4.000
>21849.703837 [0001]   s java[7340/7338] 
> 0.000  0.000 38.330
>21849.711838 [0005]   s java[7340/7338] 
> 0.000  0.000  0.000
>21849.719834 [0005]   s java[7340/7338] 
> 0.000  0.000  7.996
>
>
>
> My question is:
> How is it possible that wait_time is always zero? After all, it is 
> impossible. My CPU is multicore but there is a lot of processes (threads) 
> that needs CPU time. How to interpret that?
> From the other hand I am gonna to reduce CPU migration. Perhaps it will 
> help :-)
>
>
> 2. The second issue is:
>
> The excerpt from perf sched script
>
> java  7340 [002] 21848.012360:   sched:sched_wakeup: 
> comm=java 
> pid=7511 prio=120 target_cpu=005
> java  7340 [002] 21848.012375:   sched:sched_wakeup: 
> comm=java 
> pid=7511 prio=120 target_cpu=005
> java  7340 [002] 21848.012391:   sched:sched_wakeup: 
> comm=java 
> pid=7511 prio=120 target_cpu=005
> java  7340 [002] 21848.012406:   sched:sched_wakeup: 
> comm=java 
> pid=7511 prio=120 target_cpu=005
> 
> ...
>  swapper 0 [007] 21848.012554:   sched:sched_wakeup: 
> comm=java 
> pid=7377 prio=120 target_cpu=007
>  swapper 0 [007] 21848.012554:   sched:sched_wakeup: 
> comm=java 
> pid=7377 prio=120 target_cpu=007
> 
> ...
> java  7340 [002] 21848.012555:   sched:sched_wakeup: 
> comm=java 
> pid=7511 prio=120 target_cpu=005
> java  7377 [007] 21848.012582: sched:sched_stat_runtime: 
> comm=java 
> pid=7377 runtime=37420 [ns] vruntime=1628300433237 [ns]
> java  7340 [002] 21848.012585:   sched:sched_wakeup: 
> comm=java 
> pid=7511 prio=120 target_cpu=005
> java  7377 [007] 21848.012587:   sched:sched_switch: 
> prev_comm=java prev_pid=7377 prev_prio=120 prev_state=S ==> next_comm=
> swapper/7 next_pid=0 next_prio=120
>
> Why my server receives sched_wakup that looks like spurious wakeups? What 
> is a swapper?
>
> Please explain and help :-)
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Exclusive core for a process, is it reasonable?

2018-04-08 Thread Gil Tene
“Reasonable people adapt themselves to the world. Unreasonable people 
attempt to adapt the world to themselves. All progress, therefore, depends 
on unreasonable people.”― George Bernard Shaw 

To your question tho: there are plenty of tools available in Linux today to 
control how cores are used across processes. E.g. between numactl, cpusets, 
tasksets, and isolcpus, you can shape the way the scheduler chooses which 
cores are used by which processes and thread pretty well.

On Sunday, April 8, 2018 at 5:51:52 AM UTC-7, John Hening wrote:
>
> Hello,
>
> I've read about thread affinity and I see that it is popular in 
> high-performance-libraries (for example 
> https://github.com/OpenHFT/Java-Thread-Affinity). Ok, jugglery a thread 
> between cores has impact (generally) on performance so it is reasonable to 
> bind a specific thread to a specific core. 
>
> *Intro*:
> It is obvious that the best idea to make it possible that any process will 
> be an owner of core [let's call it X] (in multi-core CPU). I mean that main 
> thread in a process will be one and only thread executed on core X. So, 
> there is no problem with context-switching and cache flushing [with expect 
> system calls]. 
> I know that it requires a special implementation of scheduler in kernel, 
> so it requires a modification of [Linux] kernel. I know that it is not so 
> easy and so on.
>
> *Question*:
> But, we know that we have systems that need a high performance. So, it 
> could be a solution with context-switching once and at all. So, why there 
> is no a such solution? My suspicions are:
>
> * it is pointless, the bottleneck is elsewhere [However, it is meaningful 
> to get thread-affinity]
> * it is too hard and there is too risky to make it not correctly
> * there is no need
> * forking own linux kernel doesn't sound like a good idea.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Avoiding expensive memory barriers

2018-03-21 Thread Gil Tene
Daniel, if what you originally meant is "You don't need memory barriers 
that incur any costs on x86 to implement an SPSC queue.", instead of the 
stated "You don't need memory barriers to implement an SPSC queue for 
x86.", you and Avi may be on the same page. I think.

Barriers are barriers no matter what level you express them in. Arguably, 
all source-code-stated barriers are "compiler barriers". And from an 
expression point of view, most barriers are "language memory model 
barriers" (this is true even in C or MASM when calling 
implementation-specific things that establish a required (or the more 
common I-sure-hope-the-compiler-listens-to-me-and-doesn't-mess-with-this) 
ordering. At the language/source-code levels some barriers might be 
explicit and some are implicit (e.g. a read from a volatile field in Java 
includes implicit barrier semantics, but a VarHandle.loadLoadFence() is an 
explicit barrier). The compiler (or macro assembler, etc.) will adhere to 
the language semantics barriers in its own choices on ordering and 
potential elimination of code (before any CPU instructions actually get 
emitted), and it may translate some of those semantic barriers to machine 
instructions that enforce them (stated or implicit). The x86 memory model 
includes all sorts of implicit and "free" barriers. e.g. x86 instructions 
generally imply load-load, and load-store, and store-store, but not 
store-load (Only specific x86 instructions establish a store-load 
ordering). 

On x86, compilers don't need to emit barrier instruction in order to 
maintain load-load or load-store, or store-store ordering, or to enforce 
acquire fences (loadLoad and loadStore combined) and release fences 
(loadStore and storeStore combined). But the compilers themselves still 
need to know what fences/barriers they need to enforce in their own 
code-jumbling. E.g. without a source-code-semantics barrier requiring 
store-store ordering, the compiler may feely reorder any two stores in your 
code, and emit the stores in that new order, which can easily wreck the 
correctness of otherwise very-fast-on-x86 SPSC queue code, even on x86.

None of this opines on whether or not SPSC can be done without a store-load 
barrier so that on x86 no barrier instructions would be needed. I don't 
know if it can.

On Monday, March 19, 2018 at 7:41:41 AM UTC-7, Daniel Eloff wrote:
>
> We're getting a little confused on the terminology. That's a compiler 
> barrier, as it prevents the compiler from reordering certain instructions 
> beyond it (I don't think relaxed prevents any reordering, but release and 
> acquire do.) I know you understand this stuff given your background, I just 
> want to clarify the terminology for the sake of the discussion.
>
> The original post and article discuss real memory barriers like mfence. 
> These prevent the CPU from reordering loads and stores. Which should be 
> unnecessary for SPSC queues on x86 because it gives strong enough 
> guarantees about reordering, in this case, without that.
>
>
> On Mon, Mar 19, 2018, 1:19 AM Avi Kivity  
> wrote:
>
>> The release write is a memory barrier. It's not an SFENCE or another 
>> fancy instruction, but it is a memory barrier from the application writer's 
>> point of view.
>>
>>
>> The C++ code
>>
>>
>> x.store(5, std::memory_order_relaxed)
>>
>> has two effects on x86:
>>
>>   1. generate a write to x that is a single instruction (e.g. mov $5, x)
>>   2. prevent preceding writes from being reordered by the compiler (they 
>> are implicitly ordered by the processor on x86).
>>
>>
>>
>> On 03/18/2018 08:16 PM, Dan Eloff wrote:
>>
>> You don't need memory barriers to implement an SPSC queue for x86. You 
>> can do a relaxed store to the queue followed by a release write to 
>> producer_idx. As long as consumer begins with an acquire load from 
>> producer_idx it is guaranteed to see all stores to the queue memory before 
>> producer_idx, according to the happens before ordering. There are no memory 
>> barriers on x86 for acquire/release semantics.
>>
>> The release/acquire semantics have no meaning when used with different 
>> memory locations, but if used on producer_idx when synchronizing the 
>> consumer, and consumer_idx when synchronizing the producer, it should work.
>>
>>
>>
>> On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity > > wrote:
>>
>>> Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) 
>>> taking the top row in a profile? Here's the complicated story of how we 
>>> took it down:
>>>
>>>
>>> https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 

Re: Looking at reordering memory operations

2018-03-10 Thread Gil Tene


On Saturday, March 10, 2018 at 1:18:40 PM UTC-8, John Hening wrote:
>
> Gil, thanks for your response. It is very helpful. 
>
> In your specific example above, there is actually no ordering question, 
> because your writeTask() operations doesn't actually observe the state 
> changed by connection.configueBlocking(false)
>
>
> I agree that my question wasn't correct. There is not 'ordering'. I meant 
> visibility. 
>

Visibility and ordering are related. Questions about the visibility of 
state (like that of the field "blocking") apply only to things that 
interact with that state. And when things interact with some state, the 
ordering in which changes to that state becomes visible (with relation to 
changes to other state, like e.g. the enqueing of an operation via 
Executor.execute(), being visible) has (or doesn't have) certain guarantees.

E.g. in the example discussed, with the synchronized blocks in place on 
both the writer and the reader of the field "blocking", we are guaranteed 
that the change of "blocking = false" is visible to the thread that 
executes writeTask() (if writeTask actually uses the value of "blocking" 
obtained within the synchronized block) before the request to execute 
write() is visible to that same thread...
 

>  Without the use of synchronized in isBlocking(), the use of synchronized 
> in configureBlocking() wouldn't make a difference.
>
> Yes, semi-synchronized doesn't work. So, I conclude that without a 
> synchronization the result of `blocking = false` could be invisible for 
> writeTask, am I right?
>

It is not a matter of being invisible. It's a field in a shared object, so 
all operations on it are eventually visible (to things that access it). 
What you can certainly say here is that without using synchronized blocks 
*on both ends* (both the writer and the reader), and without being replaced 
by some other ordering mechanism *on both ends*, your writeTask could 
observe a value of "blocking" that predates the modification of it in the 
thread that calls connection.configueBlocking().
 

> As your question about the possibility of "skipping" some write operations.
>
>
> By skipping I meant 'being invisible for observers'. For example, if one 
> thread t1 read any not-volatile-integer x then it is possible that t1 see 
> always the same value of x (though there is another thread t2 that modifies 
> x).
>

That (t1 always see the same value of x when x is modified elsewhere) is 
possible, e.g. in a tight loop reading x and nothing else. But that will 
only happen if no other ordering constructs force the visibility of 
modifications to x. E.g if thread t1 read some volatile field y that thread 
t2 modifies after modifying x, then thread t1 will observe the modified 
value of x in reads that occur after observing the modified value of y. In 
such a case, it won't "always see the same value of x".


>
>
> 1. It is interesting for me what about a such situation:
>
>while(true) {
> SocketChannel connection = serverSocketChannel.accept();
> connection.configueBlocking(false);
> Unsafe.storeFence();
> executor.execute(() -> writeTask(connection));
> }
> void writeTask(SocketChannel s){
> (***)
> any_static_global_field = s.isBlocking();
> }
>
> For my eye it should work but I have doubts. What does it mean storeFence? 
> Please flush it to the memory immediately! 
>

Unsafe.storeFence doesn't mean "flush...". It means "Ensures lack of 
reordering of stores before the fence with loads or stores after the 
fence." (that's literally what the Javadoc for it says).
 

> So, it will be visible before starting the executor thread. But, it seems 
> that, here, load fence is not necessary (***). Why? The blocking field must 
> be read from memory (there is no possibility that it is cached, because it 
> is read the first time by the executor thread). When it comes to CPU cache 
> it may be cached but cache is coherent = no problem). Moreover, there is no 
> need to ensure ordering here. So, loadFence is not necessary. Yes? 
>

No. At least not quite. For this specific sequence, you already have the 
ordering you want, but not for the reasons you think.

First, please put aside this notion that there is some memory, and some 
cache or store buffer, and some flushing going on. This ordering and 
visibility stuff has nothing to do with any of those potential 
implementation details. and trying to explain things in terms of those 
potential (and incomplete) implementation details mostly serves to confuse 
the issue. A tip I give people for thinking about this stuff is: Always 
think of the compiler as the culprit when it comes to reordering, and in 
that thinking, imagine the compiler being super-smart and super-mean. The 
compiler is allowed to create all sorts of evil, ingenious and 
pre-cognitive reorderings, cachings, and redundant or dead operation 
eliminations (including pre-caching of values it thinks you 

Re: Looking at reordering memory operations

2018-03-10 Thread Gil Tene


On Saturday, March 10, 2018 at 1:18:40 PM UTC-8, John Hening wrote:
>
> Gil, thanks for your response. It is very helpful. 
>
> In your specific example above, there is actually no ordering question, 
> because your writeTask() operations doesn't actually observe the state 
> changed by connection.configueBlocking(false)
>
>
> I agree that my question wasn't correct. There is not 'ordering'. I meant 
> visibility. 
>

No real diference between ordering a
 

>
>
>  Without the use of synchronized in isBlocking(), the use of synchronized 
> in configureBlocking() wouldn't make a difference.
>
> Yes, semi-synchronized doesn't work. So, I conclude that without a 
> synchronization the result of `blocking = false` could be invisible for 
> writeTask, am I right?
>
>
> As your question about the possibility of "skipping" some write operations.
>
>
> By skipping I meant 'being invisible for observers'. For example, if one 
> thread t1 read any not-volatile-integer x then it is possible that t1 see 
> always the same value of x (though there is another thread t2 that modifies 
> x). 
>
>
> 1. It is interesting for me what about a such situation:
>
>while(true) {
> SocketChannel connection = serverSocketChannel.accept();
> connection.configueBlocking(false);
> Unsafe.storeFence();
> executor.execute(() -> writeTask(connection));
> }
> void writeTask(SocketChannel s){
> (***)
> any_static_global_field = s.isBlocking();
> }
>
> For my eye it should work but I have doubts. What does it mean storeFence? 
> Please flush it to the memory immediately! So, it will be visible before 
> starting the executor thread. But, it seems that, here, load fence is not 
> necessary (***). Why? The blocking field must be read from memory (there is 
> no possibility that it is cached, because it is read the first time by the 
> executor thread). When it comes to CPU cache it may be cached but cache is 
> coherent = no problem). Moreover, there is no need to ensure ordering here. 
> So, loadFence is not necessary. Yes? 
>
> 2. 
> volatile int foo;
> ...
> foo = 1;
> foo = 2;
> foo = 3;
>
>
>
> It is very interesting. So, after JITed on x86 it can look like:
>
> mov , 1
> sfence
> mov , 2
> sfence
> mov , 3
> sfence
>
>
>
> Are you sure that CPU can execute that as:
> mov , 3
> sfence
>
>
> ?
>
> I know that: 
>
> mov , 1
> mov , 2
> mov , 3 
>
>
>
> x86-CPU can optimizied it legally. 
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> W dniu piątek, 9 marca 2018 23:20:37 UTC+1 użytkownik John Hening napisał:
>>
>>
>> executor = Executors.newFixedThreadPool(16);
>> while(true) {
>> SocketChannel connection = serverSocketChannel.accept();
>> connection.configueBlocking(false);
>> executor.execute(() -> writeTask(connection)); 
>> }
>> void writeTask(SocketChannel s){
>> s.isBlocking();
>> }
>>
>> public final SelectableChannel configureBlocking(boolean block) 
>> throws IOException
>> {
>> synchronized (regLock) {
>> ...
>> blocking = block;
>> }
>> return this;
>> }
>>
>>
>>
>> We see the following situation: the main thread is setting 
>> connection.configueBlocking(false)
>>
>> and another thread (launched by executor) is reading that. So, it looks 
>> like a datarace.
>>
>> My question is:
>>
>> 1. Here 
>> configureBlocking
>>
>> is synchronized so it behaves as memory barrier. It means that code is 
>> ok- even if reading/writing to 
>> blocking
>>
>> field is not synchronized- reading/writing boolean is atomic.
>>
>> 2. What if 
>> configureBlocking
>>
>> wouldn't be synchronized? What in a such situation? I think that it would 
>> be necessary to emit a memory barrier because it is theoretically possible 
>> that setting blocking field could be reordered. 
>>
>> Am I right?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Looking at reordering memory operations

2018-03-10 Thread Gil Tene
There are many ways in which desired order may be achieved, and you need to 
examine all code that interacts with the state elements to reason about how 
(and if) ordering is enforced. Removing the "synchronized" alone  (in 
some seemingly working code) is "always a bad idea". But that doesn't make 
it "required". The question is, what scheme would you replace it with...

In your specific example above, there is actually no ordering question, 
because your writeTask() operations doesn't actually observe the state 
changed by connection.configueBlocking(false);. It comes close, but the 
fact that nothing is done with the return value of isBlocking() means that 
you have nothing to ask an ordering question about. [the entire execution 
to isBlocking() is dead code, and will be legitimately eliminated by JIT 
compilers after inlining]. However, if you change the example slightly such 
that writeTask() propagated the value of isBlocking() somewhere (e.g. to 
some static volatile boolean), we'd have a question to deal with., 
So let's assume you did that...

In the specific case of the SocketChannel implementation and the example 
above, the modification of the SocketChannel-internal blocking state in 
connection.configueBlocking(false); is guranteed to be visible to the 
potential observation of the same state by the writeTask() operation run by 
some executor thread because *both* configureBlocking() and isBlocked() use 
a synchronized block around the access to the "blocking" field (e.g. at 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/nio/channels/spi/AbstractSelectableChannel.java#AbstractSelectableChannel.isBlocking%28%29).
 
Without the use of synchronized in isBlocking(), the use of synchronized in 
configureBlocking() wouldn't make a difference.

There are many other ways that this ordering guarantee could be achieved. 
E.g. (for this specific sequence of T1: blocking = false; enqueue 
writeTask() operation; and T2: start writeTask() operation; writeTask 
return value of blocking;) making the SocketChannel-internal blocking field 
volatile would also ensure the writeTask() operation above would return 
blocking = false. And many other means for ordering these are possible.

As your question about the possibility of "skipping" some write operations: 
Write operations are never actually "skipped" (they will eventually 
happen). But in situations where a write is followed by a subsequent 
over-writing write to the same field, the code can legitimately act like it 
"ran fast enough that no-one was able to observe the intermediate state", 
and simply execute the last write. The CPU can do this. The Compiler can do 
this. And the thread running fast enough can do this. It is important to 
understand that this is true even when synchronization and other 
ordering operations exist. E.g the following sequence:

synchronized(regLock) {
  blocking = false;
}
synchronized(regLock) {
  blocking = true;
}
synchronized(regLock) {
  blocking = false;
}

Can be legitimately executed as:

synchronized(regLock) {
  blocking = false;
}

And the sequence:

volatile int foo;
...
foo = 1;
foo = 2;
foo = 3;

can (and will) be legitimately executed as:
foo = 3; 

Ordering questions only come into play if you put other memory-interacting 
things in the middle, between those writes. Then questions about whether or 
not those other things can be re-ordered with the writes come up. Sometimes 
the rules prevent such re-ordering (forcing the actual intermediate writes 
to be executed), and sometimes the rules allow re-ordering (allowing e..g 
writes in loops to be pushed to happen only once when the loop completes). 
In general, in Java, unless some of the "other thing" players are volatile, 
atomics, or synchronized blocks, any reordering is allowed as long as it 
does not change the eventual meaning of computations the sequence. 


On Saturday, March 10, 2018 at 8:13:53 AM UTC-8, John Hening wrote:
>
> ok, reordering is not a good idea to consider here. But, please note that 
> if conifgureBlocking wans't synchronized then a statement:
>
> blocking = block
>
> could be "skipped" on compilation level because JMM doesn't guarantee you 
> that every access to the memory will be "commit" to the main memory. 
> synchronized method/ putting memory barrier would solve that problem. What 
> do you think?
>
>
> W dniu piątek, 9 marca 2018 23:20:37 UTC+1 użytkownik John Hening napisał:
>>
>>
>> executor = Executors.newFixedThreadPool(16);
>> while(true) {
>> SocketChannel connection = serverSocketChannel.accept();
>> connection.configueBlocking(false);
>> executor.execute(() -> writeTask(connection)); 
>> }
>> void writeTask(SocketChannel s){
>> s.isBlocking();
>> }
>>
>> public final SelectableChannel configureBlocking(boolean block) 
>> throws IOException
>> {
>> synchronized (regLock) {
>> ...
>> blocking = 

Re: High run-queue debugging

2018-01-29 Thread Gil Tene
When you say "run-queue" below, I assume you are referring to the "r" 
column in vmstat output, right? That's not the run queue depth...

While the vmstat man pages (e.g. https://linux.die.net/man/8/vmstat) 
*misleadingly* describe the "r" column with "r: The number of processes 
waiting for run time.", vmstat (see source e.g. here 
https://gitlab.com/procps-ng/procps/blob/master/vmstat.c) actually picks 
the value for the "r" column up from the procs_running value in /proc/stat, 
with one very important tweak: it subtracts one from that number so as not 
to count the vmstat process itself. (see the actual scraping and -1 math 
here: https://gitlab.com/procps-ng/procps/blob/master/proc/sysinfo.c#L577). 

The actual meaning of procs_running (per 
https://www.kernel.org/doc/Documentation/filesystems/proc.txt) is "the 
total number of threads that are running or ready to run (i.e., the total 
number of runnable threads)." 

Annoying, but "that's how it's always been". Someone should really clean up 
that vmstat man page (e.g. 
https://gitlab.com/procps-ng/procps/blob/master/vmstat.8) (he said lazily, 
without volunteering).

So, back to your question, and assuming my logic above is correct:

- Given your description of having at least 6 hot-spinning threads, I'd 
expect the "r" column to never drop below 6 unless your spinning threads 
use some blocking logic (e.g. locking). So the fact that you see a min of 4 
is actually the only "strange" thing I see.

- Occasional spikes to much higher procs_running numbers are very normal. 
Each runnable thread counts, and there are all sorts of things on healthy 
Linux systems that spike that number up temporarily. The point in time 
during which vmstata sampled the procs_running number may just have been 
within the short period during which some multi-threaded process did 42 
milliseconds worth of work using 25 active threads, and then went idle 
again.

- For a system with 32 or 64 vcores (is it 16 x 2 hyperthreaded cores, or 
16 hyperthreaded cores?), you seem to have enough run queues to handle the 
peak number of runnable threads you have observed (31 + the 1 vmstat thread 
doing the reporting). However, I would note that the vmstat only samples 
this number once per reporitng cycle (so e.g. once per second), which means 
that it is much more likely than not to miss spikes that are large in 
amplitude but very short in duration. Your actual max may actually be much 
higher than 31. 

- If you want to study your actual peak number of runnable processes, you 
want to watch procs_running in /proc/stat at a much finer time granularity. 
You can take a look at my example LoadMeter tool 
 for something that does that. It 
samples this number every milliseocnd or so, and reports interval 
histograms of the procs_running level in HdrHistogram logs (a "now common 
format" which can be plotted and analyzed with various tools, including 
e.g. https://github.com/HdrHistogram/HistogramLogAnalyzer). Loadmeter's 
entire purpose is to better track the behavior of the max procs_running at 
any given time, such that spikes longer than 1-2 milliseconds won't hide. I 
find it to be very useful when trying to triage and e.g. confirm or 
disprove the common "do I have temporary spikes of lots of very short 
running threads causing latency spikes even at low reported CPU% levels" 
question.

- In addition, keep in mind that without special treatment, "enough run 
queues for the peak number of running threads" works within the reaction 
time of cross-core load balancing in the scheduler, and that means that 
there can be many of milliseconds during which one core has lots of threads 
waiting in it's run queue while other cores are idle (and have not yet 
chosen to steal work from other run queues). Schedukler load balancing 
(across cores) is a long subject on it's own, and for a good contemporary 
set of gripes about cross-core load balacing, you can read Andrian Colyer's 
nice summary 

 
of The Linux Scheduler: a Decade of Wasted Cores 
 (and then read 
the full paper if you want).

There is much you can do to control which cores do what if you want to 
avoid temporary load spikes on a single core causing embarrassing hiccups 
in your low-latency spinners (as threads that wake up on the same core as 
the spinning thread steal cpu away from it before being load balanced to 
some other core). E.g. when you have 6 hot-spinning, a common practice is 
to use isolcpus (or some other keep-these-cores-away-from-others mechanisms 
like cpu sets) and assigning each of your spinners to a specific core (with 
e.g. taskset of some api calls), such that no other thread will compete 
with them**.

-- Gil.

** Helpful "may save you some serious frustration" hint for when you use 
isolcpus: Keep in mind to avoid the common 

Re: Call site (optimizations)

2018-01-27 Thread Gil Tene
Some partial answers inline below, which can probably be summarized with 
"it's complicated...".

On Friday, January 26, 2018 at 8:33:53 AM UTC-8, Francesco Nigro wrote:

HI guys,
>
> in the last period I'm having some fun playing with JItWatch (many thanks 
> to Chris Newland!!!) and trying to understand a lil' bit more about 
> specific JVM optimizations, but suddenly I've found out that I was missing 
> one of the most basic definition: call site.
>
> I've several questions around that:
>
>1. There is a formal definition of call site from the point of view of 
>the JIT?
>
>  I don't know about "formal". But a call site is generally any location in 
the bytecode of one method that explicitly causes a call to another method. 
These include:

classic bytecodes used for invocation:
- Virtual method invocation (invokevirtual and invokeinterface, both of 
which calling a non-static method on an instance), which in Java tends to 
dynamically be the most common form.
- Static method invocations (invokestatic)
- Constructor/initializer invocation (invokespecial)
- Some other cool stuff (private instance method invocation with 
invokespecial, native calls, etc.)

In addition, you have these "more interesting" things that can be viewed 
(and treated by the JIT) as call sites:
- MethodHandle.invoke*()
- reflection based invocation (Method.invoke(), Constructor.newInstance())
- invokedynamic (can full of Pandora worms goes here...)


>1. I know that there are optimizations specific per call site, but 
>there is a list of them somewhere (that is not the OpenJDK source code)? 
>
> The sort of optimizations that might happen at a call site can evolve over 
time, and JVMs and JIT can keep adding newer optimizations: Some of the 
current common call site optimizations include:

- simple inlining: the target method is known  (e.g. it is a static method, 
a constructor, or a 
we-know-there-is-only-one-implementor-of-this-instance-method-for-this-type-and-all-of-its-subtypes),
 
and can be unconditionally inlined at the call site.
- guarded inlining: the target method is assumed to be a specific method 
(which we go ahead and inline), but a check (e.g. the exact type of this 
animal is actually a dog) is required ahead of the inlined code because we 
can't prove the assumption is true.
- bi-morphic and tri-morphic variants of guarded inlining exist (where two 
or three different targets are inlined).
- Inline cache: A virtual invocation dispatch (which would need to follow 
the instance's class to locate a target method) is replaced with a guarded 
static invocation to a specific target on the assumption a specific 
("monomorphic") callee type. "bi-morphic" and "tri-morphic" variants of 
inline cache exist (where one of two or three static callees are called 
depending on a type check, rather than performing a full virtual dispatch)
...
 
But there are bigger and more subtle things that can be optimized at and 
around a call site, which may not be directly evident from the calling code 
itself. Even when a call site "stays", things like this can happen:

- Analysis of all possible callees shows that no writes to some locations 
are possible and that no order-enforcing operations will occur, allowing 
the elimination of re-reading of those locations after the call. [this can 
e.g. let us hoist reads out of loops containing calls].
- Analysis of all possible callees shows that no reads of some locations 
are possible and that no order-enforcing operations will occur, allowing 
the delay/elimination of writes to those locations [this can e.g. allow us 
to sink writes such that occur once after a loop with calls in it].
- Analysis of all possible callees shows that an object passed as an 
argument does not escape to the heap, allowing certain optimizations (e.g. 
eliminating locks on the object if it was allocated in the caller and never 
escaped since we now know it is thread-local)
- ... (many more to come)


>1. 
>2. I know that compiled code from the JVM is available in a Code Cache 
>to allow different call-sites to use it: that means that the same compiled 
>method is used in all those call-sites (provided that's the best version 
> of 
>it)?
>
>
Not necessarily. 


- The same method may exist in the code cache in multiple places:
  - Each tier may retain compiles of the same method.
  - Methods compiled for OSR-entry (e.g. transition from a lower tier 
version of the same method in the middle of a long-running loop) are 
typically a separate code cache entry that ones compiled for entry via 
normal invocation.
  - Each location where a method is inlined is technically a separate copy 
of that method in the cache.
  - ...

- When an actual virtual invocation is made, that invocation will normally 
go to the currently installed version of the method in the code cache (if 
one exists). However, because the JVM works very hard to avoid actual 
virtual invocations (and tend to 

Re: Determine memory bandwidth machine

2018-01-16 Thread Gil Tene
Good point about the numa part. My dd example may well end up allocating 
memory for the file in /tmp on one socket or another, causing all the reads 
to hit that one socket's memory channels.

For a good spread, a C program will probably be best. Allocate memory from 
the local NUMA node, and run multiple threads reading that memory for speed 
clocking. then run one copy of this program bound to each socket (with 
numactl -N) and sum up the numbers (for an idealized max memory bandwidth 
thing that does no cross-socket access).

On Tuesday, January 16, 2018 at 10:54:22 AM UTC-8, Chet L wrote:
>
> Agree with Gil (w.r.t DIMM slots, channels etc) but may be not on the 'dd' 
> script. Here's why. Some BIOS's have 'memory interleaving' option turned 
> OFF. And if the OS-wide memory policy is non-interleaving too then in that 
> case unless your application explicitly binds memory to a remote socket you 
> cannot interleave memory. Or you would need to use numa tools (to set mem 
> policy etc) while launching your application.
>
> Bandwidth or latency monitoring is also dependent on the workload you are 
> running. If the workload a) runs atomics b) is running on Node-0 only 
> (memory is local or striped) : then the snoop-responses are going take 
> longer(uncore frequency etc) because socket-1 might be in power-saving 
> states. So ideally you would need a 'snoozer' thread on the remote 
> socket(Node-1) which would prevent the socket from entering one of the 'C' 
> (or whatever) states (or you can disable hardware power-saving modes - but 
> you may need to line up all the options because the kernel may have 
> power-saving options too).  If you use the industry standard tools like 
> 'stream' (as others mentioned) etc they will do all of this for you (via 
> dummy/snoozer threads and so on).
>
> If you want to write this all by yourself then you should know the numa 
> apis (http://man7.org/linux/man-pages/man3/numa.3.html), numa tools 
> (numactl, taskset). For latency measurements you should also disable 
> pre-fetching else it will give you super awesome numbers.
>
> Note: In future, if you use a device-driver that does the allocation for 
> your app then you should make sure the driver knows about numa allocation 
> too and aligns everything for you. I found that out the hard way back in 
> 2010 and realized that the linux-kernel had no guidance for pcie 
> drivers(back then ... its ok now). I fixed it locally at the time.
>
> Hope this helps.
>
> Chetan Loke
>
>
> On Monday, January 15, 2018 at 11:19:53 AM UTC-5, Kevin Bowling wrote:
>>
>> lmbench works well http://www.bitmover.com/lmbench/man_lmbench.html, 
>> and Larry seems happy to answer questions on building/using it. 
>>
>> Unless you've explicitly built an application to work with NUMA, or 
>> are able to run two copies of an application pinned to each domain, 
>> you really only will get about 1 package worth of BW, and latecny is a 
>> bigger deal (which lmbench can also measure in cooperation with 
>> numactl) 
>>
>> Regards, 
>>
>> On Sun, Jan 14, 2018 at 11:44 AM, Peter Veentjer  
>> wrote: 
>> > I'm working on some very simple aggregations on huge chunks of offheap 
>> > memory (500GB+) for a hackaton. This is done using a very simple 
>> stride; 
>> > every iteration the address increases with 20 bytes. So the prefetcher 
>> > should not have any problems with it. 
>> > 
>> > According to my calculations I'm currently processing 35 GB/s. However 
>> I'm 
>> > not sure if I'm close to the maximum bandwidth of this machine. Specs: 
>> > 2133 MHz, 24x HP 32GiB 4Rx4 PC4-2133P 
>> > 2x Intel(R) Xeon(R) CPU E5-2687W v3, 3.10GHz, 10 cores per socket 
>> > 
>> > What is the best tool to determine the maximum bandwidth of a machine 
>> > running Linux (RHEL 7) 
>> > 
>> > -- 
>> > You received this message because you are subscribed to the Google 
>> Groups 
>> > "mechanical-sympathy" group. 
>> > To unsubscribe from this group and stop receiving emails from it, send 
>> an 
>> > email to mechanical-sympathy+unsubscr...@googlegroups.com. 
>> > For more options, visit https://groups.google.com/d/optout. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: JVM detection of thread at safepoint

2017-12-05 Thread Gil Tene
:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
>   99ba70 SafepointSynchronize::block 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   6db9a0 JVM_Sleep 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
> 7f788d0174d4 Interpreter (/tmp/perf-17855.map)
> 7f788d007ffd Interpreter (/tmp/perf-17855.map)
> 7f788d0004e7 call_stub (/tmp/perf-17855.map)
>   6941e3 JavaCalls::call_helper 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   6b6e43 jni_invoke_static 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   6b9566 jni_CallStaticVoidMethod 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
> 3371 JavaMain 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/jli/libjli.so)
> 773a start_thread (/usr/lib64/libpthread-2.24.so)
>
>
> java 17998 [002] 2446326.506283: 
> probe_libjvm:_ZN20SafepointSynchronize5blockEP10JavaThread: (7f78a21d3a70)
>   99ba70 SafepointSynchronize::block 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
> 7f788d108c2a Ljava/lang/System;::arraycopy 
> (/tmp/perf-17855.map)
> 7f788d007ffd Interpreter (/tmp/perf-17855.map)
> 7f788d0004e7 call_stub (/tmp/perf-17855.map)
>   6941e3 JavaCalls::call_helper 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   691747 JavaCalls::call_virtual 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   691d27 JavaCalls::call_virtual 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   6d5864 thread_entry 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   acdfbb JavaThread::thread_main_inner 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   ace49f JavaThread::run 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
>   8f2f62 java_start 
> (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-1.b12.fc25.x86_64/jre/lib/amd64/server/libjvm.so)
> 773a start_thread (/usr/lib64/libpthread-2.24.so)
>
> Thanks
> Alex
>  
>
>>
>>
>> Cheers,
>>
>> Mark
>>
>> On Tue, Dec 5, 2017 at 11:11 AM Kirk Pepperdine <ki...@kodewerk.com 
>> > wrote:
>>
>>> Hi,
>>>
>>>
>>> On Dec 5, 2017, at 11:53 AM, Gil Tene <g...@azul.com > 
>>> wrote:
>>>
>>> Page faults in mapped file i/o and counted loops are certainly two 
>>> common causes of long TTSP. But there are many other paths that *could* 
>>> cause it as well in HotSpot. Without catching it and looking at the stack 
>>> trace, it's hard to know which ones to blame. Once you knock out one cause, 
>>> you'll see if there is another.
>>>
>>> In the specific stack trace you showed [assuming that trace was taken 
>>> during a long TTSP], mapped file i/o is the most likely culprit. Your trace 
>>> seems to be around making the page write-able for the first time and 
>>> updating the file time (which takes a lock), but even without needing the 
>>> lock, the fault itself could end up waiting for the i/o to complete (read 
>>> page from disk), and that (when Murh=phy pays you a visit) can end up 
>>> waiting behind 100s other i/o operations (e.g. when your i/o happens at the 
>>> same time the kernel decided to flush some dirty pages in the cache), 
>>> leading to TTSPs in the 100s of msec.
>>>
>>> As I'm sure you already know, one simple way to get around mapped file 
>>> related TTSP is to not used mapped files. Explicit random i/o calls are 
>>> always done while at a safepoint, so they can't cause high TTSPs.
>>>
>>>
>>> I guess another way to avoid the long TTSP is to not safe point.. ;-)
>>>
>>> Most of the long TTSP’s IME are either due to outside scheduling 
>>> interference or loops that are compiled to the point where they turn into 
>>> counted loops. Finding this using traditional tooling is impossible because 
>>> y

Re: JVM detection of thread at safepoint

2017-12-05 Thread Gil Tene


Sent from my iPad

On Dec 5, 2017, at 1:26 PM, Mark Price 
<m...@aitusoftware.com<mailto:m...@aitusoftware.com>> wrote:


That (each process having it's own copy) is surprising to me. Unless the 
mapping is such that private copies are required, I'd expect the processes to 
share the page cache entries.

I can't recreate this effect locally using FileChannel.map(); the library in 
use in the application uses a slightly more exotic route to get to mmap, so it 
could be a bug there; will investigate. I could also have been imagining it.



Is your pre-toucher thread a Java thread doing it's pre-touching using mapped 
i/o in the same process? If so, then the pre-toucher thread itself will be a 
high TTSP causer. The trick is to do the pre-touch in a thread that is already 
at a safepoint (e.g. do your pre-touch using mapped i/o from within a JNI call, 
use another process, or do the retouch with non-mapped i/o).

Yes, just a java thread in the same process; I hadn't considered that it would 
also cause long TTSP, but of course it's just as likely (or more likely) to be 
scheduled off due to a page fault. I could try using pwrite via 
FileChannel.write() to do the pre-touching, but I think it needs to perform a 
CAS (i.e. don't overwrite data that is already present), so a JNI method would 
be the only way to go. Unless just doing a 
FileChannel.position(writeLimit).read(buffer) would do the job? Presumably that 
is enough to load the page into the cache and performing a write is unnecessary.

This (non mapped reading at the write limit) will work to eliminate the actual 
page I/O impact on TTSP, but the time update path with the lock that you show 
in your initial stack trace will probably still hit you. I’d go either with a 
JNI CAS, or a forked-off mapped Java pretoucher as a separate process (tell it 
what you wNt touched via its stdin). Not sure which one is uglier. The pure 
java is more portable (for Unix/Linux variants at least)





Cheers,

Mark

On Tuesday, 5 December 2017 10:53:17 UTC, Gil Tene wrote:
Page faults in mapped file i/o and counted loops are certainly two common 
causes of long TTSP. But there are many other paths that *could* cause it as 
well in HotSpot. Without catching it and looking at the stack trace, it's hard 
to know which ones to blame. Once you knock out one cause, you'll see if there 
is another.

In the specific stack trace you showed [assuming that trace was taken during a 
long TTSP], mapped file i/o is the most likely culprit. Your trace seems to be 
around making the page write-able for the first time and updating the file time 
(which takes a lock), but even without needing the lock, the fault itself could 
end up waiting for the i/o to complete (read page from disk), and that (when 
Murphy pays you a visit) can end up waiting behind 100s other i/o operations 
(e.g. when your i/o happens at the same time the kernel decided to flush some 
dirty pages in the cache), leading to TTSPs in the 100s of msec.

As I'm sure you already know, one simple way to get around mapped file related 
TTSP is to not used mapped files. Explicit random i/o calls are always done 
while at a safepoint, so they can't cause high TTSPs.

On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
Hi Aleksey,
thanks for the response. The I/O is definitely one problem, but I was trying to 
figure out whether it was contributing to the long TTSP times, or whether I 
might have some code that was misbehaving (e.g. NonCountedLoops).

Your response aligns with my guesswork, so hopefully I just have the one 
problem to solve ;)



Cheers,

Mark

On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote:
On 12/05/2017 09:26 AM, Mark Price wrote:
> I'm investigating some long time-to-safepoint pauses in oracle/openjdk. The 
> application in question
> is also suffering from some fairly nasty I/O problems where latency-sensitive 
> threads are being
> descheduled in uninterruptible sleep state due to needing a file-system lock.
>
> My question: can the JVM detect that a thread is in signal/interrupt-handler 
> code and thus treat it
> as though it is at a safepoint (as I believe happens when a thread is in 
> native code via a JNI call)?
>
> For instance, given the stack trace below, will the JVM need to wait for the 
> thread to be scheduled
> back on to CPU in order to come to a safepoint, or will it be treated as 
> "in-native"?
>
> 7fff81714cd9 __schedule ([kernel.kallsyms])
> 7fff817151e5 schedule ([kernel.kallsyms])
> 7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms])
> 7fff813556e7 call_rwsem_down_write_failed ([kernel.kallsyms])
> 7fff817172ad down_write ([kernel.kallsyms])
> 7fffa0403dcf xfs_ilock ([kernel.kallsyms])
> 7fffa04018fe xfs_vn_update_time ([kernel.kallsyms])
> 7fff8122cc5d file_update_time ([kernel.kallsym

Re: JVM detection of thread at safepoint

2017-12-05 Thread Gil Tene


On Tuesday, December 5, 2017 at 12:31:52 PM UTC+1, Mark Price wrote:
>
> Hi Gil,
> thanks for the response.
>
> I'm fairly sure that interaction with the page-cache is one of the 
> problems. When this is occurring, the free-mem is already hovering around 
> vm.min_free_kbytes, and the mapped files are a significant fraction of 
> system memory. From what I can see, each process that maps a file will get 
> its own copy in the page-cache (kernel shared pages doesn't seem to apply 
> to the page-cache, and is otherwise disabled on the test machine), so we 
> probably have approaching total system memory in use by cached pages.
>

That (each process having it's own copy) is surprising to me. Unless the 
mapping is such that private copies are required, I'd expect the processes 
to share the page cache entries.
 

>
> I had thought that pages that were last written-to before the 
> vm.dirty_expire_centisecs threshold would be written out to disk by the 
> flusher threads, but I read on lkml that the access times are maintained on 
> a per-inode basis, rather than per-page. If this is the case, then the 
> system in question is making it very difficult for the page-cache to work 
> efficiently.
>
> The system makes use of a "pre-toucher" thread to try to handle 
> page-faults ahead of the thread that is trying to write application data to 
> the mapped pages. However, it seems that it is not always successful, so I 
> need to spend a bit of time trying to figure out why that is not working. 
> It's possible that there is just too much memory pressure, and the OS is 
> swapping out pages that have be loaded by the pre-toucher before the 
> application gets to them.
>

Is your pre-toucher thread a Java thread doing it's pre-touching using 
mapped i/o in the same process? If so, then the pre-toucher thread itself 
will be a high TTSP causer. The trick is to do the pre-touch in a thread 
that is already at a safepoint (e.g. do your pre-touch using mapped i/o 
from within a JNI call, use another process, or do the retouch with 
non-mapped i/o).
 

>
>
> Cheers,
>
> Mark
>
> On Tuesday, 5 December 2017 10:53:17 UTC, Gil Tene wrote:
>>
>> Page faults in mapped file i/o and counted loops are certainly two common 
>> causes of long TTSP. But there are many other paths that *could* cause it 
>> as well in HotSpot. Without catching it and looking at the stack trace, 
>> it's hard to know which ones to blame. Once you knock out one cause, you'll 
>> see if there is another.
>>
>> In the specific stack trace you showed [assuming that trace was taken 
>> during a long TTSP], mapped file i/o is the most likely culprit. Your trace 
>> seems to be around making the page write-able for the first time and 
>> updating the file time (which takes a lock), but even without needing the 
>> lock, the fault itself could end up waiting for the i/o to complete (read 
>> page from disk), and that (when Murphy pays you a visit) can end up waiting 
>> behind 100s other i/o operations (e.g. when your i/o happens at the same 
>> time the kernel decided to flush some dirty pages in the cache), leading to 
>> TTSPs in the 100s of msec.
>>
>> As I'm sure you already know, one simple way to get around mapped file 
>> related TTSP is to not used mapped files. Explicit random i/o calls are 
>> always done while at a safepoint, so they can't cause high TTSPs.
>>
>> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
>>>
>>> Hi Aleksey,
>>> thanks for the response. The I/O is definitely one problem, but I was 
>>> trying to figure out whether it was contributing to the long TTSP times, or 
>>> whether I might have some code that was misbehaving (e.g. NonCountedLoops).
>>>
>>> Your response aligns with my guesswork, so hopefully I just have the one 
>>> problem to solve ;)
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Mark
>>>
>>> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote:
>>>>
>>>> On 12/05/2017 09:26 AM, Mark Price wrote: 
>>>> > I'm investigating some long time-to-safepoint pauses in 
>>>> oracle/openjdk. The application in question 
>>>> > is also suffering from some fairly nasty I/O problems where 
>>>> latency-sensitive threads are being 
>>>> > descheduled in uninterruptible sleep state due to needing a 
>>>> file-system lock. 
>>>> > 
>>>> > My question: can the JVM detect that a thread is in 
>>>> signal/interrupt-handler code and thus treat it 
>>>> > as t

Re: JVM detection of thread at safepoint

2017-12-05 Thread Gil Tene
You can use it to track down your problem. You can download a Zing trial 
and play with it. It won't tell you anything about the various HotSpot 
specific TTSP paths (since most of those don't exist in Zing), but since 
you suspect mapped i/o based TTSPs, Zing should run into those just as 
much, and you'll get good TTSP coverage and clear-blame stack traces to 
play with. 

On Tuesday, December 5, 2017 at 12:33:54 PM UTC+1, Mark Price wrote:
>
>
>> In Zing, we have a built-in TTSP profiler for exactly this reason. 
>>
>
> I remember it fondly :)
>
>
>  
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: JVM detection of thread at safepoint

2017-12-05 Thread Gil Tene
One of the challenges in using GC logs to find out if you have TTSP issues 
is that GC events are extremely infrequent, so not having seen a high TTSP 
in the log doesn't mean you won't hit one soon... It's literally like 
deducing the likelihood of hitting something bad from 1000 data points 
collected with a 1-in-a-million random sampler. Coverage is very low.

In Zing, we have a built-in TTSP profiler for exactly this reason. 
Coverage. We initially used it internally, to detect and clean up all the 
JVM code paths that were problematic, and now we use it with customers, 
mostly to prove and point fingers at the system-related causes of TTSP when 
they occur. The way the TTSP profiler works is basically by using a tick 
based profiler that records the tick time arms a thread-local "go to a 
no-op safepoint" whose only purpose is to report the amount of time elapsed 
form the original tick to that thread getting to a safepoint. The thread 
leaves the safepoint immediate, so no pauses happen, and we keep track of 
the call stacks for e.g. the top 50 TTSP times seen since the last profile 
reset. With this technique, we can reliably completely cover and report on 
any experienced code path with a TTSP larger than e.g. 5 msec, and we get 
fairly good coverage even for the lower levels (just not 100% coverage, 
unless you are willing to take a e.g. TTSP-profiling tick 2000 times per 
second on each running thread).

I expect that with HotSpot finally looking at adding thread-specific 
safe-pointing and checkpoints, similar tooling will be needed there.

On Tuesday, December 5, 2017 at 12:11:44 PM UTC+1, Kirk Pepperdine wrote:
>
> Hi,
>
>
> On Dec 5, 2017, at 11:53 AM, Gil Tene <g...@azul.com > wrote:
>
> Page faults in mapped file i/o and counted loops are certainly two common 
> causes of long TTSP. But there are many other paths that *could* cause it 
> as well in HotSpot. Without catching it and looking at the stack trace, 
> it's hard to know which ones to blame. Once you knock out one cause, you'll 
> see if there is another.
>
> In the specific stack trace you showed [assuming that trace was taken 
> during a long TTSP], mapped file i/o is the most likely culprit. Your trace 
> seems to be around making the page write-able for the first time and 
> updating the file time (which takes a lock), but even without needing the 
> lock, the fault itself could end up waiting for the i/o to complete (read 
> page from disk), and that (when Murh=phy pays you a visit) can end up 
> waiting behind 100s other i/o operations (e.g. when your i/o happens at the 
> same time the kernel decided to flush some dirty pages in the cache), 
> leading to TTSPs in the 100s of msec.
>
> As I'm sure you already know, one simple way to get around mapped file 
> related TTSP is to not used mapped files. Explicit random i/o calls are 
> always done while at a safepoint, so they can't cause high TTSPs.
>
>
> I guess another way to avoid the long TTSP is to not safe point.. ;-)
>
> Most of the long TTSP’s IME are either due to outside scheduling 
> interference or loops that are compiled to the point where they turn into 
> counted loops. Finding this using traditional tooling is impossible because 
> you need the safe point which has most likely been compiled out of the code 
> to get a sample to see why you’re waiting for a safe point. Using an async 
> profiler (like honest) leaves you wanting for some correlation between 
> calls for safe point and the samples you’re looking at. So, I don’t really 
> have a good answer as to how to find these things. What I have done in the 
> past is used the GC log to get an idea of the toxicity level of the 
> environment you’re running in and then use that information to decide if I 
> need to look for internal or external causes. If the causes are internal 
> than it’s most likely a counted loop. Using JITWatch in this cause can you 
> get a candidate list of counted loops to consider. After that I’ve binary 
> searched through that list. If the environment is toxic.. then you need to 
> clean up the environment. If you are brave, you add some instrumentation to 
> the JVM to sample for the last thread to reach the safe point and use that 
> in your prod environment.
>
> Kind regards,
> Kirk
>
>
> On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
>>
>> Hi Aleksey,
>> thanks for the response. The I/O is definitely one problem, but I was 
>> trying to figure out whether it was contributing to the long TTSP times, or 
>> whether I might have some code that was misbehaving (e.g. NonCountedLoops).
>>
>> Your response aligns with my guesswork, so hopefully I just have the one 
>> problem to solve ;)
>>
>>
>>
>> Cheers,
>>
>&g

Re: JVM detection of thread at safepoint

2017-12-05 Thread Gil Tene
Page faults in mapped file i/o and counted loops are certainly two common 
causes of long TTSP. But there are many other paths that *could* cause it 
as well in HotSpot. Without catching it and looking at the stack trace, 
it's hard to know which ones to blame. Once you knock out one cause, you'll 
see if there is another.

In the specific stack trace you showed [assuming that trace was taken 
during a long TTSP], mapped file i/o is the most likely culprit. Your trace 
seems to be around making the page write-able for the first time and 
updating the file time (which takes a lock), but even without needing the 
lock, the fault itself could end up waiting for the i/o to complete (read 
page from disk), and that (when Murh=phy pays you a visit) can end up 
waiting behind 100s other i/o operations (e.g. when your i/o happens at the 
same time the kernel decided to flush some dirty pages in the cache), 
leading to TTSPs in the 100s of msec.

As I'm sure you already know, one simple way to get around mapped file 
related TTSP is to not used mapped files. Explicit random i/o calls are 
always done while at a safepoint, so they can't cause high TTSPs.

On Tuesday, December 5, 2017 at 10:30:57 AM UTC+1, Mark Price wrote:
>
> Hi Aleksey,
> thanks for the response. The I/O is definitely one problem, but I was 
> trying to figure out whether it was contributing to the long TTSP times, or 
> whether I might have some code that was misbehaving (e.g. NonCountedLoops).
>
> Your response aligns with my guesswork, so hopefully I just have the one 
> problem to solve ;)
>
>
>
> Cheers,
>
> Mark
>
> On Tuesday, 5 December 2017 09:24:33 UTC, Aleksey Shipilev wrote:
>>
>> On 12/05/2017 09:26 AM, Mark Price wrote: 
>> > I'm investigating some long time-to-safepoint pauses in oracle/openjdk. 
>> The application in question 
>> > is also suffering from some fairly nasty I/O problems where 
>> latency-sensitive threads are being 
>> > descheduled in uninterruptible sleep state due to needing a file-system 
>> lock. 
>> > 
>> > My question: can the JVM detect that a thread is in 
>> signal/interrupt-handler code and thus treat it 
>> > as though it is at a safepoint (as I believe happens when a thread is 
>> in native code via a JNI call)? 
>> > 
>> > For instance, given the stack trace below, will the JVM need to wait 
>> for the thread to be scheduled 
>> > back on to CPU in order to come to a safepoint, or will it be treated 
>> as "in-native"? 
>> > 
>> > 7fff81714cd9 __schedule ([kernel.kallsyms]) 
>> > 7fff817151e5 schedule ([kernel.kallsyms]) 
>> > 7fff81717a4b rwsem_down_write_failed ([kernel.kallsyms]) 
>> > 7fff813556e7 call_rwsem_down_write_failed ([kernel.kallsyms]) 
>> > 7fff817172ad down_write ([kernel.kallsyms]) 
>> > 7fffa0403dcf xfs_ilock ([kernel.kallsyms]) 
>> > 7fffa04018fe xfs_vn_update_time ([kernel.kallsyms]) 
>> > 7fff8122cc5d file_update_time ([kernel.kallsyms]) 
>> > 7fffa03f7183 xfs_filemap_page_mkwrite ([kernel.kallsyms]) 
>> > 7fff811ba935 do_page_mkwrite ([kernel.kallsyms]) 
>> > 7fff811bda74 handle_pte_fault ([kernel.kallsyms]) 
>> > 7fff811c041b handle_mm_fault ([kernel.kallsyms]) 
>> > 7fff8106adbe __do_page_fault ([kernel.kallsyms]) 
>> > 7fff8106b0c0 do_page_fault ([kernel.kallsyms]) 
>> > 7fff8171af48 page_fault ([kernel.kallsyms]) 
>> >  java stack trace ends here  
>>
>> I am pretty sure out-of-band page fault in Java thread does not yield a 
>> safepoint. At least because 
>> safepoint polls happen at given location in the generated code, because 
>> we need the pointer map as 
>> the part of the machine state, and that is generated by Hotspot (only) 
>> around the safepoint polls. 
>> Page faulting on random read/write insns does not have that luxury. Even 
>> if JVM had intercepted that 
>> fault, there is not enough metadata to work on. 
>>
>> The stacktrace above seems to say you have page faulted and this incurred 
>> disk I/O? This is 
>> swapping, I think, and all performance bets are off at that point. 
>>
>> Thanks, 
>> -Aleksey 
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: sun.misc.Unsafe.getAndAddInt implementation, question.

2017-10-16 Thread Gil Tene
The bytecode doesn't matter. It's not the javac compiler that will be doing 
the optimizations you should be worried about. It's the JIT compilers in 
the JVM. The javac-generated bytecode is only executed by the interpreter. 
The bytecode is eventually transformed to machine code by the JIT compiler, 
during which it will undergo aggressive optimization.

The "CPU" you should worry about and model in your mind is not x86, SPARC, 
or ARM. It's the JVM's execution engine the JIT-generated machine code that 
does most of the actual execution, and that "CPU" will reorder the code 
more aggressively than any HW CPU ever would. The JIT's optimizing 
transformations include arbitrary and massive re-ordereing, reshaping, 
folding-together, and completely eliminating big parts off your apparent 
bytecode instructions. And the JIT will do all those as long as it can 
prove that the transformations are allowed.

On Monday, October 16, 2017 at 3:30:13 PM UTC+1, John Hening wrote:
>
> Thanks Gil Tene! 
>
> You obviously right. The read is not volatile so the compiler is allowed 
> to reorder it. Moreover, the read is not volatile, so the compiler assumes 
> that noone changes the source of getInt(). So, it can hoist out curr.
>
>
> The last question (sorry for being inquisitive)  is:
>
> Let's assume that the compiler generated bytecode equivalent to 
>
> public final int getAndAddInt(Object ptr, long offset, int value) {
>  int curr;
>  do {
> curr = this.getInt(ptr, offset); (1)
>  } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
>  return curr;
> }
>
>
>
> so it wasn't optimized. 
>
> Now, it seems to work correctly. But, note that CPU can make some 
> reordering. We are lucky because the CPU cannot reorder here: there is a 
> data dependency: (1) -> (2). So, on every sensible (with data dependency) 
> CPU it works, yes?
>
> W dniu poniedziałek, 16 października 2017 12:13:07 UTC+2 użytkownik Gil 
> Tene napisał:
>>
>> Ok. So the question below (ignoring other optimizations in the JVM that 
>> are specific to this method) is "If I were doing this myself in some other 
>> method, would this logic be valid if Unsafe.getIntVolatile() could be be 
>> replaced with Unsafe.getInt()?"
>>
>> The answer IMO is "no".
>>
>> The issue here is that unlike e.g. AtomicInteger.compareAndSet(), which 
>> is explicitly specified to include the behavior of a volatile read on the 
>> field involved, Unsafe.compareAndSwapInt() does not make any claims about 
>> exhibiting volatile read semantics. As a result, if you replace 
>> Unsafe.getIntVolatile() with Unsafe.getInt(), the resulting code:
>>
>> public final int getAndAddInt(Object ptr, long offset, int value) {
>>  int curr;
>>  do {
>> curr = this.getInt(ptr, offset); (1)
>>  } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
>>  return curr;
>> }
>>
>> Can be validly transformed by the optimizer to:
>>
>> public final int getAndAddInt(Object ptr, long offset, int value) {
>>  int curr = this.getInt(ptr, offset); (1)
>>  do {  
>>  } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
>>  return curr;
>> }
>>
>> Because:
>>
>> (a) The optimizer can prove that if the compareAndSwapInt ever actually 
>> wrote to the field, the method would return and curr wouldn't be read again.
>> (b) Since the read of curr is not volatile, and the read in 
>> Unsafe.compareAndSwapInt() is not required to act like a volatile read, all 
>> the reads of curr can be reordered with the all the reads in the 
>> compareAndSwapInt() calls, which means that they can be folded together and 
>> hoisted out of the loop.
>>
>> If this valid optimization happened, the resulting code would get stuck 
>> in an infinite loop if another thread modified the field between the read 
>> of curr and the compareAndSwapInt call, and that is obviously not the 
>> intended behavior of getAndAddInt()...
>>
>> On Sunday, October 15, 2017 at 2:12:30 AM UTC-7, John Hening wrote:
>>>
>>> Gil Tene, thanks you very much. 
>>>
>>> Ok, so does it mean that Unsafe.getIntVolatile() could be be replaced 
>>> with Unsafe.getInt()?
>>>
>>> W dniu niedziela, 15 października 2017 01:34:34 UTC+2 użytkownik Gil 
>>> Tene napisał:
>>>>
>>>> A simple answer would be that the field is treated by the method as a 
>>>> volatile, and the code is simply staying consistent with that notion. Is 
>>>> an 
>>>> optimization possible here? Possibly

Re: sun.misc.Unsafe.getAndAddInt implementation, question.

2017-10-16 Thread Gil Tene
Ok. So the question below (ignoring other optimizations in the JVM that are 
specific to this method) is "If I were doing this myself in some other 
method, would this logic be valid if Unsafe.getIntVolatile() could be be 
replaced with Unsafe.getInt()?"

The answer IMO is "no".

The issue here is that unlike e.g. AtomicInteger.compareAndSet(), which is 
explicitly specified to include the behavior of a volatile read on the 
field involved, Unsafe.compareAndSwapInt() does not make any claims about 
exhibiting volatile read semantics. As a result, if you replace 
Unsafe.getIntVolatile() with Unsafe.getInt(), the resulting code:

public final int getAndAddInt(Object ptr, long offset, int value) {
 int curr;
 do {
curr = this.getInt(ptr, offset); (1)
 } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
 return curr;
}

Can be validly transformed by the optimizer to:

public final int getAndAddInt(Object ptr, long offset, int value) {
 int curr = this.getInt(ptr, offset); (1)
 do {  
 } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
 return curr;
}

Because:

(a) The optimizer can prove that if the compareAndSwapInt ever actually 
wrote to the field, the method would return and curr wouldn't be read again.
(b) Since the read of curr is not volatile, and the read in 
Unsafe.compareAndSwapInt() is not required to act like a volatile read, all 
the reads of curr can be reordered with the all the reads in the 
compareAndSwapInt() calls, which means that they can be folded together and 
hoisted out of the loop.

If this valid optimization happened, the resulting code would get stuck in 
an infinite loop if another thread modified the field between the read of 
curr and the compareAndSwapInt call, and that is obviously not the intended 
behavior of getAndAddInt()...

On Sunday, October 15, 2017 at 2:12:30 AM UTC-7, John Hening wrote:
>
> Gil Tene, thanks you very much. 
>
> Ok, so does it mean that Unsafe.getIntVolatile() could be be replaced with 
> Unsafe.getInt()?
>
> W dniu niedziela, 15 października 2017 01:34:34 UTC+2 użytkownik Gil Tene 
> napisał:
>>
>> A simple answer would be that the field is treated by the method as a 
>> volatile, and the code is simply staying consistent with that notion. Is an 
>> optimization possible here? Possibly. Probably. But does it matter? No. The 
>> source code involved is not performance critical, and is not worth 
>> optimizing. The interpreter may be running this logic, but no hot path 
>> would be executing the actual logic in this code... 
>>
>> Why?  Because the java code you see there is NOT what the hot code would 
>> be doing on (most) JVMs. Specifically, optimizing JITs can and will 
>> identify and intrinsify the method, replacing it's body with code that does 
>> whatever they want it to do. They don't have to perform any of the actual 
>> the logic in the method, as long as they make sure the method's performs 
>> it's intended (contracted) function. and that contracted functionality is 
>> to perform a getAndAddInt on a field, treating it logically as a volatile.
>>
>> For example, on x86 there is support for atomic add via the XADD 
>> instruction. Using XADD for this method's functionality has multiple 
>> advantages over doing the as-coded CAS loop. And most optimizing JITs will 
>> [transparently] use an XADD in place of a CAS in this case and get rid of 
>> the loop altogether.
>>
>> On Saturday, October 14, 2017 at 6:58:17 AM UTC-7, John Hening wrote:
>>>
>>> Hello
>>>
>>>
>>> it is an implementation from sun.misc.Unsafe.
>>>
>>>
>>>
>>> public final int getAndAddInt(Object ptr, long offset, int value) {
>>>  int curr;
>>>  do {
>>> curr = this.getIntVolatile(ptr, offset); (1)
>>>  } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
>>>
>>>  return curr;}
>>>
>>>
>>>
>>>
>>>
>>> Why there is 
>>> Unsafe.getIntVolatile()
>>>
>>>  called instead of 
>>> Unsafe.getInt()
>>>
>>>  here?
>>>
>>>
>>> I am basically familiar with memory models, memory barriers etc., but, 
>>> perhaps I don't see any something important. 
>>>
>>>
>>> *getIntVolatile* means here: ensure the order of execution: (1) -> (2)
>>>
>>>
>>> It looks something like:
>>>
>>>
>>> curr = read(); 
>>> acquire();
>>> CAS operation
>>>
>>>
>>>
>>> Obviously, acquire() depends on CPU, for example on x86 it is empty, on ARM 
>>> it is a memory barrier, etc. 
>>>
>>>
>>> My question/misunderstanding:
>>>
>>>
>>> For my eye the order is ensured by data dependency between read of *(ptr + 
>>> offset)* and *CAS* operation on it. So, I don't see a reason to worry about 
>>> memory (re)ordering. 
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: sun.misc.Unsafe.getAndAddInt implementation, question.

2017-10-14 Thread Gil Tene
A simple answer would be that the field is treated by the method as a 
volatile, and the code is simply staying consistent with that notion. Is an 
optimization possible here? Possibly and Probably. But does it matter? 
No. The source code involved is not performance critical, and is not worth 
optimizing. The interpreter may be running this logic, but no hot path 
would be executing the actual logic in this code... 

Why?  Because the java code you see there is NOT what the hot code would be 
doing on (most) JVMs. Specifically, optimizing JITs can and will identify 
and intrinsify the method, replacing it's body with code that does whatever 
they want it to do. They don't have to perform any of the actual the logic 
in the method, as long as they make sure the method's performs it's 
intended (contracted) function. and that contracted functionality it to 
perform a getAndAddInt on a field, treating it logically as a volatile.

For example, on x86 there is support for atomic add via the XADD 
instruction. Using XADD for this method's functionality has multiple 
advantages over doing the as-coded CAS loop. And most optimizing JITs will 
[transparently] use an XADD in place of a CAS in this case and get rid of 
the loop altogether.

On Saturday, October 14, 2017 at 6:58:17 AM UTC-7, John Hening wrote:
>
> Hello
>
>
> it is an implementation from sun.misc.Unsafe.
>
>
>
> public final int getAndAddInt(Object ptr, long offset, int value) {
>  int curr;
>  do {
> curr = this.getIntVolatile(ptr, offset); (1)
>  } while(!this.compareAndSwapInt(ptr, offset, curr, curr + value)); (2)
>
>  return curr;}
>
>
>
>
>
> Why there is 
> Unsafe.getIntVolatile()
>
>  called instead of 
> Unsafe.getInt()
>
>  here?
>
>
> I am basically familiar with memory models, memory barriers etc., but, 
> perhaps I don't see any something important. 
>
>
> *getIntVolatile* means here: ensure the order of execution: (1) -> (2)
>
>
> It looks something like:
>
>
> curr = read(); 
> acquire();
> CAS operation
>
>
>
> Obviously, acquire() depends on CPU, for example on x86 it is empty, on ARM 
> it is a memory barrier, etc. 
>
>
> My question/misunderstanding:
>
>
> For my eye the order is ensured by data dependency between read of *(ptr + 
> offset)* and *CAS* operation on it. So, I don't see a reason to worry about 
> memory (re)ordering. 
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Befuddling deadlock (lock held but not in stack?)

2017-10-12 Thread Gil Tene
The "machine" we run on is not just the hardware. It's also the BIOS, the 
hypervisor, the kernel, the container system, the system libraries, and the 
various runtimes. Stuff that is "interesting" and "strange" about how the 
machine seems to behave is very appropriate to this group, IMO. And 
mysterious deadlock behaviors and related issues the machine level (e.g. an 
apparent deadlock where no lock owner appears to exist, as is the case 
discussed here) is certainly an "interesting" machine behavior. Java for 
not. libposix or not, Linux or not. C#/C++/Rust or not. The fact that much 
of concurrency work happens to be done in Java does make Java and JVMs a 
more common context in which these issues are discussed, but the same can 
be said about Linux, even tho this is not a Linux support group.

E.g. the discussion we had a while back about Linux's futex wakeup bug 
 
(where certain kernel versions failed to wake up futex's, creating ()among 
other things) apparent deadlocks in non-deadlocking code) was presumably 
appropriate for this group. I see Todd's query as no different. It is not 
"my program has a deadlock" question. It is an observed "deadlock that 
isn't a deadlock" question. It may be a bug in tooling/reporting (e.g. 
tooling might be deducing the deadlock based on non-atomic sampling of 
stack state as some have suggested here), or it may be a bug in the lock 
mechanisms. e.g. failed wakeup at the JVM level. Either way, to would be 
just as interesting and relevant here as a failed wakeup or wrong lock 
state intrumentation at the Linux kernel level. or at the .NET CLR level, 
or at the golang runtime level, etc...

On Wednesday, October 11, 2017 at 6:06:16 AM UTC-7, Jarkko Miettinen wrote:
>
> keskiviikko 11. lokakuuta 2017 11.27.17 UTC+3 Avi Kivity kirjoitti:
>>
>> If this is not off topic, what is the topic of this group?
>>
>>
>> Is it a Java support group, or a "coding in a way that exploits the way 
>> the hardware works" group?
>>
>
>
> I have to agree here with Avi. Better to have a group for "coding in a way 
> that exploits the way the hardware works" and another group for Java 
> support. Otherwise there will be a lot of discussion of no real relation to 
> the topic except that people exploiting mechanical sympathy might have run 
> into such problems.
>
> (I would also be interested in a Java support group for the level of 
> problems that have been posted in this group before.)
>
>
>>
>> On 10/11/2017 10:29 AM, Kirk Pepperdine wrote:
>>
>> Not at all off topic… first, thread dumps lie like a rug… and here is 
>> why… 
>>
>> for each thread {
>> safe point
>> create stack trace for that thread
>> release threads from safe point
>> }
>>
>> And while rugs may attempt to cover the debris that you’ve swept under 
>> them, that debris leaves a clearly visible lump that suggests that you have 
>> a congestion problem on locks in both sun.security.provider.Sun and 
>> java.lang.Class…. What could possibly go wrong?
>>
>>
>> Kind regards,
>> Kirk
>>
>> On Oct 11, 2017, at 3:05 AM, Todd Lipcon  wrote:
>>
>> Hey folks, 
>>
>> Apologies for the slightly off-topic post, since this isn't performance 
>> related, but I hope I'll be excused since this might be interesting to the 
>> group members.
>>
>> We're recently facing an issue where a JVM is deadlocking in some SSL 
>> code. The resulting jstack report is bizarre -- in the deadlock analysis 
>> section it indicates that one of the locks is held by some thread, but in 
>> that thread's stack, it doesn't show the lock anywhere. Was curious if 
>> anyone had any ideas on how a lock might be "held but not held".
>>
>> jstack output is as follows (with other threads and irrelevant bottom 
>> stack frames removed):
>>
>> Found one Java-level deadlock:
>> =
>> "Thread-38190":
>>   waiting to lock monitor 0x267f2628 (object 0x802ba7f8, 
>> a sun.security.provider.Sun),
>>   which is held by "New I/O worker #1810850"
>> "New I/O worker #1810850":
>>   waiting to lock monitor 0x7482f5f8 (object 0x80ac88f0, 
>> a java.lang.Class),
>>   which is held by "New I/O worker #1810853"
>> "New I/O worker #1810853":
>>   waiting to lock monitor 0x267f2628 (object 0x802ba7f8, 
>> a sun.security.provider.Sun),
>>   which is held by "New I/O worker #1810850"
>>
>> Java stack information for the threads listed above:
>> ===
>> "Thread-38190":
>> at java.security.Provider.getService(Provider.java:1035)
>> - waiting to lock <0x802ba7f8> (a 
>> sun.security.provider.Sun)
>> at sun.security.jca.ProviderList.getService(ProviderList.java:332)
>> at sun.security.jca.GetInstance.getInstance(GetInstance.java:157)
>> at javax.net.ssl.SSLContext.getInstance(SSLContext.java:156)
>> at 
>> 

Re: ConcurrentHashMap size() method could read stale value?

2017-09-19 Thread Gil Tene
My dive into the specifics (of e.g. the volatile load) probably confused 
the issue. Nothing on either side actually establishes a happens-before 
relationship between e.g. seg.count write in put() in Thread1 and a 
seg.count read in size() in Thread2... It's up to an [optional] something 
else to establish that, if at all. There are basically three possibilities:

1. Writes in the put() in Thread1 happen-before Reads in size() in Thread2 
(via some other happens-before-establishing-thing on each side, one 
occurring after the return from put() on Thread1, and the other occurring 
before the call to size() in Thread2).

2. The Reads in size() in Thread2 happen-before the Writes in put() in 
Thread1 (via some other happens-before-establishing-thing on each side, one 
occurring after the return from size() in Thread2, and the other occurring 
before the call to put() in Thread1).

3. There is no happens-before-establishing-thing (in either direction) 
between the put() call or return in Thread1 and the size() call or return 
in Thread2.

If it's 1, the happens-before-establishing-things placed after the 
[Thread1] return from put() and before the [Thread2] call to size() 
establishes that e.g. the write to seg.count in put() happens-before the 
read of seg.count in size(). [and will therefore return the size after the 
put]

If it's 2, the happens-before-establishing-things placed after the 
[Thread2] return from size() and before the [Thread1] call to put() 
establishes that e.g. the read of seg.count in size() happens-before the 
write to seg.count in put() [and will therefore return the size before the 
put]

If it's 3, it's irrelevant, as no happens-before-establishing stuff means 
that size() is experiencing concurrent modification, and in the presence of 
concurrent modification, size() is allowed to return a stale notion of size 
and does not need an established happens-before relationship to the put()'s 
writes.

On Monday, September 18, 2017 at 7:59:10 PM UTC-7, yang liu wrote:
>
> Thanks for the detailed reply. 
> Yes, JDK code is special.  From an ordering  perspective, the segmentAt() 
> does a volatile read and ensures visibility. 
> However from happen-before perspective, I cannot find happen-before 
> relation between seg.count read action with seg.count write action. 
>
> thread 1   
>  thread 2
>
> --
> 1. segment reference volatile write   |   3. segment 
> reference volatile read
> 2. seg.count write  |   4. 
> seg.count read
>
> It's obvious that action 1 has a happen-before relation with action 3.
> But there's no happen-before relation between action 2 and action 4.
> Am I wrong? Or I just should not consider it from the happen-before 
> perspective because the JDK code is special?
>
> On Monday, September 18, 2017 at 11:31:52 PM UTC+8, Gil Tene wrote:
>>
>> In the presence of concurrent modification, size() can be stale by 
>> definition, since the modification could occur between size() establishing 
>> a notion of what size is and your next operation that might assume the size 
>> is somehow representative of the current state of the table.
>>
>> When no concurrent modification exists between your call into size() and 
>> whatever the next thing you do that depends on that size is, the size will 
>> be up-to-date. The thing that achieves that in the OpenJDK 7 variant 
>> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/concurrent/ConcurrentHashMap.java#ConcurrentHashMap.size%28%29>
>>  
>> is the fact that segmentAt() does a volatile read, which is sufficient (at 
>> least in the specific OpenJDK 7 JVM implementation) to establish a LoadLoad 
>> ahead of the following reads of seg.count and seg.modCount to reflect 
>> external modifications that (by other ordering rules) occurred prior to 
>> your size() call. The unlock() at e.g. the bottom of seg.put() creates 
>> enough ordering on the modifying side after the changes to seg.count and 
>> seg.modCount.
>>
>> A critical thing to keep in mind when reading JDK code like this is that 
>> the you should not necessarily assume that your code can do the same thing 
>> safely from an ordering perspective. JDK code is special because it *knows* 
>> what JVMs it ships with, and can validly make assumptions about the JVMs 
>> behavior. Since j.u.c.ConcurrentHashMap in OpenJDK 7 is part of the JDK, it 
>> can make certain assumptions about the OpenJDK 7 JVM's handling of ordering 
>> that may be stronger than the JMM guarantees. E.g. it could assume that the 
>&

Re: ConcurrentHashMap size() method could read stale value?

2017-09-18 Thread Gil Tene
In the presence of concurrent modification, size() can be stale by 
definition, since the modification could occur between size() establishing 
a notion of what size is and your next operation that might assume the size 
is somehow representative of the current state of the table.

When no concurrent modification exists between your call into size() and 
whatever the next thing you do that depends on that size is, the size will 
be up-to-date. The thing that achieves that in the OpenJDK 7 variant 

 
is the fact that segmentAt() does a volatile read, which is sufficient (at 
least in the specifica OpenJDK 7 JVM implementation) to establish a 
LoadLoad ahead of the following reads of seg.count and seg.modCount to 
reflect external modifications that (by other ordering rules) occurred 
prior to you size() call. The unlock() at e.g. the bottom of seg.put() 
creates enough ordering on the modifying side.

A critical thing to keep in mind when reading JDK code like this is that 
the you should not necessarily assume that your code doing the same thing 
would be safe from an ordering perspective. JDK code is special, because it 
*knows* what JVMs it ships with. Since j.u.c.ConcurrentHashMap in OpenJDK 7 
is part of the JDK, it can make certain assumptions about the JVM's 
handling of ordering that may be stronger than the JMM guarantees. E.g. it 
could assume that the *sufficient* but not fully required by the JMM 
ordering rules in http://gee.cs.oswego.edu/dl/jmm/cookbook.html are 
actually implemented, based on knowledge of the specific JVM that ships 
with the JDK code. E.g. a JVM that satisfies these rules: 



Will meet the ordering requirements between volatile loads and stores, 
regular loads and stores, and monitor enters and exits. But that doesn't 
actually mean that all JVMs (and future JDKs you may run on) will actually 
follow these rules. They may find some way to meet the JMM requirements 
without following these rules to the letter. E.g. they may apply the 
ordering between specific sets of fields (rather than globally, across all 
fields as stated in this matrix), and still meet the JMM without enforcing 
a LoadLoad between any volatile load and any field load.


On Sunday, September 17, 2017 at 11:54:45 PM UTC-7, yang liu wrote:
>
> "...useful only when a map is not undergoing concurrent updates in other 
> threads..." 
> The size() method sums the segment field "modCount" twice and compares the 
> result to ensure no concurrent updates in other threads.
> If there's concurrent updates, the size() method resort to locking 
> segments. So the size() method tries to get the mostly updated 
> result even if after the result returns the result may already be stale. 
> In java1.6 field "count" has volatile present and "modCount" 
> field read has happen-before relation with "count" write, so the sum of 
> "count" can has mostly updated result. But that's not the case for java 1.7.
> In java1.7 the no volatile present field "modCount" and "count" may fail 
> to get mostly updated value.
>
> On Monday, September 18, 2017 at 1:46:34 PM UTC+8, Nikolay Tsankov wrote:
>>
>> * Bear in mind that the results of aggregate status methods including
>> * {@code size}, {@code isEmpty}, and {@code containsValue} are typically
>> * useful only when a map is not undergoing concurrent updates in other 
>> threads.
>> * Otherwise the results of these methods reflect transient states
>> * that may be adequate for monitoring or estimation purposes, but not
>> * for program control.
>>
>>
>> On Mon, Sep 18, 2017 at 5:18 AM, yang liu  wrote:
>>
>>> Recently I studied the source code of ConcurrentHashMap. 
>>>
>>> In java 1.7, the segment field "count" got no volatile modifiers which 
>>> is different from java 1.6. 
>>>
>>> Is possible the size() method could read stale value through race read?
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Confusion regarding 'mark-sweep-compact' naming

2017-08-14 Thread Gil Tene


On Sunday, August 13, 2017 at 11:45:09 PM UTC-7, Kirk Pepperdine wrote:
>
> Hi, 
>
>
> > On Aug 14, 2017, at 7:47 AM, Peter Veentjer  > wrote: 
> > 
> > I have been improving my gc knowledge and there is some confusion on my 
> side regarding the 'mark-sweep-compact' algorithm I frequently see 
> mentioned in posts, articles and some not too formal books on the topic. 
> E.g. 
> > 
> > AFAIK mark-sweep-compact is not correct. It is either mark-sweep or 
> mark-compact where as part of the compacting process the dead objects are 
> removed. When I check more formal literature like "The Garbage Collection 
> Handbook: the Art of Automatic Memory Management", then there is no 
> mentioning of mark-sweep-compact. 
> > 
> > So I guess that when 'mark-sweep-compact' is mentioned, they are 
> actually referring to 'mark-compact'. Is this assumption correct? 
>
> The terminology around this is very loose so the very short answer is…. 
> sort of. So, a very terse but hopefully useful summary of collectors. 
>
> When we talk about mark-sweep we are generally referring to a class of 
> garbage collectors known as tracing collectors. The actual details of how 
> the mark and sweep for each of the different tracing collectors is 
> implemented depend upon a couple of factors. One of the most important 
> factors is how memory pool(s) (data structure in heap space) is organized. 
> I tend to label a collector as either “in-place” or “evacuation”. The most 
> trivial examples for both are; if you only have a single memory pool then 
> you will be performing an “in-place” collection as the data remains in the 
> that memory pool at the end of the collection. If you have two spaces you 
> can set this up as what is known has hemispheric collections. In this case 
> you mark the data in the currently active memory pool and then “evacuate” 
> it to the inactive pool. When complete active becomes inactive (thus empty) 
> and inactive becomes active. Wash, rinse, repeat…. 
>
> In-place collections will need a free-list and need to some how deal with 
> fragmentation (of the free-list). Thus they compact. When and how often is 
> an implementation detail. Evacuating collectors use a top of heap pointer 
> (and other more complex tricks) that act as a free list. Since they 
> evacuate to another pool, the data will naturally be compacted. Thus no 
> compaction will be necessary. As you can imagine, the time/space cost 
> models for these two general approaches are quite different. 
>
> In OpenJDK we have both generational and regional collectors. The 
> generational collector divides memory into 4 memory pools named Eden, 
> Survivor 0, Survivor 1, and Tenured space. The first three are grouped into 
> young and Tenured into old space. In young we can use mark-evacuation where 
> as in old we are forced to use mark-sweep-compact. I’m going to claim (I’m 
> sure Gil will want to chime in here) that G1, C4, Balance and Shenandoah 
> are regional collectors in that they divide heap into a large number of 
> memory pools (regions).


Agreed. These are all regional Mark-Compact evacuating collectors. But G1, 
Balanced and Shenandoah are olden-only regional collectors (or single 
generation in the current Shenandoah case), while C4 uses both a regional 
Mark-compact newgen collector and a regional Mark-Compact oldgen collector. 
As such, it gets all the nice-for-your-electron-budget benefits of 
generational collection along with all the nice benefits of regional 
collection.
 

> G1 aims for 2048 memory pools. In all 4 cases, the collectors are very 
> very different variants of mark-evacuation each with their own quirks. The 
> advantage of using regions is that you only need an evacuating collector as 
> you don’t end up with a terminal pool (tenured) where you are forced to use 
> an in-place collector. If you are careful in how you select regions to be 
> recovered, you can realize some nice wins over generational collectors. 
> However, these collectors tend to be far more complex and harder on your 
> electron budget than generational collectors are.


The harder-on-electron-budget thing is not algorithmically inherent to 
evacuating regional collectors. It usually has to do with other algorithmic 
choices that may be made in collectors that happen to be regional. E.g. 
some region-based incremental STW compactors (G1, Balanced) generally try 
to compact a small increment of the heap and fixup a hopefully-limited set 
of regions that refer to the compacted set, all in a single STW pause. 
These algorithms tend to rely on cross-region remembered-set tracking which 
involves some interesting electron budget implications, but more 
importantly, because the same regions will be involved in multiple fixup 
passes, the resulting N^2 complexity potential can also cause the electron 
budget to significantly inflate. Evacuating regional collector algorithms 
that do not rely on cross-region remembered sets, and 

Re: Confusion regarding 'mark-sweep-compact' naming

2017-08-14 Thread Gil Tene
I've taken to using Mark-Sweep-Compact, Mark-Sweep (no Compact), and 
Mark-Compact (no sweep) to describe the variants, because you will find all 
of them in the wild these days.

The confusion starts with the fact that some algorithms that people often 
refer to as "Mark-Sweep" are actually Mark-Sweep-Compact, while other are 
purely Mark-
Sweep (with no compaction). Mark-Sweep with no compaction usually means 
that free memory is recycled in place, with no object movement. A good 
example of this in practice is the CMS collector in the HotSpot JVM, which 
tracks unused memory in the oldgen in some form of free lists, but does not 
move any objects in the oldgen to compact it. Mark-Sweep *with* compaction 
(Mark-Sweep-Compact) obviously does compact. It usually implies the use of 
the knowledge established during a sweep (e.g. a list of the empty ranges) 
to compact the space "in place", It also usually involves a complete fixup 
of all references in all live objects. E.g. a simplistic description of a 
common Mark-Sweep-Compact technique would involves: (A) mark the live 
objects in the heap, (B) sweep the heap to find the empty ranges, (C) move 
all live objects to one side of the heap (filling in the empty spaces), (D) 
linearly scan all live objects and fix up any references in them to point 
to the correct target location. A good example of this in practice is the 
ParallelGC and SerialGC collectors in the HotSpot JVM.

Similarly, more confusion is caused by the fact that some algorithms that 
people often refer to as "Mark-Compact" are actually "Mark-Sweep-Compact", 
while other are purely Mark-Compact (no sweep). Again a good example of the 
first in common use would be ParallelGC in the HotSpot JVM (and the 
Wikipedia entry https://en.wikipedia.org/wiki/Mark-compact_algorithm). Good 
examples for the latter in common use can be found in various regional 
evacuating collectors (like C4, G1, etc.), where there is no sweep per-se 
(live object are evacuated to outside of the originally marked heap, and 
there is some mechanism for finding/tracking them to do so, but that 
mechanism is not the traditional sweep used by Mark-Sweep style of 
collectors). Fixup behavior in such algorithms can also vary quite a bit 
(e.g. C4 normally uses the next cycle's Mark to perform fixups for the 
previous Compact's relocated objects, and as such does not require a 
scanning fixup pass).

Another key difference often pointed to is that with in-place compaction 
(Mark-Sweep-Compact), no additional empty memory is required to guaranteed 
full compaction, while regional evacuating collectors need to evacuate into 
some empty memory, so their ability to compact may be limited by the amount 
of empty memory available outside of the currently occupied heap regions. 
However,. several well documented regional collector algorithms (e.g. 
Pauseless 
, C4 
, 
Compressor ) 
perform hand-over-hand compaction to guarantee forward progress in 
evacuation, such that a single empty region is sufficient to achieve full 
compaction in a single pass.

On Sunday, August 13, 2017 at 10:47:01 PM UTC-7, Peter Veentjer wrote:
>
> I have been improving my gc knowledge and there is some confusion on my 
> side regarding the 'mark-sweep-compact' algorithm I frequently see 
> mentioned in posts, articles and some not too formal books on the topic. 
> E.g.
>
>
> https://plumbr.eu/handbook/garbage-collection-algorithms/removing-unused-objects/compact
> https://www.infoq.com/news/2015/12/java-garbage-collection-minibook
>
> AFAIK mark-sweep-compact is not correct. It is either mark-sweep or 
> mark-compact where as part of the compacting process the dead objects are 
> removed. When I check more formal literature like "The Garbage Collection 
> Handbook: the Art of Automatic Memory Management", then there is no 
> mentioning of mark-sweep-compact.
>
> So I guess that when 'mark-sweep-compact' is mentioned, they are actually 
> referring to 'mark-compact'. Is this assumption correct?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: failing to understand the issues with transparent huge paging

2017-08-13 Thread Gil Tene
It looks like newer kernels (4.6 and above) support a "defer" option for 
THP behavior (4.11 and above also support "defer+madvise"). This behavior 
is set via /sys/kernel/mm/transparent_hugepage/defrag (separate from 
/sys/kernel/mm/transparent_hugepage/enabled). With the defer options on, 
allocations avoid [synchronously] attempting to compact memory, and should 
avoid the huge outliers discussed earlier.

See Andrea Arcangeli's FOSDEM slides from Feb, 2017 

 for 
some details. 

cat /sys/kernel/mm/transparent_hugepage/defrag will show you the available 
options on your system. Looking up the kernel function defrag_show() is 
probably the easiest way to check for this on a given kernel's sources:
Starting with 4.11, it shows "[always] defer defer+madvise madvise never" 
(http://elixir.free-electrons.com/linux/v4.11/source/mm/huge_memory.c#L219)
Starting with 4.6 , it shows "[always] defer madvise never" 
(http://elixir.free-electrons.com/linux/v4.6/source/mm/huge_memory.c#L380)
Prior versions (4.5 and earlier) did not have a defer option... And 
RHEL/CentOS don't either (not up to RHEL 7.3 anyway).

Personally, I'll need to kick this around a bit to see if it seems to 
really works before starting to recommend using it (in place of "never") in 
cases where people care about avoiding huge outliers. Unfortunately, the 
latest RHEL, CentOS, versions don't yet have the "defer" option, And only 
the latest Ubuntu LTS (16.04.02) does. So many of the production 
environments run by people I get to interact with does yet have the 
feature. It may take a while to get real production experience with 
"defer". But assuming it delivers on the promised behavior, it does present 
a compelling argument for upgrading to the latest releases when they start 
supporting the feature. E.g. for Ubuntu users, moving to Ubuntu LTS 
16.04.02 (or some later release), switching THP enabled to "always" and THP 
defrag to "defer" may provide a very real performance boost.







-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: failing to understand the issues with transparent huge paging

2017-08-13 Thread Gil Tene
t; accept the occasional ~0.5+ sec freezes, turn it off. 
>
>
> I just wanted to show for people who blindly follow advice on the Internet 
> (and there are many such suggestions) that there's an impact. It can be 
> noticeable and depends on setup and load.
>
>
And please keep doing that. Concrete results postings are useful. And the 
speed benefit you show in your application is quite compelling. My 
motivation is quite similar, but my focus here was on highlighting the "if 
you want to avoid terrible outliers" thing in answering the original 
questions at the top of this thread. I see way too many "recommendations on 
the internet" based purely on speed, which ignore the outliers and other 
degenerate thrashing behaviors that may occur (infrequently, but far too 
often for some)...
 

>
>
> On Sunday, August 13, 2017 at 10:10:01 AM UTC+3, Gil Tene wrote:
>>
>>
>>
>> On Saturday, August 12, 2017 at 3:01:31 AM UTC-7, Alexandr Nikitin wrote:
>>>
>>> I played with Transparent Hugepages some time ago and I want to share 
>>> some numbers based on real world high-load applications.
>>> We have a JVM application: high-load tcp server based on netty. No clear 
>>> bottleneck, CPU, memory and network are equally highly loaded. The amount 
>>> of work depends on request content.
>>> The following numbers are based on normal server load ~40% of maximum 
>>> number of requests one server can handle.
>>>
>>> *When THP is off:*
>>> End-to-end application latency in microseconds:
>>> "p50" : 718.891,
>>> "p95" : 4110.26,
>>> "p99" : 7503.938,
>>> "p999" : 15564.827,
>>>
>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>> ...
>>> ... 25,164,369  iTLB-load-misses
>>> ... 81,154,170  dTLB-load-misses
>>> ...
>>>
>>> *When THP is always on:*
>>> End-to-end application latency in microseconds:
>>> "p50" : 601.196,
>>> "p95" : 3260.494,
>>> "p99" : 7104.526,
>>> "p999" : 11872.642,
>>>
>>> perf stat -e dTLB-load-misses,iTLB-load-misses -p PID -I 1000
>>> ...
>>> ...21,400,513  dTLB-load-misses
>>> ...  4,633,644  iTLB-load-misses
>>> ...
>>>
>>> As you can see THP performance impact is measurable and too significant 
>>> to ignore. 4.1 ms vs 3.2 ms 99%% and 100M vs 25M TLB misses.
>>> I also used SytemTap to measure few kernel functions like 
>>> collapse_huge_page, clear_huge_page, split_huge_page. There were no 
>>> significant spikes using THP.
>>> AFAIR that was 3.10 kernel which is 4 years old now. I can repeat 
>>> experiments with the newer kernels if there's interest. (I don't know what 
>>> was changed there though)
>>>
>>
>> Unfortunately, just because you didn't run into a huge spike during your 
>> test doesn't mean it won't hit you in the future... The stack trace example 
>> I posted earlier represents the path that will be taken if an on-demand 
>> allocation page fault on a THP-allocated region happens when no free 2MB 
>> page is available in the system. Inducing that behavior is not that hard, 
>> e.g. just do a bunch of high volume journaling or logging, and you'll 
>> probably trigger it eventually. And when it does take that path, that will 
>> be your thread de-fragging the entire system's physical memory, one 2MB 
>> page at a time.
>>
>> And when that happens, you're probably not talking 10-20msec. More like 
>> several hundreds of msec (growing with the system physical memory size, the 
>> specific stack trace is taken from a RHEL issue that reported >22 seconds). 
>> If that occasional outlier is something you are fine with, then turning THP 
>> on for the speed benefits you may be seeing makes sense. But if you can't 
>> accept the occasional ~0.5+ sec freezes, turn it off. 
>>
>>
>>>
>>> On Monday, August 7, 2017 at 6:42:21 PM UTC+3, Peter Veentjer wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>> I'm failing to understand the problem with transparent huge pages.
>>>>
>>>> I 'understand' how normal pages work. A page is typically 4kb in a 
>>>> virtual address space; each process has its own. 
>>>>
>>>> I understand how the TLB fits in; a cache providing a mapping of 
>>>> virtual to real addresses to speed up address conversion.
>>>>
>>>&g

Re: Measuring JVM memory for containers

2017-08-04 Thread Gil Tene
or_count changes during 
> startup
>
> [2]:
> [areese@refusesbruises ]$ java -XX:MaxDirectMemorySize=1m -Xms256m 
>  -Xmx256m -XX:NativeMemoryTracking=summary  -XX:+UnlockDiagnosticVMOptions 
> -XX:+PrintNMTStatistics -cp . test
>
> Native Memory Tracking:
>
> Total: reserved=1600200KB, committed=301232KB
> - Java Heap (reserved=262144KB, committed=262144KB)
> (mmap: reserved=262144KB, committed=262144KB) 
>  
> - Class (reserved=1059947KB, committed=8043KB)
> (classes #391)
> (malloc=3179KB #129) 
> (mmap: reserved=1056768KB, committed=4864KB) 
>  
> -Thread (reserved=10323KB, committed=10323KB)
> (thread #10)
> (stack: reserved=10280KB, committed=10280KB)
> (malloc=32KB #54) 
> (arena=12KB #20)
>  
> -  Code (reserved=249631KB, committed=2567KB)
> (malloc=31KB #296) 
> (mmap: reserved=249600KB, committed=2536KB) 
>  
> -GC (reserved=13049KB, committed=13049KB)
> (malloc=3465KB #111) 
> (mmap: reserved=9584KB, committed=9584KB) 
>  
> -  Compiler (reserved=132KB, committed=132KB)
> (malloc=1KB #21) 
> (arena=131KB #3)
>  
> -  Internal (reserved=3277KB, committed=3277KB)
> (malloc=3245KB #1278) 
> (mmap: reserved=32KB, committed=32KB) 
>  
> -Symbol (reserved=1356KB, committed=1356KB)
> (malloc=900KB #64) 
> (arena=456KB #1)
>  
> -    Native Memory Tracking (reserved=34KB, committed=34KB)
> (malloc=3KB #32) 
> (tracking overhead=32KB)
>  
> -   Arena Chunk (reserved=305KB, committed=305KB)
> (malloc=305KB) 
>  
> [areese@refusesbruise ]$ 
>
>
> On Friday, August 4, 2017 at 8:58:43 AM UTC-7, Gil Tene wrote:
>>
>> Yes,you can do a lot of estimation,but measuring the actual memory usage 
>> against the actual limit you expect to be enforced is a the main thing you 
>> have to do... You'll want to make sure the things you measure have actually 
>> been exercised well before you take the measurement, and you *always* want 
>> to pad that measured memory use since you may easily miss some use that can 
>> happen in the future.
>>
>> In doing memory use measurement, it's important to understand that 
>> reported memory use usually only accounts for physical memory that was 
>> actually touched by the program, because Linux does on-demand physical page 
>> allocation at page modification time, NOT at mapping or allocation time. 
>> And on demand physical page allocation only occurs when the contents of the 
>> page has been modified for the first time. Even tho it seems like most of 
>> the memory is "allocated" when you start, actual emory use will grow over 
>> time as you exercise behaviors that actually make use of the memory you 
>> allocated. The parts of the heap, code cache, meatspace, etc. that were 
>> allocated but not yet used or exercised by your program will NOT show up 
>> in memory.usage_in_bytes. The same goes for any off heap memory your 
>> process may be using (e.g. using DirectByteBuffers): allocation does not 
>> show up as usage until the first (modifying) touch.
>>
>> Here are some things that can help in make your observed memory use 
>> reflect eventual memory use:
>>
>> Making sure the heap is "real" is simple:
>> - Set -Xms to be equal to -Xmx (to avoid starting lower and expanding 
>> later)
>> - Use -XX:+AlwaysPreTouch to make sure all heap pages were actually 
>> touched (which would force physical memory to actually be allocated, which 
>> will make them show up in the used balance),
>> With these two settings, you will ensure that all heap pages (for all of 
>> Xmx) will actually be physically allocated.
>>
>> For the non-heap parts of memory it gets more complicated. And HotSpot 
>> has TONS of that sort of memory (eventual RSS that goes far above Xmx), so 
>> ignoring it or going for some "10-20% slop" rule will usually come back to 
>> bite you hours into a run. significant memory HotSpot manages outside of 
>> the Xmx-sized hea

Re: Measuring JVM memory for containers

2017-08-04 Thread Gil Tene
Yes,you can do a lot of estimation,but measuring the actual memory usage 
against the actual limit you expect to be enforced is a the main thing you 
have to do... You'll want to make sure the things you measure have actually 
been exercised well before you take the measurement, and you *always* want 
to pad that measured memory use since you may easily miss some use that can 
happen in the future.

In doing memory use measurement, it's important to understand that reported 
memory use usually only accounts for physical memory that was actually 
touched by the program, because Linux does on-demand physical page 
allocation at page modification time, NOT at mapping or allocation time. 
And on demand physical page allocation only occurs when the contents of the 
page has been modified for the first time. Even tho it seems like most of 
the memory is "allocated" when you start, actual emory use will grow over 
time as you exercise behaviors that actually make use of the memory you 
allocated. The parts of the heap, code cache, meatspace, etc. that were 
allocated but not yet used or exercised by your program will NOT show up 
in memory.usage_in_bytes. The same goes for any off heap memory your 
process may be using (e.g. using DirectByteBuffers): allocation does not 
show up as usage until the first (modifying) touch.

Here are some things that can help in make your observed memory use reflect 
eventual memory use:

Making sure the heap is "real" is simple:
- Set -Xms to be equal to -Xmx (to avoid starting lower and expanding later)
- Use -XX:+AlwaysPreTouch to make sure all heap pages were actually touched 
(which would force physical memory to actually be allocated, which will 
make them show up in the used balance),
With these two settings, you will ensure that all heap pages (for all of 
Xmx) will actually be physically allocated.

For the non-heap parts of memory it gets more complicated. And HotSpot has 
TONS of that sort of memory (eventual RSS that goes far above Xmx), so 
ignoring it or going for some "10-20% slop" rule will usually come back to 
bite you hours into a run. significant memory HotSpot manages outside of 
the Xmx-sized heap includes various GC things that only get populated as a 
result of actually being exercised (card tables, G1 remembered set support 
structures), Metaspace that only gets touched when actually filled with 
classes, Code cache memory that only gets touched when actually filled with 
JIT'ed code, etc. As a result, you really want to make sure you application 
gets some "exercise" before you measure the process memory footprint. 

- Run load for a while (make sure your code has gone through it's paces, 
JITs have done their job, etc.). My rule of thumb for "a while" means at 
least 10 minutes of actual load and at least 100K operations (of whatever 
operations you actually do). Don't settle for or try to draw conclusions 
from silly 10-20 seconds or 1000 op micro-tests.

- Make sure to have experienced several oldgen collections with whatever 
collector you are using. This can be more challenging that you think, 
because most GC tun in g for HotSpot is focused on delaying this stuff as 
much as possible, making some of the cannot-be-avoided behaviors only occur 
hours into normal runs under load. To combat this, you can use tools that 
can be added as agents and intentionally exercise the collector with your 
actual application running. E.g. my HeapFragger 
(https://github.com/giltene/HeapFragger) has this exact purpose: to aid 
application testers in inducing inevitable-but-rare garbage collection 
events without having to wait days for them to happen. It can be capped to 
use a tiny fraction of the heap, and configurable amount of CPU (set by 
controlling it's allocation rate), and can generally put the collector 
through it's paces in a matter of minutes, including forcing oldgen 
compaction in CMS, and making G1 deal with remembered sets. For G1 at 
least, this exercise can make a HUGE difference in the observed process 
memory footprint, folks have seen remembered set footprint grow to as big 
as 50% of Xmx after being exercised (and I think it can go beyond that in 
theory).

And remember that even after you've convinced yourself that all has been 
exercised, you should pad the observed result by some safely factor slop.

On Friday, August 4, 2017 at 12:18:47 AM UTC-7, Sebastian Łaskawiec wrote:
>
> Thanks a lot for all the hints! They helped me a lot.
>
> I think I'm moving forward. The key thing was to calculate the amount of 
> occupied memory seen by CGroups. It can be easily done using:
>
>- /sys/fs/cgroup/memory/memory.usage_in_bytes
>- /sys/fs/cgroup/memory/memory.limit_in_bytes
>
> Calculated ratio along with Native Memory Tracking [1] helped me to find a 
> good balance. I also found a shortcut which makes setting initial 
> parameters much easier: -XX:MaxRAM [2] (and set it based on CGroups limit). 
> The downside is that with MaxRAM 

Re: JVM random performance

2017-08-01 Thread Gil Tene
Add -XX:+PrintGCTimeStamps, also, run with time so we can see the total run 
time...

On Tuesday, August 1, 2017 at 12:32:37 PM UTC-7, Roger Alsing wrote:
>
> Does this tell anyone anything?
> https://gist.github.com/rogeralsing/1e814f80321378ee132fa34aae77ef6d
> https://gist.github.com/rogeralsing/85ce3feb409eb7710f713b184129cc0b
>
> This is beyond my understanding of the JVM.
>
> ps. no multi socket or numa.
>
> Regards
> Roger
>
>
> Den tisdag 1 augusti 2017 kl. 20:22:23 UTC+2 skrev Georges Gomes:
>>
>> Are you benchmarking on a multi-socket/NUMA server?
>>
>> On Tue, Aug 1, 2017, 1:48 PM Wojciech Kudla  wrote:
>>
>>> It definitely makes sense to have a look at gc activity, but I would 
>>> suggest looking at safepoints from a broader perspective. Just use 
>>>  -XX:+PrintGCApplicationStoppedTime to see what's going on. If it's 
>>> safepoints, you could get more details with safepoint statistics. 
>>> Also, benchmark runs in java may appear undeterministic simply because 
>>> compilation happens in background threads by default and some runs may 
>>> exhibit a different runtime profile since the compilation threads receive 
>>> their time slice in different moments throughout the benchmark. 
>>> Are the results also jittery when run entirely in interpreted mode? It 
>>> may be worth to experiment with various compilation settings (ie. disable 
>>> tiered compilation, employ different warmup strategies, play around with 
>>> compiler control). 
>>> Are you employing any sort of affinitizing threads to cpus? 
>>> Are you running on a multi-socket setup? 
>>>
>>> On Tue, 1 Aug 2017, 19:27 Roger Alsing,  wrote:
>>>
 Some context: I'm building an actor framework, similar to Akka but 
 polyglot/cross-platform..
 For each platform we have the same benchmarks, where one of them is an 
 in process ping-pong benchmark.

 On .NET and Go, we can spin up pairs of ping-pong actors equal to the 
 number of cores in the CPU and no matter if we spin up more pairs, the 
 total throughput remains roughly the same.
 But, on the JVM. if we do this, I can see how we max out at 100% CPU, 
 as expected, but if I instead spin up a lot more pairs, e.g. 20 * 
 core_count, the total throughput tipples.

 I suspect this is due to the system running in a more steady state kind 
 of fashion in the latter case, mailboxes are never completely drained and 
 actors don't have to switch between processing and idle.
 Would this be fair to assume?
 This is the reason why I believe this is a question for this specific 
 forum.

 Now to the real question.. roughly 60-40 when the benchmark is started, 
 it runs at 250 mil msg/sec. steadily and the other times it runs at 350 
 mil 
 msg/sec.
 The reason why I find this strange is that it is stable over time. if I 
 don't stop the benchmark, it will continue at the same pace.

 If anyone is bored and like to try it out, the repo is here:
 https://github.com/AsynkronIT/protoactor-kotlin
 and the actual benchmark here: 
 https://github.com/AsynkronIT/protoactor-kotlin/blob/master/examples/src/main/kotlin/actor/proto/examples/inprocessbenchmark/InProcessBenchmark.kt

 This is also consistent with or without various vm arguments.

 I'm very interested to hear if anyone has any theories what could cause 
 this behavior.

 One factor that seems to be involved is GC, but not in the obvious way, 
 rather reversed.
 In the beginning, when the framework allocated more memory, it more 
 often ran at the high speed.
 And the fewer allocations I've managed to do w/o touching the hot path, 
 the more the benchmark have started to toggle between these two numbers.

 Thoughts?

 -- 
 You received this message because you are subscribed to the Google 
 Groups "mechanical-sympathy" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to mechanical-sympathy+unsubscr...@googlegroups.com.
 For more options, visit https://groups.google.com/d/optout.

>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Why does the G1 GC shrink number of regions?

2017-06-05 Thread Gil Tene


On Monday, June 5, 2017 at 4:58:40 PM UTC+3, Alexandr Nikitin wrote:
>
> Gil, Kirk, Thank you for the great and thorough explanation! It is an 
> awesome read! Almost everything fell into place.
>
> Indeed, Gil is right about the application pattern change. It's a 
> high-load API. There's a background thread that changes an internal state. 
> The state is ~3.5G and requires some similar amount of memory to build 
> itself. It happens in one background thread with lower priority. The change 
> doesn't happen often, once per hour or so.
>
> The G1 is new for me and I couldn't intuitively understand why it reacts 
> to the allocation pattern change like that. There's the main allocation/ 
> collection pattern that lasts for hours. And it needs just one background 
> low priority thread (which allocates less memory than worker threads) to 
> change it :) That region number change leads to more copying and promotions 
> and longer pauses as a consequence (it changes max age threshold to 1 and 
> happens more often)
>
> And the actual bottom line is very simple: You need a concurrent newgen to 
>> handle this workload, with these phase changes, without any of those 
>> "hiccups". We can discuss that in other posts if you want ;-).
>>
>
> Do you mean Azul Zing by that?
>

Yup. 

Zing's newgen would happily copy and promote those 3.5GB concurrently, and 
wouldn't break a sweat doing it. It will take a couple of seconds, but not 
stop-the-world seconds. Since all the collection work is done in the 
background and is concurrent, you won't feel it. And since Zing doesn't 
need to worry much a "bigger pause" happening if it lets newgen grow, it 
doesn't have to hurry up and collect too early, so collections would still 
~110GB of allocation apart given your heap size and apparent live set. At 
your allocation rate, that looks like newgens can stay several hundreds of 
seconds apart, keeping them very efficient. Newgen will end up copying 
those 3.5GB once (to promote), or maybe twice (if the collection happened 
less than 2 seconds after the new 3.5GB of state was allocated). But not 
several times. Zing doesn't need to keep things in newgen for multiple 
cycles to age them. We use time instead.

In fact, you could probably run this workload (with this behavior) on Zing 
with about about half or 2/3 the heap you are using for HotSpot, and still 
not break a sweat.

We'll probably also end up running an oldgen every 10 minutes or so as 
well. Just for fun. Not because of any pressure, but simply because the 
collector gets bored after a while of not doing oldgens when other stuff 
gets to have fun. Since oldgen is concurrent too (including compaction), 
it's ok to do it lazily to make sure we don't delay some strange work to 
later. We like to make sure that unless you are idle, you've seen the full 
behavior of the system if you've been running for an hour or so. Our rule 
is: if an newgen happens, and no oldgen has run in the past ten minutes, do 
one.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Why would SocketChannel be slower when sending a single msg instead of 1k msgs after proper warmup?

2017-04-13 Thread Gil Tene
If I read this right. You are running this on localhost (according to SO 
code). If that's the case, there is no actual network, and no actual TCP 
stack... UDP or TCP won't make a difference then, and neither will any TCP 
tweaking. I think this rules out the network, the switch, the NICs, and 
most of the OS's network stack. 

Now you're looking at the JVM, the OS scheduling, power management, cache 
behavior, etc.

Some more things to play with to rule out or find some insight:

- Rule out de-optimization (you may be de-optimizing when the if 
(totalMessagesSent == WARMUP) triggers). Do this by examining at 
-XX:+PrintCompilation output

- Rule out scheduling and cpu migration effects: use isolcpus and pin your 
processes to specific cores

- How do you know that your actually disabled all power management. I'd 
monitor cstate and pstate to see what they actually are over time. 
  Cool anecdote: We once had a case where something in the system was 
mysteriously elevating cstate away from 0 after we set to to 0. We never 
did find out what it was. The case was "resolved" with a cron job that set 
cstate to 0 every minute (yuck. I know).

- Slew with different interval time in your tests to find out how long the 
interval needs to be before you see the perf drop. The value at which this 
effect starts may be an interesting hint


On Wednesday, April 12, 2017 at 12:56:22 PM UTC-7, J Crawford wrote:
>
> The SO question has the source codes of a simple server and client that 
> demonstrate and isolate the problem. Basically I'm timing the latency of a 
> ping-pong (client-server-client) message. I start by sending one message 
> every 1 millisecond. I wait for 200k messages to be sent so that the 
> HotSpot has a chance to optimize the code. Then I change my pause time from 
> 1 millisecond to 30 seconds. For my surprise my write and read operation 
> become considerably slower.
>
> I don't think it is a JIT/HotSpot problem. I was able to pinpoint the 
> slower method to the native JNI calls to write (write0) and read. Even if I 
> change the pause from 1 millisecond to 1 second, problem persists.
>
> I was able to observe that on MacOS and Linux.
>
> Does anyone here have a clue of what can be happening?
>
> Note that I'm disabling Nagle's Algorithm with setTcpNoDelay(true).
>
> SO question with code and output: 
> http://stackoverflow.com/questions/43377600/socketchannel-why-if-i-write-msgs-quickly-the-latency-of-each-message-is-low-b
>
> Thanks!
>
> -JC
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Why a Java method invocation is slower when you call it somewhere else in your code?

2017-04-09 Thread Gil Tene


On Saturday, April 8, 2017 at 9:40:46 AM UTC-7, Kirk Pepperdine wrote:
>
>
> >>> 
> >>> - Your mySleep won't actually do what you think it does. The entire 
> method can be optimized away to nothing after inking at the call site by 
> the JIT once the calls to it actually warm up enough, since it has no side 
> effects and nothing is done with its return code. 
> >> 
> >> Well, this won’t happen in OpenJDK because of the return value. 
> > 
> > The return value "saves" you only as long as the method doesn't get 
> inlined. After it is inlined, the fact that the return value isn't used 
> allows the JIT to kill the entire code… 
>
> You’d think but not in my experience. 
>

Stock OpenJDK currently inlines and completely eliminates:

  public static int wasteSomeTime(int t) {
int x = 0;
for(int i = 0; i < t * 1; i++) {
  x += (t ^ x) % 93;
}
 return x;
  }

When called like this:

  wasteSomeTime(sleepArg);


So return values demonstrably don't prevent the optimization...


The optimization will not happen if inlining the method at the call site. 


I built a small set of jmh benchmarks to demonstrate this 
.
 They result in this:


Benchmark (benchLoopCount)  
(sleepArg)   Mode  Cnt   Score   Error  Units
MethodInliningExampleBench.noRetValIntLoop  10  
 1  thrpt5  2830940580.903 ±  52900090.474  ops/s
MethodInliningExampleBench.noRetValIntLoopNoInlining10  
 1  thrpt55500.356 ±   245.758  ops/s
MethodInliningExampleBench.retValIntLoop10  
 1  thrpt5  2877030926.237 ± 134788500.109  ops/s
MethodInliningExampleBench.retValIntLoopNoInlining  10  
 1  thrpt5   0.219 ± 0.007  ops/s



Which demonstrates that when inlining is **prevented** at the caller there 
is a real difference between having return value and not (the loop in the 
method gets optimized away only if there is no return value). But that when 
inlining is not prevented at the caller and the return value is not used, 
both cases get optimized away the same way. 

And since it is "hard" to reliably disallow inlining (without e.g. using 
Aleksey's cool @CompilerControl(CompilerControl.Mode.DONT_INLINE 
annotations in jmh), inlining can bite you and wreck your assumptions at 
any time...

Interestingly, as you can see from the same jmh tests above, while stock 
OpenJDK will optimize away the above code, it *currently* won't optimize 
away this code:

  public static long mySleepL1(long t) {
long x = 0;
for(int i = 0; i < t * 1; i++) {
  x += (t ^ x) % 93;
}
return x;
  }

Which differs only in using longs instead of ints.

The results for the longs tests are:

Benchmark  (benchLoopCount)  (
sleepArg)   Mode  Cnt   Score   Error  Units
MethodInliningExampleBench.noRetValLongLoop  10 
  1  thrpt5  2924098828.778 ± 234409260.906  ops/s 
MethodInliningExampleBench.noRetValLongLoopNoInlining10 
  1  thrpt5   0.243 ± 0.013  ops/s 
MethodInliningExampleBench.retValLongLoop10 
  1  thrpt5   0.254 ± 0.014  ops/s 
MethodInliningExampleBench.retValLongLoopNoInlining  10 
  1  thrpt5   0.246 ± 0.012  ops/s



So the using longs seems to defeat some of the *current* OpenJDK 
optimizations. But how much would you want to bet on that staying the same 
in the next release? 

Similarly, *current* stock OpenJDK won't recognize that System.nanoTime() 
and System.currentTimeMillis() have no side effects, so the original 
example method:
 
public static long mySleep(long t) {
long x = 0;
for(int i = 0; i < t * 1; i++) {
x += System.currentTimeMillis() / System.nanoTime();
}
return x;
}

Will not optimize away at the call site on *current* OpenJDK builds.  But 
this can change at any moment as new optimizations and metadata about 
intrinsics are added in coming versions or with better optimizing JITs.

In all these cases, dead code *might* be removed. And whether or not it 
does can depend on the length of the run, the data you use, the call site, 
the phase of the moon  , or the version of the JDK or JIT that happens to 
run your code. Any form of comparison (between call sites, versions, etc.) 
with such dead code involved is flakey, and will often lead to "surprising" 
conclusions. Sometimes those surprising conclusions happen right away. 
Sometimes the happen a year later, when you test again using your 
previously established, tried-and-tested, based-on-experience tests that no 

Re: Why a Java method invocation is slower when you call it somewhere else in your code?

2017-04-08 Thread Gil Tene


Sent from my iPad

> On Apr 8, 2017, at 9:37 AM, Kirk Pepperdine <k...@kodewerk.com> wrote:
> 
> 
>> On Apr 8, 2017, at 6:14 PM, Gil Tene <g...@azul.com> wrote:
>> 
>> There is a lot in this code to "learn" from.
> 
> Agreed….
> 
>> You can/should obviously use something like jmh to save you much of the 
>> trouble of making these mistakes. But if you want to learn about the various 
>> mistakes that people commonly make when trying to measure stuff, this code 
>> is a good starting point. It is a classic "how not to micro-benchmark" 
>> example and a good teaching/learning opportunity. So if we change the title 
>> from "Why a Java method invocation is slower..." to "Why does measuring like 
>> this give me the wrong impression that...", and putting aside style 
>> comments, here are some comments and highlights of likely reasons of why you 
>> are seeing a bigger number for the time elapsed in your single invocation 
>> from a second call site compared to the 1,000,000 invocations from the 
>> earlier call site:
>> 
>> - The first call site does not warm up the method. It warms up the call 
>> site. "Warmup" is not a proper term fir what you are doing there... The 
>> method can be (and likely was) inlined into the call site and optimized 
>> there.
>> 
>> - When the second call site is invoked once, it May not have been inlined 
>> there yet, and may be calling interpreted or lower tiered optimization 
>> versions of the method. If excercized enough, the second call site would 
>> likely inline as well.
>> 
>> Separately interesting:
>> 
>> - Your mySleep won't actually do what you think it does. The entire method 
>> can be optimized away to nothing after inking at the call site by the JIT 
>> once the calls to it actually warm up enough, since it has no side effects 
>> and nothing is done with its return code.
> 
> Well, this won’t happen in OpenJDK because of the return value.

The return value "saves" you only as long as the method doesn't get inlined. 
After it is inlined, the fact that the return value isn't used allows the JIT 
to kill the entire code...

> 
>> 
>> - Your while(true) loop in the code and the output don't seem to match. 
>> Where is rest of the output? Or is this output from an earlier code version 
>> without the while(true) loop?
>> 
>> - You are accumulating more and more contents into an array list. At some 
>> point the arraylist will be resized.
> 
> It’s pre-sized…
> 
> I’m going to play with this before commenting on the rest.
> 
> — Kirk
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/mechanical-sympathy/E7keRLsVcyk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: JVM crashed with memmap and unsafe operation for IPC

2017-02-16 Thread Gil Tene


On Thursday, February 16, 2017 at 7:14:23 AM UTC-8, Yunpeng Li wrote:
>
> Thanks a lot. It works now. NPE always find a way to give some surprise :P 
> even with GC help.
>

It's hard to call this sort of crash "even with GC help". It's more like 
"when you insist on bypassing GC safety".

You'll never (or should never) see a JVM crash with regular Java code. 
Seeing that would be an actual JVM bug.

Unsafe is NOT regular Java code. And it's hard to label the implications of 
using it more clearly than it already is. When you insist on running with 
scissors and on bypassing the safety that the GC based system provides, a 
JVM crash can't be blamed on anyone other than the user of Unsafe. If you 
want to use Unsafe, it becomes *your* responsibility to take all impacts 
into account, including interactions with your own code AND any of the 
runtime components you may or may not be aware of. These include the GC, 
the JIT compilers (including interesting interactions with on-the-fly 
optimization and de-optimization), locking mechanisms, hashcode tracking 
mechanisms, class unloading mechanisms, finalization mechanisms, the 
potential for escape analysis not materializing objects in memory where and 
when you think, and many many other things that already exist if may pop up 
in the future. "Unsafe" means "you are responsible for understanding what 
this would do in a system full of (literally) moving targets".
 

>
> Thanks a lot
> Yunpeng Li
>
> On Feb 16, 2017 1:04 PM, "Gil Tene" <g...@azul.com > wrote:
>
>> To prevent the unmapping, you can simply prevent 
>> the sun.nio.ch.DirectBuffer from going out of scope. Keep a static 
>> final sun.nio.ch.DirectBuffer around and assign it from the result 
>> of raf.getChannel().map(MapMode.READ_WRITE, 0, raf.length()), only only 
>> then get the address from it.
>>
>> That would work, technically. However, I feel obligated to point out the 
>> danger of assuming that the result 
>> of raf.getChannel().map(MapMode.READ_WRITE, 0, raf.length()) actually 
>> implements sun.nio.ch.DirectBuffer. It is only known to be an instance 
>> of java.nio.MappedByteBuffer, and the fact that an implementation-specific 
>> version of it returns an internally implemented subclass of 
>> MappedByteBuffer that also implements sun.nio.ch.DirectBuffer is a happy 
>> (actually sad, if you think about it) accident. One that can change at any 
>> moment, and with any version of the JDK. In fact, I'd be surprised if this 
>> code would work on JDK9...
>>
>> On Wednesday, February 15, 2017 at 8:09:51 PM UTC-8, Yunpeng Li wrote:
>>>
>>> Thanks for point out the root cause, could you share what I'd the 
>>> correct way to do it to make a pair of what you should and should not do 
>>> samples :P
>>>
>>> Thanks a lot
>>> Yunpeng Li
>>>
>>> On Feb 15, 2017 11:35 PM, "Gil Tene" <g...@azul.com> wrote:
>>>
>>>> This should crash. Every time. Reliably. It would be a JVM bug if it 
>>>> didn't.
>>>>
>>>> What your programs are doing is the logical equivalent to:
>>>>
>>>> T1:
>>>>   long *address = mmap(/* the file */);
>>>>   while (true) {
>>>> long value = *address;
>>>> count++;
>>>>   }
>>>>
>>>> T2:
>>>>   while (count < trigger) {
>>>> sleep(1);
>>>>   }
>>>>   munmap(address...);
>>>>
>>>>
>>>> Your pub and sub programs basically establish the unsafe addresses of 
>>>> temporarily mapped file regions that are dead immediately after being 
>>>> created. The programs will run long enough for the allocations caused by 
>>>> the printouts in your loops to trigger a couple of GCs, at which point the 
>>>> mapped file region gets cleaned up and properly unmapped. And the next 
>>>> access to those unmapped addresses crashes.
>>>>
>>>> On Wednesday, February 15, 2017 at 4:43:58 AM UTC-8, Yunpeng Li wrote:
>>>>>
>>>>> Hi there,
>>>>>  I'm trying to copycat using memmap and unsafe to build an IPC 
>>>>> ping pong test. Attached the source codes and crash reports, 
>>>>> unfortunately, 
>>>>> my code crashes JVM each time when ping pong reaches 73100(+- 10), Run 
>>>>> PublisherTest first then SubscriberTest in separated process. Sometimes 
>>>>> one 
>>>>> process crashes sometime both crash at the sametime.
>>>>>  

Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].

2017-02-15 Thread Gil Tene
Don't know if this is the same bug. RHEL 7 kernel included fixes for this 
since some time in 2015.

While one of my first courses of action when I see a suspicious FUTEX_WAIT 
hang situation is still to check kernel versions to rules this out (since 
this bug has wasted us a bunch of time in the past), keep in mind that not 
all things stuck in FUTEX_WAIT are futex_wait kernel bugs. The most likely 
explanations are usually actual application logic bugs involving actual 
deadlock or starvation.

Does attaching and detaching from the process with gdb move it forward? 
[the original bug was missing the wakeup, and an attach/detach would "kick" 
the futex out of its slumber once]

On Wednesday, February 15, 2017 at 6:33:45 AM UTC-8, Will Foster wrote:
>
>
>
> On Tuesday, February 14, 2017 at 4:01:52 PM UTC, Allen Reese wrote:
>>
>> This bug report seems to have a way to reproduce it:
>> https://bugs.centos.org/view.php?id=8371
>>
>> Hope that helps.
>>
>> --Allen Reese
>>
>>
>
> I also see this on latest CentOS7.3 with Logstash, I've disabled huge 
> pages via 
> transparent_hugepage=never
>
> in grub.
>
> Here's what I get from strace against logstash (never fully comes up to 
> listen on TCP/5044)
>
> [root@host-01 ~]# strace -p 1292
> Process 1292 attached
> futex(0x7f80eff8a9d0, FUTEX_WAIT, 1312, NULL
>
>
> I am hitting this issue on Logstash 5.2.1-1 while trying to upgrade my 
> Ansible 
> playbooks <https://github.com/sadsfae/ansible-elk/issues/16> to the 
> latest ES versions.
>
>  
>
>>
>> --
>> *From:* Longchao Dong <donglo...@gmail.com>
>> *To:* mechanical-sympathy <mechanica...@googlegroups.com> 
>> *Sent:* Monday, February 13, 2017 1:55 AM
>> *Subject:* Re: Linux futex_wait() bug... [Yes. You read that right. 
>> UPDATE to LATEST PATCHES NOW].
>>
>> How to reproduce this issue ? Is it possible to show us the method ? I am 
>> also working on one strange pthread_cond_wait issue, but not sure if that 
>> one is related with this issue.
>>
>> On Wednesday, May 20, 2015 at 8:16:12 AM UTC+8, manis...@gmail.com wrote:
>>
>> I bumped on this error couple of months back when using CentOS 6.6 with 
>> 32 cores Dell server. After many days of debugging, I realized it to be a 
>> CentOS 6.6 bug and moved back to 6.5 and since then no such issues have 
>> been seen.
>> I am able to reproduce this issue in 15 minutes of heavy load on my multi 
>> threaded c  code.
>>
>> On Wednesday, May 13, 2015 at 3:37:32 PM UTC-7, Gil Tene wrote:
>>
>> We had this one bite us hard and scare the %$^! out of us, so I figured 
>> I'd share the fear...
>>
>> The linux futex_wait call has been broken for about a year (in upstream 
>> since 3.14, around Jan 2014), and has just recently been fixed (in upstream 
>> 3.18, around October 2014). More importantly this breakage seems to have 
>> been back ported into major distros (e.g. into RHEL 6.6 and its cousins, 
>> released in October 2014), and the fix for it has only recently been back 
>> ported (e.g. RHEL 6.6.z and cousins have the fix).
>>
>> The impact of this kernel bug is very simple: user processes can deadlock 
>> and hang in seemingly impossible situations. A futex wait call (and 
>> anything using a futex wait) can stay blocked forever, even though it had 
>> been properly woken up by someone. Thread.park() in Java may stay parked. 
>> Etc. If you are lucky you may also find soft lockup messages in your dmesg 
>> logs. If you are not that lucky (like us, for example), you'll spend a 
>> couple of months of someone's time trying to find the fault in your code, 
>> when there is nothing there to find. 
>>
>> This behavior seems to regularly appear in the wild on Haswell servers 
>> (all the machines where we have had customers hit it in the field and in 
>> labs been Haswells), and since Haswell servers are basically what you get 
>> if you buy a new machine now, or run on the cool new amazon EC2/GCE/Azure 
>> stuff, you are bound to experience some interesting behavior. I don't know 
>> of anyone that will see this as a good thing for production systems. Except 
>> for maybe Netflix (maybe we should call this the linux fumonkey).
>>
>> The commit for the *fix* is here:  https://github.com/torvalds/ 
>> linux/commit/ 76835b0ebf8a7fe85beb03c7512141 9a7dec52f0 
>> <https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0>
>>
>> The commit explanation says that it fixes https://github.com/torvalds/ 
>> linux/commit/ b0c29f79ec

Re: JVM crashed with memmap and unsafe operation for IPC

2017-02-15 Thread Gil Tene
This should crash. Every time. Reliably. It would be a JVM bug if it didn't.

What your programs are doing is the logical equivalent to:

T1:
  long *address = mmap(/* the file */);
  while (true) {
long value = *address;
count++;
  }

T2:
  while (count < trigger) {
sleep(1);
  }
  munmap(address...);


Your pub and sub programs basically establish the unsafe addresses of 
temporarily mapped file regions that are dead immediately after being 
created. The programs will run long enough for the allocations caused by 
the printouts in your loops to trigger a couple of GCs, at which point the 
mapped file region gets cleaned up and properly unmapped. And the next 
access to those unmapped addresses crashes.

On Wednesday, February 15, 2017 at 4:43:58 AM UTC-8, Yunpeng Li wrote:
>
> Hi there,
>  I'm trying to copycat using memmap and unsafe to build an IPC ping 
> pong test. Attached the source codes and crash reports, unfortunately, my 
> code crashes JVM each time when ping pong reaches 73100(+- 10), Run 
> PublisherTest first then SubscriberTest in separated process. Sometimes one 
> process crashes sometime both crash at the sametime.
>  Can someone help to check what's the reason it crashes at the very 
> place. I ran the test case inside eclipse and on Mint Linux 16(yeah old 
> version) and jvm 8.
>  And one more question on how JVM work with OS on memmap swap, e.g. in 
> my case, the locks are more than 1 page away from data, I follow the 
> reserve-modify-publish pattern from disruptor, First updating data then 
> publish lock by unsafe putOrdered, How did jvm make sure the lock is 
> visible behind data from another process, Given update dirty page is 
> "random" by OS.
>
> Thanks in advance.
> Yunpeng Li
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: What happened to Managed Runtime Initiative?

2017-02-06 Thread Gil Tene


On Monday, February 6, 2017 at 12:45:52 PM UTC-8, Nikolay Tsankov wrote:
>
> ...
> Slightly off-topic (and the real reason I am writing this) but I found 
> this incredible piece of poetry hidden in a comment here 
> 
>  which 
> reads:
>
>
>> /*
>> * To batch, or not to batch, that is the question.
>> * Whether 'tis nobler in the mind to suffer
>> * The slings and arrows of long pause times,
>> * Or to take arms against a lousy virtual memory interface,
>> * And by opposing end them? To run: to scale.
>> */
>
>
> Who was this brave poet? :)
>

Michael Wolf. A true poet who's code would shine even in a heap full of 
garbage.
 

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Futex flood

2017-01-22 Thread Gil Tene
What OS distro and kernel version are you running? I also because any time 
someone mentions "interesting" behavior with futex_wait, my knee-jerk 
reaction is to first rule out the issue described in this previous posting 
 
before spending too much time investigating other things. Ruling it out is 
quick and easy (make sure your kernel is not in the version ranges affected 
by the bug described), so well worth doing as a first step.

Assuming you issue is NOT the one linked to above. You have some analysis 
and digging to do. I'd say that when having 200 actually active JVMs run on 
80 cpus, seeing significant time spent in futex_wait is not necessarily a 
surprising situation. It [heavily] depends on what what the applications 
are actually doing, and what their activity patterns look like. You'd need 
to study that and start drilling down for explanations from there. 
Speculating about the cause (and potential solutions) based on the info 
below would be premature.

On Sunday, January 22, 2017 at 8:42:54 PM UTC-8, Дмитрий Пеньков wrote:
>
> Hello everyone. I have following situation: multiple HotSpot JVMs with 
> multiple GCs' (ParallelGC) threads on multiple CPUs. I've already seen 
> recommendations to have min 2 cores per JVM, but in my case it's 80 CPU's 
> and about 200 JVMs. After working for about an hour, I have >95% of time 
> for futex syscall futex_wait. I tried to fix the number of parallel GC 
> threads in 4, but it didn't make effect. Mb GC changing can help? How to 
> solve this situation?
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Operation Reordering

2017-01-16 Thread Gil Tene
The compiler's reordering generally DOES NOT depend on the hardware. 
Optimizations that result in reordering generally occur well before 
instruction selection, and will happen in the same way for different 
hardware architectures. E.g. on X86, PowerPC, and ARM, HotSpot, gcc, and 
clang will all frequently reorder two stores, two loads, and any pair of 
loads and stores as long as there is nothing to explicitly prevent doing 
so. So forget about any sort of "hardware model" when thinking about the 
order you can expect.

The simple rule is "assume nothing". If a reordering is not specifically 
prohibited, assume it will happen. If you assume otherwise, you and likely 
to be unpleasantly surprised.

As for "stupid" reorderings, stupidity is in the eye of the beholder. 
"Surprising" sequence-breaking reordering that you may not see an immediate 
or obvious reason for may be beneficial in many ways.

E.g. take the following simple loop:

int[] a, b, c;
...
for (int i = 0; i < a.length; i++) {
a[i] = b[i] + c[i];
}

You can certainly expect (due to causality) loads from b[i] to occur before 
stores to a[i]. But is it reasonable to expect loads of b[i+1] to happen 
AFTER stores to a[i]? After all, that's the order of operations in the 
program, right? Would the JVM be "stupid" to reorder things such that some 
stores to a[i+1] occur before some loads from a[i]?

A simple optimization which most compilers will hopefully do to the above 
loop is to use vector operations (SSE, AVX, etc.) on processors capable of 
them, coupled with loop unrolling. E.g. in practice, the bulk of the loop 
will be executing on 8 slots at a time on modern AVX2 x86 cpus, and 
multiple such 8 slot operations could be in flight at the same time (due 
both to the compiler unrolling the loop and the processor aggressively 
doing OOOE even without the compiler unrolling stuff). The loads from one 
such operation are absolutely allowed to [and even likely to] occur before 
stores that occur previously in the instruction stream (yes, even on x86, 
but also because the compiler may jumble them any way it wants in the 
unrolling). There is nothing "stupid" about that, and we should all hope 
that both the compiler and the hardware will feel free to jumble that order 
to get the best speed for this loop... 

Even without vectorizing or loop unrolling by the compiler, the CPU is free 
to reorder things in many ways. E.g. if one operation misses in the cache 
and the next (in logical "i" sequence order) hits in the cache, there is no 
reason for order to be maintained between the earlier store and the 
subsequent load. Now imagine that in a processor that can juggle 72 in 
flight loads and 48 in flight stores at the same time (e.g. a Haswell 
core), and you will quickly realize that any expectation of order or 
sequence of memory access not explicitly required should be left at the 
door.


On Monday, January 16, 2017 at 4:49:39 PM UTC-5, Francesco Nigro wrote:
>
> This is indeed what I was expecting...While others archs (PowerPC , tons 
> of ARMs and the legendary Alpha DEC) are allowed to be pretty creative in 
> matter of reordering...And that's the core of my question: how much a 
> developer could rely on the fact that a compiler ( or the underline HW) 
> will respect the memory access that he has put into the code without using 
> any fences?The answer is really a "depends on the compiler/architecture"?Or 
> exist common high level patterns respected by the "most" of 
> compilers/architectures?
>
> Il lun 16 gen 2017, 22:14 Vitaly Davidovich  > ha scritto:
>
>> Depends on which hardware.  For instance, x86/64 is very specific about 
>> what memory operations can be reordered (for cacheable operations), and two 
>> stores aren't reordered.  The only reordering is stores followed by loads, 
>> where the load can appear to reorder with the preceding store.
>>
>> On Mon, Jan 16, 2017 at 4:02 PM Dave Cheney > > wrote:
>>
>>> Doesn't hardware already reorder memory writes along 64 byte boundaries? 
>>> They're called cache lines. 
>>>
>>>
>>> Dave
>>>
>>>
>>>
>>> On Tue, 17 Jan 2017, 05:35 Tavian Barnes >> > wrote:
>>>
 On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote:
>
> I'm missing something for sure, because if it was true, any 
> (single-threaded) "protocol" that rely on the order of writes/loads 
> against 
> (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) 
> risks 
> to not see the order respected if not using patterns that force the 
> compiler to block the re-ordering of such instructions (Sci-Fi 
> hypothesis).
>

 I don't think you're missing anything.  The JVM would be stupid to 
 reorder your sequential writes into random writes, but it's perfectly 
 within its right to do so for a single-threaded program according to the 
 JMM, as long as it respects data dependencies 

Re: Systematic Process to Reduce Linux OS Jitter

2016-12-26 Thread Gil Tene
One of the biggest reasons folks tend to stay away from the consumer CPUs 
in this space (like the is the i7-6950X 

 
you mentioned below) is the  ack of ECC memory support. I really wish Intel 
provided ECC support in those chips, but they don't. And ECC is usually a 
must when driving hardware performance to the edge, especially in FiunServ. 
The nightmare scenarios that happen when you aggressively choose your parts 
and push their performance to the edge (and even if you don't) with no ECC 
are very real. The soft-error correcting capabilities (ECC is usually 
SECDED) is crucial for avoid actually wrong computation results from 
occurring on a regular basis from simple things like cosmic ray effects on 
your DRAM, and with the many-GBs capacities we have tin those servers, 
going without a cosmic-ray-driven bit-flip in DRAM .

To move from hand waving and to actual numbers for the notion that ECC is 
critical (and to hopefully scare the s* out of people running business 
stuff with no soft-error correcting hardware), this 2009 Google Study 

 paper 
makes for a good read. It covers field data collected between 2006 and 
2008. Fast forward to section 3.1 if you are looking for some per-machine 
summary numbers. The simple takeaway summary is this: Even with ECC 
support, you have a ~1% of your machine experiencing an Uncorrectable Error 
(UE) once per year. But the chance of a machine encountering a Correctable 
Error (CE) at least once per year is somewhere in the 12-50% range, and the 
machines that do (which can be as many as half) will see those errors 
hundreds of times per year (so once every day or two).

One liner Summary: without hardware ECC support, random bits are probably 
flipping in your system memory, undetected, on a daily basis.

I believe that the current ECC-capable chips that would come close to the 
i7-6950X 

 you 
mentioned below are the E5-1680V4 

 (for 
1 socket setups, peaks at 4.0GHz) and either E5-2687W v4 

 
or E5-2697A v4 

 (peak 
at 3.,5 and 3.6GHz respectively, but you'd need to carefully avoid using on 
the core on the 2697 to get there probably). The E3 series (e.g. E3-1280 v5 
)
 
usually have the latest cores first, but their core counts tend to be tiny 
(4 physical cores compared to 8-12 in the others listed above).

On Monday, December 26, 2016 at 3:05:31 AM UTC-8, Lex Barringer wrote:
>
> I realize this post is a little late to the party but it's good for people 
> looking at tweaking their hardware and software for high frequency binary 
> options trading, including crypto-currency on the various exchanges. 
>
> As a note to all people seeking to create ultra low latency systems, not 
> just network components / accessories. The clock rate (clock speed) of the 
> CPU, it's multipliers vs. the memory multipliers, voltages, clock speeds of 
> the modules as well as the CAS timing (as well as other associated memory 
> timing parameters) can have a huge impact on how fast your overall system 
> performance is. Let alone it's actual reaction speed. One of the most 
> important areas are the ratios of the multipliers in the system itself. 
>
> While many operations are handled by the NIC in hardware and dedicated 
> FIFO buffers of said devices, it still is necessary to have a tuned system 
> hardware wise to give you the best performance, with the lowest overhead.
>
> Depending on if you use Intel Xeon or the newer 68 or 69 series Intel i7 
> Extreme series, you may get better trading performance using a consumer 
> grade (non Xeon) processor. I recommend using the Intel i7 6950X, it's 
> comparable in latency, it can handle
> memory speeds well in excess of 2400 MHz (2.4 GHz). The key here is to 
> find memory and motherboards that can handle DDR4
> 3200 MHz memory at CAS 14. You can get faster memory with the same CAS, if 
> your motherboard supports it, then do so at this time.
>
> I've used the following configuration:
>
> Asus E-99 WS workstation board
> Intel i7-6950X CPU
> 64 GiB of CAS 14 @ 3200 MHz RAM
> 1 Intel P3608 4 TB drive (it's the faster, bigger brother to the consumer 
> Intel 750 NVMe SSD in a PCIe slot)
> 1 NewWave Design & Verification V5022 Quad Channel 10 Gigabit Ethernet 
> FPGA Server Card
>
> The NIC on the motherboard is used for PXE Netboot and once the computer 
> is booted, it 

Re: Guard page on Linux/GCC?

2016-12-23 Thread Gil Tene
This is a common technique, and (at least) JVMs tend to use it. I.e. 
placing a PROT_NONE page at the end of the stack and lifting protection in 
the SEGV handler. You need to make sure the signal handler uses an 
alternate stack [by specifying it with sigaction(2) and signalstack(2)], or 
this will [obviously] not work.
 
There are all sorts of variations to how this stuff is done. Including 
"stack banging" (touching the stack e.g. 8KB ahead in calls) techniques 
that can sometimes avoid an alternate stack, and variations on how ones 
reserves stack virtual space to make sure the stack can be grown (note that 
the protection scheme is only "safe" as long as stack frames do not exceed 
the protected space size). There are also variations that do stack-chaining 
(where the stack is not contiguous in memory and extends somewhere else 
when needed). But the simplest seems to be the protect+extend-in-SEGV.

On Friday, December 23, 2016 at 6:28:10 AM UTC-8, Alex Snaps wrote:
>
> Hey all, 
> Not quite sure this is the best place for this, but I thought some here 
> may have ideas/experience with this:
>
> I'm trying to "gracefully" deal with stack overflows, by having  signal 
> handler (with its own small stack) test against the siginfo_t's si_code 
> on SIGSEGV. On OS X/clang I get a SEGV_ACCERR, as the main thread gets a 
> (i.e. one) guard page below the stack.
> To my surprise though, I couldn't find a way to achieve the same on Linux 
> (and gcc 5.4). I can't think of any other approach, and I don't want to 
> "artificially" spawn another thread with such a guard page in place to 
> address the issue. 
> All my experiments with trying to mmap (PROT_NONE, MAP_FIXED) such a page 
> myself didn't succeeded either.
>
> Has anyone ever dealt with such a case?
> Thanks!
> Alex
> -- 
> Alex Snaps
> Twitter: @alexsnaps
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Does percentile metrics follow the rules of summations?

2016-12-21 Thread Gil Tene
The right way to deal with percentiles (especially when it comes to 
latency) is to assume nothing more than what it says on the label.

The right way to read "99%'ile latency of a" is "1 or a 100 of occurrences 
of 'a' took longer than this. And we have no idea how long". That is the 
only information captured by that metric. It can be used to roughly deduce 
"what is the likelihood that a will take longer than that?". But deducing 
other stuff from it usually simply doesn't work.

Specifically things for which projections don't work include:
(A) the likelihoods of higher or lower percentiles of the same metric a
(B) the likelihood of similar values in neighboring metrics (b, c, or d)
(C) the likelihood of a certain percentile of composite operation (a + b + 
c + d in your example) including the same percentile of a

The reasons for A usually have to do with the sad fact that latency 
distributions are usually strongly multi-modal, and tend to not exhibit any 
form of normal distribution. A given percentile means what it means and 
nothing more, and projecting from one percentile measurement to another 
(unmeasured but extrapolated) is usually a silly act of wishful thinking. 
No amount of wishing that the "shape" of latency distribution was roughly 
known (and hopefully something close to a normal bell curve) will make it 
so. Not even close.

The reasons for B should be obvious.

The reasons for C usually have to do with the fact that the things that 
shape latency distributions in multiple related metrics (e.g. a, b, c, d) 
often exhibit correlation or anti-correlation.

A common cause for high correlations in higher percentiles is that things 
being measured may be commonly impacted by infrastructure or system 
resource artifacts that dominate the causes for their higher latencies. 
E.g. if a, b, and c are running on the same system and that system 
experiences some sort of momentery "glitch" (e.g. a periodic internal book 
keeping operation), their higher percentiles may be highly correlated. 
Similarly when momentary concentrations and spikes in arrival rates cause 
higher latencies due to queue buildups, and similarly when the cause of the 
longer latency is the complexity or size of the specific operation.

Anti-correlation is often seen when the occurrence of a higher latency in 
one component makes the likelihood of a higher latency in another component 
in the same sequence less likely that it normally would be. The causes for 
anti-correlation can vary widely, but one common example I see is when the 
things performing a, b, c, d utilize some cached state services, and high 
latencies are dominated by "misses" in those caches. In systems that work 
and behave like that, it is common to see one of the steps effectively 
"constructively prefetch" state for the others, making the likelihood off a 
high-opercentile-causing "miss" in the cache on "a" be much higher than a 
similar miss in b, c, or d. This "constructive pre-fetching" effect occurs 
naturally with all sorts of caches, from memcache to disk and networked 
storage system caches to OS file caches to CPU caches.  

On Wednesday, December 21, 2016 at 2:53:45 AM UTC-8, Gaurav Abbi wrote:
>
> Hi,
> We are collecting certain metrics using (*Graphite + Grafana*) use them 
> as a tool to monitor system health and performance. 
>
> For one of the latency metric, we get the total time as well as the 
> latencies for all the sub-components it is composed of.
>
> We display 99th percentile for all the values. However, if we sum up the 
> 99th percentiles for latencies of sub-components, they do not equate to the 
> 99th percentile of the total time.
>
> Essentially it comes down if the percentiles can follow summation rules. 
> i.e.
>
> if 
> *a + b + c + d = s*
>
> then,
> *p99(a) + p99(b) + p99(c) + p99(d) = p99(s) ?*
>
> Will this hold?
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Does percentile metrics follow the rules of summations?

2016-12-21 Thread Gil Tene
Yup. ***IF***. And in the real world they never are. Not even close.

On Wednesday, December 21, 2016 at 3:12:12 AM UTC-8, Avi Kivity wrote:
>
> Right; if the distributiona are completely random, then 
>
>   p99.99(a then b) = p99(a) + p99(b) 
>
>
>
> On 12/21/2016 01:09 PM, Greg Young wrote: 
> > no because the 99th percentiles do not necessarily happen at the same 
> time. 
> > 
> > On Wed, Dec 21, 2016 at 10:53 AM, Gaurav Abbi  > wrote: 
> >> Hi, 
> >> We are collecting certain metrics using (Graphite + Grafana) use them 
> as a 
> >> tool to monitor system health and performance. 
> >> 
> >> For one of the latency metric, we get the total time as well as the 
> >> latencies for all the sub-components it is composed of. 
> >> 
> >> We display 99th percentile for all the values. However, if we sum up 
> the 
> >> 99th percentiles for latencies of sub-components, they do not equate to 
> the 
> >> 99th percentile of the total time. 
> >> 
> >> Essentially it comes down if the percentiles can follow summation 
> rules. 
> >> i.e. 
> >> 
> >> if 
> >> a + b + c + d = s 
> >> 
> >> then, 
> >> p99(a) + p99(b) + p99(c) + p99(d) = p99(s) ? 
> >> 
> >> Will this hold? 
> >> 
> >> -- 
> >> You received this message because you are subscribed to the Google 
> Groups 
> >> "mechanical-sympathy" group. 
> >> To unsubscribe from this group and stop receiving emails from it, send 
> an 
> >> email to mechanical-sympathy+unsubscr...@googlegroups.com . 
>
> >> For more options, visit https://groups.google.com/d/optout. 
> > 
> > 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: private final static optimization

2016-12-19 Thread Gil Tene
As noted before, static finals can be treated as true constants by the JIT 
(after class initialization). And all JITs seem to do a good job with that. 
Code guarded by a static final boolean is effectively free (when the 
boolean is off), except for it's potential impact on inlining decisions in 
some JITs (which are sometimes driven by byte code counts on the target 
methods).

The Zing JVM now does what we call "Truly Final" optimization in our latest 
releases with the Falcon JIT compiler. These optimizations apply to any 
instance field declared final (private or not doesn't matter), and they 
allow the JIT to do some important optimizations that were not previously 
possible (e.g. hoist final field lookups out of loops, and along with them 
eliminate dependent operations from the loops, like range checks on final 
arrays. See cool discussion of this "problem" here 
. 
Instance fields that have their finality overridden (e.g. via reflection or 
via Unsafe) will have this optimization dynamically disabled before the 
override occurs.

On Monday, December 19, 2016 at 12:58:07 AM UTC-8, Richard Warburton wrote:
>
> Hi,
>
> You're right that serialization is called out, but I always understood 
>> that as Java serialization, and not some other type of serialization.  
>> You're right that in practice it's hard to know for the JIT and the 
>> practical implication is everyone is allowed to do it, but again, on paper 
>> I thought it refers only to builtin serialization.  This is a good example 
>> of a feature that's rarely (relatively) used but is an optimization fence 
>> for all code.
>>
>> As for deopt tracking, I believe the current plan is to track this at the 
>> class level, not a field - I think this makes it more palatable and doesn't 
>> require using modules.  But we'll see how this pans out.
>>
>
> I'll be intrigued to see what happens on this front, specifically whether 
> the de-optimisation path traps Unsafe usage. BigInteger is a good example 
> of a JDK class that doesn't use reflection to set a final field, but uses 
> Unsafe, albeit only in serialization.
>
> regards,
>
>   Richard Warburton
>
>   http://insightfullogic.com
>   @RichardWarburto 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Minimum realistic GC time for G1 collector on 10GB

2016-12-13 Thread Gil Tene


On Monday, December 12, 2016 at 9:30:21 AM UTC-5, Ivan Kelly wrote:
>
> Thanks all for the suggestions. I'll certainly check the safepoint 
> stuff. I suspect that a lot of EPA isn't happening where it could also 
> due to stuff being pushed through disruptor. Even if it is happening, 
> or I can make it happen, and also clear up all the low hanging fruit, 
> if I can't get below 100ms pauses on the 10G heap, then it would all 
> be for nothing. 
>
> Censum is interesting, I'll take a look. I have flight recordings 
> right now. Is it possible to find the time to safepoint in that? 
>
> > All that said, the copy cost and reference processing of G1 for a heap 
> as 
> > busy / sized as yours is unlikely to get under 10ms, as Chris mentioned 
> Gil 
> > will now be summoned ;-). 
> Yes, really what I'm looking for is someone to tell me "no, you're 
> nuts, not possible" so I can justify going down the rearchitecture 
> route. Unfortunately Zing isn't an option, since it would double the 
> cost of our product. 
>

Did you actually check on the cost of EOM'ing Zing with your application? 
Or are you just assuming it will be expensive?

You might be surprised. Somehow Zing seems too be getting a bad rap, with 
people assuming it must be expensive. Maybe because the value seems "too 
high to be sold cheaply". Don't confuse high value with high price. Yes, 
Zing allows some very profitable high margin businesses to make even more 
money (think trading systems). But it even more widely used in very low 
margin businesses (think online consumer retailers) with a reputation for 
penny-pinching.

Please note that I'm not making any claims that you can't find some other 
way to get to your target behaviors. I'm a great believer in the potential 
of duct tape in the hands of talented engineers 

.
 

>
> As I said in the original email though, the low latency part of the 
> application can probably fit in 100MB or less. The application takes 
> netlink notification, does some processing and caches the result for 
> invalidation later. The netlink notification and processing is small 
> but needs to be fast. The cache can pause, as long as entries for it 
> can be queued. It's a prime candidate for being moved to another 
> process. 
>
> Which brings me to another question? What are good java shared mem IPC 
> queues? Something like cronicle-queue, but without the persistence. 
> I'd prefer to not role my own. The road to hell has enough paving 
> stones. 
>
> -Ivan 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Minimum realistic GC time for G1 collector on 10GB

2016-12-13 Thread Gil Tene
Wow. I forgot I wrote that tool. Glad to see it helped you in your analysis 
Ivan.

And Kirk, [most] of my GC related tools are not designed to break OpenJDK's 
GCs. The GC-realted tools I build are usually designed to exhibit normal GC 
behaviors more quickly in order to allow practical test-based observations 
(trying hard to not exacerbate the GC stall lengths, and only make them 
happen more frequently). This is true even for stressor tools like 
HeapFragger .

MinorGC is specifically designed to demonstrate OpenJDK's collectors BEST 
possible full lifecycle pause behaviors, not their "worst" or "broken" 
modes (which would have much worse pause times). Looking at the discussion 
that prompted the tool's writing 
(https://groups.google.com/d/msg/mechanical-sympathy/frVwfX8g6Gw/PWqDLoXTxdYJ), 
there is no mention of Zing or a comparison to it's behavior. The original 
reason I wrote and posted MinorGC was to address questions and 
misconceptions about newgen pause times that were often projecting from 
misleading short runs. Specifically, people would commonly run short (e.g. 
<1hr) tests and observe that newgen pause time were short (when oldgen was 
still at low occupancy), falsely believing that those results reflect the 
long term or typical newgen pause time behavior for their application. 
MinorGC lets you fast forward through [what would otherwise take] days of 
testing, to see the full lifecycle range of newgen pause times, which [for 
many collectors] is cyclical in nature. When used as a java agent it lets 
you see what your actual application pause times are going to look like (it 
makes them happen more rapidly, but does not make them bigger) without 
having to wait days for the results. I think the README description is 
honest, and the tool is clearly useful.

That the tool can also be used to quickly compare actual GC pause times 
between OpenJDK and  Zing's is a just a nice bonus... ;-) It was not what 
is was written for.

On Tuesday, December 13, 2016 at 4:15:53 AM UTC-5, Ivan Kelly wrote:
>
> > Do know that Mr. G tools are designed to break OpenJDK’s garbage 
> collectors 
> > :-)  to demonstrate how Zing is more resilient. To be fair, Zing is more 
> > resilient but… 
> I've been playing with MinorGC for a few hours, and it's lead me to 
> the conclusion that I just need to make my heap smaller if I want sub 
> 10ms pauses. Which is fine. Exactly the kind of guidance I was looking 
> for :) 
>
> -Ivan 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Continuous performance monitoring with Java FlightRecorder (JFR)?

2016-12-02 Thread Gil Tene
I agree with the need/wish for a common way to get such information from 
JVMs. A common & standard way for JVMs and OpenJDK to provide event tracing 
as well as low-runtime-cost JVM instrumentation details (for information no 
currently covered by JVMTI and not cheaply [enough] gleaned via BCI) would 
be very useful but is not (yet?) part of the platform. JFR is very capable 
but is custom to Oracle. Zing has similar capabilities (and in some cases 
even more detailed information, such as viewing 
down-to-the-generated-machine-instruction hotness and stack traces) but 
those capabilities are also custom (to Zing). And IBM's J9 has it's own 
very detailed instrumentation capabilities. 

When looking at overlaps with APM and profiling tools, BCI, JVMTI, and 
other standard and semi-standard instrumentation levels do provide some 
overlapping capabilities for most things, and often at very affordable and 
practical runtime costs [that's why these production-time API tools are so 
popular]. But there are certainly some JVM-based instrumentation 
capabilities that current JVMs (HotSpot, Zing, J9) can fundamentally do 
"better" than the spec'ed standard things that profilers and APMs have 
access to, and/or in much cheaper ways, leaving their information in the 
custom jvm tooling arena for now. Specific examples of this include things 
like (A) tracking and examining wait times on monitor and j.u.c lock 
instances (for which the JVM has first hand knowledge that is better and 
more useful than the information that can be gleaned by external tools in 
clean/cheap-enoigh way); (B) the ability to use tick-based stack tracing 
outside of safepoints to profile code behavior. This tick-based stack 
tracing [outside of safepoints] is important not only because it is "cheap 
enough" to provide practical profiles with near-zero runtime overhead, but 
because it is accurate enough to make that information useful (as opposed 
to tick-based at-safepoint or at-BCI-instruimented-point instrumentations, 
which will often skew profiles dramatically). And (C) the ability to track 
and report on very useful heap content stats (e.g. by-type heap occupancy 
and by-type occupancy velocities) that the JVM can fundamentally measure 
cheaply [as a nearly-free part of GC scanning] but is not available to 
common tools via defined interfaces or log formats [forcing tools to 
re-instrument the heap to extract this data if they want it, often at a 
prohibitively high runtime cost]. Knowledge of generated code behavior, 
including information about compilations and deoptimizations, as well as 
the ability to express stack traces in terms of generated code locations 
(which makes stack traces both much cheaper and much more accurate in the 
non-heizengerg-ing sense) is also an area for which JVMTI capabilities 
could be greatly extended.

But even without those we-could-do-better capabilities, monitoring JVM 
behaviors in production seems to be doing pretty well. It can always be 
better, of course, but the state of these productioon-monitoring tools for 
Java is generally well ahead of what is available in almost all other 
languages and/or runtimes.

On Friday, December 2, 2016 at 2:08:37 AM UTC-8, zeo...@gmail.com wrote:
>
> Hi Gil,
>
> Thanks for the heads up and price references! I was certainly wary of the 
> license aspect even though the project I am planning is for open source 
> development.
>
> Would there be anything of similar capability in openjdk? Looking at the 
> openjdk src repo, it seems that there has been some more JEP 167 (
> http://openjdk.java.net/jeps/167) oriented changes introduced recently 
> into jdk9. The event tracing logic in the jdk8 tracing code seems to 
> already cover the core feature set of native (as opposed to BCI) JFR: 
> stacktrace samples, monitor waits, alloc/gc events, compiler 
> events. Judging by the small volume of changes between 2013 and 2015, I am 
> guessing the tracing feature is not used much in openjdk7/8 and might be of 
> uncertain reliability however (e.g. see this: 
> https://bugs.openjdk.java.net/browse/JDK-8145788). Maybe I should look 
> more into using those openjdk tracing capabilities instead of JFR for jdk9. 
> The runtime configurability and resource management (like JFR's 
> buffer/chunk/checkpoint) isn't quite there yet and I might need to write 
> some hacks to enable output destination that is not stdout/stderr.
>
> As for commercial APM out there, not doubt they will have lots of custom 
> BCI to cover app server use cases. I wonder how well (low overhead and high 
> accuracy) they do on the jvm native instrumentation side (stack sample, 
> alloc events, monitor wait). Same goes for profilers like Yourkit/JProfiler.
>
> Zee
>
> On Thursday, December 1, 2016 at 4:06:59 PM UTC-8, Gil Tene wrote:
>>
>> Virtually all the benefits of mo

Re: Continuous performance monitoring with Java FlightRecorder (JFR)?

2016-12-01 Thread Gil Tene
Virtually all the benefits of monitoring come in production environments 
(by definition, I think), and that's probably why you don't see this 
scenario (as) commonly used with JFR.

Basically, using JFR for production [currently at least] requires a 
commercial Java SE Advanced license. How/if this is enforced technically is 
irrelevant, the click-thropiugh license that allows you to use it for free 
is specifically restricted to non-production use. This is spelled out in 
the Oracle Binary Code License Agreement for the Java SE Platform Products 
and JavaFX 
(http://www.oracle.com/technetwork/java/javase/terms/license/index.html), 
under SUPPLEMENTAL LICENSE TERMS... A. COMMERCIAL FEATURES. and B. SOFTWARE 
INTERNAL USE FOR DEVELOPMENT LICENSE GRANT. And since JFR is clearly marked 
as a "Commercial Feature" (you literally have to use the 
-XX:+UnlockCommercialFeatures -XX:+FlightRecorder to use it) it's 
impossible to claim ignorance of this fact. See e.g. 
https://www.infoq.com/news/2013/10/misson-control-flight-recorder, 
http://www.adam-bien.com/roller/abien/entry/java_mission_control_development_pricing,
 
and 
https://docs.oracle.com/javacomponents/jmc-5-4/jfr-runtime-guide/run.htm#JFRUH164
 
for some discussion and mentions around it. 

So while JFR can and may do some cool (and even semi-unique) things for 
production monitoring, you'd have to clear the commercial pricing terms 
first, and those seem pretty steep, as in a list price of $5000 per 2 x86 
cores according to the Oracle price list 
(http://www.oracle.com/us/corporate/pricing/technology-price-list-070617.pdf), 
which would equate to e.g. $40K per instance for EC2 m3.2xlarge instances, 
and $80K-160K per server for modern 2 socket servers (those with "only" 
16-32 cores). While I'm sure the actual production pricing could end up 
much lower once purchasing departments finish hand-wrestling with Oracle's 
sales folks, it would probably still be way more than other commercial 
monitoring and JVM-knowledgable APM solutions that are much more feature 
rich and focused would cost (e.g. Dynatrace, AppDynamics, NewRelic, etc.), 
all of which list at a tiny fraction of the Oracle Java SE Advanced list 
price levels (and are massively used in production systems).

On Thursday, December 1, 2016 at 12:18:18 PM UTC-8, zeo...@gmail.com wrote:
>
> Hi all,
>
> Does anyone know a good way to do continuous performance monitoring using 
> JFR (JDK8)? I am interested in using this on some apache data pipeline 
> projects (Spark, Flink etc). I have used JFR for perf profiling with fixed 
> duration before. Continuous monitoring would be quite different.
>
> The ideal scenario is to set up JFR to write to UDP  destinations 
> with configurable update frequencies. Obviously that is not supported by 
> JFR as it stands today. So I tried setting up continuous JFR with 
> maxage=30s and running JFR.dump every 30s, to my surprise the time range 
> covered by the dumped jfr files does NOT correspond to the maxage parameter 
> I gave. Instead the time ranges (FlightRecordingLoader.loadFile(new 
> File("xyz.jfr")).timeRange) from successive JFR.dump can be overlapping 
> and much bigger than maxage.
>
> So couple of questions for those experienced users of JFR:
>
> -- What exactly is the semantics of maxage?
> I imagined that maxage has 2 effects: discarding events older than maxage 
> and aggregating certain metrics (like stacktrace sample counts) over the 
> time interval. It appears my understanding was way off.
>
> -- How does the event pool/buffer under consideration for next JFR.dump 
> get reset?
> I was hoping every JFR.dump would reset the pool and allow the next 
> JFR.dump to output non-overlapping time range. I was also wrong here.
>
> -- Is there any way to do continuous perf monitoring with JFR with a 
> configured aggregation and output interval?
> One thing I did notice is that JFR would periodically (default seems 60s) 
> flush to chunk files and then rotate chunk files according to maxchunksize 
> param. I could use that mechanism to inotify-watch the repository dir and 
> just read and parse the chunk files. However there are a few things missing 
> if I wanted to go down this route: there is no way to set "maxchunkage" 
> (would like to be able to set one as low as 10s), I will need to write some 
> custom chunk file parser, not sure if chunk files have all the symbols to 
> resolve the typeids.
>
> Thanks!
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: JVM heap size?

2016-11-28 Thread Gil Tene
A couple of notes:

1. There are many JVM implementations (Oracle HotSpot, OpenJDK, Zing, J9, 
...) and many collectors to choose from (G1, CMS, C4, ParallelGC, Balanced, 
GenCon, ...) and the answers may vary between them. The choice of JVM and 
collector is often driven by the application need, not the other way around.

2. The implications of breaking an application into lots of small JVMs can 
vary dramatically. For some systems (e.g. completely stateless execution 
with no benefit from in-process retained or cached information and no 
accumulation of in-process data) spreading work across JVMs can be 
architecturally easy. For some systems (e.g. data and execution can both be 
effectively sharded along dimensions that allow scalable distributed 
execution with no [significant] growth in state or work), good architecture 
can allow arbitrary lateral scaling across processes (e.g. Cassandra, 
Solr/Elastic, Kafka can be good examples of this design concept on JVMs). 
For some systems (e.g. systems where execution efficiency and speed depend 
on locally accumulated in-process state, such as caches, in-memory DBs, 
etc.) scaling across processes can be done well as long as each process can 
effectively hold all or much of the required state in-process (leading to 
minimal practical process sizes, and potentially lots of replicated memory 
for scaling). And for some systems, scaling across an arbitrary number of 
processes is not practical [and a limited number is used, primarily for 
redundancy purposes]. Which of these groupings your system falls in will 
strongly affect the "right" answer in your case.

As a general statement, the size at which you *cap* your heap size on JVMs 
that use collectors that increase their pause sizes with heap size 
(HotSpot's ParallelGC, CMS, G1 all fall into that category, but E.g. Zing's 
C4 does not) will be primarily driven by your response time requirements. 
The answers (to your question) will depend on what your application is 
doing, on what it's expected/required response time behavior is, and on 
what collectors (and JVMs) you are able to apply. 

E.g. if your application is doing batch gene sequencing analysis, you are 
concerned purely with sequencing throughput measured over a day, and 
freezing for 20 minutes several times over the period of the run is fine as 
long as it keeps going after, then you can happily run 2TB heaps on both 
CMS and G1 with HotSpot. On the other hard, if you want to keep your actual 
99.9%'ile response times during the peak 10 minutes of your day below 
2msec, you may find it hard to do with *any* HotSpot collector, with any 
heap size larger than about 50MB. If you live in the wide range of response 
time behavior needs between those two "extremes", and e.g. you want to keep 
your 99.9%'ile at peak below a much more lax number (e.g. 200msec), you may 
find some heap sizes that would work for that with HotSpot and CMS or G1, 
while larger or smaller ones won't. With C4, you can generally keep your 
percentiles happy anywhere in the 1GB...2TB range (since JVM pauses do not 
grow or shrink with heap size), and I can't speak much from experience for 
the IBM collectors.

We certainly see people commonly using multi-GB JVMs to run JVM-based 
infrastructure pieces (Cassandra, Kafka, Solr, Elastic, Cache engines, 
servlet containers, etc.). When inherent multi-100s-of-msec pauses are 
acceptable and 99.9%'iles of 200msec+ are ok, we see people using 2-8GB 
heap sizes quite often and living with the consequences on HotSpot. And 
when people want to avoid those sorts of glitches, stalls or 
circuit-breaker-triggers, we'll see them either opting for Zing (which will 
both eliminate the glitches and take the cap off of the heap size they can 
use if they wish), or spending a few man years rewriting parts of their 
application to better live around HotSpot collector limitations and reduce 
the occurrences or magnitude of glitches.

On Sunday, November 27, 2016 at 5:04:47 AM UTC-8, yahim stnsc wrote:
>
> Hi all,
>
> Does anyone have any dimensioning rules  for the maximum heap size that 
> can be managed by CMS/G1 wiith a certain FullGC time?
>
> My application works with relatively small objects.
>
> I keep hearing people of running multiple JVMs on same computer with 2 GB 
> of heap or 8GB of heap but cannot find any clear data.
>
> Regards
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: JVM heap size?

2016-11-28 Thread Gil Tene
Avi, I disagree, that would be a very narrow part of what we seem to 
discuss here. Discussing how to run systems on JVMs is no more out-of-scope 
for this list than discussion how run systems on Linux. And discussing the 
proclivities of systems that work on JVMs and how to deal with GC there is 
no more out of scope than discussing how to deal with Linux settings for 
NUMA control.

In my view, this list is not dedicated to "writing software that coded in a 
way that is efficient for the hardware to execute" (although that topic is 
well within what the list seems to deal with). If you were to use that 
sentence structure, I'd say I'd change it to Building (as opposed to just 
writing) Systems (as opposed to just software) In A Way (as opposed to just 
coding) that Executes Well (as opposed to just efficiently) on Machines (as 
opposed to just the hardware). So "Building systems in a way that executes 
well on machines". Or "executing systems on machines, well."

But even then, that won't capture the interesting "this curious thing seems 
to happen..." sort of discussions that we seem to have here, which have 
little to do with executing stuff well, and more to do with understanding 
how the machine works.

So more simply: Mechanical Sympathy.

And "How do I run certain applications well on JVMs" is well within that.

On Sunday, November 27, 2016 at 5:40:13 AM UTC-8, Avi Kivity wrote:
>
> Suggest you try asking on a Java help forum, rather than this list, which 
> is devoted to writing software that is code in a way that is efficient for 
> the hardware to execute.
>
> On 11/27/2016 03:04 PM, yahim stnsc wrote:
>
> Hi all, 
>
> Does anyone have any dimensioning rules  for the maximum heap size that 
> can be managed by CMS/G1 wiith a certain FullGC time?
>
> My application works with relatively small objects.
>
> I keep hearing people of running multiple JVMs on same computer with 2 GB 
> of heap or 8GB of heap but cannot find any clear data.
>
> Regards
> -- 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com .
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.