Re: purpose of an LFENCE

2019-10-10 Thread Francesco Nigro
RDTSC measurements that surround the instructions being measured need to be
serialized together with them, regardless those instructions are temporal
or not . They don't need to wait the store buffer to be drained, so no
expensive mfence is needed just to ensure locally to not be reordered with
the measured ones.
That's what I've understood at least, hope will help

Il gio 10 ott 2019, 18:18 Peter Veentjer  ha scritto:

>
>
> On Wednesday, October 9, 2019 at 5:12:53 AM UTC+3, Vitaly Davidovich wrote:
>>
>> FWIW, I’ve only seen lfence used precisely in the 2 cases mentioned in
>> this thread:
>> 1) use of non-temporal loads (ie weak ordering, normal x86 guarantees go
>> out the window)
>> 2) controlling execution of non-serializing instructions like rdtsc
>>
>> I’d be curious myself to hear of other cases.
>>
>
> Same here. It is the reason I'm asking the question: what is the purpose
> of the LFENCE (apart from #1)
>
> I checked the Intel Manual; but I could not make a lot of sense under
> which condition #2 would be needed.
>
>
>
>>
>> On Fri, Oct 4, 2019 at 10:10 AM Peter Veentjer 
>> wrote:
>>
>>> I'm have been checking out the new fence API's in Java
>>> (Unsafe/VarHandle).
>>>
>>> I understand how the higher level API are translated to the logical
>>> fences. E.g. release fence -> LoadStore+StoreStore. There are some great
>>> post including
>>>
>>> https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>>> Great explanation how a release fence needs to be combined with a
>>> StoreLoad to preserve sequential consistency
>>>
>>> Also this post is great on the topic:
>>> https://preshing.com/20120913/acquire-and-release-semantics/
>>>
>>> When I zoom into hardware things are a bit more blurry.
>>>
>>> X86 provides the following guarantees:
>>> Loads won't be reordered with older loads   [LoadLoad]
>>> Stores won't be reordered with older stores (TSO) [StoreStore]
>>> Stores won't be reordered with older loads [LoadStore]
>>>
>>> One fundamental fence is the MFENCE because it will provide StoreLoad
>>> semantics. And on X86 the Unsafe.fullFence can be compiled to a MFENCE (in
>>> practice it uses the lock addl ...  but that isn't relevant for this
>>> discussion). This will prevent stores to be reordered with older stores and
>>> will make sure the memory is visible to other CPU's (by waiting for the
>>> store buffer to be drained).
>>>
>> I think you meant “prevent stores to be reordered with *later loads*”.
>>
>
> You are completely right. I should have checked my message more carefully.
>
>
>
>> In fact, awaiting store buffer drain is how it prevents the later load
>> from reordering with an earlier store - the load can’t retire (maybe not
>> even issue) while the store is sitting in the buffer (which would cause the
>> load-before-store reordering to be observed).
>>
>>>
>>> The SFENCE was a bit more obscure to be because X86 proves TSO; so what
>>> is the point of adding a [StoreStore] fence is the platform provides it out
>>> of the box (so prevents stores to be reordered with older stores).
>>> Apparently there are certain instructions like those of SSE that are weakly
>>> ordered and these need to have this SFENCE. Ok. I can live with that.
>>>
>>> But the LFENCE I can't place. Initially I thought it would provide a
>>> similar fix as the SFENCE; so prevent load load reordering for weakly
>>> ordered instructions like those of SSE. But apparently the LFENCE is a very
>>> different beast.
>>>
>>> Could someone shed some light on the purpose of the LFENCE?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>>> To view this discussion on the web, visit
>>> https://groups.google.com/d/msgid/mechanical-sympathy/52527501-bffd-4a82-96fa-3fa618bec111%40googlegroups.com
>>> 
>>> .
>>>
>> --
>> Sent from my phone
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/mechanical-sympathy/BWYEfPKJeGQ/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> mechanical-sympathy+unsubscr...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/mechanical-sympathy/b0441369-9342-4c6a-bd72-4c1537f16d0e%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails f

Re: purpose of an LFENCE

2019-10-04 Thread Francesco Nigro
One of the rare cases where it makes sense to read stack-overflow (just joking 
:P): see 
https://stackoverflow.com/questions/37452772/x86-64-usage-of-lfence?rq=1

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/18e51c28-8456-443d-a624-4aef7b8f79ae%40googlegroups.com.


Re: Volatile semantic for failed/noop atomic operations

2019-09-25 Thread Francesco Nigro
> I've never heard of failed CAS being cheaper
I'm referring to this old (but good) article: 
https://blogs.oracle.com/dave/biased-locking-in-hotspot

Il giorno sabato 14 settembre 2019 21:41:56 UTC+2, Vitaly Davidovich ha 
scritto:
>
> On x86, I’ve never heard of failed CAS being cheaper.  In theory, cache 
> snooping can inform the core whether it’s xchg would succeed without going 
> through the RFO dance.  But, to perform the actual xchg it would need 
> ownership regardless (if not already owned/exclusive).
>
> Sharing ordinary mutable memory isn’t scalable, forget locks :) Just doing 
> plain loads/stores of shared memory will kill scalability (and perf) due to 
> cache coherence traffic.
>
> ARM uses LL/SC for CAS, and is much more susceptible to failures (eg even 
> if the value is expected, if the line has been refreshed after the LL part, 
> it fails).  But, I’m not overly familiar with ARM internals/mechanics.
>
> On Sat, Sep 14, 2019 at 3:20 PM Francesco Nigro  > wrote:
>
>> Although not mentioned (neither on Doug's cookbook) I have always 
>> supposed was more likely the c++ default for both weak/strong CAS ie 
>> seq_cst memory ordering.
>> To make this question more mechanical sympathy focused: on hardware level 
>> what happen? I would be curious to know both x86/arm versions for this, 
>> what kind of memory reordering are guaranteed...
>>
>> Most people says that a failed CAS cost much less then one that succeed, 
>> but just like a load or more? 
>> Probably the success will cause invalidation of all the caches that 
>> reference the cache line changed, but the failed case should lock (on x86) 
>> it and AFAIK locks (hw or SW) are not well known to be scalable and 
>> performant :)
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> To view this discussion on the web, visit 
>> https://groups.google.com/d/msgid/mechanical-sympathy/caebab48-f88f-477d-bc8e-95702e45bc76%40googlegroups.com
>> .
>>
> -- 
> Sent from my phone
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/54efeb7e-04a2-48eb-a4f8-8e97556ccc03%40googlegroups.com.


Re: Volatile semantic for failed/noop atomic operations

2019-09-14 Thread Francesco Nigro
Although not mentioned (neither on Doug's cookbook) I have always supposed was 
more likely the c++ default for both weak/strong CAS ie seq_cst memory ordering.
To make this question more mechanical sympathy focused: on hardware level what 
happen? I would be curious to know both x86/arm versions for this, what kind of 
memory reordering are guaranteed...

Most people says that a failed CAS cost much less then one that succeed, but 
just like a load or more? 
Probably the success will cause invalidation of all the caches that reference 
the cache line changed, but the failed case should lock (on x86) it and AFAIK 
locks (hw or SW) are not well known to be scalable and performant :)

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/caebab48-f88f-477d-bc8e-95702e45bc76%40googlegroups.com.


Re: What linux distribution is better tooled for learning about resource utilization and perf tuning

2019-07-23 Thread Francesco Nigro
The problem with RHEL and recentish CentOS is that they use versions of glibc 
that implements many important feature directly in ASM (for pthread cond wait, 
spin lock/unlocks etc etc), while Fedora doesn't not. With perf dwarf call 
graph walker they won't (cannot) appear and won't be accounted correctly so 
they are not that friendly with such tools..

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/3534cace-58d1-4bb3-a44b-5757f6d1b5f4%40googlegroups.com.


Re: Performance is not composable

2019-03-26 Thread Francesco Nigro
> The call to consumeCPU may not have any effect on the loop performance if 
n is low enough

This it clear to me, but with "high enough" n (to be quantified, depending 
on the operation to be amortized and the HW/HW utilisation) it should 
produce a linear-ish time occupation directly proportional to the number of 
n: I suppose 
that the point of the author of the article is that it should be used for 
the sake of producing a deterministic load in order to let the operation to 
be amortized to fade its cost starting fron an n to be experimentally 
evaluated.

What I'm not getting is why comparing directly both the amortized benchmark 
results (that I've sent in the first email) is ok while it is not by using 
a third benchmark that just use BlackHole::consumeCPU.
I suppose that both the approaches risk to compare apple-to-oranges, just 
the latter is even more wrong because it introduce a third reference...

Il giorno lunedì 25 marzo 2019 12:54:01 UTC+1, Avi Kivity ha scritto:
>
> consumeCPU() doesn't consume the CPU, but just some parts of it - a few 
> integer ports. It leaves alone the instruction cache, the instruction 
> decoder, and most of the execution ports.
>
>
> Given that the loop has 10 instructions, the function runs about 100 
> instructions, and so it can run in parallel with previous instructions from 
> the caller.
>
>
> Consider this code
>
>
>int a[1'000'000'000];
>
>fill a with random values in the range 0..999'999'999
>
>loop {
>
>idx = a[idx]
>
>consumeCPU(n)
>
>}
>
>
> The call to consumeCPU may not have any effect on the loop performance if 
> n is low enough. The processor can be memory-bound and execute consumeCPU() 
> while waiting for previous reads to complete.
>
>
> On 25/03/2019 10.44, Francesco Nigro wrote:
>
> Thanks Avi, 
>
> The way CPU can parallelize works make totally sense to me, but probably 
> I've missed some context about the operation used to amortize A (or B):
>
> http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443
>
> The point I'm not getting is why I can compare directly the results of a() 
> and b() (with amortization), but I cannot use amortization() to compare a() 
> and b(): to me they both look incorrect with different degrees. 
>
> If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost alone or if 
> combined with other operation) it means that it can always be used for 
> comparision, but if it is not, 
> the risk is that althought it can amortize with success rawA() (or 
> rawB()), it will interact with them and with low values of tokens
> the cost of the composed operations couldn't be compared in a way to say 
> anything about rawA() vs rawB() cost.
>
> I'm sure that there must be something that I'm missing here.
> FYI that's the asm printed by JMH related to tokens = 10 and I can see 
> that it doesn't get inlined (probably too unpredictable?):
>
> [Hottest Region 
> 1]..
> C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501 (88 
> bytes) 
>
> 
> Decoding compiled method 0x7f15e9223bd0:
> Code:
> [Entry Point]
> [Verified Entry Point]
> [Constants]
>   # {method} {0x7f15fcde2298} 'consumeCPU' 
> '(J)V' in 'org/openjdk/jmh/infra/Blackhole'
>   # parm0:rsi:rsi   = long
>   #   [sp+0x30]  (sp of caller)
>   0x7f15e9223d20: mov%eax,-0x14000(%rsp)
>   2.31%   0x7f15e9223d27: push   %rbp
>   0x7f15e9223d28: sub$0x20,%rsp
>  ;*synchronization entry
> ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456)
>   2.01%   0x7f15e9223d2c: movabs $0x76fe0bb60,%r10  ;   {oop(a 
> 'java/lang/Class' = 'org/openjdk/jmh/infra/Blackhole')}
>   0.34%   0x7f15e9223d36: mov0x68(%r10),%r10;*getstatic 
> consumedCPU
> ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
>   0x7f15e9223d3a: test   %rsi,%rsi
>  ╭0x7f15e9223d3d: jle0x7f15e9223d75  ;*ifle
>  │  ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
>  │0x7f15e9223d3f: movabs $0x5deece66d,%r11
>   1.5

Re: Performance is not composable

2019-03-25 Thread Francesco Nigro
bp
  0x7f15e9223d9a: jl 0x7f15e9223da4
  0x7f15e9223d9c: setne  %bpl
  0x7f15e9223da0: movzbl %bpl,%ebp  ;*lcmp

 88.00%  

Looking at the code, it seems that all the comments on 
BlockHole::consumeCPU are correct (related to cmq and test) and it is 
indeed a very straightforward translation from code -> ASM (no loop 
unrolling or weird tricks by the JVM, just a mfence dropped somewhere, 
maybe).
Just safepoint polls on each loop iteration and on the end of method call.

Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto:
>
> Suppose you have micro-operations A and B that take t(A) and t(B) to run. 
> Running repeat(n, A+B) can take n*(t(a) + t(b)), or n*max(t(a), t(b)), or 
> n*(t(a) + t(b) + huge_delta), or something else.
>
>
> Sometimes the CPU can completely parallelize A and B so running them in 
> parallel takes no extra time compared to just one. Sometimes running them 
> in sequence causes one of the caches to overflow and efficiency decreases 
> dramatically. And sometimes running both can undo some quirk and you end up 
> with them taking less time.
>
>
> Summary: CPUs are complicated.
>
>
> On 24/03/2019 10.11, Francesco Nigro wrote:
>
> Hi folks, 
>
> while reading the awesome 
> https://shipilev.net/blog/2014/nanotrusting-nanotime/ I have some 
> questions on the "Building Performance Models" part.
> Specifically, when you want to compare 2 operations (ie A,B) and you want 
> to emulate the same behaviour of a real application you need to amortize 
> the cost of such operations: 
> in JMH this is achieved with BlackHole.consumeCPU(int tokens), but any 
> microbenchmark tool (if not on the JVM) could/should provide something 
> similar.
>
> Said that, now the code to measure is not just A or B but is composed by 2 
> operations: amortization; A() (or B);
> For JMH that means:
>
> @Benchmark
> int a() {
> BlackHole.consumeCPU(tokens);
> //suppose that rawA() return an int and rawA() is the original call re 
> A
> return rawA();
> }
>
> in JMH is up to the tool to avoid Dead Code Elimination when you return a 
> value from a benchmarked method.
>
> The point of the article seems to be that given that "performance is not 
> composable" if you want to compare A and B cost (with amortization) you 
> cannot create a third benchmark:
>
> @Benchmark
> int amortization() {
> BlackHole.consumeCPU(tokens);
> }
>
> And use the benchmark results (eg throughput of calls) to be subtracted 
> from the results of a() (or b()) to compare A and B costs.
> I don't understand the meaning of "performance is not composable" and I 
> would appreciate your opinion on that, given that many people
> of this list have experience with benchmarking.
>
> Thanks,
> Franz
> -- 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com .
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Performance is not composable

2019-03-24 Thread Francesco Nigro
Hi folks,

while reading the awesome 
https://shipilev.net/blog/2014/nanotrusting-nanotime/ I have some questions 
on the "Building Performance Models" part.
Specifically, when you want to compare 2 operations (ie A,B) and you want 
to emulate the same behaviour of a real application you need to amortize 
the cost of such operations: 
in JMH this is achieved with BlackHole.consumeCPU(int tokens), but any 
microbenchmark tool (if not on the JVM) could/should provide something 
similar.

Said that, now the code to measure is not just A or B but is composed by 2 
operations: amortization; A() (or B);
For JMH that means:

@Benchmark
int a() {
BlackHole.consumeCPU(tokens);
//suppose that rawA() return an int and rawA() is the original call re A
return rawA();
}

in JMH is up to the tool to avoid Dead Code Elimination when you return a 
value from a benchmarked method.

The point of the article seems to be that given that "performance is not 
composable" if you want to compare A and B cost (with amortization) you 
cannot create a third benchmark:

@Benchmark
int amortization() {
BlackHole.consumeCPU(tokens);
}

And use the benchmark results (eg throughput of calls) to be subtracted 
from the results of a() (or b()) to compare A and B costs.
I don't understand the meaning of "performance is not composable" and I 
would appreciate your opinion on that, given that many people
of this list have experience with benchmarking.

Thanks,
Franz

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: RSS and CPU selection

2019-03-15 Thread Francesco Nigro
HI Peter!

It is a nice intuition!
Have u tried the example 
on https://www.kernel.org/doc/Documentation/IRQ-affinity.txt
to check if on your machine the IRQs are always being handled by a specific 
CPU?
And remember to disable/stop irqbalance to avoid OS to automatic balance 
IRQs across CPUs in background!
I'm not that familiar with this stuff but on monday I will see my collagues 
working on the kernel at the office and I will ask something about it 

Cheers,
Franz

Il giorno venerdì 15 marzo 2019 15:40:07 UTC+1, Peter Veentjer ha scritto:
>
> I have a question about RSS and how a CPU is selected to process the 
> interrupt.
>
> So with RSS you have multiple rx-queue and each rx-queue has an IRQ 
> associated to it. Each CPU can pick up the IRQ as clearly explained here:
>
> https://www.kernel.org/doc/Documentation/networking/scaling.txt
>
> It is possible to using IRQ/CPU affinity to prevent that just any CPU 
> picking up the IRQ.
>
> My question is about the default behavior when no explicit affinity has 
> been set.
>
> Is there any advantage of letting the same rx-queue being processed by 
> different CPU's? In case of a random CPU picking up same IRQ, you could end 
> a packet being processed at a different CPU for every packet being 
> received. This will cause a lot of cache coherence traffic which will 
> certainly not improve performance.
>
> When I look at /proc/interrupts of my local system, it seems the same CPU 
> is processing a specific rx-queue. So it looks that the OS is already 
> applying some form of affinity to map an IRQ to a CPU.
>
> Can someone shed a light on this?
>
> Thanks.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Huge, unexpected performance overhead (of static methods?)

2019-02-07 Thread Francesco Nigro
I suspect that the build phase of an app using Graal AOT could suffer the
same limitation, but it is correct that static initializers *shouldn't* be
used for heavyweight computations given that are ensured to be called
serially...

Il giorno ven 8 feb 2019, 02:36 Andrei Pangin  ha
scritto:

> Looks like you run long computation inside the static initializer
> , right?
> The problem is that HotSpot cannot fully optimize static calls when the
> holder class is not completely initialized. Moving computation to a
> separate class is the right way to deal with such cases.
>
> The situation got even worse in recent JDK updates. There was a zero day
> bug https://bugs.openjdk.java.net/browse/JDK-8215634 that could result in
> invocation of a static method of uninitialized class in violation of JVMS.
> The bug was fixed in JDK 8u201 and JDK 11.0.2. However, the fix is terrible
> from performance perspective: now the invocation of a static method of
> uninitialized class goes through the slow path, in particular, through
> resolve_static_call() JVM runtime function.
>
> So, the recommendation is simple: move heavy logic out of a class being
> initialized.
>
> - Andrei
>
> четверг, 7 февраля 2019 г., 1:54:29 UTC+3 пользователь Shevek написал:
>>
>> Hmm... "Justify/explain your beliefs." Fair, I'm game:
>>
>> I think I'm sure that my methods are both JIT'd because perf-java-flames
>> shows them both in green, and I think that's derived from some prefix on
>> the symbol name where the JVM (via perf) says it's JIT'd. If it weren't
>> JIT'd, I'd see it listed in red as "Interpreter".
>>
>> I waited for the system to warm up before grabbing the flame-chart, and
>> on one run, I was too early, so I actually saw one symbol shift from
>> interpreted to JIT. I have not used jitwatch, but the signs I am looking
>> at seem reliable. The interpreter JIT'd the inner method first, then the
>> outer one, as expected based on call frequency.
>>
>> I've never seen this resolve_static_call method show up before in any of
>> my profiling of Java code.
>>
>>  From that PoV, it looks as if the interpreter is doing something odd
>> with method call between the two methods, despite that it should just be
>> a static linkage. Like it JIT'd both methods, but didn't JIT the call,
>> or it's using some interpreter-path code in the call? So is there a
>> linkage limit based on #methods in the class, or something? Because
>> making the methods non-static ALSO moved them into a new class which has
>> only 2 methods... ok, new experiment...
>>
>> For what it's worth, I seem to be really struggling with JDK 1.8.1b191
>> performance on the new Xeon hardware for other reasons, too. I'm seeing
>> perf saying pthread_cond_wait -> native_write_msr is taking 50% of
>> runtime, and not even sure where to start with that except limit the
>> life of any JVM to 6 hours and restart it. I kind of want to blame a
>> Kernel / PMU change but it only affects the JVM.
>>
>> Caveat: I don't do JVM internals, I'm mostly a JLS-layer muggle.
>>
>> S.
>>
>> On 2/4/19 10:53 PM, Todd Lipcon wrote:
>> > On Mon, Feb 4, 2019 at 9:13 PM Shevek > > > wrote:
>> >
>> > This isn't a JIT issue. According to perf-java-flames, all my code
>> DID
>> > get jitted. The overhead is entirely calls to this mystery
>> > resolve_static_call function, so it looks like a static method
>> lookup
>> > issue in the JVM. The shape of the stack profile makes it look as
>> if
>> > something is recursive, too.
>> >
>> >
>> > Are you sure? From the code it certainly looks like
>> > 'resolve_static_call' is part of the interpreter code path.
>> >
>> > -Todd
>> >
>> >
>> > On 2/4/19 8:01 PM, Todd Lipcon wrote:
>> >  > Tried looking at LogCompilation output with jitwatch? It's been
>> > helpful
>> >  > for me in the past to understand why something wouldn't get
>> jitted.
>> >  >
>> >  > Todd
>> >  >
>> >  > On Mon, Feb 4, 2019, 7:54 PM Shevek > > 
>> >  > > wrote:
>> >  >
>> >  > Update: I now think this is slow (but not AS slow) on the
>> > Core i7-5600U
>> >  > so this may be a regression from _181 to _191, and not
>> entirely
>> >  > CPU-dependent?
>> >  >
>> >  > Wrapping the two static methods in an otherwise-pointless
>> > class, and
>> >  > calling them as instance methods made the code much faster.
>> >  >
>> >  > Is it relevant that the class in question is 522419 Kb in
>> > size and
>> >  > contains 1696 (mostly instance) methods? No individual
>> method
>> > in it is
>> >  > larger than 8K, so they all JIT.
>> >  >
>> >  > The outer readVarintTable method is called about 100K-500K
>> > times, so
>> >  > there's plenty of chance to replace it.
>> >  >
>> >  > No synchronization is used.
>> >  >
>>

Re: Happens before between putting and getting from and to ConcurrentHashMap

2018-11-19 Thread Francesco Nigro
Good point, you're totally right: I have checked the CHM Javadoc and it
won't say anything about "total" ordered operations.
"By contract" we can't rely on such ordering: I would say that only if the
map consumer is per-key we can assume that what I have written in the past
answer is correct.

Il giorno lun 19 nov 2018 alle ore 16:57 Roman Leventov <
leventov...@gmail.com> ha scritto:

> On the other hand, CHM didn't guarantee volatile store semantics of put(),
> it only guaranteed HB between get() and put() with the same key. However,
> it's the same for java.util.concurrent's Queue implementations, such as
> ConcurrentLinkedQueue: Javadoc doesn't say that queue.offer() and poll()
> are synchronization actions.
>
> On Mon, 19 Nov 2018 at 16:11, Francesco Nigro  wrote:
>
>> It would matter if you add some logic around it:
>>
>> //put of a value in the map with just a lazySet/putOrdered etc etc
>>
>> If (anyConsumerIsSleeping()) {
>>wakeUpConsumers();
>> }
>>
>> If you won't have a full barrier the compile *could* reorder
>> anyConsumerIsSleeping() ahead of the map.put, making possible to miss some
>> consumer that is gone asleep :)
>> The consumer example is more related to a queue case, but the idea is the
>> same: you can't rely on any sequential consistent order on that operation
>> (the volatile store), because without a volatile store that order doesn't
>> exist.
>>
>> Il giorno lunedì 19 novembre 2018 14:04:15 UTC+1, Roman Leventov ha
>> scritto:
>>>
>>> Does anybody understand what could go wrong if that CHM.Node.value
>>> volatile write is relaxed to storeFence + normal write, and no fence at all
>>> within the CHM.Node constructor (Hotspot JVM only)?
>>>
>>> On Mon, 19 Nov 2018, 10:28 Jean-Philippe BEMPEL >> wrote:
>>>
>>>> Thanks Vladimir for your thoroughly explanation, I need to re read the
>>>> Aleksey's JMM pragmatics 10 times more I guess 🙄
>>>>
>>>> On Sun, Nov 18, 2018 at 7:35 PM Vladimir Sitnikov <
>>>> sitnikov...@gmail.com> wrote:
>>>>
>>>>> Jean-Philippe>is a write to value but no read of this va lue inside
>>>>> the same thread, so the write is free to be reordered
>>>>>
>>>>> It ("reordering") does not really matter.
>>>>>
>>>>> For instance,
>>>>>
>>>>> 17.4.5. Happens-before Order> If the reordering produces results
>>>>> consistent with a legal execution, it is not illegal.
>>>>>
>>>>> What matters is the set of possible "writes" that given "read" is
>>>>> allowed to observe.
>>>>>
>>>>>
>>>>> In this case, simple transitivity is enough to establish hb.
>>>>> As Gil highlights, "negations" are a bit hard to deal with, and
>>>>> Mr.Alexey converts the negations to a positive clauses:
>>>>> https://shipilev.net/blog/2014/jmm-pragmatics/#_happens_before
>>>>>
>>>>> Shipilёv> Therefore, in the absence of races, we can only see the
>>>>> latest write in HB.
>>>>>
>>>>> Note: we (as programmers) do not really care HOW the runtime and/or
>>>>> CPU would make that possible. We have guarantees from JVM that "in the
>>>>> absence of races, we can only see the latest write in HB".
>>>>> CPU can reorder things and/or execute multiple instructions in
>>>>> parallel. I don't really need to know the way it is implemented in order 
>>>>> to
>>>>> prove that "CHM is fine to share objects across threads".
>>>>>
>>>>> Just in case: there are two writes for w.value field.
>>>>> "write1" is "the write of default value" which "synchronizes-with the
>>>>> first action in every thread" (see 17.4.4.) + "If an action x
>>>>> synchronizes-with a following action y, then we also have hb(x, y)." (see
>>>>> 17.4.5)
>>>>> "write2" is "w.value=42"
>>>>>
>>>>> "value=0" (write1) happens-before "w.value=42" (write2) by definition
>>>>> (17.4.4+17.4.5)
>>>>> w.value=42 happens-before map.put (program order implies
>>>>> happens-before)
>>>>> read of u.value happens-before map.put (CH

Re: Happens before between putting and getting from and to ConcurrentHashMap

2018-11-19 Thread Francesco Nigro
It would matter if you add some logic around it:

//put of a value in the map with just a lazySet/putOrdered etc etc

If (anyConsumerIsSleeping()) {
   wakeUpConsumers();
}

If you won't have a full barrier the compile *could* reorder 
anyConsumerIsSleeping() ahead of the map.put, making possible to miss some 
consumer that is gone asleep :)
The consumer example is more related to a queue case, but the idea is the 
same: you can't rely on any sequential consistent order on that operation 
(the volatile store), because without a volatile store that order doesn't 
exist.

Il giorno lunedì 19 novembre 2018 14:04:15 UTC+1, Roman Leventov ha scritto:
>
> Does anybody understand what could go wrong if that CHM.Node.value 
> volatile write is relaxed to storeFence + normal write, and no fence at all 
> within the CHM.Node constructor (Hotspot JVM only)?
>
> On Mon, 19 Nov 2018, 10:28 Jean-Philippe BEMPEL   wrote:
>
>> Thanks Vladimir for your thoroughly explanation, I need to re read the 
>> Aleksey's JMM pragmatics 10 times more I guess 🙄
>>
>> On Sun, Nov 18, 2018 at 7:35 PM Vladimir Sitnikov > > wrote:
>>
>>> Jean-Philippe>is a write to value but no read of this va lue inside the 
>>> same thread, so the write is free to be reordered
>>>
>>> It ("reordering") does not really matter.
>>>
>>> For instance,
>>>
>>> 17.4.5. Happens-before Order> If the reordering produces results 
>>> consistent with a legal execution, it is not illegal.
>>>
>>> What matters is the set of possible "writes" that given "read" is 
>>> allowed to observe.
>>>
>>>
>>> In this case, simple transitivity is enough to establish hb.
>>> As Gil highlights, "negations" are a bit hard to deal with, and 
>>> Mr.Alexey converts the negations to a positive clauses: 
>>> https://shipilev.net/blog/2014/jmm-pragmatics/#_happens_before
>>>
>>> Shipilёv> Therefore, in the absence of races, we can only see the latest 
>>> write in HB.
>>>
>>> Note: we (as programmers) do not really care HOW the runtime and/or CPU 
>>> would make that possible. We have guarantees from JVM that "in the absence 
>>> of races, we can only see the latest write in HB".
>>> CPU can reorder things and/or execute multiple instructions in parallel. 
>>> I don't really need to know the way it is implemented in order to prove 
>>> that "CHM is fine to share objects across threads".
>>>
>>> Just in case: there are two writes for w.value field.
>>> "write1" is "the write of default value" which "synchronizes-with the 
>>> first action in every thread" (see 17.4.4.) + "If an action x 
>>> synchronizes-with a following action y, then we also have hb(x, y)." (see 
>>> 17.4.5)
>>> "write2" is "w.value=42"
>>>
>>> "value=0" (write1) happens-before "w.value=42" (write2) by definition 
>>> (17.4.4+17.4.5)
>>> w.value=42 happens-before map.put (program order implies happens-before)
>>> read of u.value happens-before map.put (CHM guarantees that)
>>>
>>> In other words, "w.value=42" is the latest write in hb order for u.value 
>>> read, so u.value must observe 42.
>>> JRE must ensure that the only possible outcome for the program in 
>>> question is 42.
>>>
>>> Vladimir
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "mechanical-sympathy" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to mechanical-sympathy+unsubscr...@googlegroups.com 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Concurrent retrieval of statistics

2018-10-16 Thread Francesco Nigro
Or you could use something similar to this too: if you are slow to collect, 
you loose things (and you know how many samples), but you won't slow down 
too much the producer...https://github.com/real-logic/agrona/pull/119

Il giorno martedì 16 ottobre 2018 10:47:31 UTC+2, Wojciech Kudla ha scritto:
>
> I can only speak from my own experience. For data generated by latency 
> critical threads you will probably want to have a simple SPSC buffer per 
> thread meaning no competition between producers so fewer cycles lost on 
> cache coherence. The consumer could just iterate over all known buffers and 
> drain them in some housekeeping thread (preferably on the same numa node). 
> Nitsan Wakart did some inspiring work on queues and there's plenty to 
> learn from it, however for simple, well structured data you would probably 
> be better off with something bespoke. 
> I typically use an off-heap circular byte buffer with producer and 
> consumer high watermarks cached for additional speed. 
> I guess choosing the right data structure and the means of generally 
> moving the data will depend on many factors but most of the time jdk 
> components are not suitable for that. 
>
>
>
> On Tue, 16 Oct 2018, 09:33 Mohan Radhakrishnan,  > wrote:
>
>> Hi,
>> There is streaming data everywhere like trading data , JVM 
>> logs.etc. Retrieval of statistics of this data need fast data structures. 
>> Where can I find the literature on such fast data structures to store and 
>> retrieve timestamps and data in O(!) time ?  Should this always be 
>> low-level Java concurrent utilities ?
>>
>> Thanks,
>> Mohan
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Load generators and Coordination Omission compensations

2018-10-04 Thread Francesco Nigro
HI guys,

I have put a question (probably too basic for many folks here!) on 
HdrHistogram issues page: 
Load generator CO free and HdrHistogram 


I'm assuming most of the folks here to be familiar with the "Coordinated 
Omission" term but if not I think there are many articles all around that 
explain it much better than 
meeg 
http://psy-lob-saw.blogspot.com/2015/03/fixing-ycsb-coordinated-omission.html 
of the perf gangstar Nitsan W.

The question is related to the tool (HdrHistogram 
) but can be considered a 
wider one re "How many times do I have to correct CO to measure correctly 
Responsiveness Under Load of a system and where?"
Please share your thoughts...

Thanks,
Franz

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: synchronized and ReentrantLock are REALLY the same?

2018-05-21 Thread Francesco Nigro

>
> Preventing your appear-prior-to-lock-acquisition writes from "moving into 
> the block" is subtly different from preventing their re-ordering with 
> writes and reads that are within the block. Are you sure you want the 
> former and not just the latter?


Yes or, at least, it seems the only way I have found to implement a sort of 
naive "deadlock"( or better, a "suspected" slowness) detector.
Supposing to have a ordered executor API (ie and executor that ensure that 
all the submitted tasks would be executed in order, but not necessarly by 
one/the same thread) and 
some alien client library with synchronized calls, in order to detect at 
runtime that the synchronized calls are not deadlocked or simply unable to 
leave the synchronized block due to 
some suspected slowness I was thinking to implement a watchdog service that 
monitor at interval each Thread used by the ordered executor, polling the 
last elapsed time before a thread 
was seen approaching to enter into a synchronized block.
Translated into pseudo-code:

you hava a 

public synchronized foo(){
}

and some code:

...
executor.submit(()->alien.foo());
...

it should became:
...
executor.submit(watchDog->{
watchDog.beforeEnter();
try {
alien.foo();
} finally {
watchDog.exit(); 
}
});
...
The API is not definitive at all, but the point is that 
watchDog.beforeEnter(); should not be moved into the synchronized block 
because that woud not make possibile to compute the elapsed time into the
synchronized block if some other thread is preventing alien.foo() to enter 
into the synchronized block.
Probably there are better/smarter ways to implement it, but that's the 
reason behing my question :)

Il giorno sabato 19 maggio 2018 06:21:19 UTC+2, Gil Tene ha scritto:
>
>
>
> On Friday, May 18, 2018 at 11:10:45 PM UTC-5, Gil Tene wrote:
>>
>>
>>
>> On Friday, May 18, 2018 at 2:30:08 AM UTC-5, Francesco Nigro wrote:
>>>
>>> Thanks Gil!
>>>
>>> I will failback to my original (semi-practical) concern, using this 
>>> renewed knowledge :)
>>> Suppose that we want to perform write operations surrounding both a 
>>> j.u.c. Lock and synchronized mutual exclusion block and we want:
>>>
>>>1. these writes operations to not being moved inside the block and 
>>>maintain their relative positions from it
>>>
>>> Preventing your appear-prior-to-lock-acquisition writes from "moving 
>> into the block" is subtly different from preventing their re-ordering with 
>> writes and reads that are within the block. Are you sure you want the 
>> former and not just the latter?
>>
>
> It is also worth noting that since any writes moving into the synchronized 
> (or locked) block WILL appear atomically with any other reads and writes 
> within the block to any other thread that synchronizes (or locks) on the 
> same object (or lock), the only threads that may observe this 
> reorder-into-the-block effects are ones that would also potentially observe 
> the things *within* the block in non-atomic and non-synchronized ways.
>
> So the purpose of preventing the reordering of things into the block looks 
> suspicious, to begin with. I'm not saying there is no possible reason for 
> it, just that it seems suspect. As in "uses non-synchronized accesses to 
> things that are elsewhere intentionally done under a lock", Which spells 
> "many bugs can probably found here" to me.
>
>  
>
>> It is easier to see how to prevent the latter. All you have to do is 
>> order the earlier writes against those in-block writes and/or reads, 
>> ignoring the lock. This is where, for example, a lazySet on the 
>> inside-the-block writes will order them against the before-the-block 
>> writes, regardless of how the monitor enter is dealt with, which can save 
>> you from using volatiles. If there are reads within the block that you need 
>> to order against the before-the-block writes, you'd need to use volatiles 
>> (e.g. a volatile store for the last before-the-block write, AND a volatile 
>> load for the first in-the-block read).
>>
>> If you actually want to prevent the former (and there are ways to observe 
>> whether or not reordering "into" the block occur), you may need more 
>> complicated things. But do you really need that? Someone may be able to 
>> find some way to detect whether or not such reordering of the writes and 
>> the lock-enter happens [I'm actually not sure whether such reordering, 
>> without also reordering against writes and reads in the block, is 
>> detectable]. An if that detection is possible, it also means

Re: synchronized and ReentrantLock are REALLY the same?

2018-05-18 Thread Francesco Nigro
Thanks Gil!

I will failback to my original (semi-practical) concern, using this renewed 
knowledge :)
Suppose that we want to perform write operations surrounding both a j.u.c. 
Lock and synchronized mutual exclusion block and we want:

   1. these writes operations to not being moved inside the block and 
   maintain their relative positions from it
   2. the effects of the writes would be atomically readable from other 
   threads

Given that we can't assume any semantic difference betwenn the j.u.c Lock 
and intrinsic ones and there aren't clearly listed (but on specific 
implementations/the Cookbook) any effects on the sorrounding code, how we 
can implement it correctly?
And...we can do it just with this knowledge?
The only solution I see is by using a volatile store on both (or at least 
on the first) writes operation, while ordered (aka lazySet) ones can't work 
as expected. 

Cheers,
Franz


Il giorno mercoledì 16 maggio 2018 17:26:59 UTC+2, Gil Tene ha scritto:
>
> Note that Doug' Lea's JMM cookbook is written for implementors of JDKs and 
> related libraries and JITs, NOT for users of those JDKs and libraries. It 
> says so right in the title. It describes rules that would result in a 
> *sufficient* implementation of the JMM but is not useful for deducing the 
> *required 
> or expected *behavior of all JMM implementations. Most JMM 
> implementations go beyond the cookbook rules in at least some places and 
> apply JMM-valid transformations that are not included in it and can be 
> viewed as "shortcuts" that bypass some of the rules in the cookbook. There 
> are many examples of this in practice. Lock coarsening and lock biasing 
> optimizations are two good example sets.
>
> This means that you need to read the cookbook very carefully, and 
> (specifically) that you should not interpret it as a promise of what the 
> relationships between various operations are guaranteed to be. If you use 
> the cookbook for the latter, your code will break.
>
>
> Putting aside the current under-the-hood implementations of monitor 
> enter/exit and of ReentrantLock (which may and will change), the 
> requirements are clear:
>
> from e.g. 
> https://docs.oracle.com/javase/10/docs/api/java/util/concurrent/locks/ReentrantLock.html
> :
>
> "A reentrant mutual exclusion Lock 
> 
>  with 
> the same basic behavior and semantics as the implicit monitor lock accessed 
> using synchronized methods and statements, but with extended 
> capabilities."
>
>
> from e.g. 
> https://docs.oracle.com/javase/10/docs/api/java/util/concurrent/locks/Lock.html
> :
>
> "Memory Synchronization
>
> All Lock implementations *must* enforce the same memory synchronization 
> semantics as provided by the built-in monitor lock, as described in Chapter 
> 17 of The Java™ Language Specification 
> :
>
>
>- A successful lock operation has the same memory synchronization 
>   effects as a successful *Lock* action.
>   - A successful unlock operation has the same memory synchronization 
>   effects as a successful *Unlock* action.
>
> Unsuccessful locking and unlocking operations, and reentrant 
> locking/unlocking operations, do not require any memory synchronization 
> effects."
>
>
> So based on the spec, I'd say that you cannot make any assumptions about 
> semantic differences between ReentrantLock and synchronized blocks (even if 
> you find current implementation differences).
>
> On Wednesday, May 16, 
>>
>> Hi guys!
>>
>> probably this one should be more a concurrency-interest question, but I'm 
>> sure that's will fit with most of the people around here as well :)
>> I was looking to how ReentrantLock and synchronized are different from a 
>> semantic point of view and I've found (there are no experimental proofs on 
>> my side TBH) something interesting on the 
>> http://gee.cs.oswego.edu/dl/jmm/cookbook.html. 
>> It seems to me that:
>>
>> normal store;
>> monitorEnter;
>> [mutual exclusion zone]
>> monitorExit;
>> 
>>
>> is rather different from a:
>>
>> normal store;
>> (loop of...)
>> volatile load
>> volatile store (=== CAS)
>> ---
>> [mutual exclusion zone]
>> volatile store
>>
>> With the former to represent a synchronized block and the latter a spin 
>> lock acquisition and release, both with a normal store on top.
>> From anyone coming from other worlds than the JVM i suppose that the 
>> volatile load could be translated as a load acquire, while the volatile 
>> store as a sequential consistent store.
>>
>> About the monitorEnter/Exit that's more difficult to find something that 
>> fit other known memory models (C++) and that's the reason of my request :) 
>> From the point of view of the compiler guarantees seems that a normal 
>> store (ie any store release as well) preceding monitorEnter could be moved 
>> inside the mutual exclusion zone (past t

synchronized and ReentrantLock are REALLY the same?

2018-05-16 Thread Francesco Nigro
Hi guys!

probably this one should be more a concurrency-interest question, but I'm 
sure that's will fit with most of the people around here as well :)
I was looking to how ReentrantLock and synchronized are different from a 
semantic point of view and I've found (there are no experimental proofs on 
my side TBH) something interesting on 
the http://gee.cs.oswego.edu/dl/jmm/cookbook.html. 
It seems to me that:

normal store;
monitorEnter;
[mutual exclusion zone]
monitorExit;


is rather different from a:

normal store;
(loop of...)
volatile load
volatile store (=== CAS)
---
[mutual exclusion zone]
volatile store

With the former to represent a synchronized block and the latter a spin 
lock acquisition and release, both with a normal store on top.
>From anyone coming from other worlds than the JVM i suppose that the 
volatile load could be translated as a load acquire, while the volatile 
store as a sequential consistent store.

About the monitorEnter/Exit that's more difficult to find something that 
fit other known memory models (C++) and that's the reason of my request :) 
>From the point of view of the compiler guarantees seems that a normal store 
(ie any store release as well) preceding monitorEnter could be moved inside 
the mutual exclusion zone (past the monitorEnter), because the compiler 
isn't enforcing 
any memory barrier between them, while with the current implementation of a 
j.u.c. lock that's can't happen.
That's the biggest difference I could spot on them, but I'm struggling to 
find anything (beside looking at the JVM source code) that observe/trigger 
such compiler re-ordering.
What do you think about this? I'm just worried of something that actually 
isn't implemented/isn't happening on any known JVM implementation?

Thanks,
Franz

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Intra-process queue between Java and C++

2018-04-04 Thread Francesco Nigro

>
> The broadcast buffer is a very unique type of structure.


So true!
It relies on a nice mechanics similar to StampedLock and is quite 
ineresting indeed: I have implemented for Agrona a version of it 
<https://github.com/real-logic/agrona/pull/119> based on the FastFlow-like 
logic instead to allow "precise" loss counters, but with a finite max 
length.
Probably with actor-like services that provide a finite set of responses 
with reasonable max lengths, it could be a good match if compared to the 
original :P
Obviously the same usage patterns highlighted by Todd will apply 
independently by the chosen implementation: the semantic behind a message 
loss enables lot of interesting logics to be implemented on top, while 
simiplifying most of the deadlock issues.

Il giorno martedì 3 aprile 2018 17:13:31 UTC+2, Todd L. Montgomery ha 
scritto:
>
> Good points. The Agrona (and Aeron C versions) do have the proper basic 
> logic to avoid blocked queues from failure during appends/offers. It's not 
> simple, but the hard part is done for you. Just need to hook the logic into 
> a duty cycle.
>
> The broadcast buffer is a very unique type of structure. It allows for the 
> case of a slow receiver to fall behind and loose messages rather than block 
> the transmitter. And it lets the receiver know that this occurred. That 
> means that some of the interesting deadlocks can be avoided. Recovery of 
> lost messages (or just resync of state) can be done a number of ways. 
> Including just a system graceful reset.
>
> On Tue, Apr 3, 2018 at 7:36 AM, Francesco Nigro  > wrote:
>
>> There are a couple of things to be considered:
>>
>>1. size request and response buffers to avoid deadlocks (there is an 
>>interesting topic on this list re it too) -> could be used a broadcast 
>>buffer too to avoid it, but it change the semantic by adding a new 
>>"failure" state
>>2. consider liveness of processes and blocking operations while 
>>submitting/reading from the buffers -> Agrona ring buffers has the proper 
>>API to perform such tests/ops but need to implement the logic
>>
>>
>> Il giorno venerdì 30 marzo 2018 10:55:23 UTC+2, Roman Leventov ha scritto:
>>>
>>> I think about the possibility of building an asynchronous application 
>>> with back pressure where some upstream operators are in Java and some 
>>> downstream ones are in C++. For this purpose, some queues would be needed 
>>> to pass the data between Java and C++ layers. It seems that porting 
>>> JCTools's bounded array queues to off-heap should be doable, but I couldn't 
>>> find existing prototypes or discussions of such thing so maybe I overlook 
>>> some inherent complications with this idea.
>>>
>>> Did anybody think about something like this or has implemented in 
>>> proprietary systems?
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Intra-process queue between Java and C++

2018-04-03 Thread Francesco Nigro
There are a couple of things to be considered:

   1. size request and response buffers to avoid deadlocks (there is an 
   interesting topic on this list re it too) -> could be used a broadcast 
   buffer too to avoid it, but it change the semantic by adding a new 
   "failure" state
   2. consider liveness of processes and blocking operations while 
   submitting/reading from the buffers -> Agrona ring buffers has the proper 
   API to perform such tests/ops but need to implement the logic


Il giorno venerdì 30 marzo 2018 10:55:23 UTC+2, Roman Leventov ha scritto:
>
> I think about the possibility of building an asynchronous application with 
> back pressure where some upstream operators are in Java and some downstream 
> ones are in C++. For this purpose, some queues would be needed to pass the 
> data between Java and C++ layers. It seems that porting JCTools's bounded 
> array queues to off-heap should be doable, but I couldn't find existing 
> prototypes or discussions of such thing so maybe I overlook some inherent 
> complications with this idea.
>
> Did anybody think about something like this or has implemented in 
> proprietary systems?
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Avoiding expensive memory barriers

2018-03-19 Thread Francesco Nigro
HI guys!

Re 

>  2. prevent preceding writes from being reordered by the compiler (they 
are implicitly ordered by the processor on x86).

 I have had a quite interesting chat with Jeff Preshing 
(http://preshing.com/) about relaxed guarantees and he pointed me this one 
to help on it:
https://www.decadent.org.uk/pipermail/cpp-threads/2008-December/001946.html

And quoting himself:

>
>1. 
>
>As I wrote above, "each atomic var has its own modification order"... 
>once a value in that modification is seen, older ones cannot be seen again
> Segnala questo messaggio inviato da Jeff PreshingElimina questo 
>messaggio inviato da Jeff Preshing
>12 feb 2017
>2. [image: Jeff Preshing]
>
>
>Note this guarantee has nothing to do with acquire/release... it's a 
>separate guarantee, that each atomic var being consistent with its own 
>modification order
>
> Il giorno lunedì 19 marzo 2018 09:19:08 UTC+1, Avi Kivity ha scritto:
>
> The release write is a memory barrier. It's not an SFENCE or another fancy 
> instruction, but it is a memory barrier from the application writer's point 
> of view.
>
>
> The C++ code
>
>
> x.store(5, std::memory_order_relaxed)
>
> has two effects on x86:
>
>   1. generate a write to x that is a single instruction (e.g. mov $5, x)
>   2. prevent preceding writes from being reordered by the compiler (they 
> are implicitly ordered by the processor on x86).
>
>
>
> On 03/18/2018 08:16 PM, Dan Eloff wrote:
>
> You don't need memory barriers to implement an SPSC queue for x86. You can 
> do a relaxed store to the queue followed by a release write to 
> producer_idx. As long as consumer begins with an acquire load from 
> producer_idx it is guaranteed to see all stores to the queue memory before 
> producer_idx, according to the happens before ordering. There are no memory 
> barriers on x86 for acquire/release semantics.
>
> The release/acquire semantics have no meaning when used with different 
> memory locations, but if used on producer_idx when synchronizing the 
> consumer, and consumer_idx when synchronizing the producer, it should work.
>
>
>
> On Thu, Feb 15, 2018 at 8:29 AM, Avi Kivity  > wrote:
>
>> Ever see mfence (aka full memory barrier, or std::memory_order_seq_cst) 
>> taking the top row in a profile? Here's the complicated story of how we 
>> took it down:
>>
>>
>> https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to mechanical-sympathy+unsubscr...@googlegroups.com .
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Call site (optimizations)

2018-02-05 Thread Francesco Nigro
Thanks Rèmi!

yes I was expecting something like that: indeed inlining being a path 
specific optimization with much much luck and a deep enough (but not too 
much considering how the JVM limit the inlining vs the call stack deep) 
call stack could lead to very specific/unique optimized execution paths.
I don't know at which point CHA is being triggered/performed to say if in 
such cases it would be performed differently between those execution paths.
As you have noticed, there are probably planet alignments that are more 
likely to happen :)

@Gil
> I don't think that per-thread code ever makes much sense

You're right: maybe a JVM running only thread-pinned actors could benefit 
by it or maybe not: as Rèmi has noticed a similar effect could be achieved 
in different ways.

Franz
 
 If you call the method from a different place, that would follow the slow 
path of performing a virtual method lookup followed by a jump to that code. 
So to see the benefits of inlining from a different path, that path would 
also have to be "warmed up".
Il giorno mercoledì 31 gennaio 2018 05:42:12 UTC-5, Rémi Forax ha scritto:
>
>
>
> ------
>
> *De: *"Francesco Nigro" >
> *À: *"mechanical-sympathy" >
> *Envoyé: *Mardi 30 Janvier 2018 15:13:03
> *Objet: *Re: Call site (optimizations)
>
> Gil really thank you for the super detailed answer!
>
> I'm digesting it and I've found an interesting talk with some of these 
> concepts referenced/mentioned: https://youtu.be/oH4_unx8eJQ?t=4503
>
> Douglas Hawkins:"...the compilation is shared by all threads...all 
> threads. One thread behave badly, all threads suffer."
> After your explanation I would say: "..almost all threads..."
> But considering that is about deopt it doesn't make sense such precise!
>
> My initial fairytale idea of what a JVM should able to optimize was 
> "probably" too imaginative eg 2 threads T1,T2 calling the same static 
> method:
>
> static void foo(..){
> if(cond){
> uncommonCase();
> }
> }
>
> T1 with cond = true while T2 with cond = false.
> My "fairytale JVM" were able to optimize each specific call of foo() for 
> the 2 threads with specific compiled (inlined if small enough) versions 
> that benefit each one of the real code paths executed (ie uncommonCase() 
> inlined/compiled just for T1).
> Obviously the fairytale optimizations weren't only specific per-thread but 
> per call-site too :P
>
>
> You can hand write you own split loop optimization
>
> static void foo(boolean flag) {
>   if (flag) {
> uncommonCase();
>   }
> }
>
> and create two callsites in Runnable.run() like this
>
> Runable r = () -> {
>   if(cond) {
> for(;;) {
>   foo(true);
> }
>   } else {
> for(;;) {
>   foo(false);
> }
>   }
> };
>
> this only works if everything from runnable.run() to foo() is inlined, so 
> never in real life :) 
>
>
> After your answer I'm starting to believe that Santa Claus doesn't exists 
> and it's not his fault :)
>
> Thanks,
> Franz
>
>
> Rémi
>
>
>
> Il giorno sabato 27 gennaio 2018 18:45:28 UTC+1, Gil Tene ha scritto:
>>
>> Some partial answers inline below, which can probably be summarized with 
>> "it's complicated...".
>>
>> On Friday, January 26, 2018 at 8:33:53 AM UTC-8, Francesco Nigro wrote:
>>
>> HI guys,
>>> in the last period I'm having some fun playing with JItWatch (many 
>>> thanks to Chris Newland!!!) and trying to understand a lil' bit more about 
>>> specific JVM optimizations, but suddenly I've found out that I was missing 
>>> one of the most basic definition: call site.
>>>
>>> I've several questions around that:
>>>
>>>1. There is a formal definition of call site from the point of view 
>>>of the JIT?
>>>
>>>  I don't know about "formal". But a call site is generally any location 
>> in the bytecode of one method that explicitly causes a call to another 
>> method. These include:
>>
>> classic bytecodes used for invocation:
>> - Virtual method invocation (invokevirtual and invokeinterface, both of 
>> which calling a non-static method on an instance), which in Java tends to 
>> dynamically be the most common form.
>> - Static method invocations (invokestatic)
>> - Constructor/initializer invocation (invokespecial)
>> - Some other cool stuff (private instance method invocation with 
>> invokespecial, native calls, etc.)
>>
>> In a

Re: Call site (optimizations)

2018-01-30 Thread Francesco Nigro
Gil really thank you for the super detailed answer!

I'm digesting it and I've found an interesting talk with some of these 
concepts referenced/mentioned: https://youtu.be/oH4_unx8eJQ?t=4503

Douglas Hawkins:"...the compilation is shared by all threads...all threads. 
One thread behave badly, all threads suffer."
After your explanation I would say: "..almost all threads..."
But considering that is about deopt it doesn't make sense such precise!

My initial fairytale idea of what a JVM should able to optimize was 
"probably" too imaginative eg 2 threads T1,T2 calling the same static 
method:

static void foo(..){
if(cond){
uncommonCase();
}
}

T1 with cond = true while T2 with cond = false.
My "fairytale JVM" were able to optimize each specific call of foo() for 
the 2 threads with specific compiled (inlined if small enough) versions 
that benefit each one of the real code paths executed (ie uncommonCase() 
inlined/compiled just for T1).
Obviously the fairytale optimizations weren't only specific per-thread but 
per call-site too :P

After your answer I'm starting to believe that Santa Claus doesn't exists 
and it's not his fault :)

Thanks,
Franz


Il giorno sabato 27 gennaio 2018 18:45:28 UTC+1, Gil Tene ha scritto:
>
> Some partial answers inline below, which can probably be summarized with 
> "it's complicated...".
>
> On Friday, January 26, 2018 at 8:33:53 AM UTC-8, Francesco Nigro wrote:
>
> HI guys,
>>
>> in the last period I'm having some fun playing with JItWatch (many thanks 
>> to Chris Newland!!!) and trying to understand a lil' bit more about 
>> specific JVM optimizations, but suddenly I've found out that I was missing 
>> one of the most basic definition: call site.
>>
>> I've several questions around that:
>>
>>1. There is a formal definition of call site from the point of view 
>>of the JIT?
>>
>>  I don't know about "formal". But a call site is generally any location 
> in the bytecode of one method that explicitly causes a call to another 
> method. These include:
>
> classic bytecodes used for invocation:
> - Virtual method invocation (invokevirtual and invokeinterface, both of 
> which calling a non-static method on an instance), which in Java tends to 
> dynamically be the most common form.
> - Static method invocations (invokestatic)
> - Constructor/initializer invocation (invokespecial)
> - Some other cool stuff (private instance method invocation with 
> invokespecial, native calls, etc.)
>
> In addition, you have these "more interesting" things that can be viewed 
> (and treated by the JIT) as call sites:
> - MethodHandle.invoke*()
> - reflection based invocation (Method.invoke(), Constructor.newInstance())
> - invokedynamic (can full of Pandora worms goes here...)
>
>
>>1. I know that there are optimizations specific per call site, but 
>>there is a list of them somewhere (that is not the OpenJDK source code)? 
>>
>> The sort of optimizations that might happen at a call site can evolve 
> over time, and JVMs and JIT can keep adding newer optimizations: Some of 
> the current common call site optimizations include:
>
> - simple inlining: the target method is known  (e.g. it is a static 
> method, a constructor, or a 
> we-know-there-is-only-one-implementor-of-this-instance-method-for-this-type-and-all-of-its-subtypes),
>  
> and can be unconditionally inlined at the call site.
> - guarded inlining: the target method is assumed to be a specific method 
> (which we go ahead and inline), but a check (e.g. the exact type of this 
> animal is actually a dog) is required ahead of the inlined code because we 
> can't prove the assumption is true.
> - bi-morphic and tri-morphic variants of guarded inlining exist (where two 
> or three different targets are inlined).
> - Inline cache: A virtual invocation dispatch (which would need to follow 
> the instance's class to locate a target method) is replaced with a guarded 
> static invocation to a specific target on the assumption a specific 
> ("monomorphic") callee type. "bi-morphic" and "tri-morphic" variants of 
> inline cache exist (where one of two or three static callees are called 
> depending on a type check, rather than performing a full virtual dispatch)
> ...
>  
> But there are bigger and more subtle things that can be optimized at and 
> around a call site, which may not be directly evident from the calling code 
> itself. Even when a call site "stays", things like this can happen:
>
> - Analysis of all possible callees shows that no writes to some locations 

Call site (optimizations)

2018-01-27 Thread Francesco Nigro
My cell phone has decided to cut the third question:
 3. I know that compiled code from the JVM is available in a Code Cache to 
allow different call-sites to use it: that means that the same compiled method 
is used in all those call-sites (provided that's the best version of it)?

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Call site (optimizations)

2018-01-26 Thread Francesco Nigro
HI guys,

in the last period I'm having some fun playing with JItWatch (many thanks 
to Chris Newland!!!) and trying to understand a lil' bit more about 
specific JVM optimizations, but suddenly I've found out that I was missing 
one of the most basic definition: call site.

I've several questions around that:

   1. There is a formal definition of call site from the point of view of 
   the JIT?
   2. I know that there are optimizations specific per call site, but there 
   is a list of them somewhere (that is not the OpenJDK source code)? 
   3. I know that compiled code from the JVM is available in a Code Cache 
   to allow different call-sites to use it: that means that the same compiled 
   method is us

I hope to not have asked too naive questions :)

thanks
Franz

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Mutex and Spectre/Meltdown

2018-01-15 Thread Francesco Nigro
Hi guys!

Any of you have already measured (or simply know) if OS mutexes are somehow 
affected by Spectre/Meltdown?

Cheers,
Franz

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Detailed post on using SEDA-like algorithms to improve IPC

2017-07-20 Thread Francesco Nigro
Thanks! It is a very interesting post!

Il giorno lunedì 17 luglio 2017 16:44:13 UTC+2, Avi Kivity ha scritto:
>
> Some time ago I posted about how we ~ doubled our IPC; here is a 
> detailed blog post explaining the problem and solution: 
>
>
>
> http://www.scylladb.com/2017/07/06/scyllas-approach-improve-performance-cpu-bound-workloads/
>  
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Disk-based logger - write pretouch

2017-07-11 Thread Francesco Nigro


Il giorno lunedì 10 luglio 2017 20:41:23 UTC+2, Roman Leventov ha scritto:
>
> On 10 July 2017 at 13:21, Martin Thompson > 
> wrote:
>
>> Are you running recent MacOS with APFS which supports sparse files? 
>>
>
> If this is of any interest, my OS version is
>
> Darwin MacBook-Pro.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 
> 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64
>
> (i. e. very recent)
>
> When considering an algorithm for pre-touch ahead it is often useful to 
>> consider rate to determine who far ahead you should be pre-touching.
>>
>
> Logically the higher the rate the longer pre-touch chunk should be, but 
> when I did pretouch of the whole 32 MB buffers (instead of 1 MB chunks) 
> pauses were even worse.
>  
>
>> When pre-touching then it can be better with a positional write of a 
>> byte, i.e. use 
>> https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html#write(java.nio.ByteBuffer,%20long)
>>  
>> which should map to pwrite() on Linux. This will be a safepoint whereas the 
>> mapped buffer write will not be.
>>
>
> Thanks, will try
>

Most results depends on:
- the MMU/TLB is shared between the cores? (if not maybe using a different 
thread/core could make the things worse) 
- as Mr T. and others pointed...write back and I/O (and file system too) 
effects need to be repeatible 
- maybe Peter L. has already done something similar 
here: 
https://github.com/OpenHFT/Chronicle-Queue/blob/80fad7711bc5b5ebd0a78a44f7459d65139b8e32/src/main/java/net/openhft/chronicle/queue/impl/single/PretoucherState.java
- the order of pretouch could change the results (descending or not) due to 
LRU eviction policies of the page cache

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Write release post

2017-06-17 Thread Francesco Nigro
Hi folks!

I've found an interesting article of Jeff Preshing about what are the 
(re)ordering guarantees of a write release with the subsequent ops: 
http://preshing.com/20170612/can-reordering-of-release-acquire-operations-introduce-deadlock/
Indirectly it could affect our beloved JMM too :) 
Wdyt?
Do you agree with Jeff?

Regards,
Francesco

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Aeron zero'ing buffer?

2017-05-30 Thread Francesco Nigro
The coolest thing IMHO about the Aeron LogBuffer is how it trades 
performances on concurrent writings with a measurable loss of precision on 
backpressure: it is a really a very smart and effective idea!

Il giorno martedì 30 maggio 2017 13:06:32 UTC+2, Martin Thompson ha scritto:
>
> The claim is on a long with a check before increment.
>
> On Tuesday, 30 May 2017 11:25:50 UTC+1, Peter Veentjer wrote:
>>
>> Thanks everyone.
>>
>> Another related question. When an atomic integer is used so that writers 
>> can claim their segment in the buffer, what prevent wrapping of this atomic 
>> integer and falsely assuming you have allocated a section in the buffer?
>>
>> So imagine the buffer is full, and the atomic integer > buffer.length. 
>> Any thread that wants to write, keeps increasing this atomic integer far 
>> far beyond its maximum capacity of the buffer. Which in itself is fine 
>> because one can detect what kind write failure one had:
>> - first one the over commit
>> - subsequent over commit
>>
>> But what if the value wraps? In theory one could end up thinking one has 
>> claimed a segment of memory still needed for reading purposes.
>>
>> One simple way to reduce the problem is to use an atomic long instead of 
>> an atomic integer.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Aeron zero'ing buffer?

2017-05-30 Thread Francesco Nigro
The Intel manual effectively is pretty short while explaining it, but 
anyway a "simple" check with proper tools could be a good experiment to 
validate it: https://joemario.github.io/blog/2016/09/01/c2c-blog/

I've asked to an Intel engineer about it some times ago (on 14th Feb)  and 
he answered me this:

Hi Francesco, 

 

About your questions on prefetchers:
>
>- Prefetchers normally kick in only after multiple cache lines in a 
>specific pattern have been accessed. So I wouldn't worry too much for a 
>single cache line.
>
>
>- Prefetchers tend to only read lines, so they by itself cannot cause 
>additional classic false sharing (but may cause additional aborts on TSX).
>
>
>- The same is true for speculative execution.  You have more to fight 
>than just prefetching; speculative execution tends to pull in lots of data 
>early.   You can assume the cpu runs 150+ instructions ahead 
>specualtively, if not more.
>
>
>- There shouldn't be an automatic "get the next line" as much as there 
>are pattern recognizers, and if there's a sequential pattern, the next 
>lines will be prefeteched. it's not unconditional.
>
> You can always test by enabling/disabling the prefetchers:
>   wrmsr -a 0x1a4 0xf// to disable
>   wrmsr -a 0x1a4 0x0   // to enable
> See  
> 
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
>  for 
> more info.
> The wrmsr tool is available at: https://01.org/msr-tools/overview



Il giorno lunedì 29 maggio 2017 18:06:02 UTC+2, Benedict Elliott Smith ha 
scritto:
>
> It's approximately where you'd expect, in the Intel 64 and IA32 
> Architecture Optimization Reference Manual, under "Data Prefetching" on 
> page 2-29, and referred to as the "Spatial prefetcher"
>
> It is pretty easy to miss, given it's only afforded a single sentence.
>
> It's possible to disable it on a per-core basis:
>
>
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
>
>
> On Mon, 29 May 2017 at 16:54, Martin Thompson  > wrote:
>
>> Switching topics slightly, prefetch extending the effective cache line 
>>> size was causing us some consternation, since we were never able to find 
>>> where it was documented. Do you have a reference to it? When did it start 
>>> happening?
>>>
>>>
>>> It seems like it invalidates all software that was carefully written to 
>>> honor 64 byte cache lines.
>>>
>>>
>>> IIRC Pentium 4 had 128 byte "sectors", but it was never fully explained 
>>> what these were, and the word died with the P4.
>>>
>>
>> I've seen adjacent cacheline prefetching on Intel processors since the 
>> Netburst days (well over a decade). Until Sandy bridge it was generally 
>> recommended to disable them because memory bandwidth often became an issue. 
>> These days it works on the L2 cache sitting along side a prefetcher that 
>> looks for patterns of cache line accesses. L1 has different prefetchers. It 
>> does have quite a noticeable effect on false sharing but not as much as 
>> when within the same 64 byte cache line.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Transaction Logs

2017-02-10 Thread Francesco Nigro
Pretty late considering the age of this post and little Java'ish,but worths to 
know:

https://groups.google.com/forum/m/#!topic/pmem/2LJFFpoc8gA

And

https://github.com/pmem/pcj

Cheers,
Francesco

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Operation Reordering

2017-01-16 Thread Francesco Nigro
Thanks,

"Assume nothing" is a pretty scientific approach,I like it :)
But this (absence of) assumption lead me to think about another couple of 
things: what about all the encoders/decoders or any program that rely on data 
access patterns to pretend to be and remain "fast"?
Writing mechanichal sympathetic code means being smart enough to trick the 
compiler in new and more creative ways or check what a compiler doesn't handle 
so well,trying to manage it by yourself?
The last one is a very strong statement just for the sake of discussion..But 
I'm curious to know what the guys of this group think about it :)

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Operation Reordering

2017-01-16 Thread Francesco Nigro
This is indeed what I was expecting...While others archs (PowerPC , tons of
ARMs and the legendary Alpha DEC) are allowed to be pretty creative in
matter of reordering...And that's the core of my question: how much a
developer could rely on the fact that a compiler ( or the underline HW)
will respect the memory access that he has put into the code without using
any fences?The answer is really a "depends on the compiler/architecture"?Or
exist common high level patterns respected by the "most" of
compilers/architectures?

Il lun 16 gen 2017, 22:14 Vitaly Davidovich  ha scritto:

> Depends on which hardware.  For instance, x86/64 is very specific about
> what memory operations can be reordered (for cacheable operations), and two
> stores aren't reordered.  The only reordering is stores followed by loads,
> where the load can appear to reorder with the preceding store.
>
> On Mon, Jan 16, 2017 at 4:02 PM Dave Cheney  wrote:
>
> Doesn't hardware already reorder memory writes along 64 byte boundaries?
> They're called cache lines.
>
>
> Dave
>
>
>
> On Tue, 17 Jan 2017, 05:35 Tavian Barnes  wrote:
>
> On Monday, 16 January 2017 12:38:01 UTC-5, Francesco Nigro wrote:
>
> I'm missing something for sure, because if it was true, any
> (single-threaded) "protocol" that rely on the order of writes/loads against
> (not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks
> to not see the order respected if not using patterns that force the
> compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).
>
>
> I don't think you're missing anything.  The JVM would be stupid to reorder
> your sequential writes into random writes, but it's perfectly within its
> right to do so for a single-threaded program according to the JMM, as long
> as it respects data dependencies (AFAIK).  Of course, that would be a huge
> quality of implementation issue, but that's an entirely separate class from
> correctness issues.
>
>
> with great regards,
> Francesco
>
>
>
>
>
>
>
>
>
> --
>
>
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>
>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
>
>
>
>
> --
>
>
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
>
>
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>
>
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Sent from my phone
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "mechanical-sympathy" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/mechanical-sympathy/EMIBqjX4uzk/unsubscribe
> .
> To unsubscribe from this group and all its topics, send an email to
> mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Is Parallel Programming Hard, And, If So, What Can You Do About It?

2017-01-16 Thread Francesco Nigro
Thanks!! Cool!

Il giorno venerdì 13 gennaio 2017 12:23:15 UTC+1, Ivan Valeriani ha scritto:
>
> Hi all,
> just wanted to post this book here as it seems rather complete and well 
> written.
>
> https://arxiv.org/pdf/1701.00854.pdf
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Operation Reordering

2017-01-16 Thread Francesco Nigro
HI guys!

maybe what I'm asking could sound like a very naive question, but in recent 
times I've found very problematic to understand if exists a "rule of thumb" 
to find out how the JVM (or replace it with anything SW/HW-wise that could 
mess with the order of the statements of your program) manage reordering of 
operations in case of single threaded execution. 

As a general rule I've always applied that the order *would appear* as the 
program order shows, but this implied that writing over a (regular Direct 
or array-based) ByteBuffer with a particular order is different than doing 
it against a MappedByteBuffer.
That rule has lead me, for example, to use explicit fences (for single 
threaded program too) as a mean to put "lines on the sands" and be sure 
that the compiler would access data in the order I'm expecting it will do 
it.

This perception is enforced further after reading (and if I've understood 
it) the article 
 
from Cliff Click about "Sea of nodes" in which *every* operation on Memory 
Mapped I/O will produce a new I/O state and that prevents any reordering 
between the 2 kind of operations (LOAD and STORE), while the others are 
treated only respecting data dependencies (in theory, there is no limit on 
reordering).
I'm missing something for sure, because if it was true, any 
(single-threaded) "protocol" that rely on the order of writes/loads against 
(not mapped) ByteBuffers to be fast (ie: sequential writes rocks :P) risks 
to not see the order respected if not using patterns that force the 
compiler to block the re-ordering of such instructions (Sci-Fi hypothesis).
A. Shipilev (on twitter) has pointed me to look at gcm.cpp 

 to 
understand better what reordering are allowed to happen...but I'm not 
feeling confident to have understood it properly.
I'm not even sure that building an empiric experiment and reading the 
produced ASM could prove something useful, if not what could happen for a 
very specific case (and a very specific strong memory model, the one of my 
laptop x86 CPU).
What do you think?

with great regards,
Francesco

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Epoch Based Reclamation

2017-01-09 Thread Francesco Nigro
Hi guys!

For who's interested in lock/wait-free data structures that requires some 
form of reclamation/garbage collecting (others than an out-of-the-box 
collector provided by a JVM) this thesis 
 is a good 
starting point to learn about Epoch-Based reclamation algorithms.
Just as a recap, Quiescent State-based reclamation 
 is 
another technique, pretty well suited to be used to be used in 
Actor-based/Event-Loop based architectures.

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Minimum realistic GC time for G1 collector on 10GB

2016-12-12 Thread Francesco Nigro
Hi ivan!
 

> What are good java shared mem IPC 
> queues?
>

Agrona  (many to one, one to one and 
broadcast variable sized) and JCtools  
(many to one, one to one fixed sized) have off-heap implementations well 
suited to be used for IPC but consider that depending on how you've already 
implemented your system, you'll need to handle "new" failure cases, as dead 
publisher/receivers etc. 
IMHO, designing for failure it's worth anyway...

About the GC pauses, consider this article 

 hints 
and an old answer 

 
of Mr. G. (Mr T. is Martin T.!) about generational collectors..
AFAIK and i'm now a GC expert at all, an "hidden" factor that plays a role 
on GC pauses is the card marking phase, citing an article on G1 

:

> If you allocate more objects during concurrent marking than you end up 
> collecting, you will eventually exhaust your heap. During the concurrent 
> marking cycle, you will see young collections continue as it is not a 
> stop-the-world event.

 
 Hence, considering what could slow down the marking time (or its 
cleaning), it's worth to check(as other have suggested):

   - allocation rate
   - what are the most frequently collected data structures during minor 
   GCs (eg: linked-lists?)
   - false sharing 
   

 
   issues during card marking (i'm not sure of it)

Anyway I've found this tool 
 from Mr. G 
that could help..
About any measurement tool you're considering to use, please read this post 
 to 
choose it properly..

Regards,
Francesco
 



Il giorno lunedì 12 dicembre 2016 15:30:21 UTC+1, Ivan Kelly ha scritto:
>
> Thanks all for the suggestions. I'll certainly check the safepoint 
> stuff. I suspect that a lot of EPA isn't happening where it could also 
> due to stuff being pushed through disruptor. Even if it is happening, 
> or I can make it happen, and also clear up all the low hanging fruit, 
> if I can't get below 100ms pauses on the 10G heap, then it would all 
> be for nothing. 
>
> Censum is interesting, I'll take a look. I have flight recordings 
> right now. Is it possible to find the time to safepoint in that? 
>
> > All that said, the copy cost and reference processing of G1 for a heap 
> as 
> > busy / sized as yours is unlikely to get under 10ms, as Chris mentioned 
> Gil 
> > will now be summoned ;-). 
> Yes, really what I'm looking for is someone to tell me "no, you're 
> nuts, not possible" so I can justify going down the rearchitecture 
> route. Unfortunately Zing isn't an option, since it would double the 
> cost of our product. 
>
> As I said in the original email though, the low latency part of the 
> application can probably fit in 100MB or less. The application takes 
> netlink notification, does some processing and caches the result for 
> invalidation later. The netlink notification and processing is small 
> but needs to be fast. The cache can pause, as long as entries for it 
> can be queued. It's a prime candidate for being moved to another 
> process. 
>
> Which brings me to another question? What are good java shared mem IPC 
> queues? Something like cronicle-queue, but without the persistence. 
> I'd prefer to not role my own. The road to hell has enough paving 
> stones. 
>
> -Ivan 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: What do you guys think of Oracle management of Java?

2016-11-13 Thread Francesco Nigro
Hi Kirk! 

Can you please add me to that group? :)

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


The Hardware/Software Interface

2016-09-27 Thread Francesco Nigro
Hi guys,

i've collected all the material of the "Hardware/Software Interface" course 
from my Android device before the Coursera App tried to remove it:

https://drive.google.com/open?id=0B4kyNOYZku8xZDVQRlRfS1dsQTA

There is no online material AFAIK because the Coursera courses follow the 
policy that if the material's author (MIT) remove it then Coursera must 
remove it too.
Enjoy it if you've missed the course :)


P.S. It is based on the book http://csapp.cs.cmu.edu/

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Implementation of unbounded lock-free queue in C++

2016-09-04 Thread Francesco Nigro
FastFlow has a very good implementation of unbounded queue and doing perf 
investigations is a very good idea for educational purposes too.
Anyway seems that overriding allocators (to regain the lost cache locality) and 
node caching are good means to improve the performances,but measuring is better 
than guessing, always...

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: QSBR

2016-07-28 Thread Francesco Nigro
I need to learn C/C++ to add that fun :(
It is hard to learn it without using it in the everday job's life, i 
suppose...

Il giorno mercoledì 27 luglio 2016 18:31:05 UTC+2, Todd L. Montgomery ha 
scritto:
>
> These techniques are fun to experiment with also. Especially in C/C++.
>
> On Wed, Jul 27, 2016 at 8:55 AM, Francesco Nigro  > wrote:
>
>> Thanks Todd,
>>
>> I think all these techniques could find the right place in a Java dev 
>> concurrency toolset, considering that on-heap instances aren't the only 
>> living beings of an application.
>> And is pretty funny too:
>> "Schrödinger has numerous short-lived animals such as mice, resulting in 
>> high update rates. In addition, there is a surprising level of interest in 
>> the health of Schrödinger's cat"
>>
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to mechanical-sympathy+unsubscr...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: QSBR

2016-07-27 Thread Francesco Nigro
Thanks Todd,

I think all these techniques could find the right place in a Java dev 
concurrency toolset, considering that on-heap instances aren't the only living 
beings of an application.
And is pretty funny too:
"Schrödinger has numerous short-lived animals such as mice, resulting in high 
update rates. In addition, there is a surprising level of interest in the 
health of Schrödinger's cat"

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


QSBR

2016-07-27 Thread Francesco Nigro
I really like the blog of this guy:

http://preshing.com/20160726/using-quiescent-states-to-reclaim-memory/


-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.