Re: Interrupts and hyperthreading.

2022-07-03 Thread Wojciech Kudla
e that an engineer has configured his/her > system for low noise/jitter/latency, this serialization at kernel mode > switch time among sibling threads should not occur, correct? > > On Sun, Jul 3, 2022, 1:57 AM Wojciech Kudla > wrote: > >> @Peter >> >> > Doe

Re: Interrupts and hyperthreading.

2022-07-03 Thread Wojciech Kudla
Thanks Avi, that's interesting. I was under the impression that the Intel doc was referring to x86 specifically. I'm a bit confused about the discrepancy between what how that document describes the mitigation and your comment. Could you point to a resource or Kernel source file that sheds more lig

Re: Interrupts and hyperthreading.

2022-07-02 Thread Wojciech Kudla
@Peter > Does the CPU only need to serialize the transition, or does it need to serialize the interrupt/systemcall while it is in ring 0? Sadly yes, the kernel does need to temporarily "idle" the sibling thread while the other is in ring 0. This is described as transitioning from state 6a/6b to s

Re: Interrupts and hyperthreading.

2022-07-01 Thread Wojciech Kudla
> If you are using a hyper-threading and there is an interrupt or a system call on one logical core, then the hyper-sibling will stall as well because both need access to the kernel (mode switch). This is kind of correct. That is, for cases where the kernel implements microarchitectural data sampl

MMU gang wars: the TLB drive-by shootdown

2020-05-15 Thread Wojciech Kudla
Hi, A quick shameless plug. Just posted an article on TLB shootdowns and their impact on latency-sensitive applications. It's a result of a fragile balance between the subject's complexity and my own knowledge gaps in the matter. Hope it's useful

Re: Where has PrintGCApplicationStoppedTime gone?

2020-05-14 Thread Wojciech Kudla
i > > Try -Xlog:safepoint > > Regards > > On Thursday, May 14, 2020, Wojciech Kudla > wrote: > >> Hi, >> >> While evaluating different post-java 8 JVMs I noticed >> that PrintGCApplicationStoppedTime is an unsupported option now. >> According to Chris Newl

Where has PrintGCApplicationStoppedTime gone?

2020-05-14 Thread Wojciech Kudla
Hi, While evaluating different post-java 8 JVMs I noticed that PrintGCApplicationStoppedTime is an unsupported option now. According to Chris Newland's awesome list of JVM options it has been removed in Java 9 never to return. Was tr

Re: does call site polymorphism factor in method overrides?

2019-12-30 Thread Wojciech Kudla
Hi Brian, I think I can safely assume your question was dictated by (perfectly valid) concerns about method dispatch cost in extremely latency sensitive sections of code. After all, we've used to work together on the same problem space in the same institution only weeks ago. Vitaly provided a valu

Re: Re: JMeter and HdrHistogram Integration

2019-12-21 Thread Wojciech Kudla
> However, the messaging platform deals with transferring pub/sub messages that can vary from 1KB to 1MB in size – that is decidedly **not** in the microsecond realm, even if it were on an Infiniband network communicating via ibverbs I'm afraid I have to disagree with this statement. Contemporary

Evolution of Linux kernel performance

2019-11-06 Thread Wojciech Kudla
There's folks here who would probably appreciate this paper. https://dl.acm.org/authorize.cfm?key=N695040 Most topics are well known within the low latency crowd but it's very digestible and informative. At least I learned a thing or two. -- You received this message because you are subscribed t

Re: JVMs and the new silicon

2019-11-05 Thread Wojciech Kudla
Although I generally agree with the sentiment of Java's poor track record when it comes to early support for stuff like SIMD or GPGPU, Graal VM looks very promising. Firstly, there's a lot of relatively recent developments happening in the area of optimisations, eg: https://github.com/oracle/graal/

Re: Supermicro SYS-1029UX-LL1-S16 system clock running too fast

2019-09-03 Thread Wojciech Kudla
I have a few issues with the post you're referring to. It seems to completely ignore the existence of constant_tsc and invariant_tsc. The problems it talks about do not exist on modern day platforms; it's just an outdated post. http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/ http

Re: Supermicro SYS-1029UX-LL1-S16 system clock running too fast

2019-09-03 Thread Wojciech Kudla
There's a good reason why tsc is the default clock source in Linux. It's more precise and cheaper to probe. Given that most modern platforms offer nonstop_tsc, I'd give it another shot and run with: hpet=disable clocksource=tsc In your kernel stanza. Adding tsc=reliable disables clock sync hence th

Re: Probing the CPU for metrics / info

2019-03-22 Thread Wojciech Kudla
Hi Peter, This sounds very interesting. Could you please expand on the case of extra overhead/risk of crashing the host by reading smaps? On Fri, 22 Mar 2019, 05:38 Peter Booth, wrote: > A couple of comments: > > 1. Brendan Gregg's homepage, and his last book are worth reading > http://www.br

Re: RSS and CPU selection

2019-03-15 Thread Wojciech Kudla
me IRQ will be processed by a random > CPU' > > > > On Friday, March 15, 2019 at 5:08:53 PM UTC+2, Wojciech Kudla wrote: >> >> The rx-queue to CPU affinity you're referring to should remain fairly >> static if the packets are getting processed fast enough. &

Re: RSS and CPU selection

2019-03-15 Thread Wojciech Kudla
The rx-queue to CPU affinity you're referring to should remain fairly static if the packets are getting processed fast enough. If you are interested in controlling this behavior you can manipulate receive flow hash indirection tables (ethtool - x) , irq affinities (/proc/interrupts/$irq/smp_affinit

Re: Probing the CPU for metrics / info

2018-11-29 Thread Wojciech Kudla
If you're interested in capturing performance metrics per CPU (with zero overhead) I suggest looking at /proc/interrupts and /proc/softirqs. For more detailed data you'd need to use the PMU to capture on- and off-core events but that's: either 1) non-zero overhead or 2) inaccurate for more than a

Re: Concurrent retrieval of statistics

2018-10-16 Thread Wojciech Kudla
I can only speak from my own experience. For data generated by latency critical threads you will probably want to have a simple SPSC buffer per thread meaning no competition between producers so fewer cycles lost on cache coherence. The consumer could just iterate over all known buffers and drain t

Re: Throughput test of OpenHFT networking

2018-05-12 Thread Wojciech Kudla
It's probably not the response that you were hoping to see but I'd avoid testing for performance using loopback interface. There are whole parts of the network stack omitted by the Linux kernel in such scenarios. Not mentioning that open HFT may employ socket mechanics different from what's availab

Re: Supermicro SYS-1029UX-LL1-S16 system clock running too fast

2018-05-11 Thread Wojciech Kudla
This is interesting. I know this is not what you need but for the sake of experiment, what happens if you switch the clock source from tsc to hpet? What are all the clock sources available on that system? Can you please post the relevant entries from dmesg? On Fri, 11 May 2018, 10:02 Himanshu Sha

Re: Exclusive core for a process, is it reasonable?

2018-04-09 Thread Wojciech Kudla
re's more to gain from shaving off latency on network >> paths than there is from affinitizing work to cores/dies. But that's >> digressing from the OP. >> > > @Wojciech Kudla, > > That's digressing but it is very interesting. Can you refer me somewhere > w

Re: Exclusive core for a process, is it reasonable?

2018-04-09 Thread Wojciech Kudla
Some of the stuff I had a chance to work on managed to handle market data in single digit micros and trading in low tens. That's Java/c++. With modern day hardware it would be extremely hard (and costly) to push it much further. I can easily imagine how going for ASIC and staying under 1 microsecon

Re: Manual memory management for concurrent code

2018-01-30 Thread Wojciech Kudla
There's very interesting progress in that space happening lately. Some of that is being applied to the Linux kernel as new RCU implementation. Looks very promising. It's based on fast consensus using bounded staleness. Have a look here: https://lwn.net/Articles/745116/ And the paper: http://ipads.

Re: High run-queue debugging

2018-01-29 Thread Wojciech Kudla
I'd just run stap or ftrace to capture sched_switch events to see what's causing the scheduling pressure. Are there any anomalies in voluntary and involuntary context switching during the run queue length spikes? Also, depending on your load running several spinning fix engine threads may cause res

Re: allocation memory in Java

2017-11-20 Thread Wojciech Kudla
@John Yes, additionally in some scenarios you will also need to have relevant shared objects compiled with -fno-omit-frame-pointer gcc option. 2017-11-20 22:38 GMT+00:00 John Hening : > provided that the frame pointers are preserved > > @Wojciech, do you mean? > -XX:+PreserveFramePointer > > -- >

Re: allocation memory in Java

2017-11-20 Thread Wojciech Kudla
I don't think I had any issues with symbol resolution and stack unwinding with the default fastdebug build (provided that the frame pointers are preserved). Can anyone shed some light on what the benefits of --with-native-debug-symbols=internal are? On Mon, 20 Nov 2017, 21:56 John Hening, wrote:

Re: JVM random performance

2017-08-01 Thread Wojciech Kudla
It definitely makes sense to have a look at gc activity, but I would suggest looking at safepoints from a broader perspective. Just use -XX:+PrintGCApplicationStoppedTime to see what's going on. If it's safepoints, you could get more details with safepoint statistics. Also, benchmark runs in java

Re: Isolating low latency application on CPU-0?

2017-05-26 Thread Wojciech Kudla
Yes, that's why blacklisting workqueues from critical cpus should be on the jitter elimination check list. They can be affinitized just like irqs On Fri, 26 May 2017, 11:10 Sarunas Vancevicius, wrote: > > > On Monday, 22 May 2017 16:03:53 UTC+3, Wojciech Kudla wrote: >> >

Re: Isolating low latency application on CPU-0?

2017-05-24 Thread Wojciech Kudla
> Since Sandy Bridge at least, each CPU has its own PCIe interface. Presumably, if you're doing user-space kernel bypass IO you want your workload on the same CPU that your IO devices are connected to. I think you meant the whole socket here. Yes, this is one of the reasons why many shops move awa

Re: Isolating low latency application on CPU-0?

2017-05-22 Thread Wojciech Kudla
nce a little bit. And I am afraid we can do nothing to get > rid of them. > > Himanshu Sharma > > > On Mon, May 22, 2017 at 2:08 PM, Wojciech Kudla > wrote: > >> There's a number of kernel tasks that are implicitly bound to cpu0. For >> an example of one have

Re: Isolating low latency application on CPU-0?

2017-05-22 Thread Wojciech Kudla
There's a number of kernel tasks that are implicitly bound to cpu0. For an example of one have a look at rcu offloading and its restrictions. On Mon, 22 May 2017, 08:59 Himanshu Sharma, wrote: > Hi Michael > > Did you find a satisfactory reason for not isolating cpu 0, maybe some low > level OS

Re: Why would SocketChannel be slower when sending a single msg instead of 1k msgs after proper warmup?

2017-04-13 Thread Wojciech Kudla
I'd also monitor /proc/interrupts and /proc/softirqs for your target cpu On Thu, 13 Apr 2017, 08:36 Gil Tene, wrote: > If I read this right. You are running this on localhost (according to SO > code). If that's the case, there is no actual network, and no actual TCP > stack... UDP or TCP won't m

Re: Linux futex_wait() bug... [Yes. You read that right. UPDATE to LATEST PATCHES NOW].

2017-02-15 Thread Wojciech Kudla
Just trying to eliminate the obvious. You should be stracing JVM threads by referring their tids rather than parent process pid. That guy will pretty much always show being blocked on a futex. On Wed, 15 Feb 2017, 15:45 Gil Tene, wrote: > Don't know if this is the same bug. RHEL 7 kernel include

Re: Java Memory Allocation FlameGraph

2016-12-13 Thread Wojciech Kudla
I hope to get corrected if I'm wrong here, but you can't trace allocation on the JVM heap like this (by looking at page faults) because it has nothing to do with malloc()/calloc() (essentially sbrk() or mmap(). It's just bump-the-pointer which doesn't necessarily mean it comes with a page fault. Th

Re: Minimum realistic GC time for G1 collector on 10GB

2016-12-12 Thread Wojciech Kudla
It might also make sense to check the time to safepoint, paging (esp. during the collection) and also if there's any numa-related stalls involved. On Mon, 12 Dec 2016, 12:58 Chris Newland, wrote: > Hi Ivan, > > Without commenting on whether a 10ms pause time is achievable with HotSpot > I'd say

Re: detecting "broken" TCP connections

2016-11-29 Thread Wojciech Kudla
Any chance that socket connection is handled by some sort of kernel bypass? All bets with blocking IO are off when running with onload/offload drivers. On Tue, 29 Nov 2016, 09:29 Alen Vrečko, wrote: > Got a situation where thread hanged on socket read (old school socket > bio code). One side was

Re: Calling into c from python, go, java and others

2016-09-25 Thread Wojciech Kudla
That's an interesting topic, especially with Java as it lacks some valuable features most of us will need at some point. Whenever I needed to benchmark JNI call overhead, depending mostly on the kernel and the cpu, I got times from single digit to low tens of nanoseconds. However, my experience is

Re: Fwd: AIX/Linux Admin (Production Support)

2016-09-21 Thread Wojciech Kudla
Hi, please remove me from your mailing list On Sep 21, 2016 10:20 PM, "rani rautela" wrote: > > > Hi > > Hope you are doing fine. . !!! > > > *JOB DESCRIPTION* > > *ROLE : *AIX/Linux Admin (Production Support) > > *LOCATION : *Scottsdale, AZ / Woonsocket, RI > > *CLIENT: *TCS > > *Responsibiliti