[grpc-io] Re: High CPU cost due to a SpinLock in CompletionRegistry.Register

Tomer Schwartz Thu, 17 Feb 2022 08:39:05 -0800

I got a confirmation from the architect of the GC in the CLR team.
Those threads which are currently happen to be during NativeThread::SpinWit 
cannot be preempted. So to summarize, in case that a managed thread has 
managed to take the SpinLock and exactly at that time the GC tries to 
suspend all threads, all the threads which are now during 
NativeThread.SpinWait cannot be suspended and will have to utilize their 
entire capacity of 100000 iterations. During that time we’ll see an 
increase in CPU and a process hang (at least for the managed threads).


Thanks,
Tomer.

ב-יום חמישי, 10 בפברואר 2022 בשעה 09:02:34 UTC+2, ‪Tomer Schwartz‬‏ כתב/ה:

> I got a chance to continue investigating this issue. I ran the tests 
> several more times, and broke with a debugger when I saw that the process 
> gets stuck (also, the CPU spikes at the same time).
> In all the cases I checked I saw the same pattern: One thread that is 
> holding the SpinLock and has a call-stack that is similar to the one I sent 
> above, and several/dozens of threads which are waiting for the SpinLock.
> The call-stack above shows actually that Dictionary lookup was interrupted 
> by GC. More precisely, at that point GC hasn’t started yet. GC uses the 
> suspension service by calling SuspendEE, the thread is suspended (while 
> holding the lock), but SuspendEE hasn’t returned as it's still waiting for 
> other managed threads to get suspended. Only when all of the managed 
> threads will be suspended - the GC will start executing.
> So which threads are refusing to be suspended? Exactly the threads which 
> are currently during ThreadNative::SpinWait. Threads which end up with the 
> following call-stack:
> 00007ffe2765a015 clr!ThreadNative::SpinWait+0x75
> 00007ffe26947c55 
> mscorlib_ni!System.Threading.SpinLock.ContinueTryEnter(Int32, Boolean 
> ByRef)+0x175
> 00007ffdc7def8f5 
> Grpc_Core!Grpc.Core.Internal.CompletionRegistry.Register(IntPtr, 
> Grpc.Core.Internal.IOpCompletionCallback)+0x95
> 00007ffdc7def2b1 
> Grpc_Core!Grpc.Core.Internal.CompletionRegistry.RegisterBatchCompletion(Grpc.Core.Internal.BatchCompletionDelegate,
>  
> System.Object)+0x41
> 00007ffdc7defb98 
> Grpc_Core!Grpc.Core.Internal.CallSafeHandle.StartReceiveInitialMetadata(Grpc.Core.Internal.IReceivedResponseHeadersCallback)+0x48
>
> These threads are currently running a native code and aren't in a 
> preemptive mode. GC can't suspend them.
> When running !threads for instance, we can see that the only threads which 
> are marked as Cooperative and not as Preemptive, are the threads which are 
> during SpinWait.
> For instance, here is a partial result of !threads:
>
> [image: CooperativeThreads.png]
>
> So those threads continue to spin, hogging the CPU, and until they don't 
> complete - all managed threads are blocked and the process hangs.
> I contacted with the architect of the GC in the CLR team, and I'm waiting 
> to hear if that is a known issue when using SpinLock, but anyway, it looks 
> like this is one of the reasons that SpinLock is dangerous and should be 
> used only in cases where we are sure that the threads aren't going to wait 
> more then X cycles (in that case doing the dictionary lookup is considered 
> a long operation).
> One more thing: This issue manifests much more when using Workstation GC, 
> as in that configuration there is a much higher amount of collections 
> (meaning, garbage collections). So statistically - there is a higher chance 
> that GC will be called right when there is some thread that is in the 
> middle of this dictionary lookup. 
>
> Thanks,
> Tomer. 
>
> ב-יום שני, 7 בפברואר 2022 בשעה 22:52:59 UTC+2, ‪Tomer Schwartz‬‏ כתב/ה:
>
>> Thanks Jan,
>>
>> As I said, this is a load test which we're running, so it does make sense 
>> that we'll see a contention on such a component, given that we have a high 
>> rate of send/receive operations.
>> The problem is, that even with a low rate of transactions, we are still 
>> seeing CPU spikes due to this contention.
>> I agree that a Dictionary Add/Remove operations should be fast enough and 
>> therefore the time spent within the SpinLock is expected to be very short, 
>> but still, the profiler is pretty accurate and consistent about this. It 
>> repeatedly shows that most of the CPU is spent exclusively on 
>> clr!ThreadNative::SpinWait.
>> There are rare cases where SpinLocks are will perform better than taking 
>> a lock. There are several warnings on MSDN 
>> <https://docs.microsoft.com/en-us/dotnet/api/system.threading.spinlock?view=net-6.0>
>>  about 
>> using this struct. 
>> In general, it should be used in scenarios where waiting for the lock is 
>> going to take a super short time, or in other words - in cases where 
>> spinning is going to be better then context-switching, because there are 
>> good chances that the thread is going to get this lock very soon. 
>>
>> I just ran the load test again with 50 concurrent users, and I see here 
>> 11 seconds where 98% of the CPU frames were spent on this SpinWait (In this 
>> case the overall CPU during that time is estimated to be 34% by PerfView, 
>> so that means that the 33% of the CPU was spent just on that, and blocking 
>> other operations during that time).
>>
>> [image: Grpc.JPG]
>>
>> In the following PR the SpinLock was introduced instead of a lock that 
>> used to be there: 
>> https://github.com/grpc/grpc/commit/eec8b84c71bc81a7073142506f369a9b8a3a63a6
>> Maybe this lock was seen as a bottleneck in the past and was replaced by 
>> a SpinLock?
>>
>> I connected now with a debugger at some random time where I had a 
>> noticeable pause. Maybe that additional information can be helpful:
>> There are 24 threads blocked on CompletionRegistry.Register(), spending 
>> time on that SpinLock.ContinueTryEnter (spinning / sleeping). 
>> I see another thread which is busy on CompletionRegistry.Extract, looking 
>> for an item in the Dictionary. The amount of items in the dictionary at 
>> that time is 150. 
>> I repeated that experiment few times, and in all cases this is what I 
>> see: Several threads trying to add an item to the dictionary that are 
>> waiting for a thread that is currently removing an item from the dictionary.
>> In all those cases, the Dictionary's count is 150.
>> By seeing each time several / dozens of threads waiting/spinning/yielding 
>> eventually, it seems that during these spikes at least, using SpinLocks is 
>> not recommended.
>>
>> Additional info from one of the runs:
>> environment.DebugStats.PendingBatchCompletions = 104.
>> GrpcThreadPool.queuedContinuationCounter = 17
>>
>> Maybe the problem relies in the RunHandlerLoop? Maybe it's calling 
>> CompletionRegistry.Extract too many times in a short period?
>> Here's a partial snippet of that method:
>> [image: RunHandlerLoop.JPG]
>>
>> Here's the call-stack of the thread that's calling 
>> CompletionRegistry.Extract (it is always captured at that state, as if 
>> calling Dictionary.FindEntry is taking a non-negligible amount of time):
>> Child-SP          RetAddr           Call Site
>> 000000f0`003fdc28 00007ffe`3aca85c3 ntdll!NtWaitForSingleObject+0x14
>> 000000f0`003fdc30 00007ffe`2717a247 KERNELBASE!WaitForSingleObjectEx+0x93
>> 000000f0`003fdcd0 00007ffe`2717a1ff clr!CLREventWaitHelper2+0x3c
>> 000000f0`003fdd10 00007ffe`2717a180 clr!CLREventWaitHelper+0x1f
>> 000000f0`003fdd70 00007ffe`272e0b8e clr!CLREventBase::WaitEx+0x7c
>> 000000f0`003fde00 00007ffe`2717b6a1 
>> clr!WKS::GCHeap::WaitUntilGCComplete+0x2b
>> 000000f0`003fde30 00007ffe`272daa2c 
>> clr!Thread::RareDisablePreemptiveGC+0x180
>> 000000f0`003fdec0 00007ffe`271310c6 
>> clr!Thread::RedirectedHandledJITCase+0x1bf
>> 000000f0`003fdf70 00007ffe`25f8115f 
>> clr!RedirectedHandledJITCaseForGCThreadControl_Stub+0x26
>> 000000f0`003fdfb0 00007ffe`26aff0d0 
>> mscorlib_ni!System.Collections.Generic.Dictionary`2[[System.__Canon, 
>> mscorlib],[System.__Canon, 
>> mscorlib]].FindEntry(System.__Canon)+0xffffffff`fffe851f
>> 000000f0`003fe000 00007ffd`c7de6923 
>> mscorlib_ni!System.Collections.Generic.Dictionary`2[[System.__Canon, 
>> mscorlib],[System.__Canon, mscorlib]].get_Item(System.__Canon)+0xb69110
>> 000000f0`003fe030 00007ffd`c7b5e952 
>> Grpc_Core!Grpc.Core.Internal.CompletionRegistry.Extract(IntPtr)+0x93
>> 000000f0`003fe090 00007ffe`25f95ab3 
>> Grpc_Core!Grpc.Core.Internal.GrpcThreadPool.RunHandlerLoop(Grpc.Core.Internal.CompletionQueueSafeHandle,
>>  
>> Grpc.Core.Profiling.IProfiler)+0x72
>> 000000f0`003fe0f0 00007ffe`25f95944 
>> mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,
>>  
>> System.Threading.ContextCallback, System.Object, Boolean)+0x163
>> 000000f0`003fe1c0 00007ffe`25f95912 
>> mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,
>>  
>> System.Threading.ContextCallback, System.Object, Boolean)+0x14
>> 000000f0`003fe1f0 00007ffe`25f8f0c2 
>> mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,
>>  
>> System.Threading.ContextCallback, System.Object)+0x52
>> 000000f0`003fe240 00007ffe`271368d3 
>> mscorlib_ni!System.Threading.ThreadHelper.ThreadStart()+0x52
>> 000000f0`003fe280 00007ffe`27136768 clr!CallDescrWorkerInternal+0x83
>> 000000f0`003fe2c0 00007ffe`27136e5d clr!CallDescrWorkerWithHandler+0x4e
>> 000000f0`003fe300 00007ffe`27212cd9 
>> clr!MethodDescCallSite::CallTargetWorker+0xf8
>> 000000f0`003fe400 00007ffe`27137621 
>> clr!ThreadNative::KickOffThread_Worker+0x109
>> 000000f0`003fe660 00007ffe`2713759c 
>> clr!ManagedThreadBase_DispatchInner+0x39
>> 000000f0`003fe6a0 00007ffe`271374cd 
>> clr!ManagedThreadBase_DispatchMiddle+0x6c
>> 000000f0`003fe7a0 00007ffe`2713765b 
>> clr!ManagedThreadBase_DispatchOuter+0x75
>> 000000f0`003fe830 00007ffe`27212bbb 
>> clr!ManagedThreadBase_FullTransitionWithAD+0x2f
>> 000000f0`003fe890 00007ffe`271742bf clr!ThreadNative::KickOffThread+0xdb
>> 000000f0`003fe960 00007ffe`3b847974 
>> clr!Thread::intermediateThreadProc+0x86
>> 000000f0`003ff9a0 00007ffe`3e59a2f1 KERNEL32!BaseThreadInitThunk+0x14
>> 000000f0`003ff9d0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
>>
>> Thanks, and sorry for the long email :)
>> Tomer.
>> ב-יום שני, 7 בפברואר 2022 בשעה 13:00:49 UTC+2, Jan Tattermusch כתב/ה:
>>
>>> Hi, thanks for the data.
>>>
>>> I suspect the increase of latencies will be caused by other things than 
>>> contention in the completion registry.
>>> Basically the completion registry is a dictionary of scheduled 
>>> "operation completions" (where a completion is one of the primitive 
>>> operations defined by the C-core native layer - e.g. start a call, send a 
>>> message, receive a message etc.). Access to this dictionary is protected by 
>>> a SpinLock (as you've pointed out).
>>> The thing is, it seems weird that your system would be able to generate 
>>> so much traffic that it would lead to contention on the completion registry 
>>> dictionary. Updating the dictionary is hardly any work compared to the 
>>> overhead of e.g. sending or receiving a message so I'd expect the throughput
>>> of completion registry is much higher than the actual throughput of your 
>>> system.
>>> I suspect that since the time spent in SpinLock is extremely short AND 
>>> most of the work actually happens in the native C-core layer, the profiler 
>>> might be giving incorrect data?
>>>
>>> The spin lock could be of course replaced by other synchronization 
>>> primitive (e.g. concurrent dictionary or a normal lock), but in my 
>>> experience I've never seen completion registry being the main reason for 
>>> Grpc.Core being slow. The change is doable, but it would require careful 
>>> benchmarking first.
>>> Since Grpc.Core is in maintenace mode (see 
>>> https://grpc.io/blog/grpc-csharp-future/), this is likely now worth the 
>>> work.
>>>
>>> On Monday, January 17, 2022 at 8:32:14 AM UTC+1 tome...@gmail.com wrote:
>>>
>>>> Hi,
>>>>
>>>> We are running a client-server load test with 500 concurrent users, and 
>>>> we are seeing a high CPU on a SpinLock called 
>>>> by Grpc.Core.Internal.CompletionRegistry.Register.
>>>> At average, this SpinLock is seen on 15% of the CPU frames under a 
>>>> profiler. But during few seconds where we're seeing a significant increase 
>>>> in latencies, the profiler shows that this SpinLock cost had an average 
>>>> cost of ~80% of the CPU frames, and even 95% of the CPU frames:
>>>> [image: GrpcCore_SpinLock5.JPG]
>>>>
>>>> Is there any way to avoid this penalty?
>>>> I see that this SpinLock is protecting a Dictionary 
>>>> of IOpCompletionCallbacks. Are those callbacks affected by anything that 
>>>> we 
>>>> control in our code?
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/bf135958-a123-43d8-b47d-af29ad4c26aan%40googlegroups.com.

[grpc-io] Re: High CPU cost due to a SpinLock in CompletionRegistry.Register

Reply via email to