> On 28. Mar 2024, at 15:00, Nuno Teixeira <edua...@freebsd.org> wrote:
> 
> Hello all!
> 
> Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop (amd64)!
> 
> Thanks all!
Thanks for the feedback!

Best regards
Michael
> 
> Drew Gallatin <galla...@freebsd.org> escreveu (quinta, 21/03/2024 à(s) 12:58):
> The entire point is to *NOT* go through the overhead of scheduling something 
> asynchronously, but to take advantage of the fact that a user/kernel 
> transition is going to trash the cache anyway.
> 
> In the common case of a system which has less than the threshold  number of 
> connections , we access the tcp_hpts_softclock function pointer, make one 
> function call, and access hpts_that_need_softclock, and then return.  So 
> that's 2 variables and a function call.
> 
> I think it would be preferable to avoid that call, and to move the 
> declaration of tcp_hpts_softclock and hpts_that_need_softclock so that they 
> are in the same cacheline.  Then we'd be hitting just a single line in the 
> common case.  (I've made comments on the review to that effect).
> 
> Also, I wonder if the threshold could get higher by default, so that hpts is 
> never called in this context unless we're to the point where we're scheduling 
> thousands of runs of the hpts thread (and taking all those clock interrupts).
> 
> Drew
> 
> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:
>> On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:
>>> Ok I have created
>>> 
>>> https://reviews.freebsd.org/D44420
>>> 
>>> 
>>> To address the issue. I also attach a short version of the patch that Nuno
>>> can try and validate
>>> 
>>> it works. Drew you may want to try this and validate the optimization does
>>> kick in since I can
>>> 
>>> only now test that it does not on my local box :)
>> The patch still causes access to all cpu's cachelines on each userret.
>> It would be much better to inc/check the threshold and only schedule the
>> call when exceeded.  Then the call can occur in some dedicated context,
>> like per-CPU thread, instead of userret.
>> 
>>> 
>>> 
>>> R
>>> 
>>> 
>>> 
>>> On 3/18/24 3:42 PM, Drew Gallatin wrote:
>>>> No.  The goal is to run on every return to userspace for every thread.
>>>> 
>>>> Drew
>>>> 
>>>> On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote:
>>>>> On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote:
>>>>>> I got the idea from
>>>>>> https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf
>>>>>> The gist is that the TCP pacing stuff needs to run frequently, and
>>>>>> rather than run it out of a clock interrupt, its more efficient to run
>>>>>> it out of a system call context at just the point where we return to
>>>>>> userspace and the cache is trashed anyway. The current implementation
>>>>>> is fine for our workload, but probably not idea for a generic system.
>>>>>> Especially one where something is banging on system calls.
>>>>>> 
>>>>>> Ast's could be the right tool for this, but I'm super unfamiliar with
>>>>>> them, and I can't find any docs on them.
>>>>>> 
>>>>>> Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent to
>>>>>> what's happening here?
>>>>> This call would need some AST number added, and then it registers the
>>>>> ast to run on next return to userspace, for the current thread.
>>>>> 
>>>>> Is it enough?
>>>>>> 
>>>>>> Drew
>>>>> 
>>>>>> 
>>>>>> On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote:
>>>>>>> On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote:
>>>>>>>> On 18 Mar 2024, at 7:04, tue...@freebsd.org wrote:
>>>>>>>> 
>>>>>>>>>> On 18. Mar 2024, at 12:42, Nuno Teixeira
>>>>> <edua...@freebsd.org> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hello all!
>>>>>>>>>> 
>>>>>>>>>> It works just fine!
>>>>>>>>>> System performance is OK.
>>>>>>>>>> Using patch on main-n268841-b0aaf8beb126(-dirty).
>>>>>>>>>> 
>>>>>>>>>> ---
>>>>>>>>>> net.inet.tcp.functions_available:
>>>>>>>>>> Stack                           D
>>>>> Alias                            PCB count
>>>>>>>>>> freebsd freebsd                          0
>>>>>>>>>> rack                            *
>>>>> rack                             38
>>>>>>>>>> ---
>>>>>>>>>> 
>>>>>>>>>> It would be so nice that we can have a sysctl tunnable for
>>>>> this patch
>>>>>>>>>> so we could do more tests without recompiling kernel.
>>>>>>>>> Thanks for testing!
>>>>>>>>> 
>>>>>>>>> @gallatin: can you come up with a patch that is acceptable
>>>>> for Netflix
>>>>>>>>> and allows to mitigate the performance regression.
>>>>>>>> 
>>>>>>>> Ideally, tcphpts could enable this automatically when it
>>>>> starts to be
>>>>>>>> used (enough?), but a sysctl could select auto/on/off.
>>>>>>> There is already a well-known mechanism to request execution of the
>>>>>>> specific function on return to userspace, namely AST.  The difference
>>>>>>> with the current hack is that the execution is requested for one
>>>>> callback
>>>>>>> in the context of the specific thread.
>>>>>>> 
>>>>>>> Still, it might be worth a try to use it; what is the reason to
>>>>> hit a thread
>>>>>>> that does not do networking, with TCP processing?
>>>>>>> 
>>>>>>>> 
>>>>>>>> Mike
>>>>>>>> 
>>>>>>>>> Best regards
>>>>>>>>> Michael
>>>>>>>>>> 
>>>>>>>>>> Thanks all!
>>>>>>>>>> Really happy here :)
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Nuno Teixeira <edua...@freebsd.org> escreveu (domingo,
>>>>> 17/03/2024 à(s) 20:26):
>>>>>>>>>>> 
>>>>>>>>>>> Hello,
>>>>>>>>>>> 
>>>>>>>>>>>> I don't have the full context, but it seems like the
>>>>> complaint is a performance regression in bonnie++ and perhaps other
>>>>> things when tcp_hpts is loaded, even when it is not used.  Is that
>>>>> correct?
>>>>>>>>>>>> 
>>>>>>>>>>>> If so, I suspect its because we drive the
>>>>> tcp_hpts_softclock() routine from userret(), in order to avoid tons
>>>>> of timer interrupts and context switches.  To test this theory,  you
>>>>> could apply a patch like:
>>>>>>>>>>> 
>>>>>>>>>>> It's affecting overall system performance, bonnie was just
>>>>> a way to
>>>>>>>>>>> get some numbers to compare.
>>>>>>>>>>> 
>>>>>>>>>>> Tomorrow I will test patch.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Nuno Teixeira
>>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Nuno Teixeira
>>>>>>>>>> FreeBSD Committer (ports)
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>> 
>>> diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c
>>> index 8c4d2d41a3eb..eadbee19f69c 100644
>>> --- a/sys/netinet/tcp_hpts.c
>>> +++ b/sys/netinet/tcp_hpts.c
>>> @@ -216,6 +216,7 @@ struct tcp_hpts_entry {
>>> void *ie_cookie;
>>> uint16_t p_num; /* The hpts number one per cpu */
>>> uint16_t p_cpu; /* The hpts CPU */
>>> + uint8_t hit_callout_thresh;
>>> /* There is extra space in here */
>>> /* Cache line 0x100 */
>>> struct callout co __aligned(CACHE_LINE_SIZE);
>>> @@ -269,6 +270,11 @@ static struct hpts_domain_info {
>>> int cpu[MAXCPU];
>>> } hpts_domains[MAXMEMDOM];
>>> 
>>> +counter_u64_t hpts_that_need_softclock;
>>> +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needsoftclock, 
>>> CTLFLAG_RD,
>>> +    &hpts_that_need_softclock,
>>> +    "Number of hpts threads that need softclock");
>>> +
>>> counter_u64_t hpts_hopelessly_behind;
>>> 
>>> SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, CTLFLAG_RD,
>>> @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, precision, 
>>> CTLFLAG_RW,
>>>     &tcp_hpts_precision, 120,
>>>     "Value for PRE() precision of callout");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW,
>>> -    &conn_cnt_thresh, 0,
>>> +    &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD,
>>>     "How many connections (below) make us use the callout based mechanism");
>>> SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW,
>>>     &hpts_does_tp_logging, 0,
>>> @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)
>>> struct tcp_hpts_entry *hpts;
>>> int ticks_ran;
>>> 
>>> + if (counter_u64_fetch(hpts_that_need_softclock) == 0)
>>> + return;
>>> +
>>> hpts = tcp_choose_hpts_to_run();
>>> 
>>> if (hpts->p_hpts_active) {
>>> @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)
>>> ticks_ran = tcp_hptsi(hpts, 1);
>>> tv.tv_sec = 0;
>>> tv.tv_usec = hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT;
>>> + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && (hpts->hit_callout_thresh 
>>> == 0)) {
>>> + hpts->hit_callout_thresh = 1;
>>> + counter_u64_add(hpts_that_need_softclock, 1);
>>> + } else if ((hpts->p_on_queue_cnt <= conn_cnt_thresh) && 
>>> (hpts->hit_callout_thresh == 1)) {
>>> + hpts->hit_callout_thresh = 0;
>>> + counter_u64_add(hpts_that_need_softclock, -1);
>>> + }
>>> if (hpts->p_on_queue_cnt >= conn_cnt_thresh) {
>>> if(hpts->p_direct_wake == 0) {
>>> /*
>>> @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)
>>> cpu_top = NULL;
>>> #endif
>>> tcp_pace.rp_num_hptss = ncpus;
>>> + hpts_that_need_softclock = counter_u64_alloc(M_WAITOK);
>>> hpts_hopelessly_behind = counter_u64_alloc(M_WAITOK);
>>> hpts_loops = counter_u64_alloc(M_WAITOK);
>>> back_tosleep = counter_u64_alloc(M_WAITOK);
>>> @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)
>>> free(tcp_pace.grps, M_TCPHPTS);
>>> #endif
>>> 
>>> + counter_u64_free(hpts_that_need_softclock);
>>> counter_u64_free(hpts_hopelessly_behind);
>>> counter_u64_free(hpts_loops);
>>> counter_u64_free(back_tosleep);
>> 
>> 
> 
> 
> 
> -- 
> Nuno Teixeira
> FreeBSD Committer (ports)


Reply via email to