On Fri, Oct 6, 2017 at 4:24 PM, Ryota Ozaki <ozak...@netbsd.org> wrote: > On Fri, Oct 6, 2017 at 1:14 PM, Ryota Ozaki <ozak...@netbsd.org> wrote: >> On Fri, Oct 6, 2017 at 11:58 AM, Taylor R Campbell >> <campbell+netbsd-tech-k...@mumble.net> wrote: >>>> Date: Fri, 6 Oct 2017 11:26:40 +0900 >>>> From: Ryota Ozaki <ozak...@netbsd.org> >>>> >>>> On Mon, Oct 2, 2017 at 11:13 PM, Taylor R Campbell >>>> <campbell+netbsd-tech-k...@mumble.net> wrote: >>>> > Quick summary of the problem: >>>> > >>>> > Possible solutions. I'm leaning toward (6), to open-code the linked >>>> > list operations for this special purpose, with compile-tested patch >>>> > attached. This changes the text of psref.h, but shouldn't change the >>>> > ABI. Comments? >>>> >>>> How about using SLIST instead of open-coding? The instructions of them >>>> are very similar, but the SLIST version is slightly simpler. >>> >>> I avoided that because n psref_release operations takes expected and >>> worst-case O(n^2) time and there's no constant bound on the latency of >>> a single psref_release operation. But maybe n is always small enough >>> that it doesn't matter -- perhaps enough that the concrete cost of >>> maintaining a doubly-linked list is higher. >> >> I also suppose that a target being released is at the head of the list >> because targets of psref are typically manipulated in last-in-first-out >> manner. But yes psref_release of SLIST version can be O(n) in worst >> cases theoretically. >> >> Well, when I think this kind of problems I tend to think of our usages, >> i.e., network processing as a router. In such usages, psref is used in >> softint and the last-in-first-out manner would keep in many cases. But >> in other usages such as uses in threads the assumption is perhaps not >> really. >> >>> >>> (My desire to avoid thinking about bounds on n is also what motivated >>> me to use a linked list instead of an array in the first place.) >>> >>> Note that your patch changes the ABI of struct psref! >> >> Let's bump the kernel version! >> >>> >>> I wonder whether the open-coded version would do better if it >>> unconditionally loaded the percpu: >>> >>> pcpu = percpu_getref(class->prc_percpu); >>> KASSERTMSG(psref->psref_prevp == NULL || *psref->psref_prevp == >>> psref, >>> "psref %p prevp %p points at %p", >>> psref, psref->psref_prevp, *psref->psref_prevp); >>> KASSERTMSG(psref->psref_prevp != NULL || pcpu->pcpu_first == psref, >>> "psref %p marked as first but psref_cpu %p on %d first is %p", >>> psref, pcpu, cpu_index(curcpu()), pcpu->pcpu_first); >>> *(psref->psref_prevp ? psref->psref_prevp : &pcpu->pcpu_first) = >>> psref->psref_next; >>> percpu_putref(class->prc_percpu); >>> >>> With DIAGNOSTIC disabled, I get a conditional move instruction instead >>> of branches this way: >>> >>> 4f9: e8 00 00 00 00 callq 4fe <psref_release+0x93> >>> 4fa: R_X86_64_PC32 >>> percpu_getref+0xfffffffffffffffc >>> 4fe: 48 8b 53 08 mov 0x8(%rbx),%rdx >>> 502: 48 85 d2 test %rdx,%rdx >>> 505: 48 0f 44 d0 cmove %rax,%rdx >>> 509: 48 8b 03 mov (%rbx),%rax >>> 50c: 48 89 02 mov %rax,(%rdx) >>> 50f: 49 8b 7c 24 20 mov 0x20(%r12),%rdi >>> 514: e8 00 00 00 00 callq 519 <psref_release+0xae> >>> 515: R_X86_64_PC32 >>> percpu_putref+0xfffffffffffffffc >>> >>> Also, my original patch was missing a percpu_putref. I hope you put >>> it back before you ran your test! >> >> I'll test with the patch in the other mail later. > > The above results were actually measured by knakahara@ and > this time I've measured by myself (the measurement hardware > is the same and the kernel configs of his and mine should > be almost the same though).
I notice a difference between his and mine: check-out date of the source code. Mine is checked out today while his is probably some days ago. > > ...so the results show a slightly different trend: > - original: 137.9 138.0 (144.7) Mbps > - open-code: 135.2 134.6 (138.5) Mbps > - SLIST: 140.7 141.4 (140.1) Mbps ...though still these results are puzzling. ozaki-r > > I think knakahara@ (or someone else) needs to re-evaluate :-/ > > ozaki-r