On 2021-08-05 16:16, John Garry wrote:
On 05/08/2021 15:41, Robin Murphy wrote:
I suppose they could be combined into a smaller sub-struct and loaded in a single operation, but it looks messy, and prob without much gain.

Indeed I wouldn't say that saving memory is the primary concern here, and any more convoluted code is hardly going to help performance. Plus it still wouldn't help the other cases where we're just copying the size into a fake queue to do some prod arithmetic - I hadn't fully clocked what was going on there when I skimmed through things earlier.

Disregarding the bogus layout change, though, do you reckon the rest of my idea makes sense?

I tried the similar change to avoid zero-init the padding in arm_smmu_cmdq_write_entries() and the _arm_smmu_cmdq_poll_set_valid_map(), but the disassembly was the same. So the compiler must have got smart there.

Yeah, in my build __arm_smmu_cmdq_poll_set_valid_map() only uses 32 bytes of stack, so clearly it's managed to see through the macro magic once queue_inc_prod_n() is inlined and elide the whole struct. arm_smmu_cmdq_write_entries() is inlined already, but logically must be the same deal since it's a similarly inlined queue_inc_prod_n().

However, that may all change if different compiler flags or a different compiler lead to different inlining decisions, so I'd argue that if this can matter anywhere then it's worth treating consistently everywhere.

But for the original change in this patch, it did make a difference. It's nice to remove what was a memcpy:

     1770: a9077eff stp xzr, xzr, [x23, #112]
}, head = llq;
     1774: 94000000 bl 0 <memcpy>

And performance was very fractionally better.

Heh, mine was this beauty:

        struct arm_smmu_ll_queue llq = {
    17d4:       a9017f7f        stp     xzr, xzr, [x27, #16]
    17d8:       a9027f7f        stp     xzr, xzr, [x27, #32]
    17dc:       a9037f7f        stp     xzr, xzr, [x27, #48]
    17e0:       a9047f7f        stp     xzr, xzr, [x27, #64]
        }, head = llq;
    17e4:       b900c340        str     w0, [x26, #192]
{
    17e8:       290d0be1        stp     w1, w2, [sp, #104]
        }, head = llq;
    17ec:       a9440f62        ldp     x2, x3, [x27, #64]
    17f0:       a9007f5f        stp     xzr, xzr, [x26]
    17f4:       a9017f5f        stp     xzr, xzr, [x26, #16]
    17f8:       a9027f5f        stp     xzr, xzr, [x26, #32]
    17fc:       a9037f5f        stp     xzr, xzr, [x26, #48]
    1800:       a9040f42        stp     x2, x3, [x26, #64]
        struct arm_smmu_ll_queue llq = {
    1804:       a9057f7f        stp     xzr, xzr, [x27, #80]
        }, head = llq;
    1808:       a9057f5f        stp     xzr, xzr, [x26, #80]
        struct arm_smmu_ll_queue llq = {
    180c:       a9067f7f        stp     xzr, xzr, [x27, #96]
        }, head = llq;
    1810:       a9067f5f        stp     xzr, xzr, [x26, #96]
        struct arm_smmu_ll_queue llq = {
    1814:       a9077f7f        stp     xzr, xzr, [x27, #112]
        }, head = llq;
    1818:       a9077f5f        stp     xzr, xzr, [x26, #112]
        struct arm_smmu_ll_queue llq = {
    181c:       a9087f5f        stp     xzr, xzr, [x26, #128]

As for pre-evaluating "nents", I'm not sure how much that can help, but I am not too optimistic. I can try some testing when I get a chance. Having said that, I would need to check the disassembly also.

It'll just turn MOV,LDR,LSL sequences into plain LDRs - a small saving but with no real downside, and a third of it is in the place where doing less work matters most:

add/remove: 0/0 grow/shrink: 0/8 up/down: 0/-100 (-100)
Function                                     old     new   delta
arm_smmu_priq_thread                         532     528      -4
arm_smmu_evtq_thread                         368     364      -4
arm_smmu_device_probe                       4564    4556      -8
__arm_smmu_cmdq_poll_set_valid_map.isra      316     308      -8
arm_smmu_init_one_queue.isra                 320     308     -12
queue_remove_raw                             192     176     -16
arm_smmu_gerror_handler                      752     736     -16
arm_smmu_cmdq_issue_cmdlist                 1812    1780     -32
Total: Before=23776, After=23676, chg -0.42%


Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to