amdgpu: add concurrent baco reset support for XGMI

Andrey Grodzovsky Wed, 11 Dec 2019 06:05:44 -0800

Great! I will update the patches to also use the barrier in PSP MODE 1reset case and resend the patches for formal review.


Andrey


On 12/11/19 7:18 AM, Ma, Le wrote:

[AMD Official Use Only - Internal Distribution Only]
I tried your new patches to run BACO for about 10 loops and the resultlooks positive, without observing enter/exit baco message failure again.
The time interval between BACO entries or exits in my environment wasalmost less than 10 us: max 36us, min 2us. I think it’s safe enoughaccording to the sample data we collected in both sides.
And it looks not necessary to continue using system_highpri_wq anymore because we require all the nodes enter or exit at the same time,while do not mind how long the time interval is b/t enter and exit.The system_unbound_wq can satisfy our requirement here since it wakesdifferent CPUs up to work at the same time.
Regards,

Ma Le

*From:*Grodzovsky, Andrey <andrey.grodzov...@amd.com>
*Sent:* Wednesday, December 11, 2019 3:56 AM
*To:* Ma, Le <le...@amd.com>; amd-gfx@lists.freedesktop.org; Zhou1,Tao <tao.zh...@amd.com>; Deucher, Alexander<alexander.deuc...@amd.com>; Li, Dennis <dennis...@amd.com>; Zhang,Hawking <hawking.zh...@amd.com>
*Cc:* Chen, Guchun <guchun.c...@amd.com>
*Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco resetsupport for XGMI
I switched the workqueue we were using for xgmi_reset_work fromsystem_highpri_wq to system_unbound_wq - the difference is thatworkers servicing the queue in system_unbound_wq are not bounded tospecific CPU and so the reset jobs for each XGMI node are gettingscheduled to different CPU while system_highpri_wq is a bounded workqueue. I traced it as bellow for 10 consecutive times and didn't seeerrors any more. Also the time diff between BACO entries or exits wasnever more then around 2 uS.
Please give this updated patchset a try
kworker/u16:2-57 [004] ...1 243.276312: trace_code: func:vega20_baco_set_state, line 91 <----- - Before BEACO enter <...>-60 [007] ...1 243.276312: trace_code: func:vega20_baco_set_state, line 91 <----- - Before BEACO enter kworker/u16:2-57 [004] ...1 243.276384: trace_code: func:vega20_baco_set_state, line 105 <----- - After BEACO enter done <...>-60 [007] ...1 243.276392: trace_code: func:vega20_baco_set_state, line 105 <----- - After BEACO enter done kworker/u16:3-60 [007] ...1 243.276397: trace_code: func:vega20_baco_set_state, line 108 <----- - Before BEACO exit kworker/u16:2-57 [004] ...1 243.276399: trace_code: func:vega20_baco_set_state, line 108 <----- - Before BEACO exit kworker/u16:3-60 [007] ...1 243.288067: trace_code: func:vega20_baco_set_state, line 114 <----- - After BEACO exit done kworker/u16:2-57 [004] ...1 243.295624: trace_code: func:vega20_baco_set_state, line 114 <----- - After BEACO exit done
Andrey

On 12/9/19 9:45 PM, Ma, Le wrote:

    [AMD Official Use Only - Internal Distribution Only]

    I’m fine with your solution if synchronization time interval
    satisfies BACO requirements and loop test can pass on XGMI system.

    Regards,

    Ma Le

    *From:*Grodzovsky, Andrey <andrey.grodzov...@amd.com>
    <mailto:andrey.grodzov...@amd.com>
    *Sent:* Monday, December 9, 2019 11:52 PM
    *To:* Ma, Le <le...@amd.com> <mailto:le...@amd.com>;
    amd-gfx@lists.freedesktop.org
    <mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao
    <tao.zh...@amd.com> <mailto:tao.zh...@amd.com>; Deucher, Alexander
    <alexander.deuc...@amd.com> <mailto:alexander.deuc...@amd.com>;
    Li, Dennis <dennis...@amd.com> <mailto:dennis...@amd.com>; Zhang,
    Hawking <hawking.zh...@amd.com> <mailto:hawking.zh...@amd.com>
    *Cc:* Chen, Guchun <guchun.c...@amd.com> <mailto:guchun.c...@amd.com>
    *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset
    support for XGMI

    Thanks a lot Ma for trying - I think I have to have my own system
    to debug this so I will keep trying enabling XGMI - i still think
    the is the right and the generic solution for multiple nodes reset
    synchronization and in fact the barrier should also be used for
    synchronizing PSP mode 1 XGMI reset too.

    Andrey

    On 12/9/19 6:34 AM, Ma, Le wrote:

        [AMD Official Use Only - Internal Distribution Only]

        Hi Andrey,

        I tried your patches on my 2P XGMI platform. The baco can work
        at most time, and randomly got following error:

        [ 1701.542298] amdgpu: [powerplay] Failed to send message
        0x25, response 0x0

        This error usually means some sync issue exist for xgmi baco
        case. Feel free to debug your patches on my XGMI platform.

        Regards,

        Ma Le

        *From:*Grodzovsky, Andrey <andrey.grodzov...@amd.com>
        <mailto:andrey.grodzov...@amd.com>
        *Sent:* Saturday, December 7, 2019 5:51 AM
        *To:* Ma, Le <le...@amd.com> <mailto:le...@amd.com>;
        amd-gfx@lists.freedesktop.org
        <mailto:amd-gfx@lists.freedesktop.org>; Zhou1, Tao
        <tao.zh...@amd.com> <mailto:tao.zh...@amd.com>; Deucher,
        Alexander <alexander.deuc...@amd.com>
        <mailto:alexander.deuc...@amd.com>; Li, Dennis
        <dennis...@amd.com> <mailto:dennis...@amd.com>; Zhang, Hawking
        <hawking.zh...@amd.com> <mailto:hawking.zh...@amd.com>
        *Cc:* Chen, Guchun <guchun.c...@amd.com>
        <mailto:guchun.c...@amd.com>
        *Subject:* Re: [PATCH 07/10] drm/amdgpu: add concurrent baco
        reset support for XGMI

        Hey Ma, attached a solution - it's just compiled as I still
        can't make my XGMI setup work (with bridge connected only one
        device is visible to the system while the other is not).
        Please try it on your system if you have a chance.

        Andrey

        On 12/4/19 10:14 PM, Ma, Le wrote:

            AFAIK it's enough for even single one node in the hive to
            to fail the enter the BACO state on time to fail the
            entire hive reset procedure, no ?

            [Le]: Yeah, agree that. I’ve been thinking that make all
            nodes entering baco simultaneously can reduce the
            possibility of node failure to enter/exit BACO risk. For
            example, in an XGMI hive with 8 nodes, the total time
            interval of 8 nodes enter/exit BACO on 8 CPUs is less than
            the interval that 8 nodes enter BACO serially and exit
            BACO serially depending on one CPU with yield capability.
            This interval is usually strict for BACO feature itself.
            Anyway, we need more looping test later on any method we
            will choose.

            Any way - I see our discussion blocks your entire patch
            set - I think you can go ahead and commit yours way (I
            think you got an RB from Hawking) and I will look then and
            see if I can implement my method and if it works will just
            revert your patch.

            [Le]: OK, fine.

            Andrey

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 07/10] drm/amdgpu: add concurrent baco reset support for XGMI

Reply via email to