Long radeon stalls on recent kernels

Andy Lutomirski Mon, 8 Dec 2014 16:24:57 -0800

On Wed, Nov 26, 2014 at 7:38 AM, Andy Lutomirski <luto at amacapital.net> wrote:
> On Tue, Nov 25, 2014 at 10:42 PM, Michel DÃ¤nzer <michel at daenzer.net> 
> wrote:
>> On 20.11.2014 09:58, Andy Lutomirski wrote:
>>>
>>> On Wed, Nov 19, 2014 at 4:07 PM, Andy Lutomirski <luto at amacapital.net>
>>> wrote:
>>>>
>>>> On Tue, Nov 18, 2014 at 11:19 PM, Michel DÃ¤nzer <michel at daenzer.net>
>>>> wrote:
>>>>>
>>>>> On 19.11.2014 09:21, Andy Lutomirski wrote:
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 17, 2014 at 1:51 AM, Michel DÃ¤nzer <michel at daenzer.net>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 15.11.2014 07:21, Andy Lutomirski wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything
>>>>>>>> graphics intensive seems to cause my system to become unusable for
>>>>>>>> tens of seconds.  Pointing Firefox at Google Maps is a big offender
>>>>>>>> --
>>>>>>>> it can take several minutes for me to move my mouse far enough to
>>>>>>>> close the tab and get my computer back.
>>>>>>>>
>>>>>>>> On bootup, I get this warning:
>>>>>>>> [drm:btc_dpm_set_power_state] *ERROR*
>>>>>>>> rv770_restrict_performance_levels_before_switch failed
>>>>>>>>
>>>>>>>> Setting radeon.dpm=0 seems to work around this problem at the cost of
>>>>>>>> giving my rather slow graphics.
>>>>>>>>
>>>>>>>> Are there known issues here?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can you bisect the kernel, or at least isolate which kernel version
>>>>>>> first
>>>>>>> introduced the problem?
>>>>>>
>>>>>>
>>>>>>
>>>>>> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15,
>>>>>> 3.16, and 3.18-rc4+.  I haven't tried other versions.
>>>>>>
>>>>>> With radeon.dpm=0, I can still trigger short stalls (around one
>>>>>> second), but I seem unable to trigger long stalls easily.  (I say
>>>>>> easily because, just as I was typing this email, my system stalled for
>>>>>> about a minute.)
>>>>>
>>>>>
>>>>>
>>>>> I can only think of two things offhand that could cause such extremely
>>>>> long
>>>>> stalls: Swap thrashing or IRQ storms.
>>>>>
>>>>> With a setup where you can easily trigger long stalls, can you try
>>>>> getting a
>>>>> CPU profile for a stall with sysprof or perf?
>>>>>
>>>>>
>>>>
>>>> Got one with perf:
>>>>
>>>>    16.82%             Xorg  libc-2.18.so                        [.]
>>>> __memcpy_sse2_unaligned
>>>>     9.20%          swapper  [kernel.kallsyms]                   [k]
>>>> intel_idle
>>>>     1.00%             Xorg  [kernel.kallsyms]                   [k]
>>>> evergreen_irq_set
>>>>     0.83%          firefox  libxul.so                           [.]
>>>> 0x0000000001d93281
>>>>     0.69%          firefox  libxul.so                           [.]
>>>> 0x0000000001d932ad
>>>>     0.62%          firefox  [kernel.kallsyms]                   [k]
>>>> copy_user_generic_string
>>>>     0.55%          swapper  [kernel.kallsyms]                   [k]
>>>> evergreen_irq_ack
>>>>     0.54%          firefox  libpthread-2.18.so                  [.]
>>>> pthread_mutex_lock
>>>>     0.52%          firefox  libpthread-2.18.so                  [.]
>>>> pthread_mutex_unlock
>>>>     0.45%             Xorg  [kernel.kallsyms]                   [k]
>>>> drm_mm_insert_node_in_range_generic
>>>>     0.41%             Xorg  [kernel.kallsyms]                   [k]
>>>> lock_release
>>>>     0.40%             Xorg  [kernel.kallsyms]                   [k]
>>>> lock_acquire
>>>>     0.35%          firefox  firefox                             [.]
>>>> 0x000000000001245d
>>>>     0.33%             Xorg  [kernel.kallsyms]                   [k]
>>>> __module_address
>>>>     0.31%          firefox  [kernel.kallsyms]                   [k]
>>>> clear_page_c
>>>>     0.29%             Xorg  [kernel.kallsyms]                   [k]
>>>> copy_user_generic_string
>>>>     0.28%          firefox  firefox                             [.]
>>>> 0x0000000000013159
>>>>
>>>> and:
>>>>
>>>> Samples: 11K of event 'irq:irq_handler_entry', Event count (approx.):
>>>> 11802
>>>>    87.43%          swapper  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     7.52%          firefox  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     1.84%      irq/36-ahci  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     1.14%             Xorg  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     0.75%      kworker/5:0  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     0.32%      gnome-shell  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     0.25%     kworker/5:1H  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     0.25%  Media D~ode #10  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     0.19%  ImageDe~er #330  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>     0.07%       pulseaudio  [kernel.kallsyms]  [k]
>>>> handle_irq_event_percpu
>>>>
>>>> The cycles were with -e cycles:pp, so I think that iret would have
>>>> shown up if there were enough IRQs to cause the problem.
>>>>
>>>> I'll build a kernel with latencytop.
>>>>
>>>
>>> I just caught call_rwsem_down_write_failed for 5379 ms in khugepaged
>>> (holy crap) and radeon_fence_default_wait for 489.2ms in Xorg.
>>>
>>> Turning off THP gets rid of the khugepaged thing.  The 489.2ms is
>>> radeon_fence_default_wait is amazingly reproducible -- I've seen that
>>> exact number three times now.
>>
>>
>> Sounds like the long stalls were THP, but the shorter ones might be radeon?
>>
>> Can you get some call graphs for the profile or from latencytop? Make sure
>> at least the kernel is built with frame pointers (CONFIG_FRAME_POINTER=y),
>> preferably also userspace (-fno-omit-frame-pointer).
>
> Will try next week.  I'm out of town.


The relevant line from latencytop seems to be:

154 20441402 489139 radeon_fence_default_wait [radeon]
fence_wait_timeout ttm_bo_wait [ttm] ttm_bo_move_accel_cleanup [ttm]
radeon_move_blit.isra.12 [radeon] radeon_bo_move [radeon]
ttm_bo_handle_move_mem [ttm] ttm_bo_evict [ttm] ttm_mem_evict_first
[ttm] ttm_bo_mem_space [ttm] ttm_bo_validate [ttm]
radeon_bo_fault_reserve_notify [radeon]

--Andy

>
> --Andy
>
>>
>>
>>
>> --
>> Earthling Michel DÃ¤nzer               |               http://www.amd.com
>> Libre software enthusiast             |             Mesa and X developer
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC

Long radeon stalls on recent kernels

Reply via email to