Re: [PATCH] drm/amdgpu: always reset asic when going into suspend

2020-02-06 Thread Daniel Drake
On Thu, Jan 16, 2020 at 11:15 PM Alex Deucher  wrote:
> It's just papering over the problem.  It would be better from a power
> perspective for the driver to just not suspend and keep running like
> normal.  When the driver is not suspended runtime things like clock
> and power gating are active which keep the GPU power at a minimum.

Until we have a better solution, are there any strategies we could
apply here to avoid the suspend as you say?
e.g. DMI quirk these products to disable suspend? Or disable suspend
on all s2idle setups?

This would certainly be better than the current situation of the
machine becoming unusable on resume.

> I talked to our sbios team and they seem to think our S0ix
> implementation works pretty differently from Intel's.  I'm not really
> an expert on this area however.  We have a new team ramping on up this
> for Linux however.

Thanks for following up on this internally! Can I lend a product
sample to the new team so that they have direct access?

Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: always reset asic when going into suspend

2020-01-14 Thread Daniel Drake
On Thu, Dec 19, 2019 at 10:08 PM Alex Deucher  wrote:
> I think there may be some AMD specific handling needed in
> drivers/acpi/sleep.c.  My understanding from reading the modern
> standby documents from MS is that each vendor needs to provide a
> platform specific PEP driver.  I'm not sure how much of that current
> code is Intel specific or not.

I don't think there is anything Intel-specific in drivers/acpi/sleep.c.

Reading more about PEP, I see that Linux supports PEP devices with
ACPI ID INT33A1 or PNP0D80. Indeed the Intel platforms we work with
have INT33A1 devices in their ACPI tables.

This product has a \_SB.PEP ACPI device with _HID AMD0004 and _CID
PNP0D80. Full acpidump:
https://gist.github.com/dsd/ff3dfc0f63cdd9eba4a0fbd9e776e8be (see
ssdt7)

This PEP device responds to a _DSM with UUID argument
"e3f32452-febc-43ce-9039-932122d37721", which is not the one
documented at 
https://uefi.org/sites/default/files/resources/Intel_ACPI_Low_Power_S0_Idle.pdf

Nevertheless, there is some data about the GPU:
Package (0x04)
{
One,
"\\_SB.PCI0.GP17.VGA",
Zero,
0x03
},

However since this data is identical to many other devices that
suspend and resume just fine, I wonder if it is really important.

The one supported method does offer two calls which may mirror the
Display Off/On Notifications in the above spec:
Case (0x02)
{
\_SB.PCI0.SBRG.EC0.CSEE (0xB7)
Return (Zero)
}
Case (0x03)
{
\_SB.PCI0.SBRG.EC0.CSEE (0xB8)
Notify (\_SB.PCI0.SBRG.EC0.LID, 0x80) //
Status Change
Return (Zero)
}

but I tried executing this code after suspending amdgpu, and the
problem still stands, amdgpu cannot wakeup correctly.

There's nothing else really interesting in the PEP device as far as I can see.

PEP things aside, I am still quite suspicious about the fact that
calling amdgpu_device_suspend() then amdgpu_device_resume() on
multiple products (not just this one) fails. It seems that this code
flow is relying on the BIOS doing something in the S3 suspend/resume
path in order to make the device resumable by amdgpu_device_resume(),
which is why we have only encountered this issue for the first time on
our first AMD platform that does not support S3 suspend.

With that in mind, and lacking any better info, wouldn't it make sense
for amdgpu_device_resume() to always call reset? Maybe it's not
necessary in the S3 case, but it shouldn't harm anything. Or perhaps
it could check if the device is alive and reset it if it's not?

Alternatively do you have any other contacts within AMD that could
help us figure out the underlying question of how to correctly suspend
and resume these devices? Happy to ship an affected product sample
your way.

Thanks
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: always reset asic when going into suspend

2019-12-16 Thread Daniel Drake
Hi Alex,

On Mon, Nov 25, 2019 at 1:17 PM Daniel Drake  wrote:
> Unfortunately not. The original issue still exists (dead gfx after
> resume from s2idle) and also when I trigger execution of the suspend
> or runtime suspend routines the power usage increases around 1.5W as
> before.
>
> Have you confirmed that amdgpu s2idle is working on platforms you have in 
> hand?

Any further ideas here? Or any workarounds that you would consider?

This platform has been rather tricky but all of the other problems are
now solved:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f897e60a12f0b9146357780d317879bce2a877dc
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21b8adbd475dba19ac2086d3306327b4a297418
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=406857f773b082bc88edfd24967facf4ed07ac85
https://patchwork.kernel.org/patch/11263477/

amdgpu is the only breakage left before Linux can be shipped on this
family of products.

Thanks
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: always reset asic when going into suspend

2019-11-24 Thread Daniel Drake
On Fri, Nov 22, 2019 at 11:32 PM Alex Deucher  wrote:
> Do these patches help?
> https://patchwork.freedesktop.org/patch/341775/
> https://patchwork.freedesktop.org/patch/341968/

Unfortunately not. The original issue still exists (dead gfx after
resume from s2idle) and also when I trigger execution of the suspend
or runtime suspend routines the power usage increases around 1.5W as
before.

Have you confirmed that amdgpu s2idle is working on platforms you have in hand?

Thanks
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: always reset asic when going into suspend

2019-10-16 Thread Daniel Drake
On Wed, Oct 16, 2019 at 2:43 AM Alex Deucher  wrote:
> Is s2idle actually powering down the GPU?

My understanding is that s2idle (at a high level) just calls all
devices suspend routines and then puts the CPU into its deepest
running state.
So if there is something special to be done to power off the GPU, I
believe that amdgpu is responsible for making arrangements for that to
happen.
In this case the amdgpu code already does:

pci_disable_device(dev->pdev);
pci_set_power_state(dev->pdev, PCI_D3hot);

And the PCI layer will call through to any appropriate ACPI methods
related to that low power state.

> Do you see a difference in power usage?  I think you are just working around 
> the fact that the
> GPU never actually gets powered down.

I ran a series of experiments.

Base setup: no UI running, ran "setterm -powersave 1; setterm -blank
1" and waited 1 minute for screen to turn off.
Base power usage in this state is 4.7W as reported by BAT0/power_now

1. Run amdgpu_device_suspend(ddev, true, true); before my change
--> Power usage increases to 6.1W

2. Run amdgpu_device_suspend(ddev, true, true); with my change applied
--> Power usage increases to 6.0W

3. Put amdgpu device in runtime suspend
--> Power usage increases to 6.2W

4. Try unmodified suspend path but d3cold instead of d3hot
--> Power usage increases to 6.1W

So, all of the suspend schemes actually increase the power usage by
roughly the same amount, reset or not, with and without my patch :/
Any ideas?

Thanks,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH] drm/amdgpu: always reset asic when going into suspend

2019-10-15 Thread Daniel Drake
On Asus UX434DA (Ryzen7 3700U), upon resume from s2idle, the screen
turns on again and shows the pre-suspend image, but the display remains
frozen from that point onwards.

The kernel logs show errors:

 [drm] psp command failed and response status is (0x7)
 [drm] Fence fallback timer expired on ring sdma0
 [drm] Fence fallback timer expired on ring gfx
 amdgpu :03:00.0: [drm:amdgpu_ib_ring_tests] *ERROR* IB test failed on gfx 
(-22).
 [drm:process_one_work] *ERROR* ib ring test failed (-22).

This can also be reproduced with pm_test:
 # echo devices > /sys/power/pm_test
 # echo freeze > /sys/power/mem

The same reproducer causes the same problem on Asus X512DK (Ryzen5 3500U)
even though that model is normally able to suspend and resume OK via S3.

Experimenting, I observed that this error condition can be invoked on
any amdgpu product by executing in succession:

  amdgpu_device_suspend(drm_dev, true, true);
  amdgpu_device_resume(drm_dev, true, true);

i.e. it appears that the resume routine is unable to get the device out
of suspended state, except for the S3 suspend case where it presumably has
a bit of extra help from the firmware or hardware.

However, I also observed that the runtime suspend/resume routines work
OK when tested like this, which lead me to the key difference in these
two cases: the ASIC reset, which only happens in the runtime suspend path.

Since it takes less than 1ms, we should do the ASIC reset in all
suspend paths, fixing resume from s2idle on these products.

Link: https://bugs.freedesktop.org/show_bug.cgi?id=111811
Signed-off-by: Daniel Drake 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a1939dbd4e3..7f4870e974fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3082,15 +3082,16 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
suspend, bool fbcon)
 */
amdgpu_bo_evict_vram(adev);
 
+   amdgpu_asic_reset(adev);
+   r = amdgpu_asic_reset(adev);
+   if (r)
+   DRM_ERROR("amdgpu asic reset failed\n");
+
pci_save_state(dev->pdev);
if (suspend) {
/* Shut down the device */
pci_disable_device(dev->pdev);
pci_set_power_state(dev->pdev, PCI_D3hot);
-   } else {
-   r = amdgpu_asic_reset(adev);
-   if (r)
-   DRM_ERROR("amdgpu asic reset failed\n");
}
 
return 0;
-- 
2.20.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: amdgpu hangs on boot or shutdown on AMD Raven Ridge CPU (Engineer Sample)

2018-05-08 Thread Daniel Drake
WHi Alex,

On Thu, Apr 19, 2018 at 4:13 PM, Alex Deucher  wrote:
 https://bugs.freedesktop.org/show_bug.cgi?id=105684
>>>
>>> No progress made on that bug report so far.
>>> What can we do to help this advance?
>>
>> Ping, any news here? How can we help advance on this bug?
>
> Can you try one of these branches?
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.18-wip
> do they work any better?

It's been over 3 months since we reported this bug by email, over 6
weeks since we reported it on bugzilla, and still there has been no
meaningful diagnostics help from AMD. This follows a similar pattern
to what we have seen with other issues prior to this one.

What can we do so that this bug gets some attention from your team?

Secondarily https://bugs.freedesktop.org/show_bug.cgi?id=106228 is
another bug that needs attention. We have a growing number of consumer
platforms affected by this. When booted, the amdgpu screen brightness
value is incorrectly read back as 0, which systemd will then store on
shutdown. On next boot, it restores the very low brightness level.
This can reproduce out of the box on Fedora, Ubuntu, etc.

Thanks,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu hangs on boot or shutdown on AMD Raven Ridge CPU (Engineer Sample)

2018-04-03 Thread Daniel Drake
On Thu, Mar 22, 2018 at 3:09 AM, Daniel Drake <dr...@endlessm.com> wrote:
> On Tue, Feb 20, 2018 at 10:18 PM, Alex Deucher <alexdeuc...@gmail.com> wrote:
>>> It seems that we are not alone seeing amdgpu-induced stability
>>> problems on multiple Raven Ridge platforms.
>>> https://www.phoronix.com/scan.php?page=news_item=AMD-Raven-Ridge-Mobo-Linux
>>>
>>> AMD, what can we do to help?
>>
>> Please file bugs:
>> https://bugs.freedesktop.org
>
> Sorry for the delayed response. We're still seeing serious instability
> here even on the latest kernel. Filed
> https://bugs.freedesktop.org/show_bug.cgi?id=105684

No progress made on that bug report so far.
What can we do to help this advance?

Thanks,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu hangs on boot or shutdown on AMD Raven Ridge CPU (Engineer Sample)

2018-03-22 Thread Daniel Drake
Hi Alex,

On Tue, Feb 20, 2018 at 10:18 PM, Alex Deucher  wrote:
>> It seems that we are not alone seeing amdgpu-induced stability
>> problems on multiple Raven Ridge platforms.
>> https://www.phoronix.com/scan.php?page=news_item=AMD-Raven-Ridge-Mobo-Linux
>>
>> AMD, what can we do to help?
>
> Please file bugs:
> https://bugs.freedesktop.org

Sorry for the delayed response. We're still seeing serious instability
here even on the latest kernel. Filed
https://bugs.freedesktop.org/show_bug.cgi?id=105684

Thanks,
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amdgpu hangs on boot or shutdown on AMD Raven Ridge CPU (Engineer Sample)

2018-02-19 Thread Daniel Drake
Hi,

> >>> We are working with new laptops that have the AMD Ravenl Ridge
> >>> chipset with this `/proc/cpuinfo`
> >>> https://gist.github.com/mschiu77/b06dba574e89b9a30cf4c450eaec49bc
> >>>
> >>> With the latest kernel 4.15, there're lots of different
> >>> panics/oops during boot so no chance to get into X. It also happens
> >>> during shutdown. Then I tried to build kernel from
> >>> git://people.freedesktop.org/~agd5f/linux on branch
> >>> amd-staging-drm-next with head on commit "drm: Fix trailing semicolon"
> >>> and update the linux-firmware. Things seem to get better, only 1 oops
> >>> observed. Here's the oops
> >>> https://gist.github.com/mschiu77/1a68f27272b24775b2040acdb474cdd3.

It seems that we are not alone seeing amdgpu-induced stability
problems on multiple Raven Ridge platforms.
https://www.phoronix.com/scan.php?page=news_item=AMD-Raven-Ridge-Mobo-Linux

AMD, what can we do to help?

Thanks!
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 4.15] drm/amd/display: call set csc_default if enable adjustment is false

2017-12-29 Thread Daniel Drake
From: Yue Hin Lau <yuehin@amd.com>

Signed-off-by: Yue Hin Lau <yuehin@amd.com>
Reviewed-by: Eric Bernstein <eric.bernst...@amd.com>
Acked-by: Harry Wentland <harry.wentl...@amd.com>
Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
[dr...@endlessm.com: backport to 4.15]
Signed-off-by: Daniel Drake <dr...@endlessm.com>
---
 drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp.h  | 2 +-
 drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp_cm.c   | 6 ++
 drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c | 2 ++
 drivers/gpu/drm/amd/display/dc/inc/hw/dpp.h   | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

Testing Acer Aspire TC-380 engineering sample (Raven Ridge), the display
comes up with an excessively green tint. This patch (from amd-staging-drm-next)
solves the issue. Can it be included in Linux 4.15?

diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp.h 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp.h
index a9782b1aba47..34daf895f848 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp.h
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp.h
@@ -1360,7 +1360,7 @@ void dpp1_cm_set_output_csc_adjustment(
 
 void dpp1_cm_set_output_csc_default(
struct dpp *dpp_base,
-   const struct default_adjustment *default_adjust);
+   enum dc_color_space colorspace);
 
 void dpp1_cm_set_gamut_remap(
struct dpp *dpp,
diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp_cm.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp_cm.c
index 40627c244bf5..ed1216b53465 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp_cm.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_dpp_cm.c
@@ -225,14 +225,13 @@ void dpp1_cm_set_gamut_remap(
 
 void dpp1_cm_set_output_csc_default(
struct dpp *dpp_base,
-   const struct default_adjustment *default_adjust)
+   enum dc_color_space colorspace)
 {
 
struct dcn10_dpp *dpp = TO_DCN10_DPP(dpp_base);
uint32_t ocsc_mode = 0;
 
-   if (default_adjust != NULL) {
-   switch (default_adjust->out_color_space) {
+   switch (colorspace) {
case COLOR_SPACE_SRGB:
case COLOR_SPACE_2020_RGB_FULLRANGE:
ocsc_mode = 0;
@@ -253,7 +252,6 @@ void dpp1_cm_set_output_csc_default(
case COLOR_SPACE_UNKNOWN:
default:
break;
-   }
}
 
REG_SET(CM_OCSC_CONTROL, 0, CM_OCSC_MODE, ocsc_mode);
diff --git a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c 
b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
index 961ad5c3b454..05dc01e54531 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
+++ b/drivers/gpu/drm/amd/display/dc/dcn10/dcn10_hw_sequencer.c
@@ -2097,6 +2097,8 @@ static void program_csc_matrix(struct pipe_ctx *pipe_ctx,
tbl_entry.color_space = color_space;
//tbl_entry.regval = matrix;

pipe_ctx->plane_res.dpp->funcs->opp_set_csc_adjustment(pipe_ctx->plane_res.dpp, 
_entry);
+   } else {
+   
pipe_ctx->plane_res.dpp->funcs->opp_set_csc_default(pipe_ctx->plane_res.dpp, 
colorspace);
}
 }
 static bool is_lower_pipe_tree_visible(struct pipe_ctx *pipe_ctx)
diff --git a/drivers/gpu/drm/amd/display/dc/inc/hw/dpp.h 
b/drivers/gpu/drm/amd/display/dc/inc/hw/dpp.h
index 83a68460edcd..9420dfb94d39 100644
--- a/drivers/gpu/drm/amd/display/dc/inc/hw/dpp.h
+++ b/drivers/gpu/drm/amd/display/dc/inc/hw/dpp.h
@@ -64,7 +64,7 @@ struct dpp_funcs {
 
void (*opp_set_csc_default)(
struct dpp *dpp,
-   const struct default_adjustment *default_adjust);
+   enum dc_color_space colorspace);
 
void (*opp_set_csc_adjustment)(
struct dpp *dpp,
-- 
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] iommu/amd: flush IOTLB for specific domains only (v3)

2017-09-12 Thread Daniel Drake
Hi,

On Tue, May 30, 2017 at 3:38 PM, Nath, Arindam  wrote:
>>-Original Message-
>>From: Joerg Roedel [mailto:j...@8bytes.org]
>>Sent: Monday, May 29, 2017 8:09 PM
>>To: Nath, Arindam ; Lendacky, Thomas
>>
>>Cc: io...@lists.linux-foundation.org; amd-gfx@lists.freedesktop.org;
>>Deucher, Alexander ; Bridgman, John
>>; dr...@endlessm.com; Suthikulpanit, Suravee
>>; li...@endlessm.com; Craig Stein
>>; mic...@daenzer.net; Kuehling, Felix
>>; sta...@vger.kernel.org
>>Subject: Re: [PATCH] iommu/amd: flush IOTLB for specific domains only (v3)
>>
>>Hi Arindam,
>>
>>I met Tom Lendacky last week in Nuremberg last week and he told me he is
>>working on the same area of the code that this patch is for. His reason
>>for touching this code was to solve some locking problems. Maybe you two
>>can work together on a joint approach to improve this?
>
> Sure Joerg, I will work with Tom.

What was the end result here? I see that the code has been reworked in
4.13 so your original patch no longer applies. Is the reworked version
also expected to solve the original issue?

Thanks
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


amdgpu display corruption and hang on AMD A10-9620P

2017-05-09 Thread Daniel Drake
Hi,

We are working with new laptops that have the AMD Bristol Ridge
chipset with this SoC:

AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G

I think this is the Bristol Ridge chipset.

During boot, the display becomes unusable at the point where the
amdgpu driver loads. You can see at least two horizontal lines of
garbage at this point. We have reproduced on 4.8, 4.10 and linus
master (early 4.12).

Photo: http://pasteboard.co/qrC9mh4p.jpg

Getting logs is tricky because the system appears to freeze at that point.

Is this a known issue? Anything we can do to help diagnosis?

Thanks
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] iommu/amd: flush IOTLB for specific domains only

2017-05-08 Thread Daniel Drake
On Wed, Apr 5, 2017 at 9:01 AM, Nath, Arindam <arindam.n...@amd.com> wrote:
>
> >-Original Message-
> >From: Daniel Drake [mailto:dr...@endlessm.com]
> >Sent: Thursday, March 30, 2017 7:15 PM
> >To: Nath, Arindam
> >Cc: j...@8bytes.org; Deucher, Alexander; Bridgman, John; amd-
> >g...@lists.freedesktop.org; io...@lists.linux-foundation.org; Suthikulpanit,
> >Suravee; Linux Upstreaming Team
> >Subject: Re: [PATCH] iommu/amd: flush IOTLB for specific domains only
> >
> >On Thu, Mar 30, 2017 at 12:23 AM, Nath, Arindam <arindam.n...@amd.com>
> >wrote:
> >> Daniel, did you get chance to test this patch?
> >
> >Not yet. Should we test it alone or alongside "PCI: Blacklist AMD
> >Stoney GPU devices for ATS"?
>
> Daniel, any luck with this patch?

Sorry for the delay. The patch appears to be working fine.

Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] iommu/amd: flush IOTLB for specific domains only

2017-03-27 Thread Daniel Drake
Hi Arindam,

You CC'd me on this - does this mean that it is a fix for the issue
described in the thread "amd-iommu: can't boot with amdgpu, AMD-Vi:
Completion-Wait loop timed out" ?

Thanks
Daniel


On Mon, Mar 27, 2017 at 12:17 AM,   wrote:
> From: Arindam Nath 
>
> The idea behind flush queues is to defer the IOTLB flushing
> for domains for which the mappings are no longer valid. We
> add such domains in queue_add(), and when the queue size
> reaches FLUSH_QUEUE_SIZE, we perform __queue_flush().
>
> Since we have already taken lock before __queue_flush()
> is called, we need to make sure the IOTLB flushing is
> performed as quickly as possible.
>
> In the current implementation, we perform IOTLB flushing
> for all domains irrespective of which ones were actually
> added in the flush queue initially. This can be quite
> expensive especially for domains for which unmapping is
> not required at this point of time.
>
> This patch makes use of domain information in
> 'struct flush_queue_entry' to make sure we only flush
> IOTLBs for domains who need it, skipping others.
>
> Signed-off-by: Arindam Nath 
> ---
>  drivers/iommu/amd_iommu.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 98940d1..6a9a048 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2227,15 +2227,16 @@ static struct iommu_group 
> *amd_iommu_device_group(struct device *dev)
>
>  static void __queue_flush(struct flush_queue *queue)
>  {
> -   struct protection_domain *domain;
> -   unsigned long flags;
> int idx;
>
> -   /* First flush TLB of all known domains */
> -   spin_lock_irqsave(_iommu_pd_lock, flags);
> -   list_for_each_entry(domain, _iommu_pd_list, list)
> -   domain_flush_tlb(domain);
> -   spin_unlock_irqrestore(_iommu_pd_lock, flags);
> +   /* First flush TLB of all domains which were added to flush queue */
> +   for (idx = 0; idx < queue->next; ++idx) {
> +   struct flush_queue_entry *entry;
> +
> +   entry = queue->entries + idx;
> +
> +   domain_flush_tlb(>dma_dom->domain);
> +   }
>
> /* Wait until flushes have completed */
> domain_flush_complete(NULL);
> --
> 1.9.1
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amd-iommu: can't boot with amdgpu, AMD-Vi: Completion-Wait loop timed out

2017-03-27 Thread Daniel Drake
Hi Joerg,

Thanks for looking into this. We confirm that this workaround avoids
the iommu log spam and that amdgpu appears to be working fine with it.

Daniel


On Wed, Mar 22, 2017 at 5:22 AM, j...@8bytes.org  wrote:
> On Tue, Mar 21, 2017 at 04:30:55PM +, Deucher, Alexander wrote:
>> > I am preparing a debug-patch that disables ATS for these GPUs so someone
>> > with such a chip can test it.
>>
>> Thanks Joerg.
>
> Here is a debug patch, using the hard hammer of disabling the use of ATS
> completly in the AMD IOMMU driver. If it fixes the issue I am going to
> write a more upstreamable version.
>
> But for now, please test if this fixes the issue.
>
> Thanks,
>
> Joerg
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 98940d1..f019aa6 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -467,7 +467,7 @@ static int iommu_init_device(struct device *dev)
> struct amd_iommu *iommu;
>
> iommu = amd_iommu_rlookup_table[dev_data->devid];
> -   dev_data->iommu_v2 = iommu->is_iommu_v2;
> +   dev_data->iommu_v2 = false;
> }
>
> dev->archdata.iommu = dev_data;
> diff --git a/drivers/iommu/amd_iommu_init.c b/drivers/iommu/amd_iommu_init.c
> index 6130278..41d0e64 100644
> --- a/drivers/iommu/amd_iommu_init.c
> +++ b/drivers/iommu/amd_iommu_init.c
> @@ -171,7 +171,7 @@ int amd_iommus_present;
>
>  /* IOMMUs have a non-present cache? */
>  bool amd_iommu_np_cache __read_mostly;
> -bool amd_iommu_iotlb_sup __read_mostly = true;
> +bool amd_iommu_iotlb_sup __read_mostly = false;
>
>  u32 amd_iommu_max_pasid __read_mostly = ~0;
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: amd-iommu: can't boot with amdgpu, AMD-Vi: Completion-Wait loop timed out

2017-03-17 Thread Daniel Drake
Hi,

On Mon, Mar 13, 2017 at 2:01 PM, Deucher, Alexander
 wrote:
> > We are unable to boot Acer Aspire E5-553G (AMD FX-9800P RADEON R7) nor
> > Acer Aspire E5-523 with standard configurations because during boot
> > the screen is flooded with the following error message over and over:
> >
> >   AMD-Vi: Completion-Wait loop timed out
>
> We ran into similar issues and bisected it to commit 
> b1516a14657acf81a587e9a6e733a881625eee53.  I'm not too familiar with the 
> IOMMU hardware to know if this is an iommu or display driver issue yet.

We can confirm that reverting this commit solves the issue.

Given that that commit is an optimization, but it has introduced a
regression on multiple platforms, and has been like this for 8 months,
it would be common practice to now revert this patch upstream until
the regression is fixed. Could you please send a new patch to do this?

Also, we would be happy to test any real solutions to this issue while
we still have the affected units in hand.

Thanks
Daniel
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx