Re: [Nouveau] [PATCH v5 07/14] drm/dp_mst: Don't forget to update port->input in drm_dp_mst_handle_conn_stat()
On Mon, Oct 21, 2019 at 10:36:02PM -0400, Lyude Paul wrote: > This probably hasn't caused any problems up until now since it's > probably nearly impossible to encounter this in the wild, however if we > were to receive a connection status notification from the MST hub after > resume while we're in the middle of reprobing the link addresses for a > topology then there's a much larger chance that a port could have > changed from being an output port to input port (or vice versa). If we > forget to update this bit of information, we'll potentially ignore a > valid PDT change on a downstream port because we think it's an input > port. > > So, make sure we read the input_port field in connection status > notifications in drm_dp_mst_handle_conn_stat() to prevent this from > happening once we've implemented suspend/resume reprobing. > > Cc: Juston Li > Cc: Imre Deak > Cc: Ville Syrjälä > Cc: Harry Wentland > Cc: Daniel Vetter > Signed-off-by: Lyude Paul > Reviewed-by: Sean Paul > --- > drivers/gpu/drm/drm_dp_mst_topology.c | 52 +++ > 1 file changed, 38 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c > b/drivers/gpu/drm/drm_dp_mst_topology.c > index 7bf4db91ff90..c8e218b902ae 100644 > --- a/drivers/gpu/drm/drm_dp_mst_topology.c > +++ b/drivers/gpu/drm/drm_dp_mst_topology.c > @@ -2079,18 +2079,40 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch > *mstb, > { > struct drm_dp_mst_topology_mgr *mgr = mstb->mgr; > struct drm_dp_mst_port *port; > - int old_ddps; > - bool dowork = false; > + int old_ddps, ret; > + u8 new_pdt; > + bool dowork = false, create_connector = false; > > port = drm_dp_get_port(mstb, conn_stat->port_number); > if (!port) > return; > > - /* Locking is only needed if the port's exposed to userspace */ > - if (port->connector) > + if (port->connector) { > + if (!port->input && conn_stat->input_port) { > + /* > + * We can't remove a connector from an already exposed > + * port, so just throw the port out and make sure we > + * reprobe the link address of it's parent MSTB > + */ > + drm_dp_mst_topology_unlink_port(mgr, port); > + mstb->link_address_sent = false; > + dowork = true; > + goto out; > + } > + > + /* > + * Locking is only needed if the port's exposed to userspace > + */ optional nit: this will still fit on one line > drm_modeset_lock(>base.lock, NULL); > + } else if (port->input && !conn_stat->input_port) { > + create_connector = true; > + /* Reprobe link address so we get num_sdp_streams */ > + mstb->link_address_sent = false; > + dowork = true; > + } > > old_ddps = port->ddps; > + port->input = conn_stat->input_port; > port->mcs = conn_stat->message_capability_status; > port->ldps = conn_stat->legacy_device_plug_status; > port->ddps = conn_stat->displayport_device_plug_status; > @@ -2103,21 +2125,23 @@ drm_dp_mst_handle_conn_stat(struct drm_dp_mst_branch > *mstb, > } > } > > - if (!port->input) { > - int ret = drm_dp_port_set_pdt(port, > - conn_stat->peer_device_type); > - if (ret == 1) { > - dowork = true; > - } else if (ret < 0) { > - DRM_ERROR("Failed to change PDT for port %p: %d\n", > - port, ret); > - dowork = false; > - } > + new_pdt = port->input ? DP_PEER_DEVICE_NONE : > conn_stat->peer_device_type; > + > + ret = drm_dp_port_set_pdt(port, new_pdt); > + if (ret == 1) { > + dowork = true; > + } else if (ret < 0) { > + DRM_ERROR("Failed to change PDT for port %p: %d\n", > + port, ret); > + dowork = false; > } > > if (port->connector) > drm_modeset_unlock(>base.lock); > + else if (create_connector) > + drm_dp_mst_port_add_connector(mstb, port); > > +out: > drm_dp_mst_topology_put_port(port); > if (dowork) > queue_work(system_long_wq, >mgr->work); > -- > 2.21.0 > -- Sean Paul, Software Engineer, Google / Chromium OS ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 112070] H/W Acceleration sufficiently buggy on Debian to hard lock machine GeForce 7600 GO
https://bugs.freedesktop.org/show_bug.cgi?id=112070 --- Comment #23 from Ilia Mirkin --- Looks like a crash in glmark2 itself. Probably some unexpected condition... -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 112070] H/W Acceleration sufficiently buggy on Debian to hard lock machine GeForce 7600 GO
https://bugs.freedesktop.org/show_bug.cgi?id=112070 --- Comment #22 from Doobz --- Created attachment 145795 --> https://bugs.freedesktop.org/attachment.cgi?id=145795=edit Backtrace glmark2 segfault nv4b mesa1836 Let me know if I messed up. And thanks for taking the time to look into this. -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 112070] H/W Acceleration sufficiently buggy on Debian to hard lock machine GeForce 7600 GO
https://bugs.freedesktop.org/show_bug.cgi?id=112070 --- Comment #21 from Ilia Mirkin --- b000 range is the media range. Just remove dri/libvdpau_nouveau.so -- I'm sure that's not helping. And any similar va-api thing. -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v5 05/14] drm/dp_mst: Add probe_lock
On Mon, Oct 21, 2019 at 10:36:00PM -0400, Lyude Paul wrote: > Currently, MST lacks locking in a lot of places that really should have > some sort of locking. Hotplugging and link address code paths are some > of the offenders here, as there is actually nothing preventing us from > running a link address probe while at the same time handling a > connection status update request - something that's likely always been > possible but never seen in the wild because hotplugging has been broken > for ages now (with the exception of amdgpu, for reasons I don't think > are worth digging into very far). > > Note: I'm going to start using the term "in-memory topology layout" here > to refer to drm_dp_mst_port->mstb and drm_dp_mst_branch->ports. > > Locking in these places is a little tougher then it looks though. > Generally we protect anything having to do with the in-memory topology > layout under >lock. But this becomes nearly impossible to do from > the context of link address probes due to the fact that >lock is > usually grabbed under random various modesetting locks, meaning that > there's no way we can just invert the >lock order and keep it > locked throughout the whole process of updating the topology. > > Luckily there are only two workers which can modify the in-memory > topology layout: drm_dp_mst_up_req_work() and > drm_dp_mst_link_probe_work(), meaning as long as we prevent these two > workers from traveling the topology layout in parallel with the intent > of updating it we don't need to worry about grabbing >lock in these > workers for reads. We only need to grab >lock in these workers for > writes, so that readers outside these two workers are still protected > from the topology layout changing beneath them. > > So, add the new >probe_lock and use it in both > drm_dp_mst_link_probe_work() and drm_dp_mst_up_req_work(). Additionally, > add some more detailed explanations for how this locking is intended to > work to drm_dp_mst_port->mstb and drm_dp_mst_branch->ports. This seems like a good solution to me, thanks for working this through! Any chance we could add a WARN_ON(!mutex_is_locked(>probe_lock)); somewhere centrally called by all paths modifying the in-memory topology layout? drm_dp_add_port perhaps? That way we have a fallback in case something else starts mucking with the topology. Other than that, Reviewed-by: Sean Paul > > Signed-off-by: Lyude Paul > Cc: Juston Li > Cc: Imre Deak > Cc: Ville Syrjälä > Cc: Harry Wentland > Cc: Daniel Vetter > Cc: Sean Paul > --- > drivers/gpu/drm/drm_dp_mst_topology.c | 28 ++- > include/drm/drm_dp_mst_helper.h | 32 +++ > 2 files changed, 46 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/drm_dp_mst_topology.c > b/drivers/gpu/drm/drm_dp_mst_topology.c > index 08c316a727df..11d842f0bff5 100644 > --- a/drivers/gpu/drm/drm_dp_mst_topology.c > +++ b/drivers/gpu/drm/drm_dp_mst_topology.c > @@ -2147,37 +2147,40 @@ static void drm_dp_check_and_send_link_address(struct > drm_dp_mst_topology_mgr *m > struct drm_dp_mst_branch *mstb) > { > struct drm_dp_mst_port *port; > - struct drm_dp_mst_branch *mstb_child; > + > if (!mstb->link_address_sent) > drm_dp_send_link_address(mgr, mstb); > > list_for_each_entry(port, >ports, next) { > - if (port->input) > - continue; > + struct drm_dp_mst_branch *mstb_child = NULL; > > - if (!port->ddps) > + if (port->input || !port->ddps) > continue; > > if (!port->available_pbn) > drm_dp_send_enum_path_resources(mgr, mstb, port); > > - if (port->mstb) { > + if (port->mstb) > mstb_child = drm_dp_mst_topology_get_mstb_validated( > mgr, port->mstb); > - if (mstb_child) { > - drm_dp_check_and_send_link_address(mgr, > mstb_child); > - drm_dp_mst_topology_put_mstb(mstb_child); > - } > + > + if (mstb_child) { > + drm_dp_check_and_send_link_address(mgr, mstb_child); > + drm_dp_mst_topology_put_mstb(mstb_child); > } > } > } > > static void drm_dp_mst_link_probe_work(struct work_struct *work) > { > - struct drm_dp_mst_topology_mgr *mgr = container_of(work, struct > drm_dp_mst_topology_mgr, work); > + struct drm_dp_mst_topology_mgr *mgr = > + container_of(work, struct drm_dp_mst_topology_mgr, work); > + struct drm_device *dev = mgr->dev; > struct drm_dp_mst_branch *mstb; > int ret; > > + mutex_lock(>probe_lock); > + > mutex_lock(>lock); > mstb = mgr->mst_primary; > if (mstb) { > @@ -2190,6 +2193,7 @@ static void drm_dp_mst_link_probe_work(struct >
[Nouveau] [Bug 112070] H/W Acceleration sufficiently buggy on Debian to hard lock machine GeForce 7600 GO
https://bugs.freedesktop.org/show_bug.cgi?id=112070 --- Comment #20 from Doobz --- Created attachment 145794 --> https://bugs.freedesktop.org/attachment.cgi?id=145794=edit dmesg output from NV63 GF7100 desktop -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
[Nouveau] [Bug 112070] H/W Acceleration sufficiently buggy on Debian to hard lock machine GeForce 7600 GO
https://bugs.freedesktop.org/show_bug.cgi?id=112070 --- Comment #19 from Doobz --- Not managed to do the backtrace yet, but thought I'd attach dmesg from 7100-based desktop. Had it sitting cycling gl screensavers for a while. Lots of these:- [ 45.639455] nouveau :00:10.0: bus: MMIO write of 00540001 FAULT at 00b000 And some of these:- [ 825.282398] perf: interrupt took too long (5006 > 4996), lowering kernel.perf_event_max_sample_rate to 39750 In other news, same machine hard locked as soon as I tried to play an local video file. -- You are receiving this mail because: You are the assignee for the bug.___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
On Tue, Oct 22, 2019 at 2:45 PM Mika Westerberg wrote: > > On Tue, Oct 22, 2019 at 11:16:14AM +0200, Karol Herbst wrote: > > I think there is something I totally forgot about: > > > > When there was never a driver bound to the GPU, and if runtime power > > management gets enabled on that device, runtime suspend/resume works > > as expected (I am not 100% sure on if that always works, but I will > > recheck that). > > AFAIK, if there is no driver bound to the PCI device it is left to D0 > regardless of the runtime PM state which could explain why it works in > that case (it is never put into D3hot). > > I looked at the acpidump you sent and there is one thing that may > explain the differences between Windows and Linux. Not sure if you were > aware of this already, though. The power resource PGOF() method has > this: > >If (((OSYS <= 0x07D9) || ((OSYS == 0x07DF) && (_REV == 0x05 { > ... >} > I think this is the fallback to some older method of runtime suspending the device, and I think it will end up touching different registers on the bridge controller which do not show the broken behaviour. You'll find references to following variables which all cause a link to be powered down: Q0L2 (newest), P0L2, P0LD (oldest, I think). Maybe I remember incorrectly and have to read the code again... okay, the fallback path uses P0LD indeed. That's actually the only register of those being documented by Intel afaik. > If I read it right, the later condition tries to detect Linux which > fails nowadays but if you have acpi_rev_override in the command line (or > the machine is listed in acpi_rev_dmi_table) this check passes and does > some magic which is not clear to me. There is similar in PGON() side > which is used to turn the device back on. > > You can check what actually happens when _ON()/_OFF() is called by > passing something like below to the kernel command line: > > acpi.trace_debug_layer=0x80 acpi.trace_debug_level=0x10 > acpi.trace_method_name=\_SB.PCI0.PEG0.PG00._ON acpi.trace_state=method > > (See also Documentation/firmware-guide/acpi/method-tracing.rst). > > Trace goes to system dmesg. This sounds to be very helpful, I'll give it a try. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH v3] pci: prevent putting nvidia GPUs into lower device states on certain intel bridges
I think there is something I totally forgot about: When there was never a driver bound to the GPU, and if runtime power management gets enabled on that device, runtime suspend/resume works as expected (I am not 100% sure on if that always works, but I will recheck that). In the past I know that at some point I "bisected" the nouveau driver to figure out what actually breaks it and found out that some script executed with the help of an on-chip engine (signed script, signed firmware, both vbios provided) makes it break. Debugging the script lead me to the PCIe link speed changes done inside the script breaking it. But as "reverting" the speed change didn't make it work reliably again, I think I need to get back on that and check if it's something else. I will try to convert the script into C or python code to make it more accessible to debug and hopefully I'll find something I overlooked the last time. On Mon, Oct 21, 2019 at 6:40 PM Karol Herbst wrote: > > On Mon, Oct 21, 2019 at 5:46 PM Mika Westerberg > wrote: > > > > On Mon, Oct 21, 2019 at 04:49:09PM +0200, Karol Herbst wrote: > > > On Mon, Oct 21, 2019 at 4:09 PM Mika Westerberg > > > wrote: > > > > > > > > On Mon, Oct 21, 2019 at 03:54:09PM +0200, Karol Herbst wrote: > > > > > > I really would like to provide you more information about such > > > > > > workaround but I'm not aware of any ;-) I have not seen any issues > > > > > > like > > > > > > this when D3cold is properly implemented in the platform. That's > > > > > > why > > > > > > I'm bit skeptical that this has anything to do with specific Intel > > > > > > PCIe > > > > > > ports. More likely it is some power sequence in the _ON/_OFF() > > > > > > methods > > > > > > that is run differently on Windows. > > > > > > > > > > yeah.. maybe. I really don't know what's the actual root cause. I just > > > > > know that with this workaround it works perfectly fine on my and some > > > > > other systems it was tested on. Do you know who would be best to > > > > > approach to get proper documentation about those methods and what are > > > > > the actual prerequisites of those methods? > > > > > > > > Those should be documented in the ACPI spec. Chapter 7 should explain > > > > power resources and the device power methods in detail. > > > > > > either I looked up the wrong spec or the documentation isn't really > > > saying much there. > > > > Well it explains those methods, _PSx, _PRx and _ON()/_OFF(). In case of > > PCIe device you also want to check PCIe spec. PCIe 5.0 section 5.8 "PCI > > Function Power State Transitions" has a picture about the supported > > power state transitions and there we can find that function must be in > > D3hot before it can be transitioned into D3cold so if the _OFF() for > > example blindly assumes that the device is in D0 when it is called, it > > is a bug in the BIOS. > > > > BTW, where can I find acpidump of such system? > > I am sure it's uploaded somewhere already. But it's not an issue of > just one system. It's essentially hitting every single laptop with a > skylake or kaby lake CPU + Nvidia GPU. I didn't see any system where > it's actually working right now (and we are pestering nvidia about > this issue for over a year already with no solution) > > I've attached an acpidump from my system. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau