Re: [RFC PATCH 00/10] Device Memory TCP
On 7/10/23 15:32, Mina Almasry wrote: * TL;DR: Device memory TCP (devmem TCP) is a proposal for transferring data to and/or from device memory efficiently, without bouncing the data to a host memory buffer. (I'm writing this as someone who might plausibly use this mechanism, but I don't think I'm very likely to end up working on the kernel side, unless I somehow feel extremely inspired to implement it for i40e.) I looked at these patches and the GVE tree, and I'm trying to wrap my head around the data path. As I understand it, for RX: 1. The GVE driver notices that the queue is programmed to use devmem, and it programs the NIC to copy packet payloads to the devmem that has been programmed. 2. The NIC receives the packet and copies the header to kernel memory and the payload to dma-buf memory. 3. The kernel tells userspace where in the dma-buf the data is. 4. Userspace does something with the data. 5. Userspace does DONTNEED to recycle the memory and make it available for new received packets. Did I get this right? This seems a bit awkward if there's any chance that packets not intended for the target device end up in the rxq. I'm wondering if a more capable if somewhat higher latency model could work where the NIC stores received packets in its own device memory. Then userspace (or the kernel or a driver or whatever) could initiate a separate DMA from the NIC to the final target *after* reading the headers. Can the hardware support this? Another way of putting this is: steering received data to a specific device based on the *receive queue* forces the logic selecting a destination device to be the same as the logic selecting the queue. RX steering logic is pretty limited on most hardware (as far as I know -- certainly I've never had much luck doing anything especially intelligent with RX flow steering, and I've tried on a couple of different brands of supposedly fancy NICs). But Linux has very nice capabilities to direct packets, in software, to where they are supposed to go, and it would be nice if all that logic could just work, scalably, with device memory. If Linux could examine headers *before* the payload gets DMAed to wherever it goes, I think this could plausibly work quite nicely. One could even have an easy-to-use interface in which one directs a *socket* to a PCIe device. I expect, although I've never looked at the datasheets, that the kernel could even efficiently make rx decisions based on data in device memory on upcoming CXL NICs where device memory could participate in the host cache hierarchy. My real ulterior motive is that I think it would be great to use an ability like this for DPDK-like uses. Wouldn't it be nifty if I could open a normal TCP socket, then, after it's open, ask the kernel to kindly DMA the results directly to my application memory (via udmabuf, perhaps)? Or have a whole VLAN or macvlan get directed to a userspace queue, etc? It also seems a bit odd to me that the binding from rxq to dma-buf is established by programming the dma-buf. This makes the security model (and the mental model) awkward -- this binding is a setting on the *queue*, not the dma-buf, and in a containerized or privilege-separated system, a process could have enough privilege to make a dma-buf somewhere but not have any privileges on the NIC. (And may not even have the NIC present in its network namespace!) --Andy
Re: [RFC PATCH 06/10] net: add SO_DEVMEM_DONTNEED setsockopt to release RX pages
On 7/10/23 15:32, Mina Almasry wrote: Add an interface for the user to notify the kernel that it is done reading the NET_RX dmabuf pages returned as cmsg. The kernel will drop the reference on the NET_RX pages to make them available for re-use. Signed-off-by: Mina Almasry --- + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + struct page *pg = xa_erase(&sk->sk_pagepool, + tokens[i].token_start + j); + + if (pg) + put_page(pg); + else + /* -EINTR here notifies the userspace +* that not all tokens passed to it have +* been freed. +*/ + ret = -EINTR; Unless I'm missing something, this type of error reporting is unrecoverable -- userspace doesn't know how many tokens have been freed. I think you should either make it explicitly unrecoverable (somehow shut down dmabuf handling entirely) or tell userspace how many tokens were successfully freed. --Andy
Re: [PATCH 0/2] Nuke PAGE_KERNEL_IO
On 10/21/21 11:15, Lucas De Marchi wrote: Last user of PAGE_KERNEL_IO is the i915 driver. While removing it from there as we seek to bring the driver to other architectures, Daniel suggested that we could finish the cleanup and remove it altogether, through the tip tree. So here I'm sending both commits needed for that. Lucas De Marchi (2): drm/i915/gem: stop using PAGE_KERNEL_IO x86/mm: nuke PAGE_KERNEL_IO arch/x86/include/asm/fixmap.h | 2 +- arch/x86/include/asm/pgtable_types.h | 7 --- arch/x86/mm/ioremap.c | 2 +- arch/x86/xen/setup.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 4 ++-- include/asm-generic/fixmap.h | 2 +- 6 files changed, 6 insertions(+), 13 deletions(-) Acked-by: Andy Lutomirski
Re: [PATCH v2 3/4] drm/ttm, drm/vmwgfx: Correctly support support AMD memory encryption
> On Sep 3, 2019, at 3:15 PM, Thomas Hellström (VMware) > wrote: > >> On 9/4/19 12:08 AM, Thomas Hellström (VMware) wrote: >>> On 9/3/19 11:46 PM, Andy Lutomirski wrote: >>> On Tue, Sep 3, 2019 at 2:05 PM Thomas Hellström (VMware) >>> wrote: >>>> On 9/3/19 10:51 PM, Dave Hansen wrote: >>>>>> On 9/3/19 1:36 PM, Thomas Hellström (VMware) wrote: >>>>>> So the question here should really be, can we determine already at mmap >>>>>> time whether backing memory will be unencrypted and adjust the *real* >>>>>> vma->vm_page_prot under the mmap_sem? >>>>>> >>>>>> Possibly, but that requires populating the buffer with memory at mmap >>>>>> time rather than at first fault time. >>>>> I'm not connecting the dots. >>>>> >>>>> vma->vm_page_prot is used to create a VMA's PTEs regardless of if they >>>>> are created at mmap() or fault time. If we establish a good >>>>> vma->vm_page_prot, can't we just use it forever for demand faults? >>>> With SEV I think that we could possibly establish the encryption flags >>>> at vma creation time. But thinking of it, it would actually break with >>>> SME where buffer content can be moved between encrypted system memory >>>> and unencrypted graphics card PCI memory behind user-space's back. That >>>> would imply killing all user-space encrypted PTEs and at fault time set >>>> up new ones pointing to unencrypted PCI memory.. >>>> >>>>> Or, are you concerned that if an attempt is made to demand-fault page >>>>> that's incompatible with vma->vm_page_prot that we have to SEGV? >>>>> >>>>>> And it still requires knowledge whether the device DMA is always >>>>>> unencrypted (or if SEV is active). >>>>> I may be getting mixed up on MKTME (the Intel memory encryption) and >>>>> SEV. Is SEV supported on all memory types? Page cache, hugetlbfs, >>>>> anonymous? Or just anonymous? >>>> SEV AFAIK encrypts *all* memory except DMA memory. To do that it uses a >>>> SWIOTLB backed by unencrypted memory, and it also flips coherent DMA >>>> memory to unencrypted (which is a very slow operation and patch 4 deals >>>> with caching such memory). >>>> >>> I'm still lost. You have some fancy VMA where the backing pages >>> change behind the application's back. This isn't particularly novel >>> -- plain old anonymous memory and plain old mapped files do this too. >>> Can't you all the insert_pfn APIs and call it a day? What's so >>> special that you need all this magic? ISTM you should be able to >>> allocate memory that's addressable by the device (dma_alloc_coherent() >>> or whatever) and then map it into user memory just like you'd map any >>> other page. >>> >>> I feel like I'm missing something here. >> >> Yes, so in this case we use dma_alloc_coherent(). >> >> With SEV, that gives us unencrypted pages. (Pages whose linear kernel map is >> marked unencrypted). With SME that (typcially) gives us encrypted pages. In >> both these cases, vm_get_page_prot() returns >> an encrypted page protection, which lands in vma->vm_page_prot. >> >> In the SEV case, we therefore need to modify the page protection to >> unencrypted. Hence we need to know whether we're running under SEV and >> therefore need to modify the protection. If not, the user-space PTE would >> incorrectly have the encryption flag set. >> I’m still confused. You got unencrypted pages with an unencrypted PFN. Why do you need to fiddle? You have a PFN, and you’re inserting it with vmf_insert_pfn(). This should just work, no? There doesn’t seem to be any real funny business in dma_mmap_attrs() or dma_common_mmap(). But, reading this, I have more questions: Can’t you get rid of cvma by using vmf_insert_pfn_prot()? Would it make sense to add a vmf_insert_dma_page() to directly do exactly what you’re trying to do? And a broader question just because I’m still confused: why isn’t the encryption bit in the PFN? The whole SEV/SME system seems like it’s trying a bit to hard to be fully invisible to the kernel. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH v2 3/4] drm/ttm, drm/vmwgfx: Correctly support support AMD memory encryption
On Tue, Sep 3, 2019 at 2:05 PM Thomas Hellström (VMware) wrote: > > On 9/3/19 10:51 PM, Dave Hansen wrote: > > On 9/3/19 1:36 PM, Thomas Hellström (VMware) wrote: > >> So the question here should really be, can we determine already at mmap > >> time whether backing memory will be unencrypted and adjust the *real* > >> vma->vm_page_prot under the mmap_sem? > >> > >> Possibly, but that requires populating the buffer with memory at mmap > >> time rather than at first fault time. > > I'm not connecting the dots. > > > > vma->vm_page_prot is used to create a VMA's PTEs regardless of if they > > are created at mmap() or fault time. If we establish a good > > vma->vm_page_prot, can't we just use it forever for demand faults? > > With SEV I think that we could possibly establish the encryption flags > at vma creation time. But thinking of it, it would actually break with > SME where buffer content can be moved between encrypted system memory > and unencrypted graphics card PCI memory behind user-space's back. That > would imply killing all user-space encrypted PTEs and at fault time set > up new ones pointing to unencrypted PCI memory.. > > > > > Or, are you concerned that if an attempt is made to demand-fault page > > that's incompatible with vma->vm_page_prot that we have to SEGV? > > > >> And it still requires knowledge whether the device DMA is always > >> unencrypted (or if SEV is active). > > I may be getting mixed up on MKTME (the Intel memory encryption) and > > SEV. Is SEV supported on all memory types? Page cache, hugetlbfs, > > anonymous? Or just anonymous? > > SEV AFAIK encrypts *all* memory except DMA memory. To do that it uses a > SWIOTLB backed by unencrypted memory, and it also flips coherent DMA > memory to unencrypted (which is a very slow operation and patch 4 deals > with caching such memory). > I'm still lost. You have some fancy VMA where the backing pages change behind the application's back. This isn't particularly novel -- plain old anonymous memory and plain old mapped files do this too. Can't you all the insert_pfn APIs and call it a day? What's so special that you need all this magic? ISTM you should be able to allocate memory that's addressable by the device (dma_alloc_coherent() or whatever) and then map it into user memory just like you'd map any other page. I feel like I'm missing something here.
Re: [PATCH v2 1/4] x86/mm: Export force_dma_unencrypted
On Tue, Sep 3, 2019 at 1:46 PM Thomas Hellström (VMware) wrote: > > On 9/3/19 6:22 PM, Christoph Hellwig wrote: > > On Tue, Sep 03, 2019 at 04:32:45PM +0200, Thomas Hellström (VMware) wrote: > >> Is this a layer violation concern, that is, would you be ok with a similar > >> helper for TTM, or is it that you want to force the graphics drivers into > >> adhering strictly to the DMA api, even when it from an engineering > >> perspective makes no sense? > > >From looking at DRM I strongly believe that making DRM use the DMA > > mapping properly makes a lot of sense from the engineering perspective, > > and this series is a good argument for that positions. > > What I mean with "from an engineering perspective" is that drivers would > end up with a non-trivial amount of code supporting purely academic > cases: Setups where software rendering would be faster than gpu > accelerated, and setups on platforms where the driver would never run > anyway because the device would never be supported on that platform... > > > If DRM was using > > the DMA properl we would not need this series to start with, all the > > SEV handling is hidden behind the DMA API. While we had occasional > > bugs in that support fixing it meant that it covered all drivers > > properly using that API. > > That is not really true. The dma API can't handle faulting of coherent > pages which is what this series is really all about supporting also with > SEV active. To handle the case where we move graphics buffers or send > them to swap space while user-space have them mapped. > > To do that and still be fully dma-api compliant we would ideally need, > for example, an exported dma_pgprot(). (dma_pgprot() by the way is still > suffering from one of the bugs that you mention above). > > Still, I need a way forward and my questions weren't really answered by > this. > > I read this patch, I read force_dma_encrypted(), I read the changelog again, and I haven't the faintest clue what TTM could possibly be doing with force_dma_encrypted(). You're saying that TTM needs to transparently change mappings to relocate objects in memory between system memory and device memory. Great, I don't see the problem. Is the issue that you need to allocate system memory that is addressable by the GPU and that, if the GPU has insufficient PA bits, you need unencrypted memory? If so, this sounds like an excellent use for the DMA API. Rather than kludging directly knowledge of force_dma_encrypted() into the driver, can't you at least add, if needed, a new helper specifically to allocate memory that can be addressed by the device? Like dma_alloc_coherent()? Or, if for some reason, dma_alloc_coherent() doesn't do what you need or your driver isn't ready to use it, then explain *why* and introduce a new function to solve your problem? Keep in mind that, depending on just how MKTME ends up being supported in Linux, it's entirely possible that it will be *backwards* from what you expect -- high address bits will be needed to ask for *unencrypted* memory. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [PATCH] drm/i915: Improve PSR activation timing
On Fri, Feb 9, 2018 at 7:39 AM, Rodrigo Vivi wrote: > Rodrigo Vivi writes: > >> "Pandiyan, Dhinakaran" writes: >> >>> On Thu, 2018-02-08 at 14:48 -0800, Rodrigo Vivi wrote: >>>> Hi Andy, >>>> >>>> thanks for getting involved with PSR and sorry for not replying sooner. >>>> >>>> I first saw this patch on that bugzilla entry but only now I stop to >>>> really think why I have written the code that way. >>>> >>>> So some clarity below. >>>> >>>> On Mon, Feb 05, 2018 at 10:07:09PM +, Andy Lutomirski wrote: >>>> > The current PSR code has a two call sites that each schedule delayed >>>> > work to activate PSR. As far as I can tell, each call site intends >>>> > to keep PSR inactive for the given amount of time and then allow it >>>> > to be activated. >>>> > >>>> > The call sites are: >>>> > >>>> > - intel_psr_enable(), which explicitly states in a comment that >>>> >it's trying to keep PSR off a short time after the dispay is >>>> >initialized as a workaround. >>>> >>>> First of all I really want to kill this call here and remove the >>>> FIXME. It was an ugly hack that I added to solve a corner case >>>> that was leaving me with blank screens when activating so sooner. >>>> >>>> > >>>> > - intel_psr_flush(). There isn't an explcit explanation, but the >>>> >intent is presumably to keep PSR off until the display has been >>>> >idle for 100ms. >>>> >>>> The reason for 100 is kind of ugly-nonsense-empirical value >>>> I concluded from VLV/CHV experience. >>>> On platforms with HW tracking HW waits few identical frames >>>> until really activating PSR. VLV/CHV activation is immediate. >>>> But HW is also different and there it seemed that hw needed a >>>> few more time before starting the transitions. >>>> Furthermore I didn't want to add that so quickly because I didn't >>>> want to take the risk of killing battery with software tracking >>>> when doing transitions so quickly using software tracking. >>>> >>>> > >>>> > The current code doesn't actually accomplish either of these goals. >>>> > Rather than keeping PSR inactive for the given amount of time, it >>>> > will schedule PSR for activation after the given time, with the >>>> > earliest target time in such a request winning. >>>> >>>> Putting that way I was asking myself how that hack had ever fixed >>>> my issue. Because the way you explained here seems obvious that it >>>> wouldn't ever fix my bug or any other. >>>> >>>> So I applied your patch and it made even more sense (without considering >>>> the fact I want to kill the first call anyways). >>>> >>>> So I came back, removed your patch and tried to understand how did >>>> it ever worked. >>>> >>>> So, the thing is that intel_psr_flush will never be really executed >>>> if intel_psr_enable wasn't executed. That is guaranteed by: >>>> >>>> mutex_lock(&dev_priv->psr.lock); >>>> if (!dev_priv->psr.enabled) { >>>> >>>> So, intel_psr_enable will be for sure the first one to schedule the >>>> work delayed to the ugly higher delay. >>>> >>>> > >>>> > In other words, if intel_psr_enable() is immediately followed by >>>> > intel_psr_flush(), then PSR will be activated after 100ms even if >>>> > intel_psr_enable() wanted a longer delay. And, if the screen is >>>> > being constantly updated so that intel_psr_flush() is called once >>>> > per frame at 60Hz, PSR will still be activated once every 100ms. >>>> >>>> During this time you are right, many calls of intel_psr_exit >>>> coming from flush functions can be called... But none of >>>> them will schedule the work with 100 delay. >>>> >>>> they will skip on >>>> if (!work_busy(&dev_priv->psr.work.work)) >>> >>> Wouldn't work_busy() return false until the work is actually queued >>> which is 100ms after calling schedule_delayed_work()? >> >> That's not my understanding
Re: [Intel-gfx] [PATCH] drm/i915: Improve PSR activation timing
> On Feb 8, 2018, at 4:39 PM, Pandiyan, Dhinakaran > wrote: > > >> On Thu, 2018-02-08 at 14:48 -0800, Rodrigo Vivi wrote: >> Hi Andy, >> >> thanks for getting involved with PSR and sorry for not replying sooner. >> >> I first saw this patch on that bugzilla entry but only now I stop to >> really think why I have written the code that way. >> >> So some clarity below. >> >>> On Mon, Feb 05, 2018 at 10:07:09PM +, Andy Lutomirski wrote: >>> The current PSR code has a two call sites that each schedule delayed >>> work to activate PSR. As far as I can tell, each call site intends >>> to keep PSR inactive for the given amount of time and then allow it >>> to be activated. >>> >>> The call sites are: >>> >>> - intel_psr_enable(), which explicitly states in a comment that >>> it's trying to keep PSR off a short time after the dispay is >>> initialized as a workaround. >> >> First of all I really want to kill this call here and remove the >> FIXME. It was an ugly hack that I added to solve a corner case >> that was leaving me with blank screens when activating so sooner. >> >>> >>> - intel_psr_flush(). There isn't an explcit explanation, but the >>> intent is presumably to keep PSR off until the display has been >>> idle for 100ms. >> >> The reason for 100 is kind of ugly-nonsense-empirical value >> I concluded from VLV/CHV experience. >> On platforms with HW tracking HW waits few identical frames >> until really activating PSR. VLV/CHV activation is immediate. >> But HW is also different and there it seemed that hw needed a >> few more time before starting the transitions. >> Furthermore I didn't want to add that so quickly because I didn't >> want to take the risk of killing battery with software tracking >> when doing transitions so quickly using software tracking. >> >>> >>> The current code doesn't actually accomplish either of these goals. >>> Rather than keeping PSR inactive for the given amount of time, it >>> will schedule PSR for activation after the given time, with the >>> earliest target time in such a request winning. >> >> Putting that way I was asking myself how that hack had ever fixed >> my issue. Because the way you explained here seems obvious that it >> wouldn't ever fix my bug or any other. >> >> So I applied your patch and it made even more sense (without considering >> the fact I want to kill the first call anyways). >> >> So I came back, removed your patch and tried to understand how did >> it ever worked. >> >> So, the thing is that intel_psr_flush will never be really executed >> if intel_psr_enable wasn't executed. That is guaranteed by: >> >> mutex_lock(&dev_priv->psr.lock); >>if (!dev_priv->psr.enabled) { >> >> So, intel_psr_enable will be for sure the first one to schedule the >> work delayed to the ugly higher delay. >> >>> >>> In other words, if intel_psr_enable() is immediately followed by >>> intel_psr_flush(), then PSR will be activated after 100ms even if >>> intel_psr_enable() wanted a longer delay. And, if the screen is >>> being constantly updated so that intel_psr_flush() is called once >>> per frame at 60Hz, PSR will still be activated once every 100ms. >> >> During this time you are right, many calls of intel_psr_exit >> coming from flush functions can be called... But none of >> them will schedule the work with 100 delay. >> >> they will skip on >> if (!work_busy(&dev_priv->psr.work.work)) As below, the first call will. Then, 100ms later, the work will fire. Then the next flush will schedule it again, etc. > > Wouldn't work_busy() return false until the work is actually queued > which is 100ms after calling schedule_delayed_work()? > > For e.g, flushes at 0, 16, 32...96 will have work_busy() returning false > until 100ms. > > The first psr_work will end up getting scheduled at 100ms, which I > believe is not what we want. Indeed. I stuck some printks in and this seems to be what happens. > > > However, I think > >if (dev_priv->psr.busy_frontbuffer_bits) >goto unlock; > >intel_psr_activate(intel_dp); > > in psr_work might prevent activate being called at 100ms if an > invalidate happened to be called before that. > On my system, invalidate is never called. Even if it were called, that check would
Re: [Intel-gfx] i915 PSR test results and cursor lag
> On Feb 5, 2018, at 2:50 PM, Rodrigo Vivi wrote: > >> On Sat, Feb 03, 2018 at 05:33:08PM +, Andy Lutomirski wrote: >>> On Fri, Feb 2, 2018 at 7:18 PM, Andy Lutomirski wrote: >>>> On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski wrote: >>>>> On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson >>>>> wrote: >>>>> Quoting Andy Lutomirski (2018-02-01 21:04:30) >>>>>> I got this after a recent suspend/resume: >>>>>> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all >>>>>> dirs >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>>>>> scanning /sys/bus >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>>>>> scanning /sys/class >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open >>>>>> configuration file '/etc/systemd/sleep.conf': No such file or >>>>>> directory >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>>>>> sender=n/a destination=n/a object=/org/freedesktop/login1 >>>>>> interface=org.freedesktop.login1.Manager member=PrepareForSleep >>>>>> cookie=570 reply >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message >>>>>> type=method_call sender=:1.46 destination=:1.1 >>>>>> object=/org/freedesktop/login1/session/_32 >>>>>> interface=org.freedesktop.login1.Session member=ReleaseDevice >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>>>>> sender=n/a destination=:1.46 >>>>>> object=/org/freedesktop/login1/session/_32 >>>>>> interface=org.freedesktop.login1.Session member=PauseDevice cookie >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane >>>>>> transform 0: Permission denied >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed >>>>>> with (Permission denied), drawing cursor with OpenGL from now on >>>>>> >>>>>> But I don't see the word "cursor" in my system logs before the first >>>>>> suspend. What am I looking for? This is Fedora 27 running a Gnome >>>>>> Wayland session, but it hasn't been reinstalled in some time, so it's >>>>>> possible that there are some weird settings sitting around. But I did >>>>>> check and I have no weird i915 parameters. >>>>> >>>>> You are using gnome-shell as the display server. From that it appears to >>>>> have started off with a HW cursor and switched to a SW cursor after >>>>> suspend. Did you notice a change in behaviour? After rebooting or just >>>>> restarting gnome-shell? >>>> >>>> I think it's less consistently bad after a reboot before suspending. >>>> >>>>> >>>>>> Also, are these things potentially related: >>>>>> >>>>>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential >>>>>> atomic update failure on pipe A >>>>> >>>>> They are just "missed the immediate vblank for the screen update" >>>>> messages. Should not be related to PSR, but may cause jitter by delaying >>>>> the odd screen update. >>>> >>>> I just got this one, and the timestamp is at least reasonably close to >>>> a giant latency spike: >>>> >>>> [ 288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic >>>> update failure on pipe A (start=31 end=32) time 15 us, min 1073, max >>>> 1079, scanline start 1087, end 1088 >>>> >>>>> >>>>>> As I'm typing this, I've seen a couple instances of what seems like a >>>>>> full *second* of cursor latency, but I've only gotten the potential >>>>>> atomic update failure once. >>>>>> >>>>>> And is there any straightforward tracing to do to distinguish between >>>>>> PSR exit latency and other potential sources of latency? >>>>> >>>>> It lo
[PATCH] drm/i915: Improve PSR activation timing
The current PSR code has a two call sites that each schedule delayed work to activate PSR. As far as I can tell, each call site intends to keep PSR inactive for the given amount of time and then allow it to be activated. The call sites are: - intel_psr_enable(), which explicitly states in a comment that it's trying to keep PSR off a short time after the dispay is initialized as a workaround. - intel_psr_flush(). There isn't an explcit explanation, but the intent is presumably to keep PSR off until the display has been idle for 100ms. The current code doesn't actually accomplish either of these goals. Rather than keeping PSR inactive for the given amount of time, it will schedule PSR for activation after the given time, with the earliest target time in such a request winning. In other words, if intel_psr_enable() is immediately followed by intel_psr_flush(), then PSR will be activated after 100ms even if intel_psr_enable() wanted a longer delay. And, if the screen is being constantly updated so that intel_psr_flush() is called once per frame at 60Hz, PSR will still be activated once every 100ms. Rewrite the code so that it does what was intended. This adds a new function intel_psr_schedule(), which will enable PSR after the requested time but no sooner. Signed-off-by: Andy Lutomirski --- drivers/gpu/drm/i915/i915_debugfs.c | 9 +++-- drivers/gpu/drm/i915/i915_drv.h | 4 ++- drivers/gpu/drm/i915/intel_psr.c| 69 - 3 files changed, 71 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index c65e381b85f3..b67db93f905d 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -2663,8 +2663,13 @@ static int i915_edp_psr_status(struct seq_file *m, void *data) seq_printf(m, "Active: %s\n", yesno(dev_priv->psr.active)); seq_printf(m, "Busy frontbuffer bits: 0x%03x\n", dev_priv->psr.busy_frontbuffer_bits); - seq_printf(m, "Re-enable work scheduled: %s\n", - yesno(work_busy(&dev_priv->psr.work.work))); + + if (timer_pending(&dev_priv->psr.activate_timer)) + seq_printf(m, "Activate scheduled: yes, in %ldms\n", + (long)(dev_priv->psr.earliest_activate - jiffies) * + 1000 / HZ); + else + seq_printf(m, "Re-enable scheduled: no\n"); if (HAS_DDI(dev_priv)) { if (dev_priv->psr.psr2_support) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 46eb729b367d..c0fb7d65cda6 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1192,7 +1192,9 @@ struct i915_psr { bool source_ok; struct intel_dp *enabled; bool active; - struct delayed_work work; + struct timer_list activate_timer; + struct work_struct activate_work; + unsigned long earliest_activate; unsigned busy_frontbuffer_bits; bool psr2_support; bool aux_frame_sync; diff --git a/drivers/gpu/drm/i915/intel_psr.c b/drivers/gpu/drm/i915/intel_psr.c index 55ea5eb3b7df..333d90d4e5af 100644 --- a/drivers/gpu/drm/i915/intel_psr.c +++ b/drivers/gpu/drm/i915/intel_psr.c @@ -461,6 +461,30 @@ static void intel_psr_activate(struct intel_dp *intel_dp) dev_priv->psr.active = true; } +static void intel_psr_schedule(struct drm_i915_private *dev_priv, + unsigned long min_wait_ms) +{ + unsigned long next; + + lockdep_assert_held(&dev_priv->psr.lock); + + /* +* We update next_enable *and* call mod_timer() because it's +* possible that intel_psr_work() has already been called and is +* waiting for psr.lock. If that's the case, we don't want it +* to immediately enable PSR. +* +* We also need to make sure that PSR is never activated earlier +* than requested to avoid breaking intel_psr_enable()'s workaround +* for pre-gen9 hardware. +*/ + next = jiffies + msecs_to_jiffies(min_wait_ms); + if (time_after(next, dev_priv->psr.earliest_activate)) { + dev_priv->psr.earliest_activate = next; + mod_timer(&dev_priv->psr.activate_timer, next); + } +} + static void hsw_psr_enable_source(struct intel_dp *intel_dp, const struct intel_crtc_state *crtc_state) { @@ -544,8 +568,7 @@ void intel_psr_enable(struct intel_dp *intel_dp, * - On HSW/BDW we get a recoverable frozen screen until * next exit-activate sequence. */ - schedule_delayed_work(&dev_priv->psr.work, -
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Mon, Feb 5, 2018 at 9:17 PM, Pandiyan, Dhinakaran wrote: > > On Mon, 2018-02-05 at 20:35 +, Andy Lutomirski wrote: >> On Mon, Feb 5, 2018 at 6:53 PM, Pandiyan, Dhinakaran >> wrote: >> > >> > >> > >> > On Sun, 2018-02-04 at 21:50 +, Andy Lutomirski wrote: >> >> On Sat, Feb 3, 2018 at 5:08 PM, Andy Lutomirski wrote: >> >> > On Sat, Feb 3, 2018 at 5:20 AM, Pandiyan, Dhinakaran >> >> > wrote: >> >> >> >> >> >> On Fri, 2018-02-02 at 19:18 +, Andy Lutomirski wrote: >> >> >>> I updated to 4.15, and the situation is much worse. With >> >> >>> enable_psr=1, the system survives for several seconds and then the >> >> >>> screen stops updating entirely. If I boot with i915.enable_psr=1, I >> >> >>> get to the Fedora login screen and then the system dies. If I set >> >> >>> enable_psr=1 using sysfs, it does a bit after the next resume. It >> >> >>> seems like it also sometimes hangs even worse a bit after the screen >> >> >>> stops updating, but it's hard to tell. >> >> >> >> >> >> The login screen freeze sounds like what I have. Does this system have >> >> >> DMC firmware? If yes, can you try this series >> >> >> https://patchwork.freedesktop.org/series/37598/. You'll only need >> >> >> patches 1,8,9 and 10. >> >> > >> >> > That fixes the hang. Feel free to add: >> >> > >> >> > Tested-by: Andy Lutomirski >> >> > >> >> > to the i915 parts. Also, any chance of getting it into the 4.15 stable >> >> > kernels? >> >> >> >> Correction: I'm still getting a second or two of complete screen >> >> freezing every now and then. The kernel says: >> > Thanks a lot for testing. How do you trigger this freeze? Moving the >> > cursor? Did you apply these patches on top of drm-tip or was it >> > mainline? >> > >> > I also have another patch here that addresses screen freezes in console >> > mode with PSR - https://patchwork.freedesktop.org/patch/201144/ in case >> > that is what you are interested in. >> >> >> >> [69400.016524] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic >> >> update failure on pipe A (start=19 end=20) time 198 us, min 1073, max >> >> 1079, scanline start 1068, end 1082 >> >> >> >> So something might still be a bit buggy. >> > >> > This series fixes only the long freezes due to frame counter resets, I >> > am sure there are still other issues with PSR. >> > >> > BTW does your patch on top of these patches help with the cursor lag? >> >> Maybe, but I'm not 100% sure. I'm not currently seeing the lag with >> or without the patch. I also think my distro fixed the cursor in the >> mean time so that it uses the HW cursor even after suspend/resume. >> >> A couple of questions, though: >> >> 1. Does moving the HW cursor cause the hardware to automatically turn off >> PSR? >> > That is correct. > >> 2 When something enables vblank interrupts (using drm_*_vblank_get(), >> for example), are vblank interrupts generated even if PSR is on? > > Enabling vblank interrupts deactivates PSR (except on Braswell afaik) > >> And >> is the scanline, as returned by intel_get_crtc_scanline(), updated? > > I don't think so, I have not really checked but there are no frames > generated, so the timing related registers will not get updated. This is > the case with the frame counter register. > I bet that's the cause of some of the glitches I'm seeing. I instrumented intel_pipe_update_start() like this: diff --git a/drivers/gpu/drm/i915/intel_sprite.c b/drivers/gpu/drm/i915/intel_sprite.c index 4a8a5d918a83..6ce0a35187fb 100644 --- a/drivers/gpu/drm/i915/intel_sprite.c +++ b/drivers/gpu/drm/i915/intel_sprite.c @@ -97,6 +97,7 @@ void intel_pipe_update_start(const struct intel_crtc_state *new_crtc_state) bool need_vlv_dsi_wa = (IS_VALLEYVIEW(dev_priv) || IS_CHERRYVIEW(dev_priv)) && intel_crtc_has_type(new_crtc_state, INTEL_OUTPUT_DSI); DEFINE_WAIT(wait); +int first_scanline = -1; vblank_start = adjusted_mode->crtc_vblank_start; if (adjusted_mode->flags & DRM_MODE_FLAG_INTERLACE) @@ -131,9 +132,12 @@ void intel_pipe_update_start(const struct intel_crtc_state *new_crtc_state) if (scanline < min || scanline >
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Mon, Feb 5, 2018 at 6:53 PM, Pandiyan, Dhinakaran wrote: > > > > On Sun, 2018-02-04 at 21:50 +, Andy Lutomirski wrote: >> On Sat, Feb 3, 2018 at 5:08 PM, Andy Lutomirski wrote: >> > On Sat, Feb 3, 2018 at 5:20 AM, Pandiyan, Dhinakaran >> > wrote: >> >> >> >> On Fri, 2018-02-02 at 19:18 +, Andy Lutomirski wrote: >> >>> I updated to 4.15, and the situation is much worse. With >> >>> enable_psr=1, the system survives for several seconds and then the >> >>> screen stops updating entirely. If I boot with i915.enable_psr=1, I >> >>> get to the Fedora login screen and then the system dies. If I set >> >>> enable_psr=1 using sysfs, it does a bit after the next resume. It >> >>> seems like it also sometimes hangs even worse a bit after the screen >> >>> stops updating, but it's hard to tell. >> >> >> >> The login screen freeze sounds like what I have. Does this system have >> >> DMC firmware? If yes, can you try this series >> >> https://patchwork.freedesktop.org/series/37598/. You'll only need >> >> patches 1,8,9 and 10. >> > >> > That fixes the hang. Feel free to add: >> > >> > Tested-by: Andy Lutomirski >> > >> > to the i915 parts. Also, any chance of getting it into the 4.15 stable >> > kernels? >> >> Correction: I'm still getting a second or two of complete screen >> freezing every now and then. The kernel says: > Thanks a lot for testing. How do you trigger this freeze? Moving the > cursor? Did you apply these patches on top of drm-tip or was it > mainline? > > I also have another patch here that addresses screen freezes in console > mode with PSR - https://patchwork.freedesktop.org/patch/201144/ in case > that is what you are interested in. >> >> [69400.016524] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic >> update failure on pipe A (start=19 end=20) time 198 us, min 1073, max >> 1079, scanline start 1068, end 1082 >> >> So something might still be a bit buggy. > > This series fixes only the long freezes due to frame counter resets, I > am sure there are still other issues with PSR. > > BTW does your patch on top of these patches help with the cursor lag? Maybe, but I'm not 100% sure. I'm not currently seeing the lag with or without the patch. I also think my distro fixed the cursor in the mean time so that it uses the HW cursor even after suspend/resume. A couple of questions, though: 1. Does moving the HW cursor cause the hardware to automatically turn off PSR? 2 When something enables vblank interrupts (using drm_*_vblank_get(), for example), are vblank interrupts generated even if PSR is on? And is the scanline, as returned by intel_get_crtc_scanline(), updated? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Sat, Feb 3, 2018 at 5:08 PM, Andy Lutomirski wrote: > On Sat, Feb 3, 2018 at 5:20 AM, Pandiyan, Dhinakaran > wrote: >> >> On Fri, 2018-02-02 at 19:18 +0000, Andy Lutomirski wrote: >>> I updated to 4.15, and the situation is much worse. With >>> enable_psr=1, the system survives for several seconds and then the >>> screen stops updating entirely. If I boot with i915.enable_psr=1, I >>> get to the Fedora login screen and then the system dies. If I set >>> enable_psr=1 using sysfs, it does a bit after the next resume. It >>> seems like it also sometimes hangs even worse a bit after the screen >>> stops updating, but it's hard to tell. >> >> The login screen freeze sounds like what I have. Does this system have >> DMC firmware? If yes, can you try this series >> https://patchwork.freedesktop.org/series/37598/. You'll only need >> patches 1,8,9 and 10. > > That fixes the hang. Feel free to add: > > Tested-by: Andy Lutomirski > > to the i915 parts. Also, any chance of getting it into the 4.15 stable > kernels? Correction: I'm still getting a second or two of complete screen freezing every now and then. The kernel says: [69400.016524] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=19 end=20) time 198 us, min 1073, max 1079, scanline start 1068, end 1082 So something might still be a bit buggy. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Fri, Feb 2, 2018 at 7:18 PM, Andy Lutomirski wrote: > On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski wrote: >> On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson >> wrote: >>> Quoting Andy Lutomirski (2018-02-01 21:04:30) >>>> I got this after a recent suspend/resume: >>>> >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all >>>> dirs >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>>> scanning /sys/bus >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>>> scanning /sys/class >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open >>>> configuration file '/etc/systemd/sleep.conf': No such file or >>>> directory >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>>> sender=n/a destination=n/a object=/org/freedesktop/login1 >>>> interface=org.freedesktop.login1.Manager member=PrepareForSleep >>>> cookie=570 reply >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message >>>> type=method_call sender=:1.46 destination=:1.1 >>>> object=/org/freedesktop/login1/session/_32 >>>> interface=org.freedesktop.login1.Session member=ReleaseDevice >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>>> sender=n/a destination=:1.46 >>>> object=/org/freedesktop/login1/session/_32 >>>> interface=org.freedesktop.login1.Session member=PauseDevice cookie >>>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane >>>> transform 0: Permission denied >>>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed >>>> with (Permission denied), drawing cursor with OpenGL from now on >>>> >>>> But I don't see the word "cursor" in my system logs before the first >>>> suspend. What am I looking for? This is Fedora 27 running a Gnome >>>> Wayland session, but it hasn't been reinstalled in some time, so it's >>>> possible that there are some weird settings sitting around. But I did >>>> check and I have no weird i915 parameters. >>> >>> You are using gnome-shell as the display server. From that it appears to >>> have started off with a HW cursor and switched to a SW cursor after >>> suspend. Did you notice a change in behaviour? After rebooting or just >>> restarting gnome-shell? >> >> I think it's less consistently bad after a reboot before suspending. >> >>> >>>> Also, are these things potentially related: >>>> >>>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential >>>> atomic update failure on pipe A >>> >>> They are just "missed the immediate vblank for the screen update" >>> messages. Should not be related to PSR, but may cause jitter by delaying >>> the odd screen update. >> >> I just got this one, and the timestamp is at least reasonably close to >> a giant latency spike: >> >> [ 288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic >> update failure on pipe A (start=31 end=32) time 15 us, min 1073, max >> 1079, scanline start 1087, end 1088 >> >>> >>>> As I'm typing this, I've seen a couple instances of what seems like a >>>> full *second* of cursor latency, but I've only gotten the potential >>>> atomic update failure once. >>>> >>>> And is there any straightforward tracing to do to distinguish between >>>> PSR exit latency and other potential sources of latency? >>> >>> It looks plausible that we could at least report how long it takes the >>> registers to reflect the change in state (but we don't). The best source >>> of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status. >> >> Hmm. >> >> I went and looked at the code, and I noticed what could be bugs or >> could (more likely) be my confusion since I don't know this code at >> all: >> >> intel_single_frame_update() does something inscrutable to me, but I >> imagine it does something that causes the next page flip to get >> noticed by the panel even with PSR on. But how does the code that >> calls it know that anything happened? (Looking at the commit history, &
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Sat, Feb 3, 2018 at 5:20 AM, Pandiyan, Dhinakaran wrote: > > On Fri, 2018-02-02 at 19:18 +, Andy Lutomirski wrote: >> I updated to 4.15, and the situation is much worse. With >> enable_psr=1, the system survives for several seconds and then the >> screen stops updating entirely. If I boot with i915.enable_psr=1, I >> get to the Fedora login screen and then the system dies. If I set >> enable_psr=1 using sysfs, it does a bit after the next resume. It >> seems like it also sometimes hangs even worse a bit after the screen >> stops updating, but it's hard to tell. > > The login screen freeze sounds like what I have. Does this system have > DMC firmware? If yes, can you try this series > https://patchwork.freedesktop.org/series/37598/. You'll only need > patches 1,8,9 and 10. That fixes the hang. Feel free to add: Tested-by: Andy Lutomirski to the i915 parts. Also, any chance of getting it into the 4.15 stable kernels? --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Fri, Feb 2, 2018 at 7:18 PM, Andy Lutomirski wrote: > On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski wrote: >> On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson >> wrote: >>> Quoting Andy Lutomirski (2018-02-01 21:04:30) >>>> I got this after a recent suspend/resume: >>>> >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all >>>> dirs >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>>> scanning /sys/bus >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>>> scanning /sys/class >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open >>>> configuration file '/etc/systemd/sleep.conf': No such file or >>>> directory >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>>> sender=n/a destination=n/a object=/org/freedesktop/login1 >>>> interface=org.freedesktop.login1.Manager member=PrepareForSleep >>>> cookie=570 reply >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message >>>> type=method_call sender=:1.46 destination=:1.1 >>>> object=/org/freedesktop/login1/session/_32 >>>> interface=org.freedesktop.login1.Session member=ReleaseDevice >>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>>> sender=n/a destination=:1.46 >>>> object=/org/freedesktop/login1/session/_32 >>>> interface=org.freedesktop.login1.Session member=PauseDevice cookie >>>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane >>>> transform 0: Permission denied >>>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed >>>> with (Permission denied), drawing cursor with OpenGL from now on >>>> >>>> But I don't see the word "cursor" in my system logs before the first >>>> suspend. What am I looking for? This is Fedora 27 running a Gnome >>>> Wayland session, but it hasn't been reinstalled in some time, so it's >>>> possible that there are some weird settings sitting around. But I did >>>> check and I have no weird i915 parameters. >>> >>> You are using gnome-shell as the display server. From that it appears to >>> have started off with a HW cursor and switched to a SW cursor after >>> suspend. Did you notice a change in behaviour? After rebooting or just >>> restarting gnome-shell? >> >> I think it's less consistently bad after a reboot before suspending. >> >>> >>>> Also, are these things potentially related: >>>> >>>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential >>>> atomic update failure on pipe A >>> >>> They are just "missed the immediate vblank for the screen update" >>> messages. Should not be related to PSR, but may cause jitter by delaying >>> the odd screen update. >> >> I just got this one, and the timestamp is at least reasonably close to >> a giant latency spike: >> >> [ 288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic >> update failure on pipe A (start=31 end=32) time 15 us, min 1073, max >> 1079, scanline start 1087, end 1088 >> >>> >>>> As I'm typing this, I've seen a couple instances of what seems like a >>>> full *second* of cursor latency, but I've only gotten the potential >>>> atomic update failure once. >>>> >>>> And is there any straightforward tracing to do to distinguish between >>>> PSR exit latency and other potential sources of latency? >>> >>> It looks plausible that we could at least report how long it takes the >>> registers to reflect the change in state (but we don't). The best source >>> of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status. >> >> Hmm. >> >> I went and looked at the code, and I noticed what could be bugs or >> could (more likely) be my confusion since I don't know this code at >> all: >> >> intel_single_frame_update() does something inscrutable to me, but I >> imagine it does something that causes the next page flip to get >> noticed by the panel even with PSR on. But how does the code that >> calls it know that anything happened? (Looking at the commit history, &
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski wrote: > On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson wrote: >> Quoting Andy Lutomirski (2018-02-01 21:04:30) >>> I got this after a recent suspend/resume: >>> >>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. >>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all >>> dirs >>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>> scanning /sys/bus >>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >>> scanning /sys/class >>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open >>> configuration file '/etc/systemd/sleep.conf': No such file or >>> directory >>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... >>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>> sender=n/a destination=n/a object=/org/freedesktop/login1 >>> interface=org.freedesktop.login1.Manager member=PrepareForSleep >>> cookie=570 reply >>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message >>> type=method_call sender=:1.46 destination=:1.1 >>> object=/org/freedesktop/login1/session/_32 >>> interface=org.freedesktop.login1.Session member=ReleaseDevice >>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >>> sender=n/a destination=:1.46 >>> object=/org/freedesktop/login1/session/_32 >>> interface=org.freedesktop.login1.Session member=PauseDevice cookie >>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane >>> transform 0: Permission denied >>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed >>> with (Permission denied), drawing cursor with OpenGL from now on >>> >>> But I don't see the word "cursor" in my system logs before the first >>> suspend. What am I looking for? This is Fedora 27 running a Gnome >>> Wayland session, but it hasn't been reinstalled in some time, so it's >>> possible that there are some weird settings sitting around. But I did >>> check and I have no weird i915 parameters. >> >> You are using gnome-shell as the display server. From that it appears to >> have started off with a HW cursor and switched to a SW cursor after >> suspend. Did you notice a change in behaviour? After rebooting or just >> restarting gnome-shell? > > I think it's less consistently bad after a reboot before suspending. > >> >>> Also, are these things potentially related: >>> >>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential >>> atomic update failure on pipe A >> >> They are just "missed the immediate vblank for the screen update" >> messages. Should not be related to PSR, but may cause jitter by delaying >> the odd screen update. > > I just got this one, and the timestamp is at least reasonably close to > a giant latency spike: > > [ 288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic > update failure on pipe A (start=31 end=32) time 15 us, min 1073, max > 1079, scanline start 1087, end 1088 > >> >>> As I'm typing this, I've seen a couple instances of what seems like a >>> full *second* of cursor latency, but I've only gotten the potential >>> atomic update failure once. >>> >>> And is there any straightforward tracing to do to distinguish between >>> PSR exit latency and other potential sources of latency? >> >> It looks plausible that we could at least report how long it takes the >> registers to reflect the change in state (but we don't). The best source >> of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status. > > Hmm. > > I went and looked at the code, and I noticed what could be bugs or > could (more likely) be my confusion since I don't know this code at > all: > > intel_single_frame_update() does something inscrutable to me, but I > imagine it does something that causes the next page flip to get > noticed by the panel even with PSR on. But how does the code that > calls it know that anything happened? (Looking at the commit history, > maybe this is something special that's only needed on some platforms > but doesn't replace the normal PSR exit sequence.) > > Perhaps more interestingly, intel_psr_flush() does this: > > /* By definition flush = invalidate + flush */ > if (frontbuffer_bits) > intel_psr_exit(dev_priv); > > if (!dev_priv->psr.active && !dev_priv->psr.
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson wrote: > Quoting Andy Lutomirski (2018-02-01 21:04:30) >> I got this after a recent suspend/resume: >> >> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. >> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all dirs >> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >> scanning /sys/bus >> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: >> scanning /sys/class >> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open >> configuration file '/etc/systemd/sleep.conf': No such file or >> directory >> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... >> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >> sender=n/a destination=n/a object=/org/freedesktop/login1 >> interface=org.freedesktop.login1.Manager member=PrepareForSleep >> cookie=570 reply >> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message >> type=method_call sender=:1.46 destination=:1.1 >> object=/org/freedesktop/login1/session/_32 >> interface=org.freedesktop.login1.Session member=ReleaseDevice >> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal >> sender=n/a destination=:1.46 >> object=/org/freedesktop/login1/session/_32 >> interface=org.freedesktop.login1.Session member=PauseDevice cookie >> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane >> transform 0: Permission denied >> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed >> with (Permission denied), drawing cursor with OpenGL from now on >> >> But I don't see the word "cursor" in my system logs before the first >> suspend. What am I looking for? This is Fedora 27 running a Gnome >> Wayland session, but it hasn't been reinstalled in some time, so it's >> possible that there are some weird settings sitting around. But I did >> check and I have no weird i915 parameters. > > You are using gnome-shell as the display server. From that it appears to > have started off with a HW cursor and switched to a SW cursor after > suspend. Did you notice a change in behaviour? After rebooting or just > restarting gnome-shell? I think it's less consistently bad after a reboot before suspending. > >> Also, are these things potentially related: >> >> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential >> atomic update failure on pipe A > > They are just "missed the immediate vblank for the screen update" > messages. Should not be related to PSR, but may cause jitter by delaying > the odd screen update. I just got this one, and the timestamp is at least reasonably close to a giant latency spike: [ 288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=31 end=32) time 15 us, min 1073, max 1079, scanline start 1087, end 1088 > >> As I'm typing this, I've seen a couple instances of what seems like a >> full *second* of cursor latency, but I've only gotten the potential >> atomic update failure once. >> >> And is there any straightforward tracing to do to distinguish between >> PSR exit latency and other potential sources of latency? > > It looks plausible that we could at least report how long it takes the > registers to reflect the change in state (but we don't). The best source > of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status. Hmm. I went and looked at the code, and I noticed what could be bugs or could (more likely) be my confusion since I don't know this code at all: intel_single_frame_update() does something inscrutable to me, but I imagine it does something that causes the next page flip to get noticed by the panel even with PSR on. But how does the code that calls it know that anything happened? (Looking at the commit history, maybe this is something special that's only needed on some platforms but doesn't replace the normal PSR exit sequence.) Perhaps more interestingly, intel_psr_flush() does this: /* By definition flush = invalidate + flush */ if (frontbuffer_bits) intel_psr_exit(dev_priv); if (!dev_priv->psr.active && !dev_priv->psr.busy_frontbuffer_bits) if (!work_busy(&dev_priv->psr.work.work)) schedule_delayed_work(&dev_priv->psr.work, msecs_to_jiffies(100)); I'm guessing that the idea is that we're turning off PSR because we want the panel to update and we expect that, in 100ms, the update will have hit the panel and we'll have been idle long enough for it to make sense to re-enter PSR. IOW, the code
Re: [Intel-gfx] i915 PSR test results and cursor lag
On Thu, Feb 1, 2018 at 9:53 AM, Chris Wilson wrote: > Quoting Andy Lutomirski (2018-02-01 17:40:22) >> *However*, I do see one unfortunate side effect of turning on PSR. It >> seems that, when I move my cursor a little bit after a few seconds of >> doing nothing, there seems to be a little bit of lag, as if either a >> few frames are dropped at the beginning of the motion or maybe the >> entire motion is delayed a bit. I don't notice a similar delay when >> typing, so I'm wondering if maybe there's a minor driver bug in which >> the driver doesn't kick the panel out of PSR quite as quickly when the >> cursor is updated as it does when the framebuffer is updated. > > One thing that's important know regarding the cursor is whether the > display server is using a HW cursor or SW cursor. Could you please attach > the log from the display server (or if you are using a stock > distribution that's probably enough to work out what it is using)? > -Chris Looking at the logs, I see a few things. First, I have a few of these: Feb 01 09:24:24 laptop kernel: [drm:intel_pipe_update_start [i915]] *ERROR* Potential atomic update failure on pipe A Feb 01 09:24:48 laptop org.gnome.Shell.desktop[3261]: libinput error: event15 - libinput error: DLL0704:01 06CB:76AE Touchpad: libinput error: kernel bug: Touch jump detected and discarded. Feb 01 09:24:48 laptop org.gnome.Shell.desktop[3261]: See https://wayland.freedesktop.org/libinput/doc/1.9.3/touchpad_jumping_cursor.html for details Feb 01 09:24:50 laptop org.gnome.Shell.desktop[3261]: libinput error: event15 - libinput error: DLL0704:01 06CB:76AE Touchpad: libinput error: kernel bug: Touch jump detected and discarded. Feb 01 09:24:50 laptop org.gnome.Shell.desktop[3261]: See https://wayland.freedesktop.org/libinput/doc/1.9.3/touchpad_jumping_cursor.html for details (Hi, Peter!) So it's entirely possible that what I'm seeing is actually an input issue that's exacerbated by PSR for some bizarre reason. I got this after a recent suspend/resume: Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed. Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan all dirs Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scanning /sys/bus Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scanning /sys/class Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open configuration file '/etc/systemd/sleep.conf': No such file or directory Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending... Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal sender=n/a destination=n/a object=/org/freedesktop/login1 interface=org.freedesktop.login1.Manager member=PrepareForSleep cookie=570 reply Feb 01 09:44:34 laptop systemd-logind[2412]: Got message type=method_call sender=:1.46 destination=:1.1 object=/org/freedesktop/login1/session/_32 interface=org.freedesktop.login1.Session member=ReleaseDevice Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal sender=n/a destination=:1.46 object=/org/freedesktop/login1/session/_32 interface=org.freedesktop.login1.Session member=PauseDevice cookie Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane transform 0: Permission denied Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed with (Permission denied), drawing cursor with OpenGL from now on But I don't see the word "cursor" in my system logs before the first suspend. What am I looking for? This is Fedora 27 running a Gnome Wayland session, but it hasn't been reinstalled in some time, so it's possible that there are some weird settings sitting around. But I did check and I have no weird i915 parameters. Also, are these things potentially related: [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential atomic update failure on pipe A As I'm typing this, I've seen a couple instances of what seems like a full *second* of cursor latency, but I've only gotten the potential atomic update failure once. And is there any straightforward tracing to do to distinguish between PSR exit latency and other potential sources of latency? ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: i915 PSR test results and cursor lag
On Thu, Feb 1, 2018 at 9:40 AM, Andy Lutomirski wrote: > Hi- > > As requested in your blog post, I tested PSR. I see something like > 2.69W with PSR off and 2.17W with PSR on. Screen blanking, > suspend/resume, and the contents of the screen all seem okay. This is > a Dell XPS 13 9350, i.e.: > > System Information > Manufacturer: Dell Inc. > Product Name: XPS 13 9350 > > EDID is attached. > > *However*, I do see one unfortunate side effect of turning on PSR. It > seems that, when I move my cursor a little bit after a few seconds of > doing nothing, there seems to be a little bit of lag, as if either a > few frames are dropped at the beginning of the motion or maybe the > entire motion is delayed a bit. I don't notice a similar delay when > typing, so I'm wondering if maybe there's a minor driver bug in which > the driver doesn't kick the panel out of PSR quite as quickly when the > cursor is updated as it does when the framebuffer is updated. > I'm also getting occasional messages like: [ 2675.574486] [drm:intel_pipe_update_start [i915]] *ERROR* Potential atomic update failure on pipe A with PSR on. But there is nowhere near one of these messages per tiny lag incident. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
i915 PSR test results and cursor lag
Hi- As requested in your blog post, I tested PSR. I see something like 2.69W with PSR off and 2.17W with PSR on. Screen blanking, suspend/resume, and the contents of the screen all seem okay. This is a Dell XPS 13 9350, i.e.: System Information Manufacturer: Dell Inc. Product Name: XPS 13 9350 EDID is attached. *However*, I do see one unfortunate side effect of turning on PSR. It seems that, when I move my cursor a little bit after a few seconds of doing nothing, there seems to be a little bit of lag, as if either a few frames are dropped at the beginning of the motion or maybe the entire motion is delayed a bit. I don't notice a similar delay when typing, so I'm wondering if maybe there's a minor driver bug in which the driver doesn't kick the panel out of PSR quite as quickly when the cursor is updated as it does when the framebuffer is updated. (A couple of lists are cc'd BTW, switching PSR on and off using /sys/module/i915/parameters/enable_psr seems to work fine, although it seems like I may need to suspend/resume to get it to kick in. But, if there's really going to be a blacklist or whitelist of panels in userspace, shouldn't there be an option in sysfs in /sys/class/drm/card0-eDP-1/ or similar? --Andy panel-edid Description: Binary data ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Skylake underruns on 4.8-rc4
My Dell XPS 13 9350 laptop just got a buffer underrun: [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun I'm seeing this very occasionally, and they don't come in groups -- I seem to get one underrun with a black flash and that's it. This is with just the laptop screen -- nothing at all is plugged in to the USB-C port. 4.8-rc4 has the latest round of fixes applied, so i915/skl_dmc_ver1_26.bin loaded successfully and the SAGV fix is there. I had the same problem on 4.8-rc3. 4.7 seemed okay. I have: 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 520 (rev 07) --Andy
[Nouveau] Should I expect nouveau on 4.6 to work on a GM206?
On Sun, Jun 26, 2016 at 10:59 AM, Ilia Mirkin wrote: > On Sun, Jun 26, 2016 at 1:49 PM, Andy Lutomirski > wrote: >> On Sun, May 29, 2016 at 12:27 PM, Andy Lutomirski wrote: >>> On Sun, May 29, 2016 at 12:22 PM, Ilia Mirkin >>> wrote: >>>> On Sun, May 29, 2016 at 3:07 PM, Andy Lutomirski >>>> wrote: >>>>> On Sat, May 28, 2016 at 5:48 PM, Ilia Mirkin >>>>> wrote: >>>>>> Do you have mesa 11.2 or later? GM20x support was only added in mesa >>>>>> 11.2. >>>>>> >>>>> >>>>> I just upgraded to 11.2. I'm getting errors like this in the log: >>>>> >>>>> [ 5383.723240] nouveau :09:00.0: fifo: read fault at 011000 >>>>> engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel -1 >>>>> [007f9ed000 unknown] >>>>> [ 5398.722676] nouveau :09:00.0: systemd-logind[30778]: failed to >>>>> idle channel 2 [systemd-logind[30778]] >>>>> [ 5413.722853] nouveau :09:00.0: systemd-logind[30778]: failed to >>>>> idle channel 2 [systemd-logind[30778]] >>>>> >>>>> and the display output in general is unreliable enough that I'm having >>>>> trouble telling whether the performance is remotely reasonable. >>>> >>>> If you're having trouble telling, that means it's not :) The error you >>>> pasted is quite odd. Was there anything in the log before those >>>> messages? If there's no channel associated, that means that it's the >>>> background copying between vram and sysmem? Not sure. >>> >>> Don't get too excited yet. In the process of upgrading mesa, I >>> managed to boot 4.5 without noticing. I'll post back later today with >>> actual valid test results. >>> >> >> I replaced the monitor (turns out that my monitor had a known DP >> problem), and now the screen lights up reliably. I still get > > Great to hear! > >> occasional log lines like this: >> >> [Jun26 09:25] nouveau :09:00.0: fifo: FB_FLUSH_TIMEOUT >> [Jun26 09:30] nouveau :09:00.0: fifo: FB_FLUSH_TIMEOUT >> [Jun26 09:32] nouveau :09:00.0: fifo: CHSW_ERROR 0004 >> [ +0.000162] nouveau :09:00.0: fifo: CHSW_ERROR 0005 > > These don't sound good at all! > >> [Jun26 09:46] nouveau :09:00.0: disp: outp 04:0006:0f44: link >> training failed >> [ +0.107894] nouveau :09:00.0: disp: outp 04:0006:0f44: link >> training failed > > These are surprising if your monitor is working. Usually it means > "couldn't establish link with the monitor". Perhaps something forces > it to retry and it eventually succeeds. Given the timing, I'm guessing that it tries a couple of times and eventually works. Given that the monitor is a newer, "fixed" revision of a known-seriously-broken Dell monitor, it wouldn't shock me if what's actually happening is that the monitor uses buggy DP hardware and the "fixed" firmware A03 actually works by forcing several retries when link training fails (as opposed to what A00 - A02 did, which seemed to involve failing a few times and then crashing, sometimes hard enough that even the monitor power button stopped working). > >> >> but they aren't causing an obvious problem. >> >>>> >>>> Note that with maxwell we have yet to add EXA support to >>>> xf86-video-nouveau, so you're ending up with GLAMOR (and Ben and I >>>> disagree on whether EXA support should be added in the first place). >>>> There was also an issue that glamor was hitting with nouveau which >>>> appears to have dissipated, either due to a change in nouveau or a >>>> change in glamor. So you might consider upgrading to Xorg 1.18.3 (as >>>> glamor is part of X). >> >> I do have a serious performance issue, though: when I scroll in >> Firefox (default configuration), the whole system drops to ~1fps or >> less and, if I scroll enough (even putting the mouse over a simple >> page like start.fedoraproject.org and flicking the wheel up and down a >> few times), the entire desktop will become unusable for several >> seconds. I seem to have this problem under X and under Wayland. >> >> For better or for worse, forcing Firefox's layers acceleration on >> fixes the problem and scrolling is fast. >> >> I have no idea whether this is an X problem, a gnome-shell problem, a >> mesa problem, a kernel problem, or someth
[Nouveau] Should I expect nouveau on 4.6 to work on a GM206?
On Sun, May 29, 2016 at 12:27 PM, Andy Lutomirski wrote: > On Sun, May 29, 2016 at 12:22 PM, Ilia Mirkin wrote: >> On Sun, May 29, 2016 at 3:07 PM, Andy Lutomirski wrote: >>> On Sat, May 28, 2016 at 5:48 PM, Ilia Mirkin >>> wrote: >>>> Do you have mesa 11.2 or later? GM20x support was only added in mesa 11.2. >>>> >>> >>> I just upgraded to 11.2. I'm getting errors like this in the log: >>> >>> [ 5383.723240] nouveau :09:00.0: fifo: read fault at 011000 >>> engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel -1 >>> [007f9ed000 unknown] >>> [ 5398.722676] nouveau :09:00.0: systemd-logind[30778]: failed to >>> idle channel 2 [systemd-logind[30778]] >>> [ 5413.722853] nouveau :09:00.0: systemd-logind[30778]: failed to >>> idle channel 2 [systemd-logind[30778]] >>> >>> and the display output in general is unreliable enough that I'm having >>> trouble telling whether the performance is remotely reasonable. >> >> If you're having trouble telling, that means it's not :) The error you >> pasted is quite odd. Was there anything in the log before those >> messages? If there's no channel associated, that means that it's the >> background copying between vram and sysmem? Not sure. > > Don't get too excited yet. In the process of upgrading mesa, I > managed to boot 4.5 without noticing. I'll post back later today with > actual valid test results. > I replaced the monitor (turns out that my monitor had a known DP problem), and now the screen lights up reliably. I still get occasional log lines like this: [Jun26 09:25] nouveau :09:00.0: fifo: FB_FLUSH_TIMEOUT [Jun26 09:30] nouveau :09:00.0: fifo: FB_FLUSH_TIMEOUT [Jun26 09:32] nouveau :09:00.0: fifo: CHSW_ERROR 0004 [ +0.000162] nouveau :09:00.0: fifo: CHSW_ERROR 0005 [Jun26 09:46] nouveau :09:00.0: disp: outp 04:0006:0f44: link training failed [ +0.107894] nouveau :09:00.0: disp: outp 04:0006:0f44: link training failed but they aren't causing an obvious problem. >> >> Note that with maxwell we have yet to add EXA support to >> xf86-video-nouveau, so you're ending up with GLAMOR (and Ben and I >> disagree on whether EXA support should be added in the first place). >> There was also an issue that glamor was hitting with nouveau which >> appears to have dissipated, either due to a change in nouveau or a >> change in glamor. So you might consider upgrading to Xorg 1.18.3 (as >> glamor is part of X). I do have a serious performance issue, though: when I scroll in Firefox (default configuration), the whole system drops to ~1fps or less and, if I scroll enough (even putting the mouse over a simple page like start.fedoraproject.org and flicking the wheel up and down a few times), the entire desktop will become unusable for several seconds. I seem to have this problem under X and under Wayland. For better or for worse, forcing Firefox's layers acceleration on fixes the problem and scrolling is fast. I have no idea whether this is an X problem, a gnome-shell problem, a mesa problem, a kernel problem, or something else.
DP link training and performance issues with HDMI USB-C dongle and Skylake
I have a Dell XPS 13 9350 (Skylake) and a Dell DA200 adapter. The latter is a Thunderbolt device that includes an HDMI port and connects over USB Type C. I believe that it's internally using DP Alternate Mode. When I plug it in on 4.7-rc4, I get spew like this: [ 90.718106] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 91.077604] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 91.437059] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 91.796479] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 92.156101] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 92.515647] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 92.875184] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 93.234735] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 93.594294] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 93.953812] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 94.313390] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 94.673043] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 95.032890] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 95.393016] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 95.752879] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 96.113074] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 96.473068] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 96.833185] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 97.193233] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 97.553138] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 97.913526] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 98.273525] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 98.634178] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 98.993859] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 99.354484] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 99.714669] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 100.077412] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 100.432684] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 100.792499] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 101.152378] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 101.512265] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 101.872466] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 102.232284] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 102.592251] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 103.111283] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 103.466511] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 103.826082] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 104.191906] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 104.547038] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 104.911264] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 105.270679] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 105.625774] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 105.986064] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 106.350045] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 106.705325] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 107.064897] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 107.431263] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 107.790793] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 108.146016] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 108.506093] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 108.865924] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [ 109.225629] [drm:intel_dp_start_link_train [i915]] *ERROR* failed to train DP, aborting [
[Nouveau] Should I expect nouveau on 4.6 to work on a GM206?
On Sun, May 29, 2016 at 12:22 PM, Ilia Mirkin wrote: > On Sun, May 29, 2016 at 3:07 PM, Andy Lutomirski wrote: >> On Sat, May 28, 2016 at 5:48 PM, Ilia Mirkin wrote: >>> Do you have mesa 11.2 or later? GM20x support was only added in mesa 11.2. >>> >> >> I just upgraded to 11.2. I'm getting errors like this in the log: >> >> [ 5383.723240] nouveau :09:00.0: fifo: read fault at 011000 >> engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel -1 >> [007f9ed000 unknown] >> [ 5398.722676] nouveau :09:00.0: systemd-logind[30778]: failed to >> idle channel 2 [systemd-logind[30778]] >> [ 5413.722853] nouveau :09:00.0: systemd-logind[30778]: failed to >> idle channel 2 [systemd-logind[30778]] >> >> and the display output in general is unreliable enough that I'm having >> trouble telling whether the performance is remotely reasonable. > > If you're having trouble telling, that means it's not :) The error you > pasted is quite odd. Was there anything in the log before those > messages? If there's no channel associated, that means that it's the > background copying between vram and sysmem? Not sure. Don't get too excited yet. In the process of upgrading mesa, I managed to boot 4.5 without noticing. I'll post back later today with actual valid test results. > > Note that with maxwell we have yet to add EXA support to > xf86-video-nouveau, so you're ending up with GLAMOR (and Ben and I > disagree on whether EXA support should be added in the first place). > There was also an issue that glamor was hitting with nouveau which > appears to have dissipated, either due to a change in nouveau or a > change in glamor. So you might consider upgrading to Xorg 1.18.3 (as > glamor is part of X). > > FWIW a few other people have been using GM20x without incident, but > this can all be very sensitive to your desktop/etc. Lots of things > like to use GL nowadays - I stick to a more classic desktop - no > compositor, simple window manager, etc. This is GNOME 3 on Fedora 24 Beta. --Andy
[Nouveau] Should I expect nouveau on 4.6 to work on a GM206?
On Sat, May 28, 2016 at 5:48 PM, Ilia Mirkin wrote: > Do you have mesa 11.2 or later? GM20x support was only added in mesa 11.2. > I just upgraded to 11.2. I'm getting errors like this in the log: [ 5383.723240] nouveau :09:00.0: fifo: read fault at 011000 engine 07 [PBDMA0] client 06 [HOST] reason 00 [PDE] on channel -1 [007f9ed000 unknown] [ 5398.722676] nouveau :09:00.0: systemd-logind[30778]: failed to idle channel 2 [systemd-logind[30778]] [ 5413.722853] nouveau :09:00.0: systemd-logind[30778]: failed to idle channel 2 [systemd-logind[30778]] and the display output in general is unreliable enough that I'm having trouble telling whether the performance is remotely reasonable. --Andy > Cheers, > > -ilia > > On Sat, May 28, 2016 at 4:51 PM, Andy Lutomirski wrote: >> I have the signed firmware (I think) and I'm running a fresh 4.6 >> kernel. I got an image to show up briefly, rendering the Fedora >> sign-in screen at something like one frame per ten seconds. But then >> I got all kinds of garbage, and I see: >> >> [ 719.300820] nouveau :09:00.0: disp: outp 04:0006:0f44: link >> training failed >> >> dmesg |grep nouveau says: >> >> [ 10.053162] fb: switching to nouveaufb from EFI VGA >> [ 10.053349] nouveau :09:00.0: NVIDIA GM206 (126010a1) >> [ 10.174033] nouveau :09:00.0: bios: version 84.06.0d.00.01 >> [ 10.174854] nouveau :09:00.0: disp: dcb 15 type 8 unknown >> [ 10.178375] nouveau :09:00.0: fb: 2048 MiB GDDR5 >> [ 10.202108] nouveau :09:00.0: DRM: VRAM: 2048 MiB >> [ 10.202109] nouveau :09:00.0: DRM: GART: 1048576 MiB >> [ 10.202113] nouveau :09:00.0: DRM: TMDS table version 2.0 >> [ 10.202114] nouveau :09:00.0: DRM: DCB version 4.1 >> [ 10.202116] nouveau :09:00.0: DRM: DCB outp 00: 01000f02 00020030 >> [ 10.202117] nouveau :09:00.0: DRM: DCB outp 01: 02000f00 >> [ 10.202118] nouveau :09:00.0: DRM: DCB outp 02: 02811f76 04400020 >> [ 10.202120] nouveau :09:00.0: DRM: DCB outp 03: 02011f72 00020020 >> [ 10.202121] nouveau :09:00.0: DRM: DCB outp 04: 04822f86 04400010 >> [ 10.202122] nouveau :09:00.0: DRM: DCB outp 05: 04022f82 00020010 >> [ 10.202123] nouveau :09:00.0: DRM: DCB outp 06: 04833f96 04400020 >> [ 10.202124] nouveau :09:00.0: DRM: DCB outp 07: 04033f92 00020020 >> [ 10.202125] nouveau :09:00.0: DRM: DCB outp 08: 02044f62 00020010 >> [ 10.202126] nouveau :09:00.0: DRM: DCB outp 15: 01df5ff8 >> [ 10.202127] nouveau :09:00.0: DRM: DCB conn 00: 1030 >> [ 10.202128] nouveau :09:00.0: DRM: DCB conn 01: 00020146 >> [ 10.202129] nouveau :09:00.0: DRM: DCB conn 02: 01000246 >> [ 10.202130] nouveau :09:00.0: DRM: DCB conn 03: 02000346 >> [ 10.202131] nouveau :09:00.0: DRM: DCB conn 04: 00010461 >> [ 10.202132] nouveau :09:00.0: DRM: DCB conn 05: 0570 >> [ 10.202134] nouveau :09:00.0: DRM: Pointer to flat panel table invalid >> [ 10.214683] nouveau :09:00.0: DRM: unknown connector type 70 >> [ 10.214728] nouveau :09:00.0: DRM: failed to create encoder 1/8/0: -19 >> [ 10.214730] nouveau :09:00.0: DRM: Unknown-1 has no encoders, removing >> [ 10.369691] nouveau :09:00.0: DRM: MM: using COPY for buffer copies >> [ 10.478561] nouveau :09:00.0: priv: GPC0: 419df4 (1e40820e) >> [ 10.478578] nouveau :09:00.0: priv: GPC1: 419df4 (1e40820e) >> [ 10.607100] nouveau :09:00.0: DRM: allocated 3840x2160 fb: >> 0x6, bo 88044aad7400 >> [ 10.607276] fbcon: nouveaufb (fb0) is primary device >> [ 10.607576] nouveau :09:00.0: fb0: nouveaufb frame buffer device >> [ 10.617064] [drm] Initialized nouveau 1.3.1 20120801 for >> :09:00.0 on minor 0 >> [ 719.282184] nouveau :09:00.0: disp: outp 04:0006:0f44: link >> training failed >> [ 719.300820] nouveau :09:00.0: disp: outp 04:0006:0f44: link >> training failed >> >> >> >> Thanks, >> Andy >> ___ >> Nouveau mailing list >> Nouveau at lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/nouveau
Should I expect nouveau on 4.6 to work on a GM206?
I have the signed firmware (I think) and I'm running a fresh 4.6 kernel. I got an image to show up briefly, rendering the Fedora sign-in screen at something like one frame per ten seconds. But then I got all kinds of garbage, and I see: [ 719.300820] nouveau :09:00.0: disp: outp 04:0006:0f44: link training failed dmesg |grep nouveau says: [ 10.053162] fb: switching to nouveaufb from EFI VGA [ 10.053349] nouveau :09:00.0: NVIDIA GM206 (126010a1) [ 10.174033] nouveau :09:00.0: bios: version 84.06.0d.00.01 [ 10.174854] nouveau :09:00.0: disp: dcb 15 type 8 unknown [ 10.178375] nouveau :09:00.0: fb: 2048 MiB GDDR5 [ 10.202108] nouveau :09:00.0: DRM: VRAM: 2048 MiB [ 10.202109] nouveau :09:00.0: DRM: GART: 1048576 MiB [ 10.202113] nouveau :09:00.0: DRM: TMDS table version 2.0 [ 10.202114] nouveau :09:00.0: DRM: DCB version 4.1 [ 10.202116] nouveau :09:00.0: DRM: DCB outp 00: 01000f02 00020030 [ 10.202117] nouveau :09:00.0: DRM: DCB outp 01: 02000f00 [ 10.202118] nouveau :09:00.0: DRM: DCB outp 02: 02811f76 04400020 [ 10.202120] nouveau :09:00.0: DRM: DCB outp 03: 02011f72 00020020 [ 10.202121] nouveau :09:00.0: DRM: DCB outp 04: 04822f86 04400010 [ 10.202122] nouveau :09:00.0: DRM: DCB outp 05: 04022f82 00020010 [ 10.202123] nouveau :09:00.0: DRM: DCB outp 06: 04833f96 04400020 [ 10.202124] nouveau :09:00.0: DRM: DCB outp 07: 04033f92 00020020 [ 10.202125] nouveau :09:00.0: DRM: DCB outp 08: 02044f62 00020010 [ 10.202126] nouveau :09:00.0: DRM: DCB outp 15: 01df5ff8 [ 10.202127] nouveau :09:00.0: DRM: DCB conn 00: 1030 [ 10.202128] nouveau :09:00.0: DRM: DCB conn 01: 00020146 [ 10.202129] nouveau :09:00.0: DRM: DCB conn 02: 01000246 [ 10.202130] nouveau :09:00.0: DRM: DCB conn 03: 02000346 [ 10.202131] nouveau :09:00.0: DRM: DCB conn 04: 00010461 [ 10.202132] nouveau :09:00.0: DRM: DCB conn 05: 0570 [ 10.202134] nouveau :09:00.0: DRM: Pointer to flat panel table invalid [ 10.214683] nouveau :09:00.0: DRM: unknown connector type 70 [ 10.214728] nouveau :09:00.0: DRM: failed to create encoder 1/8/0: -19 [ 10.214730] nouveau :09:00.0: DRM: Unknown-1 has no encoders, removing [ 10.369691] nouveau :09:00.0: DRM: MM: using COPY for buffer copies [ 10.478561] nouveau :09:00.0: priv: GPC0: 419df4 (1e40820e) [ 10.478578] nouveau :09:00.0: priv: GPC1: 419df4 (1e40820e) [ 10.607100] nouveau :09:00.0: DRM: allocated 3840x2160 fb: 0x6, bo 88044aad7400 [ 10.607276] fbcon: nouveaufb (fb0) is primary device [ 10.607576] nouveau :09:00.0: fb0: nouveaufb frame buffer device [ 10.617064] [drm] Initialized nouveau 1.3.1 20120801 for :09:00.0 on minor 0 [ 719.282184] nouveau :09:00.0: disp: outp 04:0006:0f44: link training failed [ 719.300820] nouveau :09:00.0: disp: outp 04:0006:0f44: link training failed Thanks, Andy
i915 4.5 bugfix backport and release management issue?
On Tue, Mar 29, 2016 at 12:49 AM, Andy Lutomirski wrote: > On Tue, Mar 29, 2016 at 12:43 AM, Daniel Vetter > wrote: >> On Tue, Mar 29, 2016 at 4:39 AM, Andy Lutomirski >> wrote: >>> AFAICT something got rather screwed up in i915 land for 4.5. >>> >>> $ git log --oneline --grep='Pretend cursor is always on' v4.5 >>> drivers/gpu/drm/i915/ >>> e2e407dc093f drm/i915: Pretend cursor is always on for ILK-style WM >>> calculations (v2) >>> >>> $ git log --oneline --grep='Pretend cursor is always on' v4.6-rc1 >>> drivers/gpu/drm/i915/ >>> e2e407dc093f drm/i915: Pretend cursor is always on for ILK-style WM >>> calculations (v2) >>> b2435692dbb7 drm/i915: Pretend cursor is always on for ILK-style WM >>> calculations (v2) >>> >>> The two patches there are almost, but not quite, the same thing, which >>> makes me wonder how they both ended up in Linus' tree without an >>> obvious merge conflict. >>> >>> I have no idea what caused this. However, I think (on very little >>> inspection, but it's consistent with problems I have with 4.5 on my >>> laptop) that the first one is an *incorrect* fix for a regression in >>> 4.5 and the second is a correct fix for the same regression. 4.6-rc1 >>> seems okay. >>> >>> I reported the regression and everyone involved has known about it for >>> weeks. Nonetheless, 4.5 final is busted. >> >> Quoting from e2e407dc093f >> >> "(cherry picked from commit b2435692dbb709d4c8ff3b2f2815c9b8423b72bb)" >> >> i.e. this is intentionally twice in the history. We started to soak >> bugfixes in -next and then cherry pick them because we had too much >> fun with things blowing up, and also too much fun with really messy >> conflicts. It's not a botched patch in 4.5 or anything else nefarious >> at all. > > Bah, sorry, I read it wrong. They have the same final state but they > were on different bases. I somehow reversed this in my head and > thought they had the same initial state and different final states. > Also, sorry for the excessive diatribe. I plead sleepiness and mis-reading of code. --Andy
i915 4.5 bugfix backport and release management issue?
On Tue, Mar 29, 2016 at 12:43 AM, Daniel Vetter wrote: > On Tue, Mar 29, 2016 at 4:39 AM, Andy Lutomirski > wrote: >> AFAICT something got rather screwed up in i915 land for 4.5. >> >> $ git log --oneline --grep='Pretend cursor is always on' v4.5 >> drivers/gpu/drm/i915/ >> e2e407dc093f drm/i915: Pretend cursor is always on for ILK-style WM >> calculations (v2) >> >> $ git log --oneline --grep='Pretend cursor is always on' v4.6-rc1 >> drivers/gpu/drm/i915/ >> e2e407dc093f drm/i915: Pretend cursor is always on for ILK-style WM >> calculations (v2) >> b2435692dbb7 drm/i915: Pretend cursor is always on for ILK-style WM >> calculations (v2) >> >> The two patches there are almost, but not quite, the same thing, which >> makes me wonder how they both ended up in Linus' tree without an >> obvious merge conflict. >> >> I have no idea what caused this. However, I think (on very little >> inspection, but it's consistent with problems I have with 4.5 on my >> laptop) that the first one is an *incorrect* fix for a regression in >> 4.5 and the second is a correct fix for the same regression. 4.6-rc1 >> seems okay. >> >> I reported the regression and everyone involved has known about it for >> weeks. Nonetheless, 4.5 final is busted. > > Quoting from e2e407dc093f > > "(cherry picked from commit b2435692dbb709d4c8ff3b2f2815c9b8423b72bb)" > > i.e. this is intentionally twice in the history. We started to soak > bugfixes in -next and then cherry pick them because we had too much > fun with things blowing up, and also too much fun with really messy > conflicts. It's not a botched patch in 4.5 or anything else nefarious > at all. Bah, sorry, I read it wrong. They have the same final state but they were on different bases. I somehow reversed this in my head and thought they had the same initial state and different final states. > > - We've genuinely failed to cherry-pick a bugfix over. It happens, > despite our best efforts (which of course includes running stuff on > Linus' tree). Please do a reverse bisect so we know which precise > commit fell through the cracks. If I find some time, I'll try. I've already failed miserably at bisecting this thing once. --Andy
i915 4.5 bugfix backport and release management issue?
Hi all- AFAICT something got rather screwed up in i915 land for 4.5. $ git log --oneline --grep='Pretend cursor is always on' v4.5 drivers/gpu/drm/i915/ e2e407dc093f drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2) $ git log --oneline --grep='Pretend cursor is always on' v4.6-rc1 drivers/gpu/drm/i915/ e2e407dc093f drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2) b2435692dbb7 drm/i915: Pretend cursor is always on for ILK-style WM calculations (v2) The two patches there are almost, but not quite, the same thing, which makes me wonder how they both ended up in Linus' tree without an obvious merge conflict. I have no idea what caused this. However, I think (on very little inspection, but it's consistent with problems I have with 4.5 on my laptop) that the first one is an *incorrect* fix for a regression in 4.5 and the second is a correct fix for the same regression. 4.6-rc1 seems okay. I reported the regression and everyone involved has known about it for weeks. Nonetheless, 4.5 final is busted. Can you please: a) figure out what happened and send a backport of whatever needs to be backported to stable at vger.kernel.org. b) do whatever needs to be done so this doesn't happen again c) teach the i915 CI system to test Linus' tree as-is in addition to the development trees. Linus' tree and the versions of i915 in actual released versions of Linux are supposed to work. I hate to nag, but this is at least the third time I've noticed weird release management issues in i915. I tripped on a regression a few releases ago that was known to the i915 team and fixed but the fix wasn't actually queued up for the current release. Before that, I tripped on a regression caused by an intentional behavior change that was folded in to a merge commit, making it essentially impossible to bisect and making it pointlessly hard to understand what was going on even once I found the offending code. Thanks, Andy
[Intel-gfx] Possible 4.5 i915 Skylake regression
On Wed, Feb 17, 2016 at 8:18 AM, Daniel Vetter wrote: > On Tue, Feb 16, 2016 at 09:26:35AM -0800, Andy Lutomirski wrote: >> On Tue, Feb 16, 2016 at 9:12 AM, Andy Lutomirski >> wrote: >> > On Tue, Feb 16, 2016 at 8:12 AM, Daniel Vetter wrote: >> >> On Mon, Feb 15, 2016 at 06:58:33AM -0800, Andy Lutomirski wrote: >> >>> On Sun, Feb 14, 2016 at 6:59 PM, Andy Lutomirski >> >>> wrote: >> >>> > Hi- >> >>> > >> >>> > On 4.5-rc3 on a Dell XPS 13 9350 (Skylake i915, no nvidia on this >> >>> > model), shortly after resume, I saw a single black flash on the >> >>> > screen. The log said: >> >>> > >> >>> > [Feb13 07:05] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* >> >>> > CPU pipe A FIFO underrun >> >>> > >> >>> > I haven't seen this on 4.4. >> >>> > >> >>> > I'd be happy to dig up debugging info, but I don't know what would be >> >>> > useful. I have no i915 module options set. >> >>> >> >>> It's flashing quite frequently now, although I seem to get the >> >>> underrun warning only once per resume. >> >> >> >> We shut up the warning irq source to avoid hijacking an entire cpu core >> >> ;-) >> >> >> >> There's a fix from Matt right after 4.5-rc4 in Linus' branch. I'm hoping >> >> that should help. >> > >> > Do you mean: >> > >> > commit e2e407dc093f530b771ee8bf8fe1be41e3cea8b3 >> > Author: Matt Roper >> > Date: Mon Feb 8 11:05:28 2016 -0800 >> > >> > drm/i915: Pretend cursor is always on for ILK-style WM calculations >> > (v2) >> > >> > If so, it didn't help. I'm currently doing a full rebuild just in >> > case I messed something up, though. >> > >> >> Definitely not fixed. It seems to be okay after a reboot until the >> first suspend/resume. >> >> This happened after resuming. Five cents says it's the root cause. > > That's interesting, but doesn't ring a bell unfortunately. Can you try to > attempt a bisect? > I'm giving up on my attempt to bisect for now. After a bunch of false starts to avoid this crap, I'm stuck at 651174a4a0ccaf41e14fadc4bc525d61ae7f7b18, which is based on 4.3-rc3 and doesn't merge cleanly up to 4.4. It's also annoying because it reproduces reasonably quickly but not instantaneously, and I can never reproduce it before a suspend/resume, so my bisection attempts are full of errors. --Andy > Thanks, Daniel > >> >> [ 160.361200] WARNING: CPU: 2 PID: 2512 at >> drivers/gpu/drm/i915/intel_uncore.c:599 >> hsw_unclaimed_reg_debug+0x69/0x90 [i915]() >> [ 160.361209] Unclaimed register detected before writing to register 0x20a8 >> [ 160.361213] Modules linked in: rfcomm fuse ccm cmac xt_CHECKSUM >> ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns >> nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 >> xt_conntrack ebtable_filter ebtable_nat ebtable_broute bridge stp llc >> ebtables ip6table_raw ip6table_mangle ip6table_security ip6table_nat >> nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter >> ip6_tables iptable_raw iptable_mangle iptable_security iptable_nat >> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bnep >> arc4 iwlmvm mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek >> hid_multitouch snd_hda_codec_generic iwlwifi snd_hda_intel intel_rapl >> snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hwdep >> cfg80211 snd_hda_core kvm snd_seq uvcvideo snd_seq_device >> i2c_designware_platform >> [ 160.361385] i2c_designware_core btusb snd_pcm videobuf2_vmalloc >> wmi_mof vfat dell_wmi fat videobuf2_memops btrtl btbcm btintel >> bluetooth dell_laptop dell_smbios dcdbas videobuf2_v4l2 snd_timer >> videobuf2_core rtsx_pci_ms snd irqbypass videodev memstick >> ghash_clmulni_intel joydev mei_me efi_pstore mei i2c_i801 soundcore >> efivars pcspkr idma64 shpchp virt_dma media rfkill intel_lpss_pci >> processor_thermal_device intel_soc_dts_iosf wmi acpi_als kfifo_buf >> int3403_thermal tpm_tis industrialio pinctrl_sunrisepoint tpm >> intel_hid int3400_thermal pinctrl_intel intel_lpss_acpi sparse_keymap >> int340x_thermal_zone acpi_thermal_rel intel_lpss nfsd acpi_pad >> auth_rpcgss nfs_acl lockd binfmt_misc grace sunrpc dm_crypt i915 >> i2c_algo_bit drm_kms_helper syscopyare
[Intel-gfx] Possible 4.5 i915 Skylake regression
On Mon, Feb 22, 2016 at 7:13 PM, Andy Lutomirski wrote: > On Wed, Feb 17, 2016 at 5:36 PM, Andy Lutomirski > wrote: >> On Wed, Feb 17, 2016 at 8:18 AM, Daniel Vetter wrote: >>> On Tue, Feb 16, 2016 at 09:26:35AM -0800, Andy Lutomirski wrote: >>>> On Tue, Feb 16, 2016 at 9:12 AM, Andy Lutomirski >>>> wrote: >>>> > On Tue, Feb 16, 2016 at 8:12 AM, Daniel Vetter >>>> > wrote: >>>> >> On Mon, Feb 15, 2016 at 06:58:33AM -0800, Andy Lutomirski wrote: >>>> >>> On Sun, Feb 14, 2016 at 6:59 PM, Andy Lutomirski >>>> >>> wrote: >>>> >>> > Hi- >>>> >>> > >>>> >>> > On 4.5-rc3 on a Dell XPS 13 9350 (Skylake i915, no nvidia on this >>>> >>> > model), shortly after resume, I saw a single black flash on the >>>> >>> > screen. The log said: >>>> >>> > >>>> >>> > [Feb13 07:05] [drm:intel_cpu_fifo_underrun_irq_handler [i915]] >>>> >>> > *ERROR* >>>> >>> > CPU pipe A FIFO underrun >>>> >>> > >>>> >>> > I haven't seen this on 4.4. >>>> >>> > >>>> >>> > I'd be happy to dig up debugging info, but I don't know what would be >>>> >>> > useful. I have no i915 module options set. >>>> >>> >>>> >>> It's flashing quite frequently now, although I seem to get the >>>> >>> underrun warning only once per resume. >>>> >> >>>> >> We shut up the warning irq source to avoid hijacking an entire cpu core >>>> >> ;-) >>>> >> >>>> >> There's a fix from Matt right after 4.5-rc4 in Linus' branch. I'm hoping >>>> >> that should help. >>>> > >>>> > Do you mean: >>>> > >>>> > commit e2e407dc093f530b771ee8bf8fe1be41e3cea8b3 >>>> > Author: Matt Roper >>>> > Date: Mon Feb 8 11:05:28 2016 -0800 >>>> > >>>> > drm/i915: Pretend cursor is always on for ILK-style WM calculations >>>> > (v2) >>>> > >>>> > If so, it didn't help. I'm currently doing a full rebuild just in >>>> > case I messed something up, though. >>>> > >>>> >>>> Definitely not fixed. It seems to be okay after a reboot until the >>>> first suspend/resume. >>>> >>>> This happened after resuming. Five cents says it's the root cause. >>> >>> That's interesting, but doesn't ring a bell unfortunately. Can you try to >>> attempt a bisect? >> >> I probably can, but it's very slow. Is there a reasonably >> straightforward way to instrument the watermark computation to see >> what's going wrong? I'm reasonably confident that the bug is in the >> resume code or in something that only happens on resume, since I still >> haven't seen underruns after rebooting before suspending. >> > > With some instrumentation applied, I got this: > > [ 369.471064] skl_update_wm(crtc-0): computed update > [ 369.471072] skl_update_other_pipe_wm(crtc-0): no change > [ 369.471075] skl_write_wm_values... > [ 369.471078] CRTC crtc-0 pipe A > [ 369.471083] wm_linetime = 121 > [ 369.471086] plane_wm level 0 plane 0 = 2147500036 > [ 369.471090] plane_wm level 0 plane 1 = 0 > [ 369.471094] plane_wm level 0 cursor = 2147500036 > [ 369.471097] plane_wm level 1 plane 0 = 2147516439 > [ 369.471101] plane_wm level 1 plane 1 = 0 > [ 369.471104] plane_wm level 1 cursor = 2147516439 > [ 369.471108] plane_wm level 2 plane 0 = 2147516448 > [ 369.47] plane_wm level 2 plane 1 = 0 > [ 369.471115] plane_wm level 2 cursor = 0 > [ 369.471118] plane_wm level 3 plane 0 = 2147532837 > [ 369.471121] plane_wm level 3 plane 1 = 0 > [ 369.471125] plane_wm level 3 cursor = 0 > [ 369.471128] plane_wm level 4 plane 0 = 2147565639 > [ 369.471131] plane_wm level 4 plane 1 = 0 > [ 369.471135] plane_wm level 4 cursor = 0 > [ 369.471138] plane_wm level 5 plane 0 = 2147582038 > [ 369.471141] plane_wm level 5 plane 1 = 0 > [ 369.471145] plane_wm level 5 cursor = 0 > [ 369.471148] plane_wm level 6 plane 0 = 2147582044 > [ 369.471151] plane_wm level 6 plane 1 = 0 > [ 369.471155] plane_wm level 6 cursor = 0 > [ 369.471158] plane_wm level 7 plane 0 =
[PATCH v1 00/12] PCI: Rework shadow ROM handling
On Fri, Mar 11, 2016 at 3:29 PM, Bjorn Helgaas wrote: > On Fri, Mar 11, 2016 at 01:16:09PM -0800, Andy Lutomirski wrote: >> On Tue, Mar 8, 2016 at 9:45 AM, Bjorn Helgaas wrote: >> > On Thu, Mar 03, 2016 at 10:53:50AM -0600, Bjorn Helgaas wrote: >> >> The purpose of this series is to: >> >> >> >> - Fix the "BAR 6: [??? 0x flags 0x2] has bogus alignment" >> >> messages reported by Linus [1], Andy [2], and others. >> >> >> >> - Move arch-specific shadow ROM location knowledge, e.g., >> >> 0xC-0xD, from PCI core to arch code. >> >> >> >> - Fix the ia64 and MIPS Loongson 3 oddity of keeping virtual >> >> addresses in shadow ROM struct resource (resources should always >> >> contain *physical* addresses). >> >> >> >> - Remove now-unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY >> >> flags. >> >> >> >> This series is based on v4.5-rc1, and it's available on my >> >> pci/resource git branch (along with a couple tiny unrelated patches) >> >> at [3]. >> >> >> >> Bjorn >> >> >> >> >> >> [1] >> >> http://lkml.kernel.org/r/CA+55aFyVMfTBB0oz_yx8+eQOEJnzGtCsYSj9QuhEpdZ9BHdq5A >> >> at mail.gmail.com >> >> [2] >> >> http://lkml.kernel.org/r/CALCETrV+RwNPzxyL8UVNsrAGu-6cCzD_Cc9PFJT2NCTJPLZZiw >> >> at mail.gmail.com >> >> [3] >> >> https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/resource >> >> >> >> >> >> --- >> >> >> >> Bjorn Helgaas (12): >> >> PCI: Mark shadow copy of VGA ROM as IORESOURCE_PCI_FIXED >> >> PCI: Don't assign or reassign immutable resources >> >> PCI: Don't enable/disable ROM BAR if we're using a RAM shadow copy >> >> PCI: Set ROM shadow location in arch code, not in PCI core >> >> PCI: Clean up pci_map_rom() whitespace >> >> ia64/PCI: Use temporary struct resource * to avoid repetition >> >> ia64/PCI: Use ioremap() instead of open-coded equivalent >> >> ia64/PCI: Keep CPU physical (not virtual) addresses in shadow ROM >> >> resource >> >> MIPS: Loongson 3: Use temporary struct resource * to avoid >> >> repetition >> >> MIPS: Loongson 3: Keep CPU physical (not virtual) addresses in >> >> shadow ROM resource >> >> PCI: Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY >> >> PCI: Simplify sysfs ROM cleanup >> >> >> >> >> >> arch/ia64/pci/fixup.c | 21 +++-- >> >> arch/ia64/sn/kernel/io_acpi_init.c | 22 ++ >> >> arch/ia64/sn/kernel/io_init.c | 51 -- >> >> arch/mips/pci/fixup-loongson3.c| 19 +--- >> >> arch/x86/pci/fixup.c | 21 +++-- >> >> drivers/pci/pci-sysfs.c| 13 +- >> >> drivers/pci/remove.c |1 >> >> drivers/pci/rom.c | 83 >> >> +++- >> >> drivers/pci/setup-res.c|6 +++ >> >> include/linux/ioport.h |4 -- >> >> 10 files changed, 111 insertions(+), 130 deletions(-) >> > >> > I applied this series to pci/resource for v4.6. >> >> This gets rid of all the warnings for me until I try to read my i915 >> device's rom using sysfs. Then I get: >> >> i915 :00:02.0: Invalid PCI ROM header signature: expecting 0xaa55, >> got 0x >> >> So I suspect that something is still subtly wrong -- I'd imagine that >> this should either work or the intialization code should detect that >> there is no usable ROM and not expose it. >> >> (To be clear, there's no regression here.) > > Hmmm. Thanks for testing this. As you say, I think this is the way > it's always been, but it does seem non-intuitive. > > That "Invalid PCI ROM header signature" warning comes from > pci_get_rom_size(). We don't call that at enumeration-time; we only > call it later when somebody tries to read the ROM via sysfs: > > pci_bus_add_device > pci_fixup_device(pci_fixup_final) > pci_fixup_video # final fixup > res->flags = MEM | SHADOW | PCI_FIXED > pci_create_sys
[PATCH v1 00/12] PCI: Rework shadow ROM handling
On Tue, Mar 8, 2016 at 9:45 AM, Bjorn Helgaas wrote: > On Thu, Mar 03, 2016 at 10:53:50AM -0600, Bjorn Helgaas wrote: >> The purpose of this series is to: >> >> - Fix the "BAR 6: [??? 0x flags 0x2] has bogus alignment" >> messages reported by Linus [1], Andy [2], and others. >> >> - Move arch-specific shadow ROM location knowledge, e.g., >> 0xC-0xD, from PCI core to arch code. >> >> - Fix the ia64 and MIPS Loongson 3 oddity of keeping virtual >> addresses in shadow ROM struct resource (resources should always >> contain *physical* addresses). >> >> - Remove now-unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY >> flags. >> >> This series is based on v4.5-rc1, and it's available on my >> pci/resource git branch (along with a couple tiny unrelated patches) >> at [3]. >> >> Bjorn >> >> >> [1] >> http://lkml.kernel.org/r/CA+55aFyVMfTBB0oz_yx8+eQOEJnzGtCsYSj9QuhEpdZ9BHdq5A >> at mail.gmail.com >> [2] >> http://lkml.kernel.org/r/CALCETrV+RwNPzxyL8UVNsrAGu-6cCzD_Cc9PFJT2NCTJPLZZiw >> at mail.gmail.com >> [3] >> https://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=pci/resource >> >> >> --- >> >> Bjorn Helgaas (12): >> PCI: Mark shadow copy of VGA ROM as IORESOURCE_PCI_FIXED >> PCI: Don't assign or reassign immutable resources >> PCI: Don't enable/disable ROM BAR if we're using a RAM shadow copy >> PCI: Set ROM shadow location in arch code, not in PCI core >> PCI: Clean up pci_map_rom() whitespace >> ia64/PCI: Use temporary struct resource * to avoid repetition >> ia64/PCI: Use ioremap() instead of open-coded equivalent >> ia64/PCI: Keep CPU physical (not virtual) addresses in shadow ROM >> resource >> MIPS: Loongson 3: Use temporary struct resource * to avoid repetition >> MIPS: Loongson 3: Keep CPU physical (not virtual) addresses in shadow >> ROM resource >> PCI: Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY >> PCI: Simplify sysfs ROM cleanup >> >> >> arch/ia64/pci/fixup.c | 21 +++-- >> arch/ia64/sn/kernel/io_acpi_init.c | 22 ++ >> arch/ia64/sn/kernel/io_init.c | 51 -- >> arch/mips/pci/fixup-loongson3.c| 19 +--- >> arch/x86/pci/fixup.c | 21 +++-- >> drivers/pci/pci-sysfs.c| 13 +- >> drivers/pci/remove.c |1 >> drivers/pci/rom.c | 83 >> +++- >> drivers/pci/setup-res.c|6 +++ >> include/linux/ioport.h |4 -- >> 10 files changed, 111 insertions(+), 130 deletions(-) > > I applied this series to pci/resource for v4.6. This gets rid of all the warnings for me until I try to read my i915 device's rom using sysfs. Then I get: i915 :00:02.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0x So I suspect that something is still subtly wrong -- I'd imagine that this should either work or the intialization code should detect that there is no usable ROM and not expose it. (To be clear, there's no regression here.)
[Intel-gfx] i915 Skylake: "Invalid ROM contents"
On Sun, Jan 10, 2016 at 11:12 AM, Andy Lutomirski wrote: > On Sun, Jan 10, 2016 at 10:41 AM, Andy Lutomirski > wrote: >> On Wed, Nov 18, 2015 at 8:12 AM, Daniel Stone >> wrote: >>> Hi, >>> >>> On 18 November 2015 at 15:59, Andy Lutomirski >>> wrote: >>>> On Wed, Nov 18, 2015 at 2:59 AM, Ville Syrjälä >>>> wrote: >>>>> On Tue, Nov 17, 2015 at 11:43:25AM -0800, Andy Lutomirski wrote: >>>>>> Typing: >>>>>> >>>>>> # cat /sys/devices/pci:00/:00:02.0/rom >>>>>> >>>>>> Provokes: >>>>>> >>>>>> i915 :00:02.0: Invalid ROM contents >>>>> >>>>> Hmm. So there's no PCI option ROM there. I wonder what is there. I >>>>> get the same on my Braswell BTW. I tried to look through the UEFI >>>>> spec a bit, and it seems to say that even for non-legacy option ROMs >>>>> the 0x55aa signature should be there. >>>>> >>>>> But this being the GPU means we may be using the shadow ROM stuff, >>>>> which IIRC assumes that the shadow is at 0xc000. I'm not sure that >>>>> holds anymore with UEFI, and maybe we should be using some UEFI >>>>> trick instead to find out where it actually lives? >>>>> >>>>> BTW what does 'lspci -vv -s 00:02.0' say on your machine? >>>>> >>>> >>>> 00:02.0 VGA compatible controller: Intel Corporation Sky Lake >>>> Integrated Graphics (rev 07) (prog-if 00 [VGA controller]) >>>> DeviceName: Onboard IGD >>>> Subsystem: Dell Device 0704 >>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >>>> ParErr- Stepping- SERR- FastB2B- DisINTx+ >>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>>> SERR- >>> Latency: 0 >>>> Interrupt: pin A routed to IRQ 128 >>>> Region 0: Memory at db00 (64-bit, non-prefetchable) [size=16M] >>>> Region 2: Memory at 9000 (64-bit, prefetchable) [size=256M] >>>> Region 4: I/O ports at f000 [size=64] >>>> Expansion ROM at [disabled] >>> >>> UEFI has an option to enable option ROMs, which is disabled by >>> default; I wonder if having it disabled prevents all access to the >>> ROM. >>> >>> Mind you, it doesn't seem to be fatal; I've not had any issues with >>> the same machine that I can pin down to lack of ROM. >>> >> >> FWIW, my logs also get spammed with: >> >> [ 127.101881] i915 :00:02.0: BAR 6: [??? 0x flags 0x2] >> has bogus alignment >> >> I suspect that the PCI core is just failing to recognize that the ROM >> is disabled. >> > > A bit more info: > > I think I only get this error when suspending for the second time > after boot. No clue why. > > I instrumented the code a bit. At the time of that error, res->flags > == 0x2. It's probably not a coincidence that: > > #define IORESOURCE_ROM_SHADOW(1<<1)/* ROM is copy at C000:0 */ > > Should pci_fixup_video check that the resource exists in the first > place before setting flags on it? *ping* Hi, PCI people. --Andy
[PATCH] x86: Add an explicit barrier() to clflushopt()
On Tue, Jan 12, 2016 at 6:06 PM, Linus Torvalds wrote: > On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson > wrote: >> >> The double clflush() remains a mystery. > > Actually, I think it's explainable. > > It's wrong to do the clflush *after* the GPU has done the write, which > seems to be what you are doing. > > Why? > > If the GPU really isn't cache coherent, what can happen is: > > - the CPU has the line cached > > - the GPU writes the data > > - you do the clflushopt to invalidate the cacheline > > - you expect to see the GPU data. > > Right? > > Wrong. The above is complete crap. > > Why? > > Very simple reason: the CPU may have had the cacheline dirty at some > level in its caches, so when you did the clflushopt, it didn't just > invalidate the CPU cacheline, it wrote it back to memory. And in the > process over-wrote the data that the GPU had written. > > Now you can say "but the CPU never wrote to the cacheline, so it's not > dirty in the CPU caches". That may or may not be trie. The CPU may > have written to it quite a long time ago. > > So if you are doing a GPU write, and you want to see the data that the > GPU wrote, you had better do the clflushopt long *before* the GPU ever > writes to memory. > > Your pattern of doing "flush and read" is simply fundamentally buggy. > There are only two valid CPU flushing patterns: > > - write and flush (to make the writes visible to the GPU) > > - flush before starting GPU accesses, and then read > > At no point can "flush and read" be right. > > Now, I haven't actually seen your code, so I'm just going by your > high-level description of where the CPU flush and CPU read were done, > but it *sounds* like you did that invalid "flush and read" behavior. Since barriers are on my mind: how strong a barrier is needed to prevent cache fills from being speculated across the barrier? Concretely, if I do: clflush A clflush B load A load B the architecture guarantees that (unless store forwarding happens) the value I see for B is at least as new as the value I see for A *with respect to other access within the coherency domain*. But the GPU isn't in the coherency domain at all. Is: clflush A clflush B load A MFENCE load B good enough? If it is, and if clflush A clflush B load A LOCK whatever load B is *not*, then this might account for the performance difference. In any event, it seems to me that what i915 is trying to do isn't really intended to be supported for WB memory. i915 really wants to force a read from main memory and to simultaneously prevent the CPU from writing back to main memory. Ick. I'd assume that: clflush A clflush B load A serializing instruction here load B is good enough, as long as you make sure that the GPU does its writes after the clflushes make it all the way out to main memory (which might require a second serializing instruction in the case of clflushopt), but this still relies on the hardware prefetcher not prefetching B too early, which it's permitted to do even in the absence of any explicit access at all. Presumably this is good enough on any implementation: clflush A clflush B load A clflush B load B But that will be really, really slow. And you're still screwed if the hardware is permitted to arbitrarily change cache lines from S to M. In other words, I'm not really convinced that x86 was ever intended to have well-defined behavior if something outside the coherency domain writes to a page of memory while that page is mapped WB. Of course, I'm also not sure how to reliably switch a page from WB to any other memory type short of remapping it and doing CLFLUSH after remapping. SDM Volume 3 11.12.4 seems to agree with me. Could the driver be changed to use WC or UC and to use MOVNTDQA on supported CPUs to get the performance back? It sounds like i915 is effectively doing PIO here, and reasonably modern CPUs have a nice set of fast PIO instructions. --Andy
[Intel-gfx] i915 Skylake: "Invalid ROM contents"
On Sun, Jan 10, 2016 at 10:41 AM, Andy Lutomirski wrote: > On Wed, Nov 18, 2015 at 8:12 AM, Daniel Stone wrote: >> Hi, >> >> On 18 November 2015 at 15:59, Andy Lutomirski wrote: >>> On Wed, Nov 18, 2015 at 2:59 AM, Ville Syrjälä >>> wrote: >>>> On Tue, Nov 17, 2015 at 11:43:25AM -0800, Andy Lutomirski wrote: >>>>> Typing: >>>>> >>>>> # cat /sys/devices/pci:00/:00:02.0/rom >>>>> >>>>> Provokes: >>>>> >>>>> i915 :00:02.0: Invalid ROM contents >>>> >>>> Hmm. So there's no PCI option ROM there. I wonder what is there. I >>>> get the same on my Braswell BTW. I tried to look through the UEFI >>>> spec a bit, and it seems to say that even for non-legacy option ROMs >>>> the 0x55aa signature should be there. >>>> >>>> But this being the GPU means we may be using the shadow ROM stuff, >>>> which IIRC assumes that the shadow is at 0xc000. I'm not sure that >>>> holds anymore with UEFI, and maybe we should be using some UEFI >>>> trick instead to find out where it actually lives? >>>> >>>> BTW what does 'lspci -vv -s 00:02.0' say on your machine? >>>> >>> >>> 00:02.0 VGA compatible controller: Intel Corporation Sky Lake >>> Integrated Graphics (rev 07) (prog-if 00 [VGA controller]) >>> DeviceName: Onboard IGD >>> Subsystem: Dell Device 0704 >>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >>> ParErr- Stepping- SERR- FastB2B- DisINTx+ >>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>> SERR- >> Latency: 0 >>> Interrupt: pin A routed to IRQ 128 >>> Region 0: Memory at db00 (64-bit, non-prefetchable) [size=16M] >>> Region 2: Memory at 9000 (64-bit, prefetchable) [size=256M] >>> Region 4: I/O ports at f000 [size=64] >>> Expansion ROM at [disabled] >> >> UEFI has an option to enable option ROMs, which is disabled by >> default; I wonder if having it disabled prevents all access to the >> ROM. >> >> Mind you, it doesn't seem to be fatal; I've not had any issues with >> the same machine that I can pin down to lack of ROM. >> > > FWIW, my logs also get spammed with: > > [ 127.101881] i915 :00:02.0: BAR 6: [??? 0x flags 0x2] > has bogus alignment > > I suspect that the PCI core is just failing to recognize that the ROM > is disabled. > A bit more info: I think I only get this error when suspending for the second time after boot. No clue why. I instrumented the code a bit. At the time of that error, res->flags == 0x2. It's probably not a coincidence that: #define IORESOURCE_ROM_SHADOW(1<<1)/* ROM is copy at C000:0 */ Should pci_fixup_video check that the resource exists in the first place before setting flags on it? --Andy
[Intel-gfx] i915 Skylake: "Invalid ROM contents"
On Wed, Nov 18, 2015 at 8:12 AM, Daniel Stone wrote: > Hi, > > On 18 November 2015 at 15:59, Andy Lutomirski wrote: >> On Wed, Nov 18, 2015 at 2:59 AM, Ville Syrjälä >> wrote: >>> On Tue, Nov 17, 2015 at 11:43:25AM -0800, Andy Lutomirski wrote: >>>> Typing: >>>> >>>> # cat /sys/devices/pci:00/:00:02.0/rom >>>> >>>> Provokes: >>>> >>>> i915 :00:02.0: Invalid ROM contents >>> >>> Hmm. So there's no PCI option ROM there. I wonder what is there. I >>> get the same on my Braswell BTW. I tried to look through the UEFI >>> spec a bit, and it seems to say that even for non-legacy option ROMs >>> the 0x55aa signature should be there. >>> >>> But this being the GPU means we may be using the shadow ROM stuff, >>> which IIRC assumes that the shadow is at 0xc000. I'm not sure that >>> holds anymore with UEFI, and maybe we should be using some UEFI >>> trick instead to find out where it actually lives? >>> >>> BTW what does 'lspci -vv -s 00:02.0' say on your machine? >>> >> >> 00:02.0 VGA compatible controller: Intel Corporation Sky Lake >> Integrated Graphics (rev 07) (prog-if 00 [VGA controller]) >> DeviceName: Onboard IGD >> Subsystem: Dell Device 0704 >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >> ParErr- Stepping- SERR- FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> SERR- > Latency: 0 >> Interrupt: pin A routed to IRQ 128 >> Region 0: Memory at db00 (64-bit, non-prefetchable) [size=16M] >> Region 2: Memory at 9000 (64-bit, prefetchable) [size=256M] >> Region 4: I/O ports at f000 [size=64] >> Expansion ROM at [disabled] > > UEFI has an option to enable option ROMs, which is disabled by > default; I wonder if having it disabled prevents all access to the > ROM. > > Mind you, it doesn't seem to be fatal; I've not had any issues with > the same machine that I can pin down to lack of ROM. > FWIW, my logs also get spammed with: [ 127.101881] i915 :00:02.0: BAR 6: [??? 0x flags 0x2] has bogus alignment I suspect that the PCI core is just failing to recognize that the ROM is disabled. --Andy
[PATCH] x86: Add an explicit barrier() to clflushopt()
On Sat, Jan 9, 2016 at 12:01 AM, Chris Wilson wrote: > On Thu, Jan 07, 2016 at 02:32:23PM -0800, H. Peter Anvin wrote: >> On 01/07/16 14:29, H. Peter Anvin wrote: >> > >> > I would be very interested in knowing if replacing the final clflushopt >> > with a clflush would resolve your problems (in which case the last mb() >> > shouldn't be necessary either.) >> > >> >> Nevermind. CLFLUSH is not ordered with regards to CLFLUSHOPT to the >> same cache line. >> >> Could you add a sync_cpu(); call to the end (can replace the final mb()) >> and see if that helps your case? > > s/sync_cpu()/sync_core()/ > > No. I still see failures on Baytrail and Braswell (Pineview is not > affected) with the final mb() replaced with sync_core(). I can reproduce > failures on Pineview by tweaking the clflush_cache_range() parameters, > so I am fairly confident that it is validating the current code. > > iirc sync_core() is cpuid, a heavy serialising instruction, an > alternative to mfence. Is there anything that else I can infer about > the nature of my bug from this result? No clue, but I don't know much about the underlying architecture. Can you try clflush_cache_ranging one cacheline less and then manually doing clflushopt; mb on the last cache line, just to make sure that the helper is really doing the right thing? You could also try clflush instead of clflushopt to see if that makes a difference. --Andy -- Andy Lutomirski AMA Capital Management, LLC
[PATCH] x86: Add an explicit barrier() to clflushopt()
On Thu, Jan 7, 2016 at 2:16 AM, Chris Wilson wrote: > On Mon, Oct 19, 2015 at 10:58:55AM +0100, Chris Wilson wrote: >> During testing we observed that the last cacheline was not being flushed >> from a >> >> mb() >> for (addr = addr & -clflush_size; addr < end; addr += clflush_size) >> clflushopt(); >> mb() >> >> loop (where the initial addr and end were not cacheline aligned). >> >> Changing the loop from addr < end to addr <= end, or replacing the >> clflushopt() with clflush() both fixed the testcase. Hinting that GCC >> was miscompling the assembly within the loop and specifically the >> alternative within clflushopt() was confusing the loop optimizer. >> >> Adding a barrier() into clflushopt() is enough for GCC to dtrt, but >> solving why GCC is not seeing the constraints from the alternative_io() >> would be smarter... >> >> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=92501 >> Testcase: gem_tiled_partial_pwrite_pread/read >> Signed-off-by: Chris Wilson >> Cc: Ross Zwisler >> Cc: H. Peter Anvin >> Cc: Imre Deak >> Cc: Daniel Vetter >> Cc: dri-devel at lists.freedesktop.org >> --- >> arch/x86/include/asm/special_insns.h | 5 + >> 1 file changed, 5 insertions(+) >> >> diff --git a/arch/x86/include/asm/special_insns.h >> b/arch/x86/include/asm/special_insns.h >> index 2270e41b32fd..0c7aedbf8930 100644 >> --- a/arch/x86/include/asm/special_insns.h >> +++ b/arch/x86/include/asm/special_insns.h >> @@ -199,6 +199,11 @@ static inline void clflushopt(volatile void *__p) >> ".byte 0x66; clflush %P0", >> X86_FEATURE_CLFLUSHOPT, >> "+m" (*(volatile char __force *)__p)); >> + /* GCC (4.9.1 and 5.2.1 at least) appears to be very confused when >> + * meeting this alternative() and demonstrably miscompiles loops >> + * iterating over clflushopts. >> + */ >> + barrier(); >> } > > Or an alternative: > > +#define alternative_output(oldinstr, newinstr, feature, output)\ > + asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) \ > + : output : "i" (0) : "memory") > > I would really appreciate some knowledgeable folks taking a look at the > asm for clflushopt() as it still affects today's kernel and gcc. > > Fwiw, I have confirmed that arch/x86/mm/pageattr.c clflush_cache_range() > is similarly affected. Unless I'm mis-reading the asm, clflush_cache_range() is compiled correctly for me. (I don't know what the %P is for in the asm, but that shouldn't matter.) The ALTERNATIVE shouldn't even be visible to the optimizer. Can you attach a bad .s file and let us know what gcc version this is? (You can usually do 'make foo/bar/baz.s' to get a .s file.) I'd also be curious whether changing clflushopt to clwb works around the issue. --Andy
i915 Skylake crash on 4.4-rc3
[53834.386369] traps: gnome-session-b[2308] general protection ip:7f10efc1fc2b sp:7ffdfde31880 error:0 in libc-2.22.so[7f10efba1000+1b7000] [53834.687584] [ cut here ] [53834.687607] WARNING: CPU: 0 PID: 23730 at drivers/gpu/drm/i915/i915_gem_context.c:144 i915_gem_context_free+0x196/0x1c0 [i915]() [53834.687609] WARN_ON(!list_empty(&ppgtt->base.active_list)) [53834.687610] Modules linked in: [53834.687612] wmi_mof dell_wmi wmi(E) rfcomm fuse ccm cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute bridge stp llc ebtable_nat ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security arc4 bnep iwlmvm mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel iwlwifi snd_hda_codec intel_rapl x86_pkg_temp_thermal snd_hwdep coretemp snd_hda_core kvm_intel btusb snd_seq btrtl kvm uvcvideo btbcm cfg80211 btintel bluetooth snd_seq_device [53834.687656] videobuf2_vmalloc snd_pcm videobuf2_memops videobuf2_v4l2 hid_multitouch videobuf2_core sparse_keymap v4l2_common videodev i2c_designware_platform i2c_designware_core vfat snd_timer fat irqbypass dell_laptop snd dcdbas efi_pstore pcspkr joydev efivars media rtsx_pci_ms rfkill soundcore i2c_i801 memstick pinctrl_sunrisepoint pinctrl_intel intel_lpss_acpi int3400_thermal int3403_thermal acpi_thermal_rel acpi_pad mei_me tpm_tis mei tpm idma64 shpchp virt_dma acpi_als kfifo_buf processor_thermal_device nfsd industrialio intel_soc_dts_iosf iosf_mbi intel_lpss_pci int340x_thermal_zone intel_lpss auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul crc32c_intel [53834.687700] serio_raw rtsx_pci i2c_hid video [last unloaded: wmi] [53834.687706] CPU: 0 PID: 23730 Comm: kworker/u8:4 Tainted: G W E 4.4.0-rc3+ #19 [53834.687708] Hardware name: Dell Inc. XPS 13 9350/07TYC2, BIOS 1.0.4 10/19/2015 [53834.687726] Workqueue: i915 i915_gem_retire_work_handler [i915] [53834.687728] c62a1d15 88017ffb7c70 8142510c [53834.687731] 88017ffb7cb8 88017ffb7ca8 81092122 8802adfd1240 [53834.687734] 8802845dd800 88028468 88017ffb7d70 8802adfd12b8 [53834.687738] Call Trace: [53834.687743] [] dump_stack+0x4e/0x82 [53834.687747] [] warn_slowpath_common+0x82/0xc0 [53834.687750] [] warn_slowpath_fmt+0x5c/0x80 [53834.687764] [] i915_gem_context_free+0x196/0x1c0 [i915] [53834.68] [] i915_gem_request_free+0x9f/0xb0 [i915] [53834.687792] [] intel_execlists_retire_requests+0x138/0x190 [i915] [53834.687806] [] i915_gem_retire_requests+0xd1/0xe0 [i915] [53834.687827] [] i915_gem_retire_work_handler+0x58/0x70 [i915] [53834.687831] [] process_one_work+0x152/0x400 [53834.687834] [] worker_thread+0x4b/0x440 [53834.687837] [] ? process_one_work+0x400/0x400 [53834.687839] [] ? process_one_work+0x400/0x400 [53834.687842] [] kthread+0xd8/0xf0 [53834.687845] [] ? kthread_worker_fn+0x150/0x150 [53834.687849] [] ret_from_fork+0x3f/0x70 [53834.687851] [] ? kthread_worker_fn+0x150/0x150 [53834.690419] ---[ end trace 1802637761c0942d ]--- I think this happened when I started emacs. My user session crashed, but the system is still usable after logging back in. --Andy
[Intel-gfx] i915 Skylake: "Invalid ROM contents"
[adding linux-pci] On Wed, Nov 18, 2015 at 2:59 AM, Ville Syrjälä wrote: > On Tue, Nov 17, 2015 at 11:43:25AM -0800, Andy Lutomirski wrote: >> Typing: >> >> # cat /sys/devices/pci:00/:00:02.0/rom >> >> Provokes: >> >> i915 :00:02.0: Invalid ROM contents > > Hmm. So there's no PCI option ROM there. I wonder what is there. I > get the same on my Braswell BTW. I tried to look through the UEFI > spec a bit, and it seems to say that even for non-legacy option ROMs > the 0x55aa signature should be there. > > But this being the GPU means we may be using the shadow ROM stuff, > which IIRC assumes that the shadow is at 0xc000. I'm not sure that > holds anymore with UEFI, and maybe we should be using some UEFI > trick instead to find out where it actually lives? > > BTW what does 'lspci -vv -s 00:02.0' say on your machine? > 00:02.0 VGA compatible controller: Intel Corporation Sky Lake Integrated Graphics (rev 07) (prog-if 00 [VGA controller]) DeviceName: Onboard IGD Subsystem: Dell Device 0704 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- [disabled] Capabilities: [40] Vendor Specific Information: Len=0c Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap:MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl:Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta:CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00018 Data: Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] #1b Capabilities: [200 v1] Address Translation Service (ATS) ATSCap:Invalidate Queue Depth: 00 ATSCtl:Enable-, Smallest Translation Unit: 00 Capabilities: [300 v1] #13 Kernel driver in use: i915 Kernel modules: i915 --Andy >> >> This is on a Dell XPS 13 9350 (Skylake). This is 4.3.0 plus some >> wireless-next bits. >> >> --Andy >> >> -- >> Andy Lutomirski >> AMA Capital Management, LLC >> ___ >> Intel-gfx mailing list >> Intel-gfx at lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > -- > Ville Syrjälä > Intel OTC -- Andy Lutomirski AMA Capital Management, LLC
i915 Skylake: "Invalid ROM contents"
Typing: # cat /sys/devices/pci:00/:00:02.0/rom Provokes: i915 :00:02.0: Invalid ROM contents This is on a Dell XPS 13 9350 (Skylake). This is 4.3.0 plus some wireless-next bits. --Andy -- Andy Lutomirski AMA Capital Management, LLC
X using radeon is refusing to start
I just started getting X failures that say: [ 739.208] (EE) RADEON(0): [drm] failed to set drm interface version. I'm not sure what triggered it. dmesg says: [ 740.156499] [drm:drm_stub_open] [ 740.156502] [drm:drm_open_helper] pid = 2170, minor = 0 [ 740.156541] [drm:drm_ioctl] pid=2170, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETRESOURCES [ 740.156557] [drm:drm_ioctl] pid=2170, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETRESOURCES [ 740.156575] [drm:drm_release] open_count = 2 [ 740.156577] [drm:drm_release] pid = 2170, device = 0xe200, open_count = 2 [ 740.158548] [drm:drm_ioctl] pid=2170, dev=0xe200, auth=1, DRM_IOCTL_SET_VERSION [ 740.158549] [drm:drm_ioctl] ret = -13 [ 740.159612] [drm:drm_framebuffer_reference] 8804457110a0: FB ID: 83 (3) -13 means -EACCES. The X logs say: [ 739.042] X.Org X Server 1.16.3 Release Date: 2014-12-20 [ 739.055] X Protocol Version 11, Revision 0 [ 739.059] Build Operating System: 3.17.8-300.bz1178975.fc21.x86_64 [ 739.063] Current Operating System: Linux amaluto.corp.amacapital.net 3.18.3-201.fc21.x86_64 #1 SMP Mon Jan 19 15:59:31 UTC 2015 x86_64 [ 739.063] Kernel command line: rd.md=0 rd.dm=0 rd.lvm.lv=vg_amaluto_2014/root KEYTABLE=us rd.luks.uuid=luks-1dd64d38-40c0-4e20-ad67-aa2590991023 SYSFONT=True ro root=/dev/mapper/vg_amaluto_2014-root LANG=en_US.UTF-8 rhgb quiet [ 739.078] Build Date: 31 January 2015 11:23:27PM [ 739.082] Build ID: xorg-x11-server 1.16.3-2.fc21 [ 739.087] Current version of pixman: 0.32.6 [ 739.096] Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. [ 739.097] Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. [ 739.116] (==) Log file: "/var/log/Xorg.0.log", Time: Thu Feb 5 15:54:44 2015 [ 739.122] (==) Using config file: "/etc/X11/xorg.conf" [ 739.127] (==) Using config directory: "/etc/X11/xorg.conf.d" [ 739.131] (==) Using system config directory "/usr/share/X11/xorg.conf.d" [ 739.132] (==) No Layout section. Using the first Screen section. [ 739.132] (==) No screen section available. Using defaults. [ 739.132] (**) |-->Screen "Default Screen Section" (0) [ 739.132] (**) | |-->Monitor "" [ 739.133] (==) No monitor specified for screen "Default Screen Section". Using a default monitor configuration. [ 739.133] (==) Automatically adding devices [ 739.133] (==) Automatically enabling devices [ 739.133] (==) Automatically adding GPU devices [ 739.133] (==) FontPath set to: catalogue:/etc/X11/fontpath.d, built-ins [ 739.133] (**) ModulePath set to "/usr/local/lib/xorg/modules,/usr/lib64/xorg/modules" [ 739.133] (II) The server relies on udev to provide the list of input devices. If no devices become available, reconfigure udev or disable AutoAddDevices. [ 739.133] (II) Loader magic: 0x81de40 [ 739.133] (II) Module ABI versions: [ 739.133] X.Org ANSI C Emulation: 0.4 [ 739.133] X.Org Video Driver: 18.0 [ 739.133] X.Org XInput driver : 21.0 [ 739.133] X.Org Server Extension : 8.0 [ 739.135] (II) systemd-logind: took control of session /org/freedesktop/login1/session/_31 [ 739.136] (II) xfree86: Adding drm device (/dev/dri/card0) [ 739.136] (II) systemd-logind: got fd for /dev/dri/card0 226:0 fd 10 paused 0 [ 739.146] (--) PCI:*(0:9:0:0) 1002:683f:1787:2318 rev 0, Mem @ 0xe000/268435456, 0xf4a0/262144, I/O @ 0xc000/256, BIOS @ 0x/131072 [ 739.146] (II) LoadModule: "glx" [ 739.147] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so [ 739.150] (II) Module glx: vendor="X.Org Foundation" [ 739.150] compiled for 1.16.3, module version = 1.0.0 [ 739.150] ABI class: X.Org Server Extension, version 8.0 [ 739.150] (==) AIGLX enabled [ 739.150] (==) Matched ati as autoconfigured driver 0 [ 739.150] (==) Matched ati as autoconfigured driver 1 [ 739.150] (==) Matched modesetting as autoconfigured driver 2 [ 739.150] (==) Matched fbdev as autoconfigured driver 3 [ 739.150] (==) Matched vesa as autoconfigured driver 4 [ 739.150] (==) Assigned the driver to the xf86ConfigLayout [ 739.150] (II) LoadModule: "ati" [ 739.151] (II) Loading /usr/lib64/xorg/modules/drivers/ati_drv.so [ 739.151] (II) Module ati: vendor="X.Org Foundation" [ 739.151] compiled for 1.16.1, module version = 7.5.0 [ 739.152] Module class: X.Org Video Driver [ 739.152] ABI class: X.Org Video Driver, version 18.0 [ 739.152] (II) LoadModule: "radeon" [ 739.152] (II) Loading /usr/lib64/xorg/modules/drivers/radeon_drv.so [ 739.153] (II) Module radeon: vendor="X.Org Foundation" [ 739.153] compiled for 1.16.1, module version = 7.5.0 [ 739.153] Module class: X.Org Video Driver [ 739.153] ABI class: X.Org Video Driver, version 18.0 [ 739.153] (II) LoadModule: "modesetting" [ 739.154]
Long radeon stalls on recent kernels
On Wed, Dec 10, 2014 at 8:24 PM, Michel Dänzer wrote: > On 11.12.2014 05:28, Andy Lutomirski wrote: >> On Wed, Dec 10, 2014 at 1:44 AM, Michel Dänzer >> wrote: >>> On 10.12.2014 06:39, Andy Lutomirski wrote: >>>> On Tue, Dec 9, 2014 at 8:06 AM, Andy Lutomirski >>>> wrote: >>>>> On Tue, Dec 9, 2014 at 1:18 AM, Michel Dänzer >>>>> wrote: >>>>>> On 09.12.2014 09:24, Andy Lutomirski wrote: >>>>>>> >>>>>>> The relevant line from latencytop seems to be: >>>>>>> >>>>>>> 154 20441402 489139 radeon_fence_default_wait [radeon] >>>>>>> fence_wait_timeout ttm_bo_wait [ttm] ttm_bo_move_accel_cleanup [ttm] >>>>>>> radeon_move_blit.isra.12 [radeon] radeon_bo_move [radeon] >>>>>>> ttm_bo_handle_move_mem [ttm] ttm_bo_evict [ttm] ttm_mem_evict_first >>>>>>> [ttm] ttm_bo_mem_space [ttm] ttm_bo_validate [ttm] >>>>>>> radeon_bo_fault_reserve_notify [radeon] >>>>>> >>>>>> Which process is this? >>>>> >>>>> Xorg >>>>> >>>>>> >>>>>> Looks like CPU access to a BO in VRAM, but the BO is located outside of >>>>>> the CPU visible area of VRAM, so it has to be moved into the CPU visible >>>>>> area first. > > [...] > >>>> But I'm still waiting for the day that buggy userspace *can't* cause >>>> kernel graphics stalls. >>> >>> Actually, this looks more like buggy userspace stalling itself. :) >> >> I thought the stall was the kernel evicting things from vram. Why >> does it need to wait for userspace for that? Is it that userspace is >> actively using whatever's being evicted? > > As I explained above, the stall happens because userspace does CPU > access to a BO which resides in the CPU-inaccessible part of VRAM. The > kernel has to move the BO into the CPU accessible part of VRAM before it > can let userspace proceed. Sure, but why does that take nearly 500ms? Even if the object in question is the entire framebuffer, that still seems extraordinarily slow. --Andy > > Current Mesa (10.4 or newer I think) sets a hint for BOs which will > likely be accessed by the CPU, so recent kernels can prioritize putting > those into the CPU accessible part of VRAM in the first place. > > Or, if you're using EXA, the problem could be in the xf86-video-ati EXA > code. > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
On Wed, Dec 10, 2014 at 1:44 AM, Michel Dänzer wrote: > On 10.12.2014 06:39, Andy Lutomirski wrote: >> On Tue, Dec 9, 2014 at 8:06 AM, Andy Lutomirski >> wrote: >>> On Tue, Dec 9, 2014 at 1:18 AM, Michel Dänzer >>> wrote: >>>> On 09.12.2014 09:24, Andy Lutomirski wrote: >>>>> >>>>> The relevant line from latencytop seems to be: >>>>> >>>>> 154 20441402 489139 radeon_fence_default_wait [radeon] >>>>> fence_wait_timeout ttm_bo_wait [ttm] ttm_bo_move_accel_cleanup [ttm] >>>>> radeon_move_blit.isra.12 [radeon] radeon_bo_move [radeon] >>>>> ttm_bo_handle_move_mem [ttm] ttm_bo_evict [ttm] ttm_mem_evict_first >>>>> [ttm] ttm_bo_mem_space [ttm] ttm_bo_validate [ttm] >>>>> radeon_bo_fault_reserve_notify [radeon] >>>> >>>> Which process is this? >>> >>> Xorg >>> >>>> >>>> Looks like CPU access to a BO in VRAM, but the BO is located outside of >>>> the CPU visible area of VRAM, so it has to be moved into the CPU visible >>>> area first. >>>> >>>> Which version of Mesa are you using? >>>> >>> >>> mesa-dri-drivers-10.3.3-1.20141110.fc20.x86_64 >>> >>> I'm planning on upgrading to Fedora 21 fairly soon. >> >> Upgrading to mesa-dri-drivers-10.3.3-1.20141110.fc21.x86_64 seems to >> have helped enough that my usual test (open a couple of Firefox tabs >> with graphics in them) doesn't hang anymore. > > Hmm, since that looks like the exact same upstream version, maybe it was > actually upgrading something else that made the difference? > Maybe mutter? > >> This card still isn't *fast*. > > I'm afraid it wasn't exactly a high-end card even when it was new. What > kind of operations are slow? Things like scrolling in Google Maps. It's not *that* bad, but older Intel IGPs still seem considerably smoother. > > >> Is there some way I can check that I'm actually using all 16 PCIe lanes? >> In my tinkering w/ power management settings, I got some odd logs >> suggesting that only one lane was in use. > > You can try forcing off ASPM with radeon.aspm=0, other than that I'm not > sure. > > >> But I'm still waiting for the day that buggy userspace *can't* cause >> kernel graphics stalls. > > Actually, this looks more like buggy userspace stalling itself. :) I thought the stall was the kernel evicting things from vram. Why does it need to wait for userspace for that? Is it that userspace is actively using whatever's being evicted? --Andy > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
On Tue, Dec 9, 2014 at 8:06 AM, Andy Lutomirski wrote: > On Tue, Dec 9, 2014 at 1:18 AM, Michel Dänzer wrote: >> On 09.12.2014 09:24, Andy Lutomirski wrote: >>> >>> The relevant line from latencytop seems to be: >>> >>> 154 20441402 489139 radeon_fence_default_wait [radeon] >>> fence_wait_timeout ttm_bo_wait [ttm] ttm_bo_move_accel_cleanup [ttm] >>> radeon_move_blit.isra.12 [radeon] radeon_bo_move [radeon] >>> ttm_bo_handle_move_mem [ttm] ttm_bo_evict [ttm] ttm_mem_evict_first >>> [ttm] ttm_bo_mem_space [ttm] ttm_bo_validate [ttm] >>> radeon_bo_fault_reserve_notify [radeon] >> >> Which process is this? > > Xorg > >> >> Looks like CPU access to a BO in VRAM, but the BO is located outside of >> the CPU visible area of VRAM, so it has to be moved into the CPU visible >> area first. >> >> Which version of Mesa are you using? >> > > mesa-dri-drivers-10.3.3-1.20141110.fc20.x86_64 > > I'm planning on upgrading to Fedora 21 fairly soon. Upgrading to mesa-dri-drivers-10.3.3-1.20141110.fc21.x86_64 seems to have helped enough that my usual test (open a couple of Firefox tabs with graphics in them) doesn't hang anymore. This card still isn't *fast*. Is there some way I can check that I'm actually using all 16 PCIe lanes? In my tinkering w/ power management settings, I got some odd logs suggesting that only one lane was in use. Other than that, maybe everything works :) But I'm still waiting for the day that buggy userspace *can't* cause kernel graphics stalls. --Andy > > --Andy > >> >> -- >> Earthling Michel Dänzer | http://www.amd.com >> Libre software enthusiast | Mesa and X developer > > > > -- > Andy Lutomirski > AMA Capital Management, LLC -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
On Tue, Dec 9, 2014 at 1:18 AM, Michel Dänzer wrote: > On 09.12.2014 09:24, Andy Lutomirski wrote: >> >> The relevant line from latencytop seems to be: >> >> 154 20441402 489139 radeon_fence_default_wait [radeon] >> fence_wait_timeout ttm_bo_wait [ttm] ttm_bo_move_accel_cleanup [ttm] >> radeon_move_blit.isra.12 [radeon] radeon_bo_move [radeon] >> ttm_bo_handle_move_mem [ttm] ttm_bo_evict [ttm] ttm_mem_evict_first >> [ttm] ttm_bo_mem_space [ttm] ttm_bo_validate [ttm] >> radeon_bo_fault_reserve_notify [radeon] > > Which process is this? Xorg > > Looks like CPU access to a BO in VRAM, but the BO is located outside of > the CPU visible area of VRAM, so it has to be moved into the CPU visible > area first. > > Which version of Mesa are you using? > mesa-dri-drivers-10.3.3-1.20141110.fc20.x86_64 I'm planning on upgrading to Fedora 21 fairly soon. --Andy > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
On Wed, Nov 26, 2014 at 7:38 AM, Andy Lutomirski wrote: > On Tue, Nov 25, 2014 at 10:42 PM, Michel Dänzer > wrote: >> On 20.11.2014 09:58, Andy Lutomirski wrote: >>> >>> On Wed, Nov 19, 2014 at 4:07 PM, Andy Lutomirski >>> wrote: >>>> >>>> On Tue, Nov 18, 2014 at 11:19 PM, Michel Dänzer >>>> wrote: >>>>> >>>>> On 19.11.2014 09:21, Andy Lutomirski wrote: >>>>>> >>>>>> >>>>>> On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On 15.11.2014 07:21, Andy Lutomirski wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >>>>>>>> graphics intensive seems to cause my system to become unusable for >>>>>>>> tens of seconds. Pointing Firefox at Google Maps is a big offender >>>>>>>> -- >>>>>>>> it can take several minutes for me to move my mouse far enough to >>>>>>>> close the tab and get my computer back. >>>>>>>> >>>>>>>> On bootup, I get this warning: >>>>>>>> [drm:btc_dpm_set_power_state] *ERROR* >>>>>>>> rv770_restrict_performance_levels_before_switch failed >>>>>>>> >>>>>>>> Setting radeon.dpm=0 seems to work around this problem at the cost of >>>>>>>> giving my rather slow graphics. >>>>>>>> >>>>>>>> Are there known issues here? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Can you bisect the kernel, or at least isolate which kernel version >>>>>>> first >>>>>>> introduced the problem? >>>>>> >>>>>> >>>>>> >>>>>> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, >>>>>> 3.16, and 3.18-rc4+. I haven't tried other versions. >>>>>> >>>>>> With radeon.dpm=0, I can still trigger short stalls (around one >>>>>> second), but I seem unable to trigger long stalls easily. (I say >>>>>> easily because, just as I was typing this email, my system stalled for >>>>>> about a minute.) >>>>> >>>>> >>>>> >>>>> I can only think of two things offhand that could cause such extremely >>>>> long >>>>> stalls: Swap thrashing or IRQ storms. >>>>> >>>>> With a setup where you can easily trigger long stalls, can you try >>>>> getting a >>>>> CPU profile for a stall with sysprof or perf? >>>>> >>>>> >>>> >>>> Got one with perf: >>>> >>>>16.82% Xorg libc-2.18.so[.] >>>> __memcpy_sse2_unaligned >>>> 9.20% swapper [kernel.kallsyms] [k] >>>> intel_idle >>>> 1.00% Xorg [kernel.kallsyms] [k] >>>> evergreen_irq_set >>>> 0.83% firefox libxul.so [.] >>>> 0x01d93281 >>>> 0.69% firefox libxul.so [.] >>>> 0x01d932ad >>>> 0.62% firefox [kernel.kallsyms] [k] >>>> copy_user_generic_string >>>> 0.55% swapper [kernel.kallsyms] [k] >>>> evergreen_irq_ack >>>> 0.54% firefox libpthread-2.18.so [.] >>>> pthread_mutex_lock >>>> 0.52% firefox libpthread-2.18.so [.] >>>> pthread_mutex_unlock >>>> 0.45% Xorg [kernel.kallsyms] [k] >>>> drm_mm_insert_node_in_range_generic >>>> 0.41% Xorg [kernel.kallsyms] [k] >>>> lock_release >>>> 0.40% Xorg [kernel.kallsyms] [k] >>>> lock_acquire >>>> 0.35% firefox firefox [.] >>>> 0x0001245d >>>
Long radeon stalls on recent kernels
On Tue, Nov 25, 2014 at 10:42 PM, Michel Dänzer wrote: > On 20.11.2014 09:58, Andy Lutomirski wrote: >> >> On Wed, Nov 19, 2014 at 4:07 PM, Andy Lutomirski >> wrote: >>> >>> On Tue, Nov 18, 2014 at 11:19 PM, Michel Dänzer >>> wrote: >>>> >>>> On 19.11.2014 09:21, Andy Lutomirski wrote: >>>>> >>>>> >>>>> On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer >>>>> wrote: >>>>>> >>>>>> >>>>>> On 15.11.2014 07:21, Andy Lutomirski wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >>>>>>> graphics intensive seems to cause my system to become unusable for >>>>>>> tens of seconds. Pointing Firefox at Google Maps is a big offender >>>>>>> -- >>>>>>> it can take several minutes for me to move my mouse far enough to >>>>>>> close the tab and get my computer back. >>>>>>> >>>>>>> On bootup, I get this warning: >>>>>>> [drm:btc_dpm_set_power_state] *ERROR* >>>>>>> rv770_restrict_performance_levels_before_switch failed >>>>>>> >>>>>>> Setting radeon.dpm=0 seems to work around this problem at the cost of >>>>>>> giving my rather slow graphics. >>>>>>> >>>>>>> Are there known issues here? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Can you bisect the kernel, or at least isolate which kernel version >>>>>> first >>>>>> introduced the problem? >>>>> >>>>> >>>>> >>>>> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, >>>>> 3.16, and 3.18-rc4+. I haven't tried other versions. >>>>> >>>>> With radeon.dpm=0, I can still trigger short stalls (around one >>>>> second), but I seem unable to trigger long stalls easily. (I say >>>>> easily because, just as I was typing this email, my system stalled for >>>>> about a minute.) >>>> >>>> >>>> >>>> I can only think of two things offhand that could cause such extremely >>>> long >>>> stalls: Swap thrashing or IRQ storms. >>>> >>>> With a setup where you can easily trigger long stalls, can you try >>>> getting a >>>> CPU profile for a stall with sysprof or perf? >>>> >>>> >>> >>> Got one with perf: >>> >>>16.82% Xorg libc-2.18.so[.] >>> __memcpy_sse2_unaligned >>> 9.20% swapper [kernel.kallsyms] [k] >>> intel_idle >>> 1.00% Xorg [kernel.kallsyms] [k] >>> evergreen_irq_set >>> 0.83% firefox libxul.so [.] >>> 0x01d93281 >>> 0.69% firefox libxul.so [.] >>> 0x01d932ad >>> 0.62% firefox [kernel.kallsyms] [k] >>> copy_user_generic_string >>> 0.55% swapper [kernel.kallsyms] [k] >>> evergreen_irq_ack >>> 0.54% firefox libpthread-2.18.so [.] >>> pthread_mutex_lock >>> 0.52% firefox libpthread-2.18.so [.] >>> pthread_mutex_unlock >>> 0.45% Xorg [kernel.kallsyms] [k] >>> drm_mm_insert_node_in_range_generic >>> 0.41% Xorg [kernel.kallsyms] [k] >>> lock_release >>> 0.40% Xorg [kernel.kallsyms] [k] >>> lock_acquire >>> 0.35% firefox firefox [.] >>> 0x0001245d >>> 0.33% Xorg [kernel.kallsyms] [k] >>> __module_address >>> 0.31% firefox [kernel.kallsyms] [k] >>> clear_page_c >>> 0.29% Xorg [kernel.kallsyms] [k] >>> copy_user_generic_string >>> 0.28% firefox firefox [.] >>> 0x00013
Long radeon stalls on recent kernels
On Wed, Nov 19, 2014 at 4:07 PM, Andy Lutomirski wrote: > On Tue, Nov 18, 2014 at 11:19 PM, Michel Dänzer > wrote: >> On 19.11.2014 09:21, Andy Lutomirski wrote: >>> >>> On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer >>> wrote: >>>> >>>> On 15.11.2014 07:21, Andy Lutomirski wrote: >>>>> >>>>> >>>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >>>>> graphics intensive seems to cause my system to become unusable for >>>>> tens of seconds. Pointing Firefox at Google Maps is a big offender -- >>>>> it can take several minutes for me to move my mouse far enough to >>>>> close the tab and get my computer back. >>>>> >>>>> On bootup, I get this warning: >>>>> [drm:btc_dpm_set_power_state] *ERROR* >>>>> rv770_restrict_performance_levels_before_switch failed >>>>> >>>>> Setting radeon.dpm=0 seems to work around this problem at the cost of >>>>> giving my rather slow graphics. >>>>> >>>>> Are there known issues here? >>>> >>>> >>>> >>>> Can you bisect the kernel, or at least isolate which kernel version first >>>> introduced the problem? >>> >>> >>> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, >>> 3.16, and 3.18-rc4+. I haven't tried other versions. >>> >>> With radeon.dpm=0, I can still trigger short stalls (around one >>> second), but I seem unable to trigger long stalls easily. (I say >>> easily because, just as I was typing this email, my system stalled for >>> about a minute.) >> >> >> I can only think of two things offhand that could cause such extremely long >> stalls: Swap thrashing or IRQ storms. >> >> With a setup where you can easily trigger long stalls, can you try getting a >> CPU profile for a stall with sysprof or perf? >> >> > > Got one with perf: > > 16.82% Xorg libc-2.18.so[.] > __memcpy_sse2_unaligned >9.20% swapper [kernel.kallsyms] [k] intel_idle >1.00% Xorg [kernel.kallsyms] [k] > evergreen_irq_set >0.83% firefox libxul.so [.] > 0x01d93281 >0.69% firefox libxul.so [.] > 0x01d932ad >0.62% firefox [kernel.kallsyms] [k] > copy_user_generic_string >0.55% swapper [kernel.kallsyms] [k] > evergreen_irq_ack >0.54% firefox libpthread-2.18.so [.] > pthread_mutex_lock >0.52% firefox libpthread-2.18.so [.] > pthread_mutex_unlock >0.45% Xorg [kernel.kallsyms] [k] > drm_mm_insert_node_in_range_generic >0.41% Xorg [kernel.kallsyms] [k] > lock_release >0.40% Xorg [kernel.kallsyms] [k] > lock_acquire >0.35% firefox firefox [.] > 0x0001245d >0.33% Xorg [kernel.kallsyms] [k] > __module_address >0.31% firefox [kernel.kallsyms] [k] > clear_page_c >0.29% Xorg [kernel.kallsyms] [k] > copy_user_generic_string >0.28% firefox firefox [.] > 0x00013159 > > and: > > Samples: 11K of event 'irq:irq_handler_entry', Event count (approx.): 11802 > 87.43% swapper [kernel.kallsyms] [k] handle_irq_event_percpu >7.52% firefox [kernel.kallsyms] [k] handle_irq_event_percpu >1.84% irq/36-ahci [kernel.kallsyms] [k] handle_irq_event_percpu >1.14% Xorg [kernel.kallsyms] [k] handle_irq_event_percpu >0.75% kworker/5:0 [kernel.kallsyms] [k] handle_irq_event_percpu >0.32% gnome-shell [kernel.kallsyms] [k] handle_irq_event_percpu >0.25% kworker/5:1H [kernel.kallsyms] [k] handle_irq_event_percpu >0.25% Media D~ode #10 [kernel.kallsyms] [k] handle_irq_event_percpu >0.19% ImageDe~er #330 [kernel.kallsyms] [k] handle_irq_event_percpu >0.07% pulseaudio [kernel.kallsyms] [k] handle_irq_event_percpu > > The cycles were with -e cycles:pp, so I think that iret would have > shown up if there were enough IRQs to cause the problem. > > I'll build a kernel with latencytop. > I just caught call_rwsem_down_write_failed for 5379 ms in khugepaged (holy crap) and radeon_fence_default_wait for 489.2ms in Xorg. Turning off THP gets rid of the khugepaged thing. The 489.2ms is radeon_fence_default_wait is amazingly reproducible -- I've seen that exact number three times now. > --Andy -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
On Tue, Nov 18, 2014 at 11:19 PM, Michel Dänzer wrote: > On 19.11.2014 09:21, Andy Lutomirski wrote: >> >> On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer >> wrote: >>> >>> On 15.11.2014 07:21, Andy Lutomirski wrote: >>>> >>>> >>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >>>> graphics intensive seems to cause my system to become unusable for >>>> tens of seconds. Pointing Firefox at Google Maps is a big offender -- >>>> it can take several minutes for me to move my mouse far enough to >>>> close the tab and get my computer back. >>>> >>>> On bootup, I get this warning: >>>> [drm:btc_dpm_set_power_state] *ERROR* >>>> rv770_restrict_performance_levels_before_switch failed >>>> >>>> Setting radeon.dpm=0 seems to work around this problem at the cost of >>>> giving my rather slow graphics. >>>> >>>> Are there known issues here? >>> >>> >>> >>> Can you bisect the kernel, or at least isolate which kernel version first >>> introduced the problem? >> >> >> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, >> 3.16, and 3.18-rc4+. I haven't tried other versions. >> >> With radeon.dpm=0, I can still trigger short stalls (around one >> second), but I seem unable to trigger long stalls easily. (I say >> easily because, just as I was typing this email, my system stalled for >> about a minute.) > > > I can only think of two things offhand that could cause such extremely long > stalls: Swap thrashing or IRQ storms. > > With a setup where you can easily trigger long stalls, can you try getting a > CPU profile for a stall with sysprof or perf? > > Got one with perf: 16.82% Xorg libc-2.18.so[.] __memcpy_sse2_unaligned 9.20% swapper [kernel.kallsyms] [k] intel_idle 1.00% Xorg [kernel.kallsyms] [k] evergreen_irq_set 0.83% firefox libxul.so [.] 0x01d93281 0.69% firefox libxul.so [.] 0x01d932ad 0.62% firefox [kernel.kallsyms] [k] copy_user_generic_string 0.55% swapper [kernel.kallsyms] [k] evergreen_irq_ack 0.54% firefox libpthread-2.18.so [.] pthread_mutex_lock 0.52% firefox libpthread-2.18.so [.] pthread_mutex_unlock 0.45% Xorg [kernel.kallsyms] [k] drm_mm_insert_node_in_range_generic 0.41% Xorg [kernel.kallsyms] [k] lock_release 0.40% Xorg [kernel.kallsyms] [k] lock_acquire 0.35% firefox firefox [.] 0x0001245d 0.33% Xorg [kernel.kallsyms] [k] __module_address 0.31% firefox [kernel.kallsyms] [k] clear_page_c 0.29% Xorg [kernel.kallsyms] [k] copy_user_generic_string 0.28% firefox firefox [.] 0x00013159 and: Samples: 11K of event 'irq:irq_handler_entry', Event count (approx.): 11802 87.43% swapper [kernel.kallsyms] [k] handle_irq_event_percpu 7.52% firefox [kernel.kallsyms] [k] handle_irq_event_percpu 1.84% irq/36-ahci [kernel.kallsyms] [k] handle_irq_event_percpu 1.14% Xorg [kernel.kallsyms] [k] handle_irq_event_percpu 0.75% kworker/5:0 [kernel.kallsyms] [k] handle_irq_event_percpu 0.32% gnome-shell [kernel.kallsyms] [k] handle_irq_event_percpu 0.25% kworker/5:1H [kernel.kallsyms] [k] handle_irq_event_percpu 0.25% Media D~ode #10 [kernel.kallsyms] [k] handle_irq_event_percpu 0.19% ImageDe~er #330 [kernel.kallsyms] [k] handle_irq_event_percpu 0.07% pulseaudio [kernel.kallsyms] [k] handle_irq_event_percpu The cycles were with -e cycles:pp, so I think that iret would have shown up if there were enough IRQs to cause the problem. I'll build a kernel with latencytop. --Andy
Long radeon stalls on recent kernels
On Tue, Nov 18, 2014 at 4:34 PM, Andy Lutomirski wrote: > On Tue, Nov 18, 2014 at 4:21 PM, Andy Lutomirski > wrote: >> On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer >> wrote: >>> On 15.11.2014 07:21, Andy Lutomirski wrote: >>>> >>>> I have a Caicos card, like this: >>>> >>>> [3.077260] [drm] radeon kernel modesetting enabled. >>>> [3.077338] checking generic (e000 60) vs hw (e000 >>>> 1000) >>>> [3.077339] fb: switching to radeondrmfb from EFI VGA >>>> [3.077377] Console: switching to colour dummy device 80x25 >>>> [3.078881] [drm] initializing kernel modesetting (CAICOS >>>> 0x1002:0x6779 0x174B:0xE164). >>>> [3.078903] [drm] register mmio base: 0xF4A2 >>>> [3.078904] [drm] register mmio size: 131072 >>>> [3.078982] ATOM BIOS: C26401 >>>> [3.079572] radeon :09:00.0: VRAM: 1024M 0x - >>>> 0x3FFF (1024M used) >>>> [3.079574] radeon :09:00.0: GTT: 1024M 0x4000 - >>>> 0x7FFF >>>> [3.079576] [drm] Detected VRAM RAM=1024M, BAR=256M >>>> [3.079577] [drm] RAM width 64bits DDR >>>> [3.079755] [TTM] Zone kernel: Available graphics memory: 8186568 kiB >>>> [3.079757] [TTM] Zone dma32: Available graphics memory: 2097152 kiB >>>> [3.079757] [TTM] Initializing pool allocator >>>> [3.079773] [TTM] Initializing DMA pool allocator >>>> [3.080011] [drm] radeon: 1024M of VRAM memory ready >>>> [3.080012] [drm] radeon: 1024M of GTT memory ready. >>>> [3.080049] [drm] Loading CAICOS Microcode >>>> [3.080330] [drm] Internal thermal controller without fan control >>>> [3.081425] [drm] radeon: power management initialized >>>> [3.081551] [drm] GART: num cpu pages 262144, num gpu pages 262144 >>>> [3.082589] [drm] enabling PCIE gen 2 link speeds, disable with >>>> radeon.pcie_gen2=0 >>>> [3.085030] [drm] PCIE GART of 1024M enabled (table at >>>> 0x00274000). >>>> [3.085221] radeon :09:00.0: WB enabled >>>> [3.085224] radeon :09:00.0: fence driver on ring 0 use gpu >>>> addr 0x4c00 and cpu addr 0x88043d914c00 >>>> [3.085225] radeon :09:00.0: fence driver on ring 3 use gpu >>>> addr 0x4c0c and cpu addr 0x88043d914c0c >>>> [3.097438] radeon :09:00.0: fence driver on ring 5 use gpu >>>> addr 0x00072118 and cpu addr 0xc900128b2118 >>>> [3.097441] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). >>>> [3.097442] [drm] Driver supports precise vblank timestamp query. >>>> [3.097514] radeon :09:00.0: irq 56 for MSI/MSI-X >>>> [3.097544] radeon :09:00.0: radeon: using MSI. >>>> [3.097614] [drm] radeon: irq initialized. >>>> >>>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >>>> graphics intensive seems to cause my system to become unusable for >>>> tens of seconds. Pointing Firefox at Google Maps is a big offender -- >>>> it can take several minutes for me to move my mouse far enough to >>>> close the tab and get my computer back. >>>> >>>> On bootup, I get this warning: >>>> [drm:btc_dpm_set_power_state] *ERROR* >>>> rv770_restrict_performance_levels_before_switch failed >>>> >>>> Setting radeon.dpm=0 seems to work around this problem at the cost of >>>> giving my rather slow graphics. >>>> >>>> Are there known issues here? >>> >>> >>> Can you bisect the kernel, or at least isolate which kernel version first >>> introduced the problem? >> >> With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, >> 3.16, and 3.18-rc4+. I haven't tried other versions. >> >> With radeon.dpm=0, I can still trigger short stalls (around one >> second), but I seem unable to trigger long stalls easily. (I say >> easily because, just as I was typing this email, my system stalled for >> about a minute.) > > I could be wrong here, but I think that radeon.dpm=0, > power_profile=default is okay, but radeon.dpm=0, power_profile=high is > bad. I'm wrong again. power_profile=default is also bad. Grr. --Andy
Long radeon stalls on recent kernels
On Tue, Nov 18, 2014 at 4:21 PM, Andy Lutomirski wrote: > On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer wrote: >> On 15.11.2014 07:21, Andy Lutomirski wrote: >>> >>> I have a Caicos card, like this: >>> >>> [3.077260] [drm] radeon kernel modesetting enabled. >>> [3.077338] checking generic (e000 60) vs hw (e000 >>> 1000) >>> [3.077339] fb: switching to radeondrmfb from EFI VGA >>> [3.077377] Console: switching to colour dummy device 80x25 >>> [3.078881] [drm] initializing kernel modesetting (CAICOS >>> 0x1002:0x6779 0x174B:0xE164). >>> [3.078903] [drm] register mmio base: 0xF4A2 >>> [3.078904] [drm] register mmio size: 131072 >>> [3.078982] ATOM BIOS: C26401 >>> [3.079572] radeon :09:00.0: VRAM: 1024M 0x - >>> 0x3FFF (1024M used) >>> [3.079574] radeon :09:00.0: GTT: 1024M 0x4000 - >>> 0x7FFF >>> [3.079576] [drm] Detected VRAM RAM=1024M, BAR=256M >>> [3.079577] [drm] RAM width 64bits DDR >>> [3.079755] [TTM] Zone kernel: Available graphics memory: 8186568 kiB >>> [3.079757] [TTM] Zone dma32: Available graphics memory: 2097152 kiB >>> [3.079757] [TTM] Initializing pool allocator >>> [3.079773] [TTM] Initializing DMA pool allocator >>> [3.080011] [drm] radeon: 1024M of VRAM memory ready >>> [3.080012] [drm] radeon: 1024M of GTT memory ready. >>> [3.080049] [drm] Loading CAICOS Microcode >>> [3.080330] [drm] Internal thermal controller without fan control >>> [3.081425] [drm] radeon: power management initialized >>> [3.081551] [drm] GART: num cpu pages 262144, num gpu pages 262144 >>> [3.082589] [drm] enabling PCIE gen 2 link speeds, disable with >>> radeon.pcie_gen2=0 >>> [3.085030] [drm] PCIE GART of 1024M enabled (table at >>> 0x00274000). >>> [3.085221] radeon :09:00.0: WB enabled >>> [3.085224] radeon :09:00.0: fence driver on ring 0 use gpu >>> addr 0x4c00 and cpu addr 0x88043d914c00 >>> [3.085225] radeon :09:00.0: fence driver on ring 3 use gpu >>> addr 0x4c0c and cpu addr 0x88043d914c0c >>> [3.097438] radeon :09:00.0: fence driver on ring 5 use gpu >>> addr 0x00072118 and cpu addr 0xc900128b2118 >>> [3.097441] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). >>> [3.097442] [drm] Driver supports precise vblank timestamp query. >>> [3.097514] radeon :09:00.0: irq 56 for MSI/MSI-X >>> [3.097544] radeon :09:00.0: radeon: using MSI. >>> [3.097614] [drm] radeon: irq initialized. >>> >>> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >>> graphics intensive seems to cause my system to become unusable for >>> tens of seconds. Pointing Firefox at Google Maps is a big offender -- >>> it can take several minutes for me to move my mouse far enough to >>> close the tab and get my computer back. >>> >>> On bootup, I get this warning: >>> [drm:btc_dpm_set_power_state] *ERROR* >>> rv770_restrict_performance_levels_before_switch failed >>> >>> Setting radeon.dpm=0 seems to work around this problem at the cost of >>> giving my rather slow graphics. >>> >>> Are there known issues here? >> >> >> Can you bisect the kernel, or at least isolate which kernel version first >> introduced the problem? > > With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, > 3.16, and 3.18-rc4+. I haven't tried other versions. > > With radeon.dpm=0, I can still trigger short stalls (around one > second), but I seem unable to trigger long stalls easily. (I say > easily because, just as I was typing this email, my system stalled for > about a minute.) I could be wrong here, but I think that radeon.dpm=0, power_profile=default is okay, but radeon.dpm=0, power_profile=high is bad. --Andy > > --Andy > >> >> >> -- >> Earthling Michel Dänzer| http://www.amd.com >> Libre software enthusiast |Mesa and X developer > > > > -- > Andy Lutomirski > AMA Capital Management, LLC -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer wrote: > On 15.11.2014 07:21, Andy Lutomirski wrote: >> >> I have a Caicos card, like this: >> >> [3.077260] [drm] radeon kernel modesetting enabled. >> [3.077338] checking generic (e000 60) vs hw (e000 >> 1000) >> [3.077339] fb: switching to radeondrmfb from EFI VGA >> [3.077377] Console: switching to colour dummy device 80x25 >> [3.078881] [drm] initializing kernel modesetting (CAICOS >> 0x1002:0x6779 0x174B:0xE164). >> [3.078903] [drm] register mmio base: 0xF4A2 >> [3.078904] [drm] register mmio size: 131072 >> [3.078982] ATOM BIOS: C26401 >> [3.079572] radeon :09:00.0: VRAM: 1024M 0x - >> 0x3FFF (1024M used) >> [3.079574] radeon :09:00.0: GTT: 1024M 0x4000 - >> 0x7FFF >> [3.079576] [drm] Detected VRAM RAM=1024M, BAR=256M >> [3.079577] [drm] RAM width 64bits DDR >> [3.079755] [TTM] Zone kernel: Available graphics memory: 8186568 kiB >> [3.079757] [TTM] Zone dma32: Available graphics memory: 2097152 kiB >> [3.079757] [TTM] Initializing pool allocator >> [3.079773] [TTM] Initializing DMA pool allocator >> [3.080011] [drm] radeon: 1024M of VRAM memory ready >> [3.080012] [drm] radeon: 1024M of GTT memory ready. >> [3.080049] [drm] Loading CAICOS Microcode >> [3.080330] [drm] Internal thermal controller without fan control >> [3.081425] [drm] radeon: power management initialized >> [3.081551] [drm] GART: num cpu pages 262144, num gpu pages 262144 >> [3.082589] [drm] enabling PCIE gen 2 link speeds, disable with >> radeon.pcie_gen2=0 >> [3.085030] [drm] PCIE GART of 1024M enabled (table at >> 0x00274000). >> [3.085221] radeon :09:00.0: WB enabled >> [3.085224] radeon :09:00.0: fence driver on ring 0 use gpu >> addr 0x4c00 and cpu addr 0x88043d914c00 >> [3.085225] radeon :09:00.0: fence driver on ring 3 use gpu >> addr 0x4c0c and cpu addr 0x88043d914c0c >> [3.097438] radeon :09:00.0: fence driver on ring 5 use gpu >> addr 0x00072118 and cpu addr 0xc900128b2118 >> [3.097441] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). >> [3.097442] [drm] Driver supports precise vblank timestamp query. >> [3.097514] radeon :09:00.0: irq 56 for MSI/MSI-X >> [3.097544] radeon :09:00.0: radeon: using MSI. >> [3.097614] [drm] radeon: irq initialized. >> >> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >> graphics intensive seems to cause my system to become unusable for >> tens of seconds. Pointing Firefox at Google Maps is a big offender -- >> it can take several minutes for me to move my mouse far enough to >> close the tab and get my computer back. >> >> On bootup, I get this warning: >> [drm:btc_dpm_set_power_state] *ERROR* >> rv770_restrict_performance_levels_before_switch failed >> >> Setting radeon.dpm=0 seems to work around this problem at the cost of >> giving my rather slow graphics. >> >> Are there known issues here? > > > Can you bisect the kernel, or at least isolate which kernel version first > introduced the problem? With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, 3.16, and 3.18-rc4+. I haven't tried other versions. With radeon.dpm=0, I can still trigger short stalls (around one second), but I seem unable to trigger long stalls easily. (I say easily because, just as I was typing this email, my system stalled for about a minute.) --Andy > > > -- > Earthling Michel Dänzer| http://www.amd.com > Libre software enthusiast |Mesa and X developer -- Andy Lutomirski AMA Capital Management, LLC
Long radeon stalls on recent kernels
I have a Caicos card, like this: [3.077260] [drm] radeon kernel modesetting enabled. [3.077338] checking generic (e000 60) vs hw (e000 1000) [3.077339] fb: switching to radeondrmfb from EFI VGA [3.077377] Console: switching to colour dummy device 80x25 [3.078881] [drm] initializing kernel modesetting (CAICOS 0x1002:0x6779 0x174B:0xE164). [3.078903] [drm] register mmio base: 0xF4A2 [3.078904] [drm] register mmio size: 131072 [3.078982] ATOM BIOS: C26401 [3.079572] radeon :09:00.0: VRAM: 1024M 0x - 0x3FFF (1024M used) [3.079574] radeon :09:00.0: GTT: 1024M 0x4000 - 0x7FFF [3.079576] [drm] Detected VRAM RAM=1024M, BAR=256M [3.079577] [drm] RAM width 64bits DDR [3.079755] [TTM] Zone kernel: Available graphics memory: 8186568 kiB [3.079757] [TTM] Zone dma32: Available graphics memory: 2097152 kiB [3.079757] [TTM] Initializing pool allocator [3.079773] [TTM] Initializing DMA pool allocator [3.080011] [drm] radeon: 1024M of VRAM memory ready [3.080012] [drm] radeon: 1024M of GTT memory ready. [3.080049] [drm] Loading CAICOS Microcode [3.080330] [drm] Internal thermal controller without fan control [3.081425] [drm] radeon: power management initialized [3.081551] [drm] GART: num cpu pages 262144, num gpu pages 262144 [3.082589] [drm] enabling PCIE gen 2 link speeds, disable with radeon.pcie_gen2=0 [3.085030] [drm] PCIE GART of 1024M enabled (table at 0x00274000). [3.085221] radeon :09:00.0: WB enabled [3.085224] radeon :09:00.0: fence driver on ring 0 use gpu addr 0x4c00 and cpu addr 0x88043d914c00 [3.085225] radeon :09:00.0: fence driver on ring 3 use gpu addr 0x4c0c and cpu addr 0x88043d914c0c [3.097438] radeon :09:00.0: fence driver on ring 5 use gpu addr 0x00072118 and cpu addr 0xc900128b2118 [3.097441] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [3.097442] [drm] Driver supports precise vblank timestamp query. [3.097514] radeon :09:00.0: irq 56 for MSI/MSI-X [3.097544] radeon :09:00.0: radeon: using MSI. [3.097614] [drm] radeon: irq initialized. On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything graphics intensive seems to cause my system to become unusable for tens of seconds. Pointing Firefox at Google Maps is a big offender -- it can take several minutes for me to move my mouse far enough to close the tab and get my computer back. On bootup, I get this warning: [drm:btc_dpm_set_power_state] *ERROR* rv770_restrict_performance_levels_before_switch failed Setting radeon.dpm=0 seems to work around this problem at the cost of giving my rather slow graphics. Are there known issues here? Thanks, Andy
[PATCH 1/6] x86: Add support for the pcommit instruction
On Fri, Nov 14, 2014 at 1:07 PM, Ross Zwisler wrote: > On Wed, 2014-11-12 at 19:25 -0800, Andy Lutomirski wrote: >> On 11/11/2014 10:43 AM, Ross Zwisler wrote: >> > Add support for the new pcommit instruction. This instruction was >> > announced in the document "Intel Architecture Instruction Set Extensions >> > Programming Reference" with reference number 319433-022. >> > >> > https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf >> > >> > Signed-off-by: Ross Zwisler >> > Cc: H Peter Anvin >> > Cc: Ingo Molnar >> > Cc: Thomas Gleixner >> > Cc: David Airlie >> > Cc: dri-devel at lists.freedesktop.org >> > Cc: x86 at kernel.org >> > --- >> > arch/x86/include/asm/cpufeature.h| 1 + >> > arch/x86/include/asm/special_insns.h | 6 ++ >> > 2 files changed, 7 insertions(+) >> > >> > diff --git a/arch/x86/include/asm/cpufeature.h >> > b/arch/x86/include/asm/cpufeature.h >> > index 0bb1335..b3e6b89 100644 >> > --- a/arch/x86/include/asm/cpufeature.h >> > +++ b/arch/x86/include/asm/cpufeature.h >> > @@ -225,6 +225,7 @@ >> > #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ >> > #define X86_FEATURE_ADX( 9*32+19) /* The ADCX and ADOX >> > instructions */ >> > #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access >> > Prevention */ >> > +#define X86_FEATURE_PCOMMIT( 9*32+22) /* PCOMMIT instruction */ >> > #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ >> > #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ >> > #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and >> > Reciprocal */ >> > diff --git a/arch/x86/include/asm/special_insns.h >> > b/arch/x86/include/asm/special_insns.h >> > index e820c08..1709a2e 100644 >> > --- a/arch/x86/include/asm/special_insns.h >> > +++ b/arch/x86/include/asm/special_insns.h >> > @@ -199,6 +199,12 @@ static inline void clflushopt(volatile void *__p) >> >"+m" (*(volatile char __force *)__p)); >> > } >> > >> > +static inline void pcommit(void) >> > +{ >> > + alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8", >> > + X86_FEATURE_PCOMMIT); >> > +} >> > + >> >> Should this patch add the feature bit and cpuinfo entry to go with it? >> >> --Andy > > I think this patch does everything we need? The text for cpuinfo is > auto-generated in arch/x86/kernel/cpu/capflags.c from the flags defined > in arch/x86/include/asm/cpufeature.h, I think. Here's what I get in > cpuinfo on my system with a faked-out CPUID saying that clwb and pcommit > are present: > > $ grep 'flags' /proc/cpuinfo > flags : fpu erms pcommit clflushopt clwb xsaveopt > > The X86_FEATURE_CLWB and X86_FEATURE_PCOMMIT flags are being set up > according to what's in CPUID, and the proper alternatives are being > triggered. I stuck some debug code in the alternatives code to see what > was being patched in the presence and absence of each of the flags. > > Is there something else I'm missing? No. I just missed the magical auto-generation part. --Andy > > Thanks, > - Ross > -- Andy Lutomirski AMA Capital Management, LLC
[PATCH 6/6] x86: Use clwb in drm_clflush_virt_range
On Nov 13, 2014 3:20 AM, "Borislav Petkov" wrote: > > On Wed, Nov 12, 2014 at 07:14:21PM -0800, Andy Lutomirski wrote: > > On 11/11/2014 10:43 AM, Ross Zwisler wrote: > > > If clwb is available on the system, use it in drm_clflush_virt_range. > > > If clwb is not available, fall back to clflushopt if you can. > > > If clflushopt is not supported, fall all the way back to clflush. > > > > I don't know exactly what drm_clflush_virt_range (and the other > > functions you're modifying similarly) are for, but it seems plausible to > > me that they're used before reads to make sure that non-coherent memory > > sees updated data. If that's true, then this will break it. > > Why would it break it? The updated cachelines will be in memory and > subsequent reads will be serviced from the cache instead from going to > memory as it is not invalidated as it would be by CLFLUSH. > > /me is puzzled. Suppose you map some device memory WB, and then the device non-coherently updates. If you want the CPU to see it, you need clflush or clflushopt. Some architectures might do this for dma_sync_single_for_cpu with DMA_FROM_DEVICE. I'm not sure that such a thing exists on x86. --Andy
[PATCH 1/6] x86: Add support for the pcommit instruction
On 11/11/2014 10:43 AM, Ross Zwisler wrote: > Add support for the new pcommit instruction. This instruction was > announced in the document "Intel Architecture Instruction Set Extensions > Programming Reference" with reference number 319433-022. > > https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf > > Signed-off-by: Ross Zwisler > Cc: H Peter Anvin > Cc: Ingo Molnar > Cc: Thomas Gleixner > Cc: David Airlie > Cc: dri-devel at lists.freedesktop.org > Cc: x86 at kernel.org > --- > arch/x86/include/asm/cpufeature.h| 1 + > arch/x86/include/asm/special_insns.h | 6 ++ > 2 files changed, 7 insertions(+) > > diff --git a/arch/x86/include/asm/cpufeature.h > b/arch/x86/include/asm/cpufeature.h > index 0bb1335..b3e6b89 100644 > --- a/arch/x86/include/asm/cpufeature.h > +++ b/arch/x86/include/asm/cpufeature.h > @@ -225,6 +225,7 @@ > #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */ > #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX > instructions */ > #define X86_FEATURE_SMAP ( 9*32+20) /* Supervisor Mode Access Prevention > */ > +#define X86_FEATURE_PCOMMIT ( 9*32+22) /* PCOMMIT instruction */ > #define X86_FEATURE_CLFLUSHOPT ( 9*32+23) /* CLFLUSHOPT instruction */ > #define X86_FEATURE_AVX512PF ( 9*32+26) /* AVX-512 Prefetch */ > #define X86_FEATURE_AVX512ER ( 9*32+27) /* AVX-512 Exponential and > Reciprocal */ > diff --git a/arch/x86/include/asm/special_insns.h > b/arch/x86/include/asm/special_insns.h > index e820c08..1709a2e 100644 > --- a/arch/x86/include/asm/special_insns.h > +++ b/arch/x86/include/asm/special_insns.h > @@ -199,6 +199,12 @@ static inline void clflushopt(volatile void *__p) > "+m" (*(volatile char __force *)__p)); > } > > +static inline void pcommit(void) > +{ > + alternative(ASM_NOP4, ".byte 0x66, 0x0f, 0xae, 0xf8", > + X86_FEATURE_PCOMMIT); > +} > + Should this patch add the feature bit and cpuinfo entry to go with it? --Andy
[PATCH 6/6] x86: Use clwb in drm_clflush_virt_range
On 11/11/2014 10:43 AM, Ross Zwisler wrote: > If clwb is available on the system, use it in drm_clflush_virt_range. > If clwb is not available, fall back to clflushopt if you can. > If clflushopt is not supported, fall all the way back to clflush. I don't know exactly what drm_clflush_virt_range (and the other functions you're modifying similarly) are for, but it seems plausible to me that they're used before reads to make sure that non-coherent memory sees updated data. If that's true, then this will break it. But maybe all the users are write to coherent memory that just need to ensure that whatever's backing the memory knows about the write. FWIW, it may make sense to rename this function to drm_clwb_virt_range if you make this change. --Andy > > Signed-off-by: Ross Zwisler > Cc: H Peter Anvin > Cc: Ingo Molnar > Cc: Thomas Gleixner > Cc: David Airlie > Cc: dri-devel at lists.freedesktop.org > Cc: x86 at kernel.org > --- > drivers/gpu/drm/drm_cache.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c > index aad9d82..84e9a04 100644 > --- a/drivers/gpu/drm/drm_cache.c > +++ b/drivers/gpu/drm/drm_cache.c > @@ -138,8 +138,8 @@ drm_clflush_virt_range(void *addr, unsigned long length) > void *end = addr + length; > mb(); > for (; addr < end; addr += boot_cpu_data.x86_clflush_size) > - clflushopt(addr); > - clflushopt(end - 1); > + clwb(addr); > + clwb(end - 1); > mb(); > return; > } >
3.14 radeon regression: radeon is broken (pci bug?)
On Tue, Sep 16, 2014 at 9:45 AM, Bjorn Helgaas wrote: > On Thu, Mar 27, 2014 at 11:30:37AM -0600, Bjorn Helgaas wrote: >> On Mon, Mar 24, 2014 at 4:04 PM, Bjorn Helgaas >> wrote: >> > On Sat, Mar 22, 2014 at 9:18 AM, Andy Lutomirski >> > wrote: >> >> On Fri, Mar 21, 2014 at 9:37 AM, Bjorn Helgaas >> >> wrote: >> >>> On Fri, Mar 21, 2014 at 9:49 AM, Andy Lutomirski > >>> amacapital.net> wrote: >> >>>> On Fri, Mar 21, 2014 at 7:41 AM, Alex Deucher > >>>> gmail.com> wrote: >> >>>>> On Thu, Mar 20, 2014 at 10:17 PM, Andy Lutomirski > >>>>> amacapital.net> wrote: >> >>>>>> My system works on a 3.13 Fedora kernel. It does not work on a >> >>>>>> more-or-less identically configured 3.14-rc7+ kernel. The symptom is >> >>>>>> that the Plymouth password prompt flashes and them the screen goes >> >>>>>> blank. Hitting escape brings back the text console, and all is well >> >>>>>> until X tries to start. Then I get a blank screen. killall -9 Xorg >> >>>>>> from ssh causes these errors to be logged: >> >>>>>> >> >>>>>> >> >>>>>> [ 226.239747] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> >>>>>> more than 5secs aborting >> >>>>>> [ 226.239751] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> >>>>>> executing CD34 (len 55, WS 0, PS 0) @ 0xCD57 >> >>>>>> [ 231.241492] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> >>>>>> more than 5secs aborting >> >>>>>> [ 231.241496] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> >>>>>> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >> >>>>>> [ 236.243111] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> >>>>>> more than 5secs aborting >> >>>>>> [ 236.243115] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> >>>>>> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >> >>>>>> [ 241.244625] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> >>>>>> more than 5secs aborting >> >>>>>> [ 241.244628] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> >>>>>> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >> >>>>>> >> >>>>>> >> >>>>>> lspci -vvvxxxnn on 3.14-rc7+ says: >> >>>>>> >> >>>>>> 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. >> >>>>>> [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] >> >>>>>> (rev ff) (prog-if ff) >> >>>>>> !!! Unknown header type 7f >> >>>>>> Kernel driver in use: radeon >> >>>>>> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> >> >>>>>> 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] >> >>>>>> Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] (rev ff) >> >>>>>> (prog-if ff) >> >>>>>> !!! Unknown header type 7f >> >>>>>> Kernel driver in use: snd_hda_intel >> >>>>>> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >>>>>> >> >>>>>> (oops!) >> >>>>>> >> >>>>>> On 3.13, it says: >> >>>>>> >> >>>>>> 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. >> >>>>>> [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] >> >>>>>> (prog-if 00 [VGA controller]) >> >>>>>> Subsystem: PC Partner Limited / Sapphire Te
[PATCH 0/6] File Sealing & memfd_create()
On Jun 17, 2014 2:48 AM, "Florian Weimer" wrote: > > On 04/10/2014 10:37 PM, Andy Lutomirski wrote: > >> It occurs to me that, before going nuts with these kinds of flags, it >> may pay to just try to fix the /proc/self/fd issue for real -- we >> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is >> read-only. That may be enough for the file sealing thing. > > > Increasing privilege on O_PATH descriptors via access through /proc/self/fd is part of the userspace API. The same thing might be true for O_RDONLY descriptors, but it's a bit less likely that there are any users out there. In any case, I'm not sure it makes sense to plug the O_RDONLY hole while leaving the O_PATH hole open. Do you mean O_PATH fds for the directory or O_PATH fds for the file itself? In any event, I'm much less concerned about passing O_PATH memfds around than O_RDONLY memfds. I have incomplete patches for this stuff. I need to fix them so they work and get past Al Viro. --Andy -- next part -- An HTML attachment was scrubbed... URL: <http://lists.freedesktop.org/archives/dri-devel/attachments/20140617/30d2d605/attachment.html>
[PATCH 2/6] shm: add sealing API
On Fri, Apr 11, 2014 at 2:42 PM, David Herrmann wrote: > Hi > > On Fri, Apr 11, 2014 at 11:36 PM, Andy Lutomirski > wrote: >> A quick grep of the kernel tree finds exactly zero code paths >> incrementing i_mmap_writable outside of mmap and fork. >> >> Or do you mean a different kind of write ref? What am I missing here? > > Sorry, I meant i_writecount. I bet this is missing from lots of places. For example, I can't find any write_access stuff in the rdma code. I suspect that the VM_DENYWRITE code is just generally racy. --Andy
[PATCH 2/6] shm: add sealing API
On 04/11/2014 02:31 PM, David Herrmann wrote: > Hi > > On Fri, Apr 11, 2014 at 3:43 PM, Tony Battersby > wrote: >> Exactly. For O_DIRECT, that would be the call to get_user_pages_fast() >> from dio_refill_pages() in fs/direct-io.c, which is ultimately called >> from blkdev_direct_IO(). > > If you drop mmap_sem after pinning a page without taking a write-ref, > you break i_mmap_writable / VM_DENYWRITE. In memfd I rely on > i_mmap_writable to work, same thing is done by exec() (and the old, > now disabled, MAP_DENYWRITE). > > I don't know whether I should care. I mean, everyone pinning pages and > writing to it without holding the mmap_sem has to take a write-ref for > each page or it breaks i_mmap_writable. So this seems to be a bug in > direct-IO, not in anyone relying on it, right? A quick grep of the kernel tree finds exactly zero code paths incrementing i_mmap_writable outside of mmap and fork. Or do you mean a different kind of write ref? What am I missing here? --Andy
[PATCH 2/6] shm: add sealing API
On 04/10/2014 05:22 PM, David Herrmann wrote: > Hi > > On Thu, Apr 10, 2014 at 11:33 PM, Tony Battersby > wrote: >> For O_DIRECT the kernel pins the submitted pages in memory for DMA by >> incrementing the page reference counts when the I/O is submitted, >> allowing the pages to be modified by DMA even if they are no longer >> mapped in the address space of the process. This is different from a >> regular read(), which uses the CPU to copy the data and will fail if the >> pages are not mapped. > > Can you please provide an example code-path? For instance, > file_read_actor() does not pin any pages but only keeps the user-space > address and resolves it once it has data to write. This may be an issue for anything in the kernel that calls get_user_pages and holds onto the result at any time that mmap_sem isn't held. I don't know exactly what does that, but RDMA comes to mind. So does (ugh!) vmsplice, although I suspect that vmsplice doesn't write. --Andy
[PATCH 0/6] File Sealing & memfd_create()
On Thu, Apr 10, 2014 at 4:16 PM, David Herrmann wrote: > Hi > > On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski > wrote: >> /proc/pid/fd is a really weird corner case in which the mode of an >> inode that doesn't have a name matters. I suspect that almost no one >> will ever want to open one of these things out of /proc/self/fd, and >> those who do should be made to think about it. > > I'm arguing in the context of memfd, and there's no security leak if > people get access to the underlying inode (at least I'm not aware of > any). I'm not sure what you mean. > As I said, context information is attached to the inode, not > file context, so I'm fine if people want to open multiple file > contexts via /proc. If someone wants to forbid open(), I want to hear > _why_. I assume the memfd object has uid==uid-of-creator and > mode==(777 & ~umask) (which usually results in X00, so no access for > non-owners). I cannot see how /proc is a security issue here. On further reflection, my argument for 000 is crap. As far as I can see, the only time that the mode matters at all when playing with /proc/pid/fd, and they only way to get a non-O_RDWR memfd is using /proc/pid/fd, so I'll argue for 0600 instead. Argument why 0600 is better than 0600 & ~umask: either callers don't care because the inode mode simply doesn't matter or they're using /proc/pid/fd to *reduce* permissions, in which case they'd probably like to avoid having to play with umask or call fchmod. Argument why 0600 is better than 0777 & ~umask: People /prod/pid/fd are the only ones who care, in which case they probably prefer for the permissions not be increased by other users if they give them a reduced-permission fd. Anyway, this is all mostly unimportant. Some text in the man page is probably sufficient, but I still think that 0600 is trivial to implement and a little bit more friendly. --Andy > > Thanks > David -- Andy Lutomirski AMA Capital Management, LLC
[PATCH 0/6] File Sealing & memfd_create()
On Thu, Apr 10, 2014 at 3:57 PM, David Herrmann wrote: > Hi > > On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski > wrote: >> Would it make sense for the initial mode on a memfd inode to be 000? >> Anyone who finds this to be problematic could use fchmod to fix it. > > memfd_create() should be subject to umask() just like anything else. > That should solve any possible race here, right? Yes, but how many people will actually think about umask when doing things that don't really look like creating files? /proc/pid/fd is a really weird corner case in which the mode of an inode that doesn't have a name matters. I suspect that almost no one will ever want to open one of these things out of /proc/self/fd, and those who do should be made to think about it. It also avoids odd screwups where things are secure until someone runs them with umask 000. --Andy
[PATCH 0/6] File Sealing & memfd_create()
On Thu, Apr 10, 2014 at 1:49 PM, David Herrmann wrote: > Hi > > On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski > wrote: >> It occurs to me that, before going nuts with these kinds of flags, it >> may pay to just try to fix the /proc/self/fd issue for real -- we >> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is >> read-only. That may be enough for the file sealing thing. > > For the sealing API, none of this is needed. As long as the inode is > owned by the uid who creates the memfd, you can pass it around and > no-one besides root and you can open /proc/self/fd/$fd (assuming chmod > 700). If you share the fd with someone with the same uid as you, > you're screwed anyway. We don't protect users against themselves (I > mean, they can ptrace you, or kill()..). Therefore, I'm not really > convinced that we want this for memfd. At least no-one has provided a > _proper_ use-case for this so far. Hmm. Fair enough. Would it make sense for the initial mode on a memfd inode to be 000? Anyone who finds this to be problematic could use fchmod to fix it. I might even go so far as to suggest that the default uid on the inode should be 0 (i.e. global root), since there is the odd corner case of root setting euid != 0, creating a memfd, and setting euid back to 0. The latter might cause resource accounting issues, though. --Andy
[PATCH 0/6] File Sealing & memfd_create()
On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o wrote: > On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote: >> >> This is the second time in a week that someone has asked for a way to >> have a struct file (or struct inode or whatever) that can't be reopened >> through /proc/pid/fd. This should be quite easy to implement as a >> separate feature. > > What I suggested on a different thread was to add the following new > file descriptor flags, to join FD_CLOEXEC, which would be maniuplated > using the F_GETFD and F_SETFD fcntl commands: > > FD_NOPROCFS disallow being able to open the inode via /proc//fd > > FD_NOPASSFD disallow being able to pass the fd via a unix domain socket > > FD_LOCKFLAGSif this bit is set, disallow any further changes of > FD_CLOEXEC, > FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags. > > Regardless of what else we might need to meet the use case for the > proposed File Sealing API, I think this is a useful feature that could > be used in many other contexts besides just the proposed > memfd_create() use case. It occurs to me that, before going nuts with these kinds of flags, it may pay to just try to fix the /proc/self/fd issue for real -- we could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is read-only. That may be enough for the file sealing thing. --Andy
[PATCH 0/6] File Sealing & memfd_create()
On 04/08/2014 06:00 AM, Florian Weimer wrote: > On 03/19/2014 08:06 PM, David Herrmann wrote: > >> Unlike existing techniques that provide similar protection, sealing >> allows >> file-sharing without any trust-relationship. This is enforced by >> rejecting seal >> modifications if you don't own an exclusive reference to the given >> file. So if >> you own a file-descriptor, you can be sure that no-one besides you can >> modify >> the seals on the given file. This allows mapping shared files from >> untrusted >> parties without the fear of the file getting truncated or modified by an >> attacker. > > How do you keep these promises on network and FUSE file systems? Surely > there is still some trust involved for such descriptors? > > What happens if you create a loop device on a sealed descriptor? > > Why does memfd_create not create a file backed by a memory region in the > current process? Wouldn't this be a far more generic primitive? > Creating aliases of memory regions would be interesting for many things > (not just libffi bypassing SELinux-enforced NX restrictions :-). If you write a patch to prevent selinux from enforcing NX, I will ack that patch with all my might. I don't know how far it would get me, but I think that selinux has no business going anywhere near execmem. Adding a clone mode to mremap might be a better bet. But memfd solves that problem, too, albeit messily. --Andy
[PATCH 0/6] File Sealing & memfd_create()
On 04/10/2014 07:45 AM, Colin Walters wrote: > On Thu, Mar 20, 2014 at 11:32 AM, tytso at mit.edu wrote: >> >> Looking at your patches, and what files you are modifying, you are >> enforcing this in the low-level file system. > > I would love for this to be implemented in the filesystem level as > well. Something like the ext4 immutable bit, but with the ability to > still make hardlinks would be *very* useful for OSTree. And anyone else > that uses hardlinks as a data source. The vserver people do something > similiar: > http://linux-vserver.org/util-vserver:Vhashify > > At the moment I have a read-only bind mount over /usr, but what I really > want is to make the individual objects in the object store in > /ostree/repo/objects be immutable, so even if a user or app navigates > out to /sysroot they still can't mutate them (or the link targets in the > visible /usr). COW links can do this already, I think. Of course, you'll have to use a filesystem that supports them. --Andy
[PATCH 0/6] File Sealing & memfd_create()
On 03/20/2014 09:38 AM, tytso at mit.edu wrote: > On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote: >> On Thu, Mar 20, 2014 at 4:32 PM, wrote: >>> Why not make sealing an attribute of the "struct file", and enforce it >>> at the VFS layer? That way all file system objects would have access >>> to sealing interface, and for memfd_shmem, you can't get another >>> struct file pointing at the object, the security properties would be >>> identical. >> >> Sealing as introduced here is an inode-attribute, not "struct file". >> This is intentional. For instance, a gfx-client can get a read-only FD >> via /proc/self/fd/ and pass it to the compositor so it can never >> overwrite the contents (unless the compositor has write-access to the >> inode itself, in which case it can just re-open it read-write). > > Hmm, good point. I had forgotten about the /proc/self/fd hole. > Hmm... what if we have a SEAL_PROC which forces the permissions of > /proc/self/fd to be 000? This is the second time in a week that someone has asked for a way to have a struct file (or struct inode or whatever) that can't be reopened through /proc/pid/fd. This should be quite easy to implement as a separate feature. Actually, that feature would solve a major pet peeve of mine, I think: I want something like memfd that allows me to keep the thing read-write but that whomever I pass the fd to can't change. With this feature, I could do: fd_rw = memfd_create (or O_TMPFILE or whatever) fd_ro = open(/proc/self/fd/fd_ro, O_RDONLY); fcntl(fd_ro, F_RESTRICT, F_RESTRICT_REOPEN); send fd_ro via SCM_RIGHTS. To really make this work well, I also want to SEAL_SHRINK the inode so that the receiver can verify that I'm not going to truncate the file out from under it. Bingo, fast and secure one-way IPC. --Andy
[PATCH 3/6] shm: add memfd_create() syscall
On 04/02/2014 06:38 AM, Konstantin Khlebnikov wrote: > On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann > wrote: >> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor >> that you can pass to mmap(). It explicitly allows sealing and >> avoids any connection to user-visible mount-points. Thus, it's not >> subject to quotas on mounted file-systems, but can be used like >> malloc()'ed memory, but with a file-descriptor to it. >> >> memfd_create() does not create a front-FD, but instead returns the raw >> shmem file, so calls like ftruncate() can be used. Also calls like fstat() >> will return proper information and mark the file as regular file. Sealing >> is explicitly supported on memfds. >> >> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not >> subject to quotas and alike. > > Instead of adding new syscall we can extend existing openat() a little > bit more: > > openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666) Please don't. O_TMPFILE is a messy enough API, and the last thing we need to do is to extend it. If we want a fancy API for creating new inodes with no corresponding dentry, let's create one. Otherwise, let's just stick with a special-purpose API for these shm files. --Andy
Framebuffer corruption in QEMU or Linux's cirrus driver
On Tue, Apr 1, 2014 at 3:09 PM, Andy Lutomirski wrote: > Running: > > ./virtme-run --installed-kernel > > from this virtme commit: > > https://git.kernel.org/cgit/utils/kernel/virtme/virtme.git/commit/?id=2b409a086d15b7a878c7d5204b1f44a6564a341f > > results in a bunch of missing lines of text once bootup finishes. > Pressing enter a few times gradually fixes it. > > I don't know whether this is a qemu bug or a Linux bug. > > I'm seeing this on Fedora's 3.13.7 kernel and on a fairly recent > 3.14-rc kernel. For the latter, cirrus is built-in (not a module), > I'm running: > > virtme-run --kimg arch/x86/boot/bzImage > > and I see more profound corruption. I'm guessing this is a cirrus drm bug. bochs-drm (using virtme-run --installed-kernel --qemu-opts -vga std) does not appear to have the same issue. Neither does qxl. (qxl is painfully slow, though, and it doesn't seem to be using UC memory.) --Andy
3.14 radeon regression: radeon is broken (pci bug?)
On Fri, Mar 21, 2014 at 9:37 AM, Bjorn Helgaas wrote: > On Fri, Mar 21, 2014 at 9:49 AM, Andy Lutomirski > wrote: >> On Fri, Mar 21, 2014 at 7:41 AM, Alex Deucher >> wrote: >>> On Thu, Mar 20, 2014 at 10:17 PM, Andy Lutomirski >>> wrote: >>>> My system works on a 3.13 Fedora kernel. It does not work on a >>>> more-or-less identically configured 3.14-rc7+ kernel. The symptom is >>>> that the Plymouth password prompt flashes and them the screen goes >>>> blank. Hitting escape brings back the text console, and all is well >>>> until X tries to start. Then I get a blank screen. killall -9 Xorg >>>> from ssh causes these errors to be logged: >>>> >>>> >>>> [ 226.239747] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >>>> more than 5secs aborting >>>> [ 226.239751] [drm:atom_execute_table_locked] *ERROR* atombios stuck >>>> executing CD34 (len 55, WS 0, PS 0) @ 0xCD57 >>>> [ 231.241492] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >>>> more than 5secs aborting >>>> [ 231.241496] [drm:atom_execute_table_locked] *ERROR* atombios stuck >>>> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >>>> [ 236.243111] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >>>> more than 5secs aborting >>>> [ 236.243115] [drm:atom_execute_table_locked] *ERROR* atombios stuck >>>> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >>>> [ 241.244625] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >>>> more than 5secs aborting >>>> [ 241.244628] [drm:atom_execute_table_locked] *ERROR* atombios stuck >>>> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >>>> >>>> >>>> lspci -vvvxxxnn on 3.14-rc7+ says: >>>> >>>> 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. >>>> [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] >>>> (rev ff) (prog-if ff) >>>> !!! Unknown header type 7f >>>> Kernel driver in use: radeon >>>> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> >>>> 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] >>>> Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] (rev ff) >>>> (prog-if ff) >>>> !!! Unknown header type 7f >>>> Kernel driver in use: snd_hda_intel >>>> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >>>> >>>> (oops!) >>>> >>>> On 3.13, it says: >>>> >>>> 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. >>>> [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] >>>> (prog-if 00 [VGA controller]) >>>> Subsystem: PC Partner Limited / Sapphire Technology Radeon HD >>>> 6450 1 GB DDR3 [174b:e164] >>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >>>> ParErr- Stepping- SERR- FastB2B- DisINTx+ >>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >>>> SERR- >>> Latency: 0, Cache Line Size: 64 bytes >>>> Interrupt: pin A routed to IRQ 92 >>>> Region 0: Memory at e000 (64-bit, prefetchable) [size=256M] >>>> Region 2: Memory at f4a2 (64-bit, non-prefetchable) [size=128K] >>>> Region 4: I/O ports at c000 [size=256] >>>> Expansion ROM at f4a0 [disabled] [size=128K] >>>> Capabilities: >>>> Kernel driver in use: radeon >>>> 00: 02 10 79 67 07 04 10 00 00 00 00 03 10 00 80 00 >>>> 10: 0c 00 00 e0 00 00 00 00 04 00 a2 f4 00 00 00 00 >>>> 20: 01 c0 00 00 00 00 00 00 00 00 00 00 4b 17 64 e1 >>>> 30: 00 00 a0 f4 50 00 00 00 00 00 00 00 0a 01 00 00 >>>> >>>> 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] >>>> Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] >>>> Subsystem: PC Partner Limited / Sapphire Technology Radeon HD >>>>
3.14 radeon regression: radeon is broken (pci bug?)
On Fri, Mar 21, 2014 at 7:41 AM, Alex Deucher wrote: > On Thu, Mar 20, 2014 at 10:17 PM, Andy Lutomirski > wrote: >> My system works on a 3.13 Fedora kernel. It does not work on a >> more-or-less identically configured 3.14-rc7+ kernel. The symptom is >> that the Plymouth password prompt flashes and them the screen goes >> blank. Hitting escape brings back the text console, and all is well >> until X tries to start. Then I get a blank screen. killall -9 Xorg >> from ssh causes these errors to be logged: >> >> >> [ 226.239747] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> more than 5secs aborting >> [ 226.239751] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> executing CD34 (len 55, WS 0, PS 0) @ 0xCD57 >> [ 231.241492] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> more than 5secs aborting >> [ 231.241496] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >> [ 236.243111] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> more than 5secs aborting >> [ 236.243115] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >> [ 241.244625] [drm:atom_op_jump] *ERROR* atombios stuck in loop for >> more than 5secs aborting >> [ 241.244628] [drm:atom_execute_table_locked] *ERROR* atombios stuck >> executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 >> >> >> lspci -vvvxxxnn on 3.14-rc7+ says: >> >> 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. >> [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] >> (rev ff) (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: radeon >> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >> 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] >> Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] (rev ff) >> (prog-if ff) >> !!! Unknown header type 7f >> Kernel driver in use: snd_hda_intel >> 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >> >> (oops!) >> >> On 3.13, it says: >> >> 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. >> [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] >> (prog-if 00 [VGA controller]) >> Subsystem: PC Partner Limited / Sapphire Technology Radeon HD >> 6450 1 GB DDR3 [174b:e164] >> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >> ParErr- Stepping- SERR- FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> SERR- > Latency: 0, Cache Line Size: 64 bytes >> Interrupt: pin A routed to IRQ 92 >> Region 0: Memory at e000 (64-bit, prefetchable) [size=256M] >> Region 2: Memory at f4a2 (64-bit, non-prefetchable) [size=128K] >> Region 4: I/O ports at c000 [size=256] >> Expansion ROM at f4a0 [disabled] [size=128K] >> Capabilities: >> Kernel driver in use: radeon >> 00: 02 10 79 67 07 04 10 00 00 00 00 03 10 00 80 00 >> 10: 0c 00 00 e0 00 00 00 00 04 00 a2 f4 00 00 00 00 >> 20: 01 c0 00 00 00 00 00 00 00 00 00 00 4b 17 64 e1 >> 30: 00 00 a0 f4 50 00 00 00 00 00 00 00 0a 01 00 00 >> >> 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] >> Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] >> Subsystem: PC Partner Limited / Sapphire Technology Radeon HD >> 6450 1GB DDR3 [174b:aa98] >> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- >> ParErr- Stepping- SERR- FastB2B- DisINTx+ >> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- >> SERR- > Latency: 0, Cache Line Size: 64 bytes >> Interrupt: pin B routed to IRQ 96 >> Region 0: Memory at f4a4 (64-bit, non-prefetchable) [size=16K] >> Capabilities: >> Kernel driver in use: snd_hda_intel >> 00: 02 10 98 aa 06 04 10 00 00 00 03 04 10 00 80 00 >> 10: 04 00 a4 f4 00 00 00 00 00 00 00 00 00 00 00 00 >> 20: 00 00 00 00 00 00 00 00 00 00 00 00 4b 17 98 aa >> 30: 00 00 00 00 50 00 00 00 00 00 00 00 05 02 00 00 >> >> Logs attached. >> >> Unfortunately, I'll be away from this computer until Wednesday. > > Can you bisect? Not until Wednesday -- I don't have any way to test this remotely. I can do tests that don't involve rebooting, though. > > Alex -- Andy Lutomirski AMA Capital Management, LLC
3.14 radeon regression: radeon is broken (pci bug?)
My system works on a 3.13 Fedora kernel. It does not work on a more-or-less identically configured 3.14-rc7+ kernel. The symptom is that the Plymouth password prompt flashes and them the screen goes blank. Hitting escape brings back the text console, and all is well until X tries to start. Then I get a blank screen. killall -9 Xorg from ssh causes these errors to be logged: [ 226.239747] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 226.239751] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing CD34 (len 55, WS 0, PS 0) @ 0xCD57 [ 231.241492] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 231.241496] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 [ 236.243111] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 236.243115] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 [ 241.244625] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 241.244628] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing CD6C (len 62, WS 0, PS 0) @ 0xCD88 lspci -vvvxxxnn on 3.14-rc7+ says: 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: radeon 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: snd_hda_intel 00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 10: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 20: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff 30: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff (oops!) On 3.13, it says: 09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM] [1002:6779] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited / Sapphire Technology Radeon HD 6450 1 GB DDR3 [174b:e164] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Kernel driver in use: radeon 00: 02 10 79 67 07 04 10 00 00 00 00 03 10 00 80 00 10: 0c 00 00 e0 00 00 00 00 04 00 a2 f4 00 00 00 00 20: 01 c0 00 00 00 00 00 00 00 00 00 00 4b 17 64 e1 30: 00 00 a0 f4 50 00 00 00 00 00 00 00 0a 01 00 00 09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI Audio [Radeon HD 6400 Series] [1002:aa98] Subsystem: PC Partner Limited / Sapphire Technology Radeon HD 6450 1GB DDR3 [174b:aa98] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Kernel driver in use: snd_hda_intel 00: 02 10 98 aa 06 04 10 00 00 00 03 04 10 00 80 00 10: 04 00 a4 f4 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 4b 17 98 aa 30: 00 00 00 00 50 00 00 00 00 00 00 00 05 02 00 00 Logs attached. Unfortunately, I'll be away from this computer until Wednesday. -- next part -- 00:00.0 Host bridge [0600]: Intel Corporation Xeon E5/Core i7 DMI2 [8086:3c00] (rev 06) Subsystem: Micro-Star International Co., Ltd. Device [1462:7760] Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 00: 86 80 00 3c 00 00 10 00 06 00 00 06 10 00 00 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 60 77 30: 00 00 00 00 90 00 00 00 00 00 00 00 00 01 00 00 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a [8086:3c02] (rev 06) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: Kernel driver in use: pcieport 00: 86 80 02 3c 07 04 10 00 06 00 04 06 10 00 81 00 10: 00 00 00 00 00 00 00 00 00 01 01 00 f0 00 00 00 20: f0 ff 00 00 f1 ff 01 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 10 00 00:02.0 PCI bridge [0604]: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a [8086:3c04] (rev 06) (prog-if 0
Re: [PATCH 17/25] drm: rip out drm_core_has_MTRR checks
On Fri, Aug 9, 2013 at 11:47 AM, Daniel Vetter wrote: > On Fri, Aug 9, 2013 at 8:39 PM, Andy Lutomirski wrote: >> On Fri, Aug 9, 2013 at 11:36 AM, Daniel Vetter >> wrote: >>> On Fri, Aug 9, 2013 at 8:12 PM, Andy Lutomirski wrote: >>>> On Thu, Aug 8, 2013 at 6:41 AM, Daniel Vetter >>>> wrote: >>>>> The new arch_phys_wc_add/del functions do the right thing both with >>>>> and without MTRR support in the kernel. So we can drop these >>>>> additional checks. >>>> >>>> If any of the new arch_phys_wc_add calls are reachable and if the >>>> driver calls arch_phys_wc_add itself, then the lack of refcounting on >>>> non-PAT systems may cause a problem. (I don't understand the drm >>>> stuff well enough to know whether that can actually happen.) >>> >>> This is only about compile-time options really. Somehow drm had the >>> idea to use these check functions instead of #ifdef plus dummy static >>> inline noop functions. David Herrmann just did the same patch for the >>> agp stuff. So refcounting is of no concern here. >> >> I feel like I'm missing something obvious here. On nouveau, prior to >> this patch, the drm maps code would not touch mtrrs. Now it will. >> Nouveau already calls arch_phys_wc_add, so if that maps code is >> reached on the same resource, then there could be refcounting issues. > > Oh that kind of confusion. The maps code here is for old userspace > drivers, I have some patches in the queue that will disable it > properly for kms drivers. So it should never happen that both the kms > driver and the maps code in the drm core set up a mtrr mapping. And if > it happens someone is doing something really nasty, and that hole will > soon be plugged. In that case, I'm convinced. In case you care: Acked-by: Andy Lutomirski --Andy > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Andy Lutomirski AMA Capital Management, LLC ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 17/25] drm: rip out drm_core_has_MTRR checks
On Fri, Aug 9, 2013 at 11:36 AM, Daniel Vetter wrote: > On Fri, Aug 9, 2013 at 8:12 PM, Andy Lutomirski wrote: >> On Thu, Aug 8, 2013 at 6:41 AM, Daniel Vetter wrote: >>> The new arch_phys_wc_add/del functions do the right thing both with >>> and without MTRR support in the kernel. So we can drop these >>> additional checks. >> >> If any of the new arch_phys_wc_add calls are reachable and if the >> driver calls arch_phys_wc_add itself, then the lack of refcounting on >> non-PAT systems may cause a problem. (I don't understand the drm >> stuff well enough to know whether that can actually happen.) > > This is only about compile-time options really. Somehow drm had the > idea to use these check functions instead of #ifdef plus dummy static > inline noop functions. David Herrmann just did the same patch for the > agp stuff. So refcounting is of no concern here. I feel like I'm missing something obvious here. On nouveau, prior to this patch, the drm maps code would not touch mtrrs. Now it will. Nouveau already calls arch_phys_wc_add, so if that maps code is reached on the same resource, then there could be refcounting issues. --Andy > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Andy Lutomirski AMA Capital Management, LLC ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 17/25] drm: rip out drm_core_has_MTRR checks
On Thu, Aug 8, 2013 at 6:41 AM, Daniel Vetter wrote: > The new arch_phys_wc_add/del functions do the right thing both with > and without MTRR support in the kernel. So we can drop these > additional checks. If any of the new arch_phys_wc_add calls are reachable and if the driver calls arch_phys_wc_add itself, then the lack of refcounting on non-PAT systems may cause a problem. (I don't understand the drm stuff well enough to know whether that can actually happen.) --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH 17/25] drm: rip out drm_core_has_MTRR checks
On Fri, Aug 9, 2013 at 11:47 AM, Daniel Vetter wrote: > On Fri, Aug 9, 2013 at 8:39 PM, Andy Lutomirski > wrote: >> On Fri, Aug 9, 2013 at 11:36 AM, Daniel Vetter >> wrote: >>> On Fri, Aug 9, 2013 at 8:12 PM, Andy Lutomirski >>> wrote: >>>> On Thu, Aug 8, 2013 at 6:41 AM, Daniel Vetter >>>> wrote: >>>>> The new arch_phys_wc_add/del functions do the right thing both with >>>>> and without MTRR support in the kernel. So we can drop these >>>>> additional checks. >>>> >>>> If any of the new arch_phys_wc_add calls are reachable and if the >>>> driver calls arch_phys_wc_add itself, then the lack of refcounting on >>>> non-PAT systems may cause a problem. (I don't understand the drm >>>> stuff well enough to know whether that can actually happen.) >>> >>> This is only about compile-time options really. Somehow drm had the >>> idea to use these check functions instead of #ifdef plus dummy static >>> inline noop functions. David Herrmann just did the same patch for the >>> agp stuff. So refcounting is of no concern here. >> >> I feel like I'm missing something obvious here. On nouveau, prior to >> this patch, the drm maps code would not touch mtrrs. Now it will. >> Nouveau already calls arch_phys_wc_add, so if that maps code is >> reached on the same resource, then there could be refcounting issues. > > Oh that kind of confusion. The maps code here is for old userspace > drivers, I have some patches in the queue that will disable it > properly for kms drivers. So it should never happen that both the kms > driver and the maps code in the drm core set up a mtrr mapping. And if > it happens someone is doing something really nasty, and that hole will > soon be plugged. In that case, I'm convinced. In case you care: Acked-by: Andy Lutomirski --Andy > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Andy Lutomirski AMA Capital Management, LLC
[PATCH 17/25] drm: rip out drm_core_has_MTRR checks
On Fri, Aug 9, 2013 at 11:36 AM, Daniel Vetter wrote: > On Fri, Aug 9, 2013 at 8:12 PM, Andy Lutomirski > wrote: >> On Thu, Aug 8, 2013 at 6:41 AM, Daniel Vetter >> wrote: >>> The new arch_phys_wc_add/del functions do the right thing both with >>> and without MTRR support in the kernel. So we can drop these >>> additional checks. >> >> If any of the new arch_phys_wc_add calls are reachable and if the >> driver calls arch_phys_wc_add itself, then the lack of refcounting on >> non-PAT systems may cause a problem. (I don't understand the drm >> stuff well enough to know whether that can actually happen.) > > This is only about compile-time options really. Somehow drm had the > idea to use these check functions instead of #ifdef plus dummy static > inline noop functions. David Herrmann just did the same patch for the > agp stuff. So refcounting is of no concern here. I feel like I'm missing something obvious here. On nouveau, prior to this patch, the drm maps code would not touch mtrrs. Now it will. Nouveau already calls arch_phys_wc_add, so if that maps code is reached on the same resource, then there could be refcounting issues. --Andy > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch -- Andy Lutomirski AMA Capital Management, LLC
[PATCH 17/25] drm: rip out drm_core_has_MTRR checks
On Thu, Aug 8, 2013 at 6:41 AM, Daniel Vetter wrote: > The new arch_phys_wc_add/del functions do the right thing both with > and without MTRR support in the kernel. So we can drop these > additional checks. If any of the new arch_phys_wc_add calls are reachable and if the driver calls arch_phys_wc_add itself, then the lack of refcounting on non-PAT systems may cause a problem. (I don't understand the drm stuff well enough to know whether that can actually happen.) --Andy
Re: Bug in warning message from MTRR rework in uvesafb
On Wed, Jul 10, 2013 at 10:07 AM, Torsten Kaiser wrote: > Commit 63e28a7a5ffce59b645ca9cbcc01e1e8be56bd75, "uvesafb: Clean up > MTRR code" contains the following change: > > @@ -1930,6 +1891,9 @@ static int uvesafb_setup(char *options) > } > } > > +if (mtrr != 3 && mtrr != 1) > +pr_warn("uvesafb: mtrr should be set to 0 or 3; %d is > unsupported", mtrr); > + > return 0; > } > #endif /* !MODULE */ > > Shouldn't this be && mtrr != 0? Indeed, and Sylvain Hitier (cc'd) sent a patch (off-list) that must have gotten lost somewhere. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Bug in warning message from MTRR rework in uvesafb
On Wed, Jul 10, 2013 at 10:07 AM, Torsten Kaiser wrote: > Commit 63e28a7a5ffce59b645ca9cbcc01e1e8be56bd75, "uvesafb: Clean up > MTRR code" contains the following change: > > @@ -1930,6 +1891,9 @@ static int uvesafb_setup(char *options) > } > } > > +if (mtrr != 3 && mtrr != 1) > +pr_warn("uvesafb: mtrr should be set to 0 or 3; %d is > unsupported", mtrr); > + > return 0; > } > #endif /* !MODULE */ > > Shouldn't this be && mtrr != 0? Indeed, and Sylvain Hitier (cc'd) sent a patch (off-list) that must have gotten lost somewhere. --Andy
Re: [PATCH 33/39] drm: rip out drm_core_has_MTRR checks
On Wed, Jul 10, 2013 at 8:59 AM, Daniel Vetter wrote: > On Wed, Jul 10, 2013 at 5:41 PM, David Herrmann wrote: >> On Wed, Jul 10, 2013 at 5:22 PM, Daniel Vetter >> wrote: >>> On Wed, Jul 10, 2013 at 3:51 PM, David Herrmann >>> wrote: > -#if __OS_HAS_MTRR > -static inline int drm_core_has_MTRR(struct drm_device *dev) > -{ > - return drm_core_check_feature(dev, DRIVER_USE_MTRR); > -} > -#else > -#define drm_core_has_MTRR(dev) (0) > -#endif > - That was the last user of DRIVER_USE_MTRR (apart from drivers setting it in .driver_features). Any reason to keep it around? >>> >>> Yeah, I guess we could rip things out. Which will also force me to >>> properly audit drivers for the eventual behaviour change this could >>> entail (in case there's an x86 driver which did not ask for an mtrr, >>> but iirc there isn't). >> >> david@david-mb ~/dev/kernel/linux $ for i in drivers/gpu/drm/* ; do if >> test -d "$i" ; then if ! grep -q USE_MTRR -r $i ; then echo $i ; fi ; >> fi ; done >> drivers/gpu/drm/exynos >> drivers/gpu/drm/gma500 >> drivers/gpu/drm/i2c >> drivers/gpu/drm/nouveau >> drivers/gpu/drm/omapdrm >> drivers/gpu/drm/qxl >> drivers/gpu/drm/rcar-du >> drivers/gpu/drm/shmobile >> drivers/gpu/drm/tilcdc >> drivers/gpu/drm/ttm >> drivers/gpu/drm/udl >> drivers/gpu/drm/vmwgfx >> david@david-mb ~/dev/kernel/linux $ >> >> So for x86 gma500,nouveau,qxl,udl,vmwgfx don't set DRIVER_USE_MTRR. >> But I cannot tell whether they break if we call arch_phys_wc_add/del, >> anyway. At least nouveau seemed to work here, but it doesn't use AGP >> or drm_bufs, I guess. > > Cool, thanks a lot for stitching together the list of drivers to look > at. So for real KMS drivers it's the drives responsibility to add an > mtrr if it needs one. nouvea, radeon, mgag200, i915 and vmwgfx do that > already. Somehow the savage driver also ends up doing that, I have no > idea why. > > Note that gma500 as a pure KMS driver doesn't need MTRR setup since > the platforms that it supports all support PAT. So no MTRRs needed to > get wc iomappings. > > The mtrr support in the drm core is all for legacy mappings of garts, > framebuffers and registers. All legacy drivers set the USE_MTRR flag, > so we're good there. > Are all of those codepaths really inaccessible in non-legacy drm drivers? I didn't try to fully unravel all the ioctls and such, but it seems like userspace could add bufs and map them. Since the mtrr code isn't very robust (reference counting? what reference counting?), I'm a little bit worried that potentially enabling it in more cases, which your patch does, could be harmful. The arch_phys_wc stuff puts a prettier interface on the mtrr code and turns it off when PAT is available, but the underlying code is still just as bad. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH 33/39] drm: rip out drm_core_has_MTRR checks
On Wed, Jul 10, 2013 at 8:59 AM, Daniel Vetter wrote: > On Wed, Jul 10, 2013 at 5:41 PM, David Herrmann > wrote: >> On Wed, Jul 10, 2013 at 5:22 PM, Daniel Vetter >> wrote: >>> On Wed, Jul 10, 2013 at 3:51 PM, David Herrmann >>> wrote: > -#if __OS_HAS_MTRR > -static inline int drm_core_has_MTRR(struct drm_device *dev) > -{ > - return drm_core_check_feature(dev, DRIVER_USE_MTRR); > -} > -#else > -#define drm_core_has_MTRR(dev) (0) > -#endif > - That was the last user of DRIVER_USE_MTRR (apart from drivers setting it in .driver_features). Any reason to keep it around? >>> >>> Yeah, I guess we could rip things out. Which will also force me to >>> properly audit drivers for the eventual behaviour change this could >>> entail (in case there's an x86 driver which did not ask for an mtrr, >>> but iirc there isn't). >> >> david at david-mb ~/dev/kernel/linux $ for i in drivers/gpu/drm/* ; do if >> test -d "$i" ; then if ! grep -q USE_MTRR -r $i ; then echo $i ; fi ; >> fi ; done >> drivers/gpu/drm/exynos >> drivers/gpu/drm/gma500 >> drivers/gpu/drm/i2c >> drivers/gpu/drm/nouveau >> drivers/gpu/drm/omapdrm >> drivers/gpu/drm/qxl >> drivers/gpu/drm/rcar-du >> drivers/gpu/drm/shmobile >> drivers/gpu/drm/tilcdc >> drivers/gpu/drm/ttm >> drivers/gpu/drm/udl >> drivers/gpu/drm/vmwgfx >> david at david-mb ~/dev/kernel/linux $ >> >> So for x86 gma500,nouveau,qxl,udl,vmwgfx don't set DRIVER_USE_MTRR. >> But I cannot tell whether they break if we call arch_phys_wc_add/del, >> anyway. At least nouveau seemed to work here, but it doesn't use AGP >> or drm_bufs, I guess. > > Cool, thanks a lot for stitching together the list of drivers to look > at. So for real KMS drivers it's the drives responsibility to add an > mtrr if it needs one. nouvea, radeon, mgag200, i915 and vmwgfx do that > already. Somehow the savage driver also ends up doing that, I have no > idea why. > > Note that gma500 as a pure KMS driver doesn't need MTRR setup since > the platforms that it supports all support PAT. So no MTRRs needed to > get wc iomappings. > > The mtrr support in the drm core is all for legacy mappings of garts, > framebuffers and registers. All legacy drivers set the USE_MTRR flag, > so we're good there. > Are all of those codepaths really inaccessible in non-legacy drm drivers? I didn't try to fully unravel all the ioctls and such, but it seems like userspace could add bufs and map them. Since the mtrr code isn't very robust (reference counting? what reference counting?), I'm a little bit worried that potentially enabling it in more cases, which your patch does, could be harmful. The arch_phys_wc stuff puts a prettier interface on the mtrr code and turns it off when PAT is available, but the underlying code is still just as bad. --Andy
Re: [RFC 3/6] drm: add SimpleDRM driver
On 06/24/2013 03:27 PM, David Herrmann wrote: > + sdrm->fb_map = ioremap(sdrm->fb_base, sdrm->fb_size); This should probably be ioremap_wc. Otherwise it will be *really* slow if used in legacy mode and it may cause conflicts with the pgprot_writecombine mode for mmap. (Watching boot messages go by on fbcon on efifb was like using an old 2400 baud modem before I made the corresponding change to efifb.) --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: MTRR use in drivers
On Sun, Jun 23, 2013 at 1:38 PM, H. Peter Anvin wrote: > On 06/23/2013 01:30 PM, Dave Airlie wrote: >>>>> Why do you care about performance when PAT is disabled? >> >> breaking old boxes just because, is just going to get reverted when I >> get the first regression report that you broke old boxes. >> > > Not "just because", but *if* the choice is between breaking old boxes > and breaking new boxes I'll take the latter. > >> Andy Lutomirski just submitted a bunch of patches to clean up the DRM >> usage of mtrrs, they are in drm-next, afaik we no longer add them on >> PAT systems. > > Fantastic news. No issue, then, and no need to break anything. > > The only problem I see with having ioremap_wc() installing an MTRR on > non-PAT, rather than pushing that into the drivers which is clearly not > the right thing, is that we will need a hook to uninstall it when the > mapping is destroyed. I have trouble believing that this will ever work well -- MTRRs have crazy alignment requirements and interactions with other MTRRs, and a few drivers have to jump through hoops to set up the right MTRRs. There aren't really enough to break down every mapping. My patches (in dri-next) add functions arch_wc_phys_add and arch_wc_phys_del that do nothing except on x86 with MTRRs on and PAT off, in which case they try to add a WC MTRR. That way the handful of drivers that need WC for performance on old hardware can try (and possibly fail, depending on the usual vagaries of MTRRs). With my patches applied, DRM and agpgart no longer touch MTRRs at all with PAT on. I didn't get around to excising MTRRs from the non-DRM video drivers or from the few odd cases like myri10ge. This stuff is painful to test. The only drivers I can really test are i915 and radeon. I have a myri10ge device, but it's on a production server. I also have several mgag200 devices, but they're in a super-secret-locked-down datacenter a few thousand miles away, and trying to gauge framebuffer performance over Dell and/or HP's crappy remoting interface is a lost cause. I'm not sure that my oldest computer (locked in a basement in another state) is old enough to have an AGP port. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[RFC 3/6] drm: add SimpleDRM driver
On 06/24/2013 03:27 PM, David Herrmann wrote: > + sdrm->fb_map = ioremap(sdrm->fb_base, sdrm->fb_size); This should probably be ioremap_wc. Otherwise it will be *really* slow if used in legacy mode and it may cause conflicts with the pgprot_writecombine mode for mmap. (Watching boot messages go by on fbcon on efifb was like using an old 2400 baud modem before I made the corresponding change to efifb.) --Andy
MTRR use in drivers
On Sun, Jun 23, 2013 at 1:38 PM, H. Peter Anvin wrote: > On 06/23/2013 01:30 PM, Dave Airlie wrote: >>>>> Why do you care about performance when PAT is disabled? >> >> breaking old boxes just because, is just going to get reverted when I >> get the first regression report that you broke old boxes. >> > > Not "just because", but *if* the choice is between breaking old boxes > and breaking new boxes I'll take the latter. > >> Andy Lutomirski just submitted a bunch of patches to clean up the DRM >> usage of mtrrs, they are in drm-next, afaik we no longer add them on >> PAT systems. > > Fantastic news. No issue, then, and no need to break anything. > > The only problem I see with having ioremap_wc() installing an MTRR on > non-PAT, rather than pushing that into the drivers which is clearly not > the right thing, is that we will need a hook to uninstall it when the > mapping is destroyed. I have trouble believing that this will ever work well -- MTRRs have crazy alignment requirements and interactions with other MTRRs, and a few drivers have to jump through hoops to set up the right MTRRs. There aren't really enough to break down every mapping. My patches (in dri-next) add functions arch_wc_phys_add and arch_wc_phys_del that do nothing except on x86 with MTRRs on and PAT off, in which case they try to add a WC MTRR. That way the handful of drivers that need WC for performance on old hardware can try (and possibly fail, depending on the usual vagaries of MTRRs). With my patches applied, DRM and agpgart no longer touch MTRRs at all with PAT on. I didn't get around to excising MTRRs from the non-DRM video drivers or from the few odd cases like myri10ge. This stuff is painful to test. The only drivers I can really test are i915 and radeon. I have a myri10ge device, but it's on a production server. I also have several mgag200 devices, but they're in a super-secret-locked-down datacenter a few thousand miles away, and trying to gauge framebuffer performance over Dell and/or HP's crappy remoting interface is a lost cause. I'm not sure that my oldest computer (locked in a basement in another state) is old enough to have an AGP port. --Andy
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski wrote: > On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse wrote: >> Andy can you test (without your patch) and see if it helps with your issue : >> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch > > Testing now. I'll report back in a couple of days. > 3.9.4 plus this patch has been completely stable for several days now. Tested-by: Andy Lutomirski Can you send this to Linux and -stable? Thanks, Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Thu, Jun 13, 2013 at 2:22 PM, Andy Lutomirski wrote: > On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse wrote: >> Andy can you test (without your patch) and see if it helps with your issue : >> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch > > Testing now. I'll report back in a couple of days. > 3.9.4 plus this patch has been completely stable for several days now. Tested-by: Andy Lutomirski Can you send this to Linux and -stable? Thanks, Andy
Re: [PATCH 0/3] fbdev no more!
On 06/16/2013 07:57 AM, Daniel Vetter wrote: > Hi all, > > So I've taken a look again at the locking mess in our fbdev support and cried. > Fixing up the console_lock mess around the fbdev notifier will be real work, > semanatically the fbdev layer does lots of stupid things (like the radeon > resume > issue I've just debugged) and the panic notifier is pretty much a lost cause. > > So I've decided to instead rip it all out. It seems to work \o/ I wonder how badly this breaks on EFI systems. Currently, efifb is an fbdev driver. When i915 calls register_framebuffer, the fbdev core removes efifb's framebuffer. (This is scary already -- what if i915 has reused that memory for something else beforehand?) But now, if i915 doesn't call register_framebuffer, the efifb "framebuffer" might stick around forever. Presumably, efifb ought to become a framebuffer-only drm driver and there should be a saner way to hand control from efifb (or vesa?) to a real driver. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH 0/3] fbdev no more!
On 06/16/2013 07:57 AM, Daniel Vetter wrote: > Hi all, > > So I've taken a look again at the locking mess in our fbdev support and cried. > Fixing up the console_lock mess around the fbdev notifier will be real work, > semanatically the fbdev layer does lots of stupid things (like the radeon > resume > issue I've just debugged) and the panic notifier is pretty much a lost cause. > > So I've decided to instead rip it all out. It seems to work \o/ I wonder how badly this breaks on EFI systems. Currently, efifb is an fbdev driver. When i915 calls register_framebuffer, the fbdev core removes efifb's framebuffer. (This is scary already -- what if i915 has reused that memory for something else beforehand?) But now, if i915 doesn't call register_framebuffer, the efifb "framebuffer" might stick around forever. Presumably, efifb ought to become a framebuffer-only drm driver and there should be a saner way to hand control from efifb (or vesa?) to a real driver. --Andy
Re: [PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse wrote: > On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer wrote: >> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >>> If the device is idle for over ten seconds, then the next attempt to do >>> anything can race with the lockup detector and cause a bogus lockup >>> to be detected. >>> >>> Oddly, the situation is well-described in the lockup detector's comments >>> and a fix is even described. This patch implements that fix (and corrects >>> some typos in the description). >>> >>> My system has been stable for about a week running this code. Without this, >>> my screen would go blank every now and then and, when it came back, >>> everything >>> would be remarkably slow (the latter is a separate bug). >>> >>> Signed-off-by: Andy Lutomirski >> >> [...] >> >>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c >>> b/drivers/gpu/drm/radeon/radeon_ring.c >>> index 1ef5eaa..fb7b3ea 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_ring.c >>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring >>> *ring) >>> * have CP rptr to a different value of jiffies wrap around which will >>> force >>> * initialization of the lockup tracking informations. >>> * >>> - * A possible false positivie is if we get call after while and >>> last_cp_rptr == >>> - * the current CP rptr, even if it's unlikely it might happen. To avoid >>> this >>> - * if the elapsed time since last call is bigger than 2 second than we >>> return >>> - * false and update the tracking information. Due to this the caller must >>> call >>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be >>> reported >>> - * the fencing code should be cautious about that. >>> + * A possible false positive is if we get called after a while and >>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >>> + * happen. To avoid this if the elapsed time since the last call is bigger >>> + * than 2 second then we return false and update the tracking >>> + * information. Due to this the caller must call radeon_ring_test_lockup >>> + * more frequently than once every 2s when waiting. >> >> Is it guaranteed that radeon_ring_test_lockup will be called more often >> than every 2s when waiting? If not, this change might prevent a real >> lockup from being detected? > > Yes it will if you wait for a fence, because the fence timeout wait is > way smaller than 2sec so radeon_ring_is_lockup get call several time, > which call radeon_ring_force_activity and then > radeon_ring_test_lockup. > > This also means it very very very unlikely (see below for the likely > case) to have a wrap around that give last rptr same as current one. > > The likely case is when you have something like a long compute, then > nothing is lockup but you keep filling ring with > radeon_ring_force_activity but the cp is still stuck on the ib of the > compute stuff so rptr does not progress. > >> Either way, I wonder if there might not be a simpler solution to the >> problem, e.g. by updating last_activity when submitting commands to a >> previously empty ring. > > Maybe but i still don't think it should matter. > > Andy can you test (without your patch) and see if it helps with your issue : > http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Testing now. I'll report back in a couple of days. I don't think that long computes have anything to do with it. The bogus lockups happen when I look away from my computer for a while and then click something. I thing the graphics are usually completely idle when this happens. AFAIK I've never run an OpenCL or similar application on this system. --Andy ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse wrote: > On Wed, Jun 12, 2013 at 6:26 AM, Michel D?nzer wrote: >> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote: >>> If the device is idle for over ten seconds, then the next attempt to do >>> anything can race with the lockup detector and cause a bogus lockup >>> to be detected. >>> >>> Oddly, the situation is well-described in the lockup detector's comments >>> and a fix is even described. This patch implements that fix (and corrects >>> some typos in the description). >>> >>> My system has been stable for about a week running this code. Without this, >>> my screen would go blank every now and then and, when it came back, >>> everything >>> would be remarkably slow (the latter is a separate bug). >>> >>> Signed-off-by: Andy Lutomirski >> >> [...] >> >>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c >>> b/drivers/gpu/drm/radeon/radeon_ring.c >>> index 1ef5eaa..fb7b3ea 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_ring.c >>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c >>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring >>> *ring) >>> * have CP rptr to a different value of jiffies wrap around which will >>> force >>> * initialization of the lockup tracking informations. >>> * >>> - * A possible false positivie is if we get call after while and >>> last_cp_rptr == >>> - * the current CP rptr, even if it's unlikely it might happen. To avoid >>> this >>> - * if the elapsed time since last call is bigger than 2 second than we >>> return >>> - * false and update the tracking information. Due to this the caller must >>> call >>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be >>> reported >>> - * the fencing code should be cautious about that. >>> + * A possible false positive is if we get called after a while and >>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might >>> + * happen. To avoid this if the elapsed time since the last call is bigger >>> + * than 2 second then we return false and update the tracking >>> + * information. Due to this the caller must call radeon_ring_test_lockup >>> + * more frequently than once every 2s when waiting. >> >> Is it guaranteed that radeon_ring_test_lockup will be called more often >> than every 2s when waiting? If not, this change might prevent a real >> lockup from being detected? > > Yes it will if you wait for a fence, because the fence timeout wait is > way smaller than 2sec so radeon_ring_is_lockup get call several time, > which call radeon_ring_force_activity and then > radeon_ring_test_lockup. > > This also means it very very very unlikely (see below for the likely > case) to have a wrap around that give last rptr same as current one. > > The likely case is when you have something like a long compute, then > nothing is lockup but you keep filling ring with > radeon_ring_force_activity but the cp is still stuck on the ib of the > compute stuff so rptr does not progress. > >> Either way, I wonder if there might not be a simpler solution to the >> problem, e.g. by updating last_activity when submitting commands to a >> previously empty ring. > > Maybe but i still don't think it should matter. > > Andy can you test (without your patch) and see if it helps with your issue : > http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch Testing now. I'll report back in a couple of days. I don't think that long computes have anything to do with it. The bogus lockups happen when I look away from my computer for a while and then click something. I thing the graphics are usually completely idle when this happens. AFAIK I've never run an OpenCL or similar application on this system. --Andy
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski --- This may be -stable material. drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 --- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsignedring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsignedlast_rptr; uint64_tgpu_addr; uint32_talign_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr; cjiffies = jiffies; + + last_test = ring->last_test_lockup; + ring->last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring->last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test > 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring->last_activity); if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed); -- 1.8.1.4 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH] radeon: Fix a false positive lockup after 10s of inactivity
If the device is idle for over ten seconds, then the next attempt to do anything can race with the lockup detector and cause a bogus lockup to be detected. Oddly, the situation is well-described in the lockup detector's comments and a fix is even described. This patch implements that fix (and corrects some typos in the description). My system has been stable for about a week running this code. Without this, my screen would go blank every now and then and, when it came back, everything would be remarkably slow (the latter is a separate bug). Signed-off-by: Andy Lutomirski --- This may be -stable material. drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_ring.c | 23 --- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8263af3..9de5778 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -652,6 +652,7 @@ struct radeon_ring { unsignedring_free_dw; int count_dw; unsigned long last_activity; + unsigned long last_test_lockup; unsignedlast_rptr; uint64_tgpu_addr; uint32_talign_mask; diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c index 1ef5eaa..fb7b3ea 100644 --- a/drivers/gpu/drm/radeon/radeon_ring.c +++ b/drivers/gpu/drm/radeon/radeon_ring.c @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) * have CP rptr to a different value of jiffies wrap around which will force * initialization of the lockup tracking informations. * - * A possible false positivie is if we get call after while and last_cp_rptr == - * the current CP rptr, even if it's unlikely it might happen. To avoid this - * if the elapsed time since last call is bigger than 2 second than we return - * false and update the tracking information. Due to this the caller must call - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported - * the fencing code should be cautious about that. + * A possible false positive is if we get called after a while and + * last_cp_rptr == the current CP rptr, even if it's unlikely it might + * happen. To avoid this if the elapsed time since the last call is bigger + * than 2 second then we return false and update the tracking + * information. Due to this the caller must call radeon_ring_test_lockup + * more frequently than once every 2s when waiting. * * Caller should write to the ring to force CP to do something so we don't get * false positive when CP is just gived nothing to do. @@ -560,10 +560,14 @@ void radeon_ring_lockup_update(struct radeon_ring *ring) **/ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *ring) { - unsigned long cjiffies, elapsed; + unsigned long cjiffies, elapsed, last_test; uint32_t rptr; cjiffies = jiffies; + + last_test = ring->last_test_lockup; + ring->last_test_lockup = cjiffies; + if (!time_after(cjiffies, ring->last_activity)) { /* likely a wrap around */ radeon_ring_lockup_update(ring); @@ -576,6 +580,11 @@ bool radeon_ring_test_lockup(struct radeon_device *rdev, struct radeon_ring *rin radeon_ring_lockup_update(ring); return false; } + if (cjiffies - last_test > 2 * HZ) { + /* Possible race -- see comment above */ + radeon_ring_lockup_update(ring); + return false; + } elapsed = jiffies_to_msecs(cjiffies - ring->last_activity); if (radeon_lockup_timeout && elapsed >= radeon_lockup_timeout) { dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed); -- 1.8.1.4