subject:"\[PATCH\] drm\/\[amdgpu\|radeon\]\: fix memset on io mem"

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-20 Thread Chen Li

On Fri, 18 Dec 2020 16:10:12 +0800,
Christian König wrote:
> 
> Am 18.12.20 um 04:51 schrieb Chen Li:
> > [SNIP]
>  If your ARM base board can't do that for some then you can't use the 
>  hardware
>  with that board.
> >>> Good to know, thanks! BTW, have you ever seen or heard boards like mine 
> >>> which cannot mmap device memory correctly from userspace correctly?
> >> Unfortunately yes. We haven't been able to figure out what exactly goes 
> >> wrong in
> >> those cases.
> > Ok. one more question: only e8860 or all radeon cards have this issue?
> 
> This applies to all hardware with dedicated memory which needs to be mapped to
> userspace.
> 
> That includes all graphics hardware from AMD as well as NVidia and probably a
> whole bunch of other PCIe devices.

Can mmio on these devices work fine in kernel space? I cannot see the 
difference here except user space should use uncacheable mmap to map virtual 
memory to device space(though I don't know how to use uncacheable mmap), while 
kernel use uncache ioremap. 

> 
> >>>   The graphics address remapping table (GART),[1] also known as the 
> >>> graphics aperture remapping table,[2] or graphics translation table 
> >>> (GTT),[3] is an I/O memory management unit (IOMMU) used by Accelerated 
> >>> Graphics Port (AGP) and PCI Express (PCIe) graphics cards.
> >> GART or GTT refers to the translation tables graphics hardware use to 
> >> access
> >> system memory.
> >> 
> >> Something like 15 years ago we used the IOMMU functionality from AGP to
> >> implement that. But modern hardware (PCIe) uses some specialized hardware 
> >> in the
> >> GPU for that.
> >> 
> >> Regards,
> >> Christian.
> >> 
> >> 
> >> 
> > Good to know, thanks! So modern GART/GTT is like tlb, and iommu is forcused 
> > on translating address and not manager the tlb.
> 
> You are getting closer in your understanding, but the TLB is the Translation
> lookaside buffer. Basically a cache of recent VM translations which is present
> is all page table translations (GART, IOMMU, CPU etc...).
> 
> The key difference is where the page table translation happens on modern
> hardware:
> 1. For the GART/GTT it is inside the GPU to translate between GPU internal and
> bus addresses.
> 2. For IOMMU it is inside the root complex of the PCIe to translate between 
> bus
> addresses and physical addresses.
> 3. For CPU page tables it is inside the CPU core to translate between virtual
> addresses and physical addresses.
> 
> Regards,
> Christian.
> 
> 

Awesome explaination! Thanks in a ton!


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Robin Murphy


On 2020-12-18 14:33, Christian König wrote:

Am 18.12.20 um 15:17 schrieb Robin Murphy:

On 2020-12-17 14:02, Christian König wrote:

[SNIP]
Do you have some background why some ARM boards fail with that?

We had a couple of reports that memset/memcpy fail in userspace 
(usually system just spontaneously reboots or becomes unresponsive), 
but so far nobody could tell us why that happens?


Part of it is that Arm doesn't really have an ideal memory type for 
mapping RAM behind PCI (much like we also struggle with the vague 
expectations of what write-combine might mean beyond x86). Device 
memory can be relaxed to allow gathering, reordering and 
write-buffering, but is still a bit too restrictive in other ways - 
aligned, non-speculative, etc. - for something that's really just RAM 
and expected to be usable as such. Thus to map PCI memory as 
"write-combine" we use Normal non-cacheable, which means the CPU MMU 
is going to allow software to do all the things it might expect of 
RAM, but we're now at the mercy of the menagerie of interconnects and 
PCI implementations out there.


I see. As far as I know we already correctly map the RAM from the GPU as 
"write-combine".


Atomic operations, for example, *might* be resolved by the CPU 
coherency mechanism or in the interconnect, such that the PCI host 
bridge only sees regular loads and stores, but more often than not 
they'll just result in an atomic transaction going all the way to the 
host bridge. A super-duper-clever host bridge implementation might 
even support that, but the vast majority are likely to just reject it 
as invalid.


Support for atomics is actually specified by an PCIe extension. As far 
as I know that extension is even necessary for full KFD support on AMD 
and full Cuda support for NVidia GPUs.




Similarly, unaligned accesses, cache line fills/evictions, and such 
will often work, since they're essentially just larger read/write 
bursts, but some host bridges can be picky and might reject access 
sizes they don't like (there's at least one where even 64-bit accesses 
don't work. On a 64-bit system...)


This is breaking our neck here. We need 64bit writes on 64bit systems to 
end up as one 64bit write at the hardware and not two 32bit writes or 
otherwise the doorbells won't work correctly.


Just to clarify, that particular case *is* considered catastrophically 
broken ;)


In general you can assume that on AArch64, any aligned 64-bit load or 
store is atomic (64-bit accesses on 32-bit Arm are less well-defined, 
but hopefully nobody cares by now).


Larger writes are pretty much unproblematic, for P2P our bus interface 
even supports really large multi byte transfers.


If an invalid transaction does reach the host bridge, it's going to 
come back to the CPU as an external abort. If we're really lucky that 
could be taken synchronously, attributable to a specific instruction, 
and just oops/SIGBUS the relevant kernel/userspace thread. Often 
though, (particularly with big out-of-order CPUs) it's likely to be 
asynchronous and no longer attributable, and thus taken as an SError 
event, which in general roughly translates to "part of the SoC has 
fallen off". The only reasonable response we have to that is to panic 
the system.


Yeah, that sounds exactly like what we see on some of the ARM boards out 
there. At least we have an explanation for that behavior now.


Going to talk about this with our hardware engineers. We might be able 
to work around some of that stuff, but that is rather tricky to get 
working under those conditions.


Yeah, unfortunately there's no easy way to judge the quality of any 
given SoC's PCI implementation until you throw your required traffic at 
it and things either break or don't...


Cheers,
Robin.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Robin Murphy

On 2020-12-18 06:14, Chen Li wrote:
[...]

No, not performance. See standards like OpenGL, Vulkan as well as VA-API and
VDPAU require that you can mmap() device memory and execute memset/memcpy on
the memory from userspace.

If your ARM base board can't do that for some then you can't use the hardware
with that board.

If the VRAM lives in a prefetchable PCI bar then on most sane Arm-based systems
I believe it should be able to mmap() to userspace with the Normal memory type,
where unaligned accesses and such are allowed, as opposed to the Device memory
type intended for MMIO mappings, which has more restrictions but stricter
ordering guarantees.

Hi, Robin. I cannot understand it allow unaligned accesses. prefetchable PCI bar should also be mmio, and accesses will end with device memory, so why does this allow unaligned access?

Because even Device-GRE is a bit too restrictive to expose to userspace
that's likely to expect it to behave as regular memory, so, for better
or worse, we use MT_NORMAL_MC for pgrprot_writecombine().

Regardless of what happens elsewhere though, if something is mapped *into the
kernel* with ioremap(), then it is fundamentally wrong per the kernel memory
model to reference that mapping directly without using I/O accessors. That is
not specific to any individual architecture, and Sparse should be screaming
about it already. I guess in this case the UVD code needs to pay more attention
to whether radeon_bo_kmap() ends up going via ttm_bo_ioremap() or not.

(I'm assuming the initial fault was memset() with 0 trying to perform "DC ZVA"
on a Device-type mapping from ioremap() - FYI a stacktrace on its own without
the rest of the error dump showing what actually triggered it isn't overly
useful)

Robin.

why it may be 'DC ZVA'? I'm not sure the pc in initial kernel fault memset, but
I capture the userspace crash pc: stp(128bit) or str with neon(also 128bit) to
render node(/dev/dri/renderD128).

As I said it was an assumption. I guessed at it being more likely to be
an MMU fault than an external abort, and given the size and the fact
that it's a variable initialisation guessed at it being slightly more
likely to hit the ZVA special-case rather than being unaligned. Looking
again, I guess starting at an odd-numbered 32-bit element might lead to
an unaligned store of XZR, but either way it doesn't really matter -
what it showed is it clearly *could* be an MMU fault because TTM seems
to be a bit careless with iomem pointers.

That said, if you're also getting external aborts from your host bridge
not liking 128-bit transactions, then as Christian says you're probably
going to have a bad time on this platform either way.

Robin.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Christian König


Am 18.12.20 um 15:17 schrieb Robin Murphy:

On 2020-12-17 14:02, Christian König wrote:

[SNIP]
Do you have some background why some ARM boards fail with that?

We had a couple of reports that memset/memcpy fail in userspace 
(usually system just spontaneously reboots or becomes unresponsive), 
but so far nobody could tell us why that happens?


Part of it is that Arm doesn't really have an ideal memory type for 
mapping RAM behind PCI (much like we also struggle with the vague 
expectations of what write-combine might mean beyond x86). Device 
memory can be relaxed to allow gathering, reordering and 
write-buffering, but is still a bit too restrictive in other ways - 
aligned, non-speculative, etc. - for something that's really just RAM 
and expected to be usable as such. Thus to map PCI memory as 
"write-combine" we use Normal non-cacheable, which means the CPU MMU 
is going to allow software to do all the things it might expect of 
RAM, but we're now at the mercy of the menagerie of interconnects and 
PCI implementations out there.


I see. As far as I know we already correctly map the RAM from the GPU as 
"write-combine".


Atomic operations, for example, *might* be resolved by the CPU 
coherency mechanism or in the interconnect, such that the PCI host 
bridge only sees regular loads and stores, but more often than not 
they'll just result in an atomic transaction going all the way to the 
host bridge. A super-duper-clever host bridge implementation might 
even support that, but the vast majority are likely to just reject it 
as invalid.


Support for atomics is actually specified by an PCIe extension. As far 
as I know that extension is even necessary for full KFD support on AMD 
and full Cuda support for NVidia GPUs.




Similarly, unaligned accesses, cache line fills/evictions, and such 
will often work, since they're essentially just larger read/write 
bursts, but some host bridges can be picky and might reject access 
sizes they don't like (there's at least one where even 64-bit accesses 
don't work. On a 64-bit system...)


This is breaking our neck here. We need 64bit writes on 64bit systems to 
end up as one 64bit write at the hardware and not two 32bit writes or 
otherwise the doorbells won't work correctly.


Larger writes are pretty much unproblematic, for P2P our bus interface 
even supports really large multi byte transfers.


If an invalid transaction does reach the host bridge, it's going to 
come back to the CPU as an external abort. If we're really lucky that 
could be taken synchronously, attributable to a specific instruction, 
and just oops/SIGBUS the relevant kernel/userspace thread. Often 
though, (particularly with big out-of-order CPUs) it's likely to be 
asynchronous and no longer attributable, and thus taken as an SError 
event, which in general roughly translates to "part of the SoC has 
fallen off". The only reasonable response we have to that is to panic 
the system.


Yeah, that sounds exactly like what we see on some of the ARM boards out 
there. At least we have an explanation for that behavior now.


Going to talk about this with our hardware engineers. We might be able 
to work around some of that stuff, but that is rather tricky to get 
working under those conditions.


Thanks,
Christian.




Robin.


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Robin Murphy

On 2020-12-17 14:02, Christian König wrote:

Am 17.12.20 um 14:45 schrieb Robin Murphy:

On 2020-12-17 10:25, Christian König wrote:

Am 17.12.20 um 02:07 schrieb Chen Li:

On Wed, 16 Dec 2020 22:19:11 +0800,
Christian König wrote:

Am 16.12.20 um 14:48 schrieb Chen Li:

On Wed, 16 Dec 2020 15:59:37 +0800,
Christian König wrote:

[SNIP]
Hi, Christian. I'm not sure why this change is a hack here. I
cannot see the problem and wll be grateful if you give more
explainations.
__memset is supposed to work on those addresses, otherwise you
can't use the

e8860 on your arm64 system.
If __memset is supposed to work on those adresses, why this
commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7C3551ae4972b044bb831608d8a291f81c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438095114292394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=xns81uCGfN1tjsVn5LBU8QhmUinZRJQlXz8w%2FJ7%2FGTM%3Dreserved=0)
is needed? (I also notice drm/radeon didn't take this change though)
just out of curiosity.

We generally accept those patches as cleanup in the kernel with the
hope that we can find a way to work around the userspace restrictions.

But when you also have this issue in userspace then there isn't much
we can do for you.

Replacing the the direct write in the kernel with calls to writel() or
memset_io() will fix that temporary, but you have a more general
problem here.
I cannot see what's the more general problem here :( u mean
performance?

No, not performance. See standards like OpenGL, Vulkan as well as
VA-API and VDPAU require that you can mmap() device memory and
execute memset/memcpy on the memory from userspace.

If your ARM base board can't do that for some then you can't use the
hardware with that board.

If the VRAM lives in a prefetchable PCI bar then on most sane
Arm-based systems I believe it should be able to mmap() to userspace
with the Normal memory type, where unaligned accesses and such are
allowed, as opposed to the Device memory type intended for MMIO
mappings, which has more restrictions but stricter ordering guarantees.

Do you have some background why some ARM boards fail with that?

We had a couple of reports that memset/memcpy fail in userspace (usually
system just spontaneously reboots or becomes unresponsive), but so far
nobody could tell us why that happens?

Part of it is that Arm doesn't really have an ideal memory type for
mapping RAM behind PCI (much like we also struggle with the vague
expectations of what write-combine might mean beyond x86). Device memory
can be relaxed to allow gathering, reordering and write-buffering, but
is still a bit too restrictive in other ways - aligned, non-speculative,
etc. - for something that's really just RAM and expected to be usable as
such. Thus to map PCI memory as "write-combine" we use Normal
non-cacheable, which means the CPU MMU is going to allow software to do
all the things it might expect of RAM, but we're now at the mercy of the
menagerie of interconnects and PCI implementations out there.

Atomic operations, for example, *might* be resolved by the CPU coherency
mechanism or in the interconnect, such that the PCI host bridge only
sees regular loads and stores, but more often than not they'll just
result in an atomic transaction going all the way to the host bridge. A
super-duper-clever host bridge implementation might even support that,
but the vast majority are likely to just reject it as invalid.

Similarly, unaligned accesses, cache line fills/evictions, and such will
often work, since they're essentially just larger read/write bursts, but
some host bridges can be picky and might reject access sizes they don't
like (there's at least one where even 64-bit accesses don't work. On a
64-bit system...)

If an invalid transaction does reach the host bridge, it's going to come
back to the CPU as an external abort. If we're really lucky that could
be taken synchronously, attributable to a specific instruction, and just
oops/SIGBUS the relevant kernel/userspace thread. Often though,
(particularly with big out-of-order CPUs) it's likely to be asynchronous
and no longer attributable, and thus taken as an SError event, which in
general roughly translates to "part of the SoC has fallen off". The only
reasonable response we have to that is to panic the system.

Robin.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Christian König


Am 18.12.20 um 09:52 schrieb Chen Li:

On Fri, 18 Dec 2020 16:10:12 +0800,
Christian König wrote:

Am 18.12.20 um 04:51 schrieb Chen Li:

[SNIP]

If your ARM base board can't do that for some then you can't use the hardware
with that board.

Good to know, thanks! BTW, have you ever seen or heard boards like mine which 
cannot mmap device memory correctly from userspace correctly?

Unfortunately yes. We haven't been able to figure out what exactly goes wrong in
those cases.

Ok. one more question: only e8860 or all radeon cards have this issue?

This applies to all hardware with dedicated memory which needs to be mapped to
userspace.

That includes all graphics hardware from AMD as well as NVidia and probably a
whole bunch of other PCIe devices.

Can mmio on these devices work fine in kernel space?


The kernel drivers know that this is MMIO and can use special 
instructions/functions like 
writel()/writeq()/memcpy_fromio()/memcpy_toio() etc...



I cannot see the difference here except user space should use uncacheable mmap 
to map virtual memory to device space(though I don't know how to use 
uncacheable mmap), while kernel use uncache ioremap.


The problem with mmap() of MMIO into the userspace is that this can 
easily crash the whole system.


When an application uses memset()/memcpy() on the mapped region and the 
system spontaneous reboots than that's a rather big hardware problem.


Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Chen Li

On Thu, 17 Dec 2020 22:16:59 +0800,
Christian König wrote:
> 
> Am 17.12.20 um 14:37 schrieb Chen Li:
> > On Thu, 17 Dec 2020 18:25:11 +0800,
> > Christian König wrote:
> >> Am 17.12.20 um 02:07 schrieb Chen Li:
> >>> On Wed, 16 Dec 2020 22:19:11 +0800,
> >>> Christian König wrote:
>  Am 16.12.20 um 14:48 schrieb Chen Li:
> > On Wed, 16 Dec 2020 15:59:37 +0800,
> > Christian König wrote:
> >> [SNIP]
> > Hi, Christian. I'm not sure why this change is a hack here. I cannot 
> > see the problem and wll be grateful if you give more explainations.
>  __memset is supposed to work on those addresses, otherwise you can't use 
>  the
>  e8860 on your arm64 system.
> >>> If __memset is supposed to work on those adresses, why this 
> >>> commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cfdb4ca3e05ad4ea4882408d8a2914fbc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438092297678363%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=88oAUlEhnsVNSqYIfXk%2B811oXYd18XPScVZ4ceAurNk%3Dreserved=0)
> >>>  is needed? (I also notice drm/radeon didn't take this change though) 
> >>> just out of curiosity.
> >> We generally accept those patches as cleanup in the kernel with the hope 
> >> that we
> >> can find a way to work around the userspace restrictions.
> > What's the userspace restriction here? mmap device memory?
> 
> Yes, exactly that.
> 
> >> But when you also have this issue in userspace then there isn't much we 
> >> can do
> >> for you.
> >> 
>  Replacing the the direct write in the kernel with calls to writel() or
>  memset_io() will fix that temporary, but you have a more general problem 
>  here.
> >>>I cannot see what's the more general problem here :( u mean 
> >>> performance?
> >> No, not performance. See standards like OpenGL, Vulkan as well as VA-API 
> >> and
> >> VDPAU require that you can mmap() device memory and execute memset/memcpy 
> >> on the
> >> memory from userspace.
> >> 
> >> If your ARM base board can't do that for some then you can't use the 
> >> hardware
> >> with that board.
> > Good to know, thanks! BTW, have you ever seen or heard boards like mine 
> > which cannot mmap device memory correctly from userspace correctly?
> 
> Unfortunately yes. We haven't been able to figure out what exactly goes wrong 
> in
> those cases.

Ok. one more question: only e8860 or all radeon cards have this issue?
 
> >> For amdgpu I suggest that we allocate the UVD message in GTT instead 
> >> of VRAM
> >> since we don't have the hardware restriction for that on the new 
> >> generations.
> >> 
> > Thanks, I will try to dig into deeper. But what's the "hardware 
> > restriction" meaning here? I'm not familiar with video driver stack and 
> > amd gpu, sorry.
>  On older hardware (AGP days) the buffer had to be in VRAM (MMIO) memory, 
>  but on
>  modern system GTT (system memory) works as well.
> >>> IIUC, e8860 can use amdgpu(I use radeon now) beause its device id 6822 is 
> >>> in amdgpu's table. But I cannot tell whether e8860 has iommu, and I 
> >>> cannot find iommu from lspci, so graphics translation table may not work 
> >>> here?
> >> That is not related to IOMMU. IOMMU is a feature of the CPU/motherboard. 
> >> This is
> >> implemented using GTT, e.g. the VM page tables inside the GPU.
> >> 
> >> And yes it should work I will prepare a patch for it.
> > I think you mean mmu :)
> 
> No, I really meant IOMMU.
> 
> > Refer to wikipedia: 
> > https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fen.wikipedia.org%2Fwiki%2FInput%25E2%2580%2593output_memory_management_unit%23:~:text%3DIn%2520computing%252C%2520an%2520input%25E2%2580%2593output%2Cbus%2520to%2520the%2520main%2520memorydata=04%7C01%7Cchristian.koenig%40amd.com%7Cfdb4ca3e05ad4ea4882408d8a2914fbc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438092297678363%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=t6NDi8didU7GFzaCSMFvdSTKA%2FmRZ1cgPCpY7lf7UKo%3Dreserved=0.
> > 
> >  In computing, an input–output memory management unit (IOMMU) is a 
> > memory management unit (MMU) that connects a direct-memory-access–capable 
> > (DMA-capable) I/O bus to the main memory. Like a traditional MMU, which 
> > translates CPU-visible virtual addresses to physical addresses, the IOMMU 
> > maps device-visible virtual addresses (also called device addresses or I/O 
> > addresses in this context) to physical addresses. Some units also provide 
> > memory protection from faulty or malicious devices.
> >  An example IOMMU is the graphics address remapping table (GART) used 
> > by AGP and PCI Express graphics cards on Intel Architecture and AMD 
> > computers.
> 
> Maybe somebody should clarify the

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Chen Li

On Thu, 17 Dec 2020 21:45:06 +0800,
Robin Murphy wrote:
> 
> On 2020-12-17 10:25, Christian König wrote:
> > Am 17.12.20 um 02:07 schrieb Chen Li:
> >> On Wed, 16 Dec 2020 22:19:11 +0800,
> >> Christian König wrote:
> >>> Am 16.12.20 um 14:48 schrieb Chen Li:
>  On Wed, 16 Dec 2020 15:59:37 +0800,
>  Christian König wrote:
> > [SNIP]
>  Hi, Christian. I'm not sure why this change is a hack here. I cannot see
>  the problem and wll be grateful if you give more explainations.
> >>> __memset is supposed to work on those addresses, otherwise you can't use 
> >>> the
> >>> e8860 on your arm64 system.
> >> If __memset is supposed to work on those adresses, why this
> >> commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7C4ed3c075888746b7f41408d8a22811c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437640274023350%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=HhWxUaLo3WpzoV6hjV%2BG1HICaIOXwsoNpzv5tNMNg8A%3Dreserved=0)
> >> is needed? (I also notice drm/radeon didn't take this change though) just 
> >> out
> >> of curiosity.
> > 
> > We generally accept those patches as cleanup in the kernel with the hope 
> > that
> > we can find a way to work around the userspace restrictions.
> > 
> > But when you also have this issue in userspace then there isn't much we can 
> > do
> > for you.
> > 
> >>> Replacing the the direct write in the kernel with calls to writel() or
> >>> memset_io() will fix that temporary, but you have a more general problem
> >>> here.
> >> I cannot see what's the more general problem here :( u mean performance?
> > 
> > No, not performance. See standards like OpenGL, Vulkan as well as VA-API and
> > VDPAU require that you can mmap() device memory and execute memset/memcpy on
> > the memory from userspace.
> > 
> > If your ARM base board can't do that for some then you can't use the 
> > hardware
> > with that board.
> 
> If the VRAM lives in a prefetchable PCI bar then on most sane Arm-based 
> systems
> I believe it should be able to mmap() to userspace with the Normal memory 
> type,
> where unaligned accesses and such are allowed, as opposed to the Device memory
> type intended for MMIO mappings, which has more restrictions but stricter
> ordering guarantees.
 
Hi, Robin. I cannot understand it allow unaligned accesses. prefetchable PCI 
bar should also be mmio, and accesses will end with device memory, so why does 
this allow unaligned access?
> Regardless of what happens elsewhere though, if something is mapped *into the
> kernel* with ioremap(), then it is fundamentally wrong per the kernel memory
> model to reference that mapping directly without using I/O accessors. That is
> not specific to any individual architecture, and Sparse should be screaming
> about it already. I guess in this case the UVD code needs to pay more 
> attention
> to whether radeon_bo_kmap() ends up going via ttm_bo_ioremap() or not.
> 
> (I'm assuming the initial fault was memset() with 0 trying to perform "DC ZVA"
> on a Device-type mapping from ioremap() - FYI a stacktrace on its own without
> the rest of the error dump showing what actually triggered it isn't overly
> useful)
> 
> Robin.
why it may be 'DC ZVA'? I'm not sure the pc in initial kernel fault memset, but 
I capture the userspace crash pc: stp(128bit) or str with neon(also 128bit) to 
render node(/dev/dri/renderD128).
 
> > For amdgpu I suggest that we allocate the UVD message in GTT instead of
> > VRAM
> > since we don't have the hardware restriction for that on the new
> > generations.
> > 
>  Thanks, I will try to dig into deeper. But what's the "hardware
>  restriction" meaning here? I'm not familiar with video driver stack and 
>  amd
>  gpu, sorry.
> >>> On older hardware (AGP days) the buffer had to be in VRAM (MMIO) memory, 
> >>> but
> >>> on
> >>> modern system GTT (system memory) works as well.
> >> IIUC, e8860 can use amdgpu(I use radeon now) beause its device id 6822 is 
> >> in
> >> amdgpu's table. But I cannot tell whether e8860 has iommu, and I cannot 
> >> find
> >> iommu from lspci, so graphics translation table may not work here?
> > 
> > That is not related to IOMMU. IOMMU is a feature of the CPU/motherboard. 
> > This
> > is implemented using GTT, e.g. the VM page tables inside the GPU.
> > 
> > And yes it should work I will prepare a patch for it.
> > 
> > BTW: How does userspace work on arm64 then? The driver stack usually 
> > only
> > works
> > if mmio can be mapped directly.
>  I also post two usespace issue on mesa, and you may be interested with
>  them:
> 
>

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Chen Li

On Thu, 17 Dec 2020 18:25:11 +0800,
Christian König wrote:
> 
> Am 17.12.20 um 02:07 schrieb Chen Li:
> > On Wed, 16 Dec 2020 22:19:11 +0800,
> > Christian König wrote:
> >> Am 16.12.20 um 14:48 schrieb Chen Li:
> >>> On Wed, 16 Dec 2020 15:59:37 +0800,
> >>> Christian König wrote:
>  [SNIP]
> >>> Hi, Christian. I'm not sure why this change is a hack here. I cannot see 
> >>> the problem and wll be grateful if you give more explainations.
> >> __memset is supposed to work on those addresses, otherwise you can't use 
> >> the
> >> e8860 on your arm64 system.
> > If __memset is supposed to work on those adresses, why this 
> > commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7C4ed3c075888746b7f41408d8a22811c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437640274023350%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=HhWxUaLo3WpzoV6hjV%2BG1HICaIOXwsoNpzv5tNMNg8A%3Dreserved=0)
> >  is needed? (I also notice drm/radeon didn't take this change though) just 
> > out of curiosity.
> 
> We generally accept those patches as cleanup in the kernel with the hope that 
> we
> can find a way to work around the userspace restrictions.
What's the userspace restriction here? mmap device memory?
> 
> But when you also have this issue in userspace then there isn't much we can do
> for you.
> 
> >> Replacing the the direct write in the kernel with calls to writel() or
> >> memset_io() will fix that temporary, but you have a more general problem 
> >> here.
> >   I cannot see what's the more general problem here :( u mean performance?
> 
> No, not performance. See standards like OpenGL, Vulkan as well as VA-API and
> VDPAU require that you can mmap() device memory and execute memset/memcpy on 
> the
> memory from userspace.
> 
> If your ARM base board can't do that for some then you can't use the hardware
> with that board.

Good to know, thanks! BTW, have you ever seen or heard boards like mine which 
cannot mmap device memory correctly from userspace correctly?
> 
>  For amdgpu I suggest that we allocate the UVD message in GTT instead of 
>  VRAM
>  since we don't have the hardware restriction for that on the new 
>  generations.
>  
> >>> Thanks, I will try to dig into deeper. But what's the "hardware 
> >>> restriction" meaning here? I'm not familiar with video driver stack and 
> >>> amd gpu, sorry.
> >> On older hardware (AGP days) the buffer had to be in VRAM (MMIO) memory, 
> >> but on
> >> modern system GTT (system memory) works as well.
> > IIUC, e8860 can use amdgpu(I use radeon now) beause its device id 6822 is 
> > in amdgpu's table. But I cannot tell whether e8860 has iommu, and I cannot 
> > find iommu from lspci, so graphics translation table may not work here?
> 
> That is not related to IOMMU. IOMMU is a feature of the CPU/motherboard. This 
> is
> implemented using GTT, e.g. the VM page tables inside the GPU.
> 
> And yes it should work I will prepare a patch for it.

I think you mean mmu :) Refer to wikipedia: 
https://en.wikipedia.org/wiki/Input%E2%80%93output_memory_management_unit#:~:text=In%20computing%2C%20an%20input%E2%80%93output,bus%20to%20the%20main%20memory.

In computing, an input–output memory management unit (IOMMU) is a memory 
management unit (MMU) that connects a direct-memory-access–capable 
(DMA-capable) I/O bus to the main memory. Like a traditional MMU, which 
translates CPU-visible virtual addresses to physical addresses, the IOMMU maps 
device-visible virtual addresses (also called device addresses or I/O addresses 
in this context) to physical addresses. Some units also provide memory 
protection from faulty or malicious devices.
An example IOMMU is the graphics address remapping table (GART) used by AGP 
and PCI Express graphics cards on Intel Architecture and AMD computers.

GART should be antoher abber of 
GTT(https://en.wikipedia.org/wiki/Graphics_address_remapping_table):

The graphics address remapping table (GART),[1] also known as the graphics 
aperture remapping table,[2] or graphics translation table (GTT),[3] is an I/O 
memory management unit (IOMMU) used by Accelerated Graphics Port (AGP) and PCI 
Express (PCIe) graphics cards. 

> 
>  BTW: How does userspace work on arm64 then? The driver stack usually 
>  only works
>  if mmio can be mapped directly.
> >>> I also post two usespace issue on mesa, and you may be interested with 
> >>> them:
> >>>
> >>>

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-18 Thread Christian König


Am 18.12.20 um 04:51 schrieb Chen Li:

[SNIP]

If your ARM base board can't do that for some then you can't use the hardware
with that board.

Good to know, thanks! BTW, have you ever seen or heard boards like mine which 
cannot mmap device memory correctly from userspace correctly?

Unfortunately yes. We haven't been able to figure out what exactly goes wrong in
those cases.

Ok. one more question: only e8860 or all radeon cards have this issue?


This applies to all hardware with dedicated memory which needs to be 
mapped to userspace.


That includes all graphics hardware from AMD as well as NVidia and 
probably a whole bunch of other PCIe devices.



  The graphics address remapping table (GART),[1] also known as the 
graphics aperture remapping table,[2] or graphics translation table (GTT),[3] 
is an I/O memory management unit (IOMMU) used by Accelerated Graphics Port 
(AGP) and PCI Express (PCIe) graphics cards.

GART or GTT refers to the translation tables graphics hardware use to access
system memory.

Something like 15 years ago we used the IOMMU functionality from AGP to
implement that. But modern hardware (PCIe) uses some specialized hardware in the
GPU for that.

Regards,
Christian.




Good to know, thanks! So modern GART/GTT is like tlb, and iommu is forcused on 
translating address and not manager the tlb.


You are getting closer in your understanding, but the TLB is the 
Translation lookaside buffer. Basically a cache of recent VM 
translations which is present is all page table translations (GART, 
IOMMU, CPU etc...).


The key difference is where the page table translation happens on modern 
hardware:
1. For the GART/GTT it is inside the GPU to translate between GPU 
internal and bus addresses.
2. For IOMMU it is inside the root complex of the PCIe to translate 
between bus addresses and physical addresses.
3. For CPU page tables it is inside the CPU core to translate between 
virtual addresses and physical addresses.


Regards,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-17 Thread Lucas Stach

Am Donnerstag, den 17.12.2020, 15:02 +0100 schrieb Christian König:
> Am 17.12.20 um 14:45 schrieb Robin Murphy:
> > On 2020-12-17 10:25, Christian König wrote:
> > > Am 17.12.20 um 02:07 schrieb Chen Li:
> > > > On Wed, 16 Dec 2020 22:19:11 +0800,
> > > > Christian König wrote:
> > > > > Am 16.12.20 um 14:48 schrieb Chen Li:
> > > > > > On Wed, 16 Dec 2020 15:59:37 +0800,
> > > > > > Christian König wrote:
> > > > > > > [SNIP]
> > > > > > Hi, Christian. I'm not sure why this change is a hack here. I 
> > > > > > cannot see the problem and wll be grateful if you give more 
> > > > > > explainations.
> > > > > __memset is supposed to work on those addresses, otherwise you 
> > > > > can't use the
> > > > > e8860 on your arm64 system.
> > > > If __memset is supposed to work on those adresses, why this 
> > > > commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7C3551ae4972b044bb831608d8a291f81c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438095114292394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=xns81uCGfN1tjsVn5LBU8QhmUinZRJQlXz8w%2FJ7%2FGTM%3Dreserved=0)
> > > >  
> > > > is needed? (I also notice drm/radeon didn't take this change though) 
> > > > just out of curiosity.
> > > 
> > > We generally accept those patches as cleanup in the kernel with the 
> > > hope that we can find a way to work around the userspace restrictions.
> > > 
> > > But when you also have this issue in userspace then there isn't much 
> > > we can do for you.
> > > 
> > > > > Replacing the the direct write in the kernel with calls to writel() or
> > > > > memset_io() will fix that temporary, but you have a more general 
> > > > > problem here.
> > > > I cannot see what's the more general problem here :( u mean 
> > > > performance?
> > > 
> > > No, not performance. See standards like OpenGL, Vulkan as well as 
> > > VA-API and VDPAU require that you can mmap() device memory and 
> > > execute memset/memcpy on the memory from userspace.
> > > 
> > > If your ARM base board can't do that for some then you can't use the 
> > > hardware with that board.
> > 
> > If the VRAM lives in a prefetchable PCI bar then on most sane 
> > Arm-based systems I believe it should be able to mmap() to userspace 
> > with the Normal memory type, where unaligned accesses and such are 
> > allowed, as opposed to the Device memory type intended for MMIO 
> > mappings, which has more restrictions but stricter ordering guarantees.
> 
> Do you have some background why some ARM boards fail with that?
> 
> We had a couple of reports that memset/memcpy fail in userspace (usually 
> system just spontaneously reboots or becomes unresponsive), but so far 
> nobody could tell us why that happens?

Optimized memset/memcpy uses unaligned access in some cases, where
handling unaligned start/end addresses would cause more instructions to
be used otherwise.

If the device memory isn't mapped at least writecombined (bufferable in
ARM speak) into userspace, those unaligned accesses are not allowed and
will cause traps on the hardware level. Normally this should just lead
to the process making the access getting killed with a SIGBUS, but
maybe some systems handle those traps wrong on a firmware level? If the
kernel makes such an unaligned access then the kernel will fault, which
normally means halting the kernel.

Regards,
Lucas


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-17 Thread Christian König

Am 17.12.20 um 14:37 schrieb Chen Li:

On Thu, 17 Dec 2020 18:25:11 +0800,
Christian König wrote:

Am 17.12.20 um 02:07 schrieb Chen Li:

On Wed, 16 Dec 2020 22:19:11 +0800,
Christian König wrote:

Am 16.12.20 um 14:48 schrieb Chen Li:

On Wed, 16 Dec 2020 15:59:37 +0800,
Christian König wrote:

[SNIP]

Hi, Christian. I'm not sure why this change is a hack here. I cannot see the
problem and wll be grateful if you give more explainations.

__memset is supposed to work on those addresses, otherwise you can't use the
e8860 on your arm64 system.

If __memset is supposed to work on those adresses, why this
commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7Cfdb4ca3e05ad4ea4882408d8a2914fbc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438092297678363%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=88oAUlEhnsVNSqYIfXk%2B811oXYd18XPScVZ4ceAurNk%3Dreserved=0)
is needed? (I also notice drm/radeon didn't take this change though) just out of curiosity.

We generally accept those patches as cleanup in the kernel with the hope that we
can find a way to work around the userspace restrictions.

What's the userspace restriction here? mmap device memory?

Yes, exactly that.

But when you also have this issue in userspace then there isn't much we can do
for you.

Replacing the the direct write in the kernel with calls to writel() or
memset_io() will fix that temporary, but you have a more general problem here.

I cannot see what's the more general problem here :( u mean performance?

No, not performance. See standards like OpenGL, Vulkan as well as VA-API and
VDPAU require that you can mmap() device memory and execute memset/memcpy on the
memory from userspace.

If your ARM base board can't do that for some then you can't use the hardware
with that board.

Good to know, thanks! BTW, have you ever seen or heard boards like mine which
cannot mmap device memory correctly from userspace correctly?

Unfortunately yes. We haven't been able to figure out what exactly goes
wrong in those cases.

For amdgpu I suggest that we allocate the UVD message in GTT instead of VRAM
since we don't have the hardware restriction for that on the new generations.

Thanks, I will try to dig into deeper. But what's the "hardware restriction"
meaning here? I'm not familiar with video driver stack and amd gpu, sorry.

On older hardware (AGP days) the buffer had to be in VRAM (MMIO) memory, but on
modern system GTT (system memory) works as well.

IIUC, e8860 can use amdgpu(I use radeon now) beause its device id 6822 is in
amdgpu's table. But I cannot tell whether e8860 has iommu, and I cannot find
iommu from lspci, so graphics translation table may not work here?

That is not related to IOMMU. IOMMU is a feature of the CPU/motherboard. This is
implemented using GTT, e.g. the VM page tables inside the GPU.

And yes it should work I will prepare a patch for it.

I think you mean mmu :)

No, I really meant IOMMU.

Refer to wikipedia:
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fen.wikipedia.org%2Fwiki%2FInput%25E2%2580%2593output_memory_management_unit%23:~:text%3DIn%2520computing%252C%2520an%2520input%25E2%2580%2593output%2Cbus%2520to%2520the%2520main%2520memorydata=04%7C01%7Cchristian.koenig%40amd.com%7Cfdb4ca3e05ad4ea4882408d8a2914fbc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637438092297678363%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=t6NDi8didU7GFzaCSMFvdSTKA%2FmRZ1cgPCpY7lf7UKo%3Dreserved=0.

In computing, an input–output memory management unit (IOMMU) is a memory
management unit (MMU) that connects a direct-memory-access–capable
(DMA-capable) I/O bus to the main memory. Like a traditional MMU, which
translates CPU-visible virtual addresses to physical addresses, the IOMMU maps
device-visible virtual addresses (also called device addresses or I/O addresses
in this context) to physical addresses. Some units also provide memory
protection from faulty or malicious devices.
An example IOMMU is the graphics address remapping table (GART) used by
AGP and PCI Express graphics cards on Intel Architecture and AMD computers.

Maybe somebody should clarify the wikipedia article a bit since this is
to general and misleading.

The key difference is that today IOMMU usually refers to the MMU block
in the PCIe root complex of the CPU.

GART should be antoher abber of

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-17 Thread Christian König

Am 17.12.20 um 14:45 schrieb Robin Murphy:

On 2020-12-17 10:25, Christian König wrote:

Am 17.12.20 um 02:07 schrieb Chen Li:

On Wed, 16 Dec 2020 22:19:11 +0800,
Christian König wrote:

Am 16.12.20 um 14:48 schrieb Chen Li:

On Wed, 16 Dec 2020 15:59:37 +0800,
Christian König wrote:

We generally accept those patches as cleanup in the kernel with the
hope that we can find a way to work around the userspace restrictions.

But when you also have this issue in userspace then there isn't much
we can do for you.

No, not performance. See standards like OpenGL, Vulkan as well as
VA-API and VDPAU require that you can mmap() device memory and
execute memset/memcpy on the memory from userspace.

If your ARM base board can't do that for some then you can't use the
hardware with that board.

If the VRAM lives in a prefetchable PCI bar then on most sane
Arm-based systems I believe it should be able to mmap() to userspace
with the Normal memory type, where unaligned accesses and such are
allowed, as opposed to the Device memory type intended for MMIO
mappings, which has more restrictions but stricter ordering guarantees.

Do you have some background why some ARM boards fail with that?

We had a couple of reports that memset/memcpy fail in userspace (usually
system just spontaneously reboots or becomes unresponsive), but so far
nobody could tell us why that happens?

Regardless of what happens elsewhere though, if something is mapped
*into the kernel* with ioremap(), then it is fundamentally wrong per
the kernel memory model to reference that mapping directly without
using I/O accessors. That is not specific to any individual
architecture, and Sparse should be screaming about it already. I guess
in this case the UVD code needs to pay more attention to whether
radeon_bo_kmap() ends up going via ttm_bo_ioremap() or not.

Yes, exactly. That's why we already have memcpy_fromio()/memcpy_toio()
to upload the firmware and save the state on suspend/resume.

It's just that in this case here we also have IO memory because some 15+
years old AGP based hardware doesn't work when you but it in system
memory :)

So pointing that out is correct and I'm going to clean that up now.

Regards,
Christian.

(I'm assuming the initial fault was memset() with 0 trying to perform
"DC ZVA" on a Device-type mapping from ioremap() - FYI a stacktrace on
its own without the rest of the error dump showing what actually
triggered it isn't overly useful)

Robin.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-17 Thread Robin Murphy

On 2020-12-17 10:25, Christian König wrote:

Am 17.12.20 um 02:07 schrieb Chen Li:

On Wed, 16 Dec 2020 22:19:11 +0800,
Christian König wrote:

Am 16.12.20 um 14:48 schrieb Chen Li:

On Wed, 16 Dec 2020 15:59:37 +0800,
Christian König wrote:

[SNIP]
Hi, Christian. I'm not sure why this change is a hack here. I cannot
see the problem and wll be grateful if you give more explainations.
__memset is supposed to work on those addresses, otherwise you can't
use the

e8860 on your arm64 system.
If __memset is supposed to work on those adresses, why this
commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7C4ed3c075888746b7f41408d8a22811c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437640274023350%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=HhWxUaLo3WpzoV6hjV%2BG1HICaIOXwsoNpzv5tNMNg8A%3Dreserved=0)
is needed? (I also notice drm/radeon didn't take this change though)
just out of curiosity.

We generally accept those patches as cleanup in the kernel with the hope
that we can find a way to work around the userspace restrictions.

But when you also have this issue in userspace then there isn't much we
can do for you.

Replacing the the direct write in the kernel with calls to writel() or
memset_io() will fix that temporary, but you have a more general
problem here.

I cannot see what's the more general problem here :( u mean performance?

No, not performance. See standards like OpenGL, Vulkan as well as VA-API
and VDPAU require that you can mmap() device memory and execute
memset/memcpy on the memory from userspace.

If your ARM base board can't do that for some then you can't use the
hardware with that board.

If the VRAM lives in a prefetchable PCI bar then on most sane Arm-based
systems I believe it should be able to mmap() to userspace with the
Normal memory type, where unaligned accesses and such are allowed, as
opposed to the Device memory type intended for MMIO mappings, which has
more restrictions but stricter ordering guarantees.

Regardless of what happens elsewhere though, if something is mapped
*into the kernel* with ioremap(), then it is fundamentally wrong per the
kernel memory model to reference that mapping directly without using I/O
accessors. That is not specific to any individual architecture, and
Sparse should be screaming about it already. I guess in this case the
UVD code needs to pay more attention to whether radeon_bo_kmap() ends up
going via ttm_bo_ioremap() or not.

(I'm assuming the initial fault was memset() with 0 trying to perform
"DC ZVA" on a Device-type mapping from ioremap() - FYI a stacktrace on
its own without the rest of the error dump showing what actually
triggered it isn't overly useful)

Robin.

For amdgpu I suggest that we allocate the UVD message in GTT
instead of VRAM
since we don't have the hardware restriction for that on the new
generations.

Thanks, I will try to dig into deeper. But what's the "hardware
restriction" meaning here? I'm not familiar with video driver stack
and amd gpu, sorry.
On older hardware (AGP days) the buffer had to be in VRAM (MMIO)
memory, but on

modern system GTT (system memory) works as well.
IIUC, e8860 can use amdgpu(I use radeon now) beause its device id 6822
is in amdgpu's table. But I cannot tell whether e8860 has iommu, and I
cannot find iommu from lspci, so graphics translation table may not
work here?

That is not related to IOMMU. IOMMU is a feature of the CPU/motherboard.
This is implemented using GTT, e.g. the VM page tables inside the GPU.

And yes it should work I will prepare a patch for it.

BTW: How does userspace work on arm64 then? The driver stack
usually only works

if mmio can be mapped directly.
I also post two usespace issue on mesa, and you may be interested
with them:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fmesa%2F-%2Fissues%2F3954data=04%7C01%7Cchristian.koenig%40amd.com%7C4ed3c075888746b7f41408d8a22811c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437640274023350%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=ZR7pDS%2BCLUuMjCeKcMAXfHtbczt8WdUwSeLZCuHfCHw%3Dreserved=0

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fmesa%2Fmesa%2F-%2Fissues%2F3951data=04%7C01%7Cchristian.koenig%40amd.com%7C4ed3c075888746b7f41408d8a22811c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437640274033344%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=jAJo3aG2I1oIDTZXWhNgcKoKbd6tTdiAtc7vE4hJJPY%3Dreserved=0

I paste some virtual memory map in userspace there. (and the two
problems do bother me quite a long time.)

I don't really see a solution for those problems.

Re: [PATCH] drm/[amdgpu|radeon]: fix memset on io mem

2020-12-17 Thread Christian König

Am 17.12.20 um 02:07 schrieb Chen Li:

On Wed, 16 Dec 2020 22:19:11 +0800,
Christian König wrote:

Am 16.12.20 um 14:48 schrieb Chen Li:

On Wed, 16 Dec 2020 15:59:37 +0800,
Christian König wrote:

[SNIP]

Hi, Christian. I'm not sure why this change is a hack here. I cannot see the
problem and wll be grateful if you give more explainations.

__memset is supposed to work on those addresses, otherwise you can't use the
e8860 on your arm64 system.

If __memset is supposed to work on those adresses, why this
commit(https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fcommit%2Fba0b2275a6781b2f3919d931d63329b5548f6d5fdata=04%7C01%7Cchristian.koenig%40amd.com%7C4ed3c075888746b7f41408d8a22811c5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637437640274023350%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=HhWxUaLo3WpzoV6hjV%2BG1HICaIOXwsoNpzv5tNMNg8A%3Dreserved=0)
is needed? (I also notice drm/radeon didn't take this change though) just out of curiosity.