Re: [Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-26 Thread Stéphane Marchesin
On Mon, May 26, 2014 at 7:42 PM, Alexandre Courbot  wrote:
> On Tue, May 27, 2014 at 10:07 AM, Stéphane Marchesin
>  wrote:
>> On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot  wrote:
>>> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach  wrote:
 Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
> On 23.05.2014 17:40, Alex Courbot wrote:
> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
> > So after checking with more knowledgeable people, it turns out this is
> > the expected behavior on ARM and BAR regions should be mapped uncached
> > on GK20A. All the more reasons to avoid using the BAR at all.
>
> This is actually specific to Tegra.
>
> >> You may want to make yourself aware of all the quirks required for
> >> sharing memory between the GPU and CPU on an ARM host. I think there 
> >> are
> >> far more involved than what you see now and writing an replacement for
> >> TTM will not be an easy task.
> >>
> >> Doing away with the concept of two memory areas will not get you to a
> >> single unified address space. You would have to deal with things like
> >> not being able to change the caching state of pages in the systems
> >> lowmem yourself. You will still have to deal with remapping pages that
> >> aren't currently visible to the CPU (ok this is not an issue on Jetson
> >> right now as it only has 2GB of RAM), because it's in systems highmem,
> >> or even in a different LPAE area.
> >>
> >> You really want to be sure you are aware of all the consequences of
> >> this, before considering this task.
> >
> > Yep, that's why I am seeking advice here. My first hope is that with a
> > few tweaks we will be able to keep using TTM and the current nouveau_bo
> > implementation. But unless I missed something this is not going to be 
> > easy.
> >
> > We can also use something like the patch I originally sent to make it
> > work, although not with good performance, on GK20A. Not very graceful,
> > but it will allow applications to run.
> >
> > In the long run though, we will want to achieve better performance, and
> > it seems like a BO implementation targeted at UMA devices would also be
> > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm
> > interested in gathering thoughts and why not giving it a first try with
> > GK20A, even if it imposes some limitations like having buffers in lowmem
> > in a first time (we can probably live with this one for a short while,
> > and 64 bits will also be coming to the rescue :))
>
> I don't think lowmem or LPAE is any problem, if the memory manager is
> designed with that in mind. Vast majority of the buffers kernel
> allocates do not need to be touched in kernel space.
>
> Actually I can't think of any buffers that we allocate on behalf of user
> space that would need to be permanently mapped also to kernel. In case
> or relocs only push buffer needs to be temporarily mapped to kernel.
>
> Ultimately even relocs are not necessary if we expose GPU virtual
> addresses directly to user space. But that's another topic.
>
 Nouveau already exposes constant virtual addresses to userspace and
 skips the pushbuf patching when the presumed offset from userspace is
 the same as what the kernel thinks it should be.

 The problem with lowmem on ARM is that you can't unmap those pages from
 the kernel cached mapping. So if you alloc a page, give it to userspace
 and userspace decides to map the page WC you just produced a conflicting
 mapping, which may yield undefined results on ARMv7. You may think this
 is not a problem as you are not touching the kernel cached mapping, but
 in fact it is. The CPUs prefetcher can still access this mapping.
>>>
>>> Why would this memory be mapped into the kernel?
>>
>> On ARM the kernel keeps a linear mapping of lowmem using sections
>> (ARM's version of huge pages). This is always cached, and because the
>> sections are not 4k, it's a pain to remove parts of it. See
>> arch/arm/mm/mmu.c
>
> Ah, are we talking about the directly-mapped low memory region
> starting at PAGE_OFFSET? Ok, it makes sense now, thanks.
>
> But it seems to me that such different mappings can also happen in
> many other scenarios as well, don't they? How is the issue handled in
> these cases?

It depends. A lot of cache controllers actually implement a solution
for that in hardware, in the cache controller. For example I think
Tegra2 is one of those platforms. And then a lot of platforms just
ignore the issue completely because it has very low probability.

Stéphane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.o

Re: [Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-26 Thread Alexandre Courbot
On Tue, May 27, 2014 at 10:07 AM, Stéphane Marchesin
 wrote:
> On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot  wrote:
>> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach  wrote:
>>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
 On 23.05.2014 17:40, Alex Courbot wrote:
 > On 05/23/2014 06:59 PM, Lucas Stach wrote:
 > So after checking with more knowledgeable people, it turns out this is
 > the expected behavior on ARM and BAR regions should be mapped uncached
 > on GK20A. All the more reasons to avoid using the BAR at all.

 This is actually specific to Tegra.

 >> You may want to make yourself aware of all the quirks required for
 >> sharing memory between the GPU and CPU on an ARM host. I think there are
 >> far more involved than what you see now and writing an replacement for
 >> TTM will not be an easy task.
 >>
 >> Doing away with the concept of two memory areas will not get you to a
 >> single unified address space. You would have to deal with things like
 >> not being able to change the caching state of pages in the systems
 >> lowmem yourself. You will still have to deal with remapping pages that
 >> aren't currently visible to the CPU (ok this is not an issue on Jetson
 >> right now as it only has 2GB of RAM), because it's in systems highmem,
 >> or even in a different LPAE area.
 >>
 >> You really want to be sure you are aware of all the consequences of
 >> this, before considering this task.
 >
 > Yep, that's why I am seeking advice here. My first hope is that with a
 > few tweaks we will be able to keep using TTM and the current nouveau_bo
 > implementation. But unless I missed something this is not going to be 
 > easy.
 >
 > We can also use something like the patch I originally sent to make it
 > work, although not with good performance, on GK20A. Not very graceful,
 > but it will allow applications to run.
 >
 > In the long run though, we will want to achieve better performance, and
 > it seems like a BO implementation targeted at UMA devices would also be
 > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm
 > interested in gathering thoughts and why not giving it a first try with
 > GK20A, even if it imposes some limitations like having buffers in lowmem
 > in a first time (we can probably live with this one for a short while,
 > and 64 bits will also be coming to the rescue :))

 I don't think lowmem or LPAE is any problem, if the memory manager is
 designed with that in mind. Vast majority of the buffers kernel
 allocates do not need to be touched in kernel space.

 Actually I can't think of any buffers that we allocate on behalf of user
 space that would need to be permanently mapped also to kernel. In case
 or relocs only push buffer needs to be temporarily mapped to kernel.

 Ultimately even relocs are not necessary if we expose GPU virtual
 addresses directly to user space. But that's another topic.

>>> Nouveau already exposes constant virtual addresses to userspace and
>>> skips the pushbuf patching when the presumed offset from userspace is
>>> the same as what the kernel thinks it should be.
>>>
>>> The problem with lowmem on ARM is that you can't unmap those pages from
>>> the kernel cached mapping. So if you alloc a page, give it to userspace
>>> and userspace decides to map the page WC you just produced a conflicting
>>> mapping, which may yield undefined results on ARMv7. You may think this
>>> is not a problem as you are not touching the kernel cached mapping, but
>>> in fact it is. The CPUs prefetcher can still access this mapping.
>>
>> Why would this memory be mapped into the kernel?
>
> On ARM the kernel keeps a linear mapping of lowmem using sections
> (ARM's version of huge pages). This is always cached, and because the
> sections are not 4k, it's a pain to remove parts of it. See
> arch/arm/mm/mmu.c

Ah, are we talking about the directly-mapped low memory region
starting at PAGE_OFFSET? Ok, it makes sense now, thanks.

But it seems to me that such different mappings can also happen in
many other scenarios as well, don't they? How is the issue handled in
these cases?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Nouveau] [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-26 Thread Stéphane Marchesin
On Mon, May 26, 2014 at 5:02 PM, Alexandre Courbot  wrote:
> On Mon, May 26, 2014 at 6:21 PM, Lucas Stach  wrote:
>> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
>>> On 23.05.2014 17:40, Alex Courbot wrote:
>>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>>> > So after checking with more knowledgeable people, it turns out this is
>>> > the expected behavior on ARM and BAR regions should be mapped uncached
>>> > on GK20A. All the more reasons to avoid using the BAR at all.
>>>
>>> This is actually specific to Tegra.
>>>
>>> >> You may want to make yourself aware of all the quirks required for
>>> >> sharing memory between the GPU and CPU on an ARM host. I think there are
>>> >> far more involved than what you see now and writing an replacement for
>>> >> TTM will not be an easy task.
>>> >>
>>> >> Doing away with the concept of two memory areas will not get you to a
>>> >> single unified address space. You would have to deal with things like
>>> >> not being able to change the caching state of pages in the systems
>>> >> lowmem yourself. You will still have to deal with remapping pages that
>>> >> aren't currently visible to the CPU (ok this is not an issue on Jetson
>>> >> right now as it only has 2GB of RAM), because it's in systems highmem,
>>> >> or even in a different LPAE area.
>>> >>
>>> >> You really want to be sure you are aware of all the consequences of
>>> >> this, before considering this task.
>>> >
>>> > Yep, that's why I am seeking advice here. My first hope is that with a
>>> > few tweaks we will be able to keep using TTM and the current nouveau_bo
>>> > implementation. But unless I missed something this is not going to be 
>>> > easy.
>>> >
>>> > We can also use something like the patch I originally sent to make it
>>> > work, although not with good performance, on GK20A. Not very graceful,
>>> > but it will allow applications to run.
>>> >
>>> > In the long run though, we will want to achieve better performance, and
>>> > it seems like a BO implementation targeted at UMA devices would also be
>>> > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm
>>> > interested in gathering thoughts and why not giving it a first try with
>>> > GK20A, even if it imposes some limitations like having buffers in lowmem
>>> > in a first time (we can probably live with this one for a short while,
>>> > and 64 bits will also be coming to the rescue :))
>>>
>>> I don't think lowmem or LPAE is any problem, if the memory manager is
>>> designed with that in mind. Vast majority of the buffers kernel
>>> allocates do not need to be touched in kernel space.
>>>
>>> Actually I can't think of any buffers that we allocate on behalf of user
>>> space that would need to be permanently mapped also to kernel. In case
>>> or relocs only push buffer needs to be temporarily mapped to kernel.
>>>
>>> Ultimately even relocs are not necessary if we expose GPU virtual
>>> addresses directly to user space. But that's another topic.
>>>
>> Nouveau already exposes constant virtual addresses to userspace and
>> skips the pushbuf patching when the presumed offset from userspace is
>> the same as what the kernel thinks it should be.
>>
>> The problem with lowmem on ARM is that you can't unmap those pages from
>> the kernel cached mapping. So if you alloc a page, give it to userspace
>> and userspace decides to map the page WC you just produced a conflicting
>> mapping, which may yield undefined results on ARMv7. You may think this
>> is not a problem as you are not touching the kernel cached mapping, but
>> in fact it is. The CPUs prefetcher can still access this mapping.
>
> Why would this memory be mapped into the kernel?

On ARM the kernel keeps a linear mapping of lowmem using sections
(ARM's version of huge pages). This is always cached, and because the
sections are not 4k, it's a pain to remove parts of it. See
arch/arm/mm/mmu.c

That said, I don't think this issue exists on A15 (which is what those
GPUs are paired with), so it's a purely theoretical problem.

Stéphane
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-26 Thread Alexandre Courbot
On Mon, May 26, 2014 at 6:21 PM, Lucas Stach  wrote:
> Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
>> On 23.05.2014 17:40, Alex Courbot wrote:
>> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
>> > So after checking with more knowledgeable people, it turns out this is
>> > the expected behavior on ARM and BAR regions should be mapped uncached
>> > on GK20A. All the more reasons to avoid using the BAR at all.
>>
>> This is actually specific to Tegra.
>>
>> >> You may want to make yourself aware of all the quirks required for
>> >> sharing memory between the GPU and CPU on an ARM host. I think there are
>> >> far more involved than what you see now and writing an replacement for
>> >> TTM will not be an easy task.
>> >>
>> >> Doing away with the concept of two memory areas will not get you to a
>> >> single unified address space. You would have to deal with things like
>> >> not being able to change the caching state of pages in the systems
>> >> lowmem yourself. You will still have to deal with remapping pages that
>> >> aren't currently visible to the CPU (ok this is not an issue on Jetson
>> >> right now as it only has 2GB of RAM), because it's in systems highmem,
>> >> or even in a different LPAE area.
>> >>
>> >> You really want to be sure you are aware of all the consequences of
>> >> this, before considering this task.
>> >
>> > Yep, that's why I am seeking advice here. My first hope is that with a
>> > few tweaks we will be able to keep using TTM and the current nouveau_bo
>> > implementation. But unless I missed something this is not going to be easy.
>> >
>> > We can also use something like the patch I originally sent to make it
>> > work, although not with good performance, on GK20A. Not very graceful,
>> > but it will allow applications to run.
>> >
>> > In the long run though, we will want to achieve better performance, and
>> > it seems like a BO implementation targeted at UMA devices would also be
>> > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm
>> > interested in gathering thoughts and why not giving it a first try with
>> > GK20A, even if it imposes some limitations like having buffers in lowmem
>> > in a first time (we can probably live with this one for a short while,
>> > and 64 bits will also be coming to the rescue :))
>>
>> I don't think lowmem or LPAE is any problem, if the memory manager is
>> designed with that in mind. Vast majority of the buffers kernel
>> allocates do not need to be touched in kernel space.
>>
>> Actually I can't think of any buffers that we allocate on behalf of user
>> space that would need to be permanently mapped also to kernel. In case
>> or relocs only push buffer needs to be temporarily mapped to kernel.
>>
>> Ultimately even relocs are not necessary if we expose GPU virtual
>> addresses directly to user space. But that's another topic.
>>
> Nouveau already exposes constant virtual addresses to userspace and
> skips the pushbuf patching when the presumed offset from userspace is
> the same as what the kernel thinks it should be.
>
> The problem with lowmem on ARM is that you can't unmap those pages from
> the kernel cached mapping. So if you alloc a page, give it to userspace
> and userspace decides to map the page WC you just produced a conflicting
> mapping, which may yield undefined results on ARMv7. You may think this
> is not a problem as you are not touching the kernel cached mapping, but
> in fact it is. The CPUs prefetcher can still access this mapping.

Why would this memory be mapped into the kernel? AFAICT Nouveau only
maps fences and (somehow) PBs into the kernel. Other BOs are not
mapped unless I missed something. Or are you talking about VRAM
allocated by dma_alloc_*()? We prevent this from happening by using
the CMA allocator (which doesn't create a kmap) directly, which has
its own problems (cannot compile Nouveau as a module and use these
allocators). In the future we plan to use the iommu to present sparse
memory pages in a way the GPU likes.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-26 Thread Alexandre Courbot
On Fri, May 23, 2014 at 6:24 PM, Lucas Stach  wrote:
>> The best way to solve this issue would be to not use the BAR at all
>> since the memory behind these objects can be directly accessed by the
>> CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
>> instead. But right now this is clearly not how nouveau_bo.c is written
>> and it does not look like this can easily be done. :/
>
> Yeah, it sounds like we want this shortcut for stolen VRAM
> implementations.

Tried playing a bit with nouveau_bo and the following hack allows a
simple Mesa program to run to completion... once (second time leads to
a kernel panic):

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index f00ae18003f1..6317d30a8e1d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -538,7 +538,6 @@ nouveau_bo_init_mem_type(struct ttm_bo_device
*bdev, uint32_t type,
man->available_caching = TTM_PL_MASK_CACHING;
man->default_caching = TTM_PL_FLAG_CACHED;
break;
-   case TTM_PL_VRAM:
if (nv_device(drm->device)->card_type >= NV_50) {
man->func = &nouveau_vram_manager;
man->io_reserve_fastpath = false;
@@ -556,6 +555,7 @@ nouveau_bo_init_mem_type(struct ttm_bo_device
*bdev, uint32_t type,
man->default_caching = TTM_PL_FLAG_WC;
 #endif
break;
+   case TTM_PL_VRAM:
case TTM_PL_TT:
if (nv_device(drm->device)->card_type >= NV_50)
man->func = &nouveau_gart_manager;
@@ -1297,6 +1297,7 @@ nouveau_ttm_io_mem_reserve(struct ttm_bo_device
*bdev, struct ttm_mem_reg *mem)
break;
/* fallthrough, tiled memory */
case TTM_PL_VRAM:
+   break;
mem->bus.offset = mem->start << PAGE_SHIFT;
mem->bus.base = nv_device_resource_start(nouveau_dev(dev), 1);
mem->bus.is_iomem = true;


Of course it won't go very far this way, but I wonder if the principle
is not what we would want to do for UMA devices? Not using the vram
manager at all, and strictly rely on TT placements for BOs. We will
need to add extra handling for things like tiled memory. Does that
look like the right direction?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-26 Thread Lucas Stach
Am Montag, den 26.05.2014, 09:45 +0300 schrieb Terje Bergström:
> On 23.05.2014 17:40, Alex Courbot wrote:
> > On 05/23/2014 06:59 PM, Lucas Stach wrote:
> > So after checking with more knowledgeable people, it turns out this is 
> > the expected behavior on ARM and BAR regions should be mapped uncached 
> > on GK20A. All the more reasons to avoid using the BAR at all.
> 
> This is actually specific to Tegra.
> 
> >> You may want to make yourself aware of all the quirks required for
> >> sharing memory between the GPU and CPU on an ARM host. I think there are
> >> far more involved than what you see now and writing an replacement for
> >> TTM will not be an easy task.
> >>
> >> Doing away with the concept of two memory areas will not get you to a
> >> single unified address space. You would have to deal with things like
> >> not being able to change the caching state of pages in the systems
> >> lowmem yourself. You will still have to deal with remapping pages that
> >> aren't currently visible to the CPU (ok this is not an issue on Jetson
> >> right now as it only has 2GB of RAM), because it's in systems highmem,
> >> or even in a different LPAE area.
> >>
> >> You really want to be sure you are aware of all the consequences of
> >> this, before considering this task.
> > 
> > Yep, that's why I am seeking advice here. My first hope is that with a 
> > few tweaks we will be able to keep using TTM and the current nouveau_bo 
> > implementation. But unless I missed something this is not going to be easy.
> > 
> > We can also use something like the patch I originally sent to make it 
> > work, although not with good performance, on GK20A. Not very graceful, 
> > but it will allow applications to run.
> > 
> > In the long run though, we will want to achieve better performance, and 
> > it seems like a BO implementation targeted at UMA devices would also be 
> > beneficial to quite a few desktop GPUs. So as tricky as it may be I'm 
> > interested in gathering thoughts and why not giving it a first try with 
> > GK20A, even if it imposes some limitations like having buffers in lowmem 
> > in a first time (we can probably live with this one for a short while, 
> > and 64 bits will also be coming to the rescue :))
> 
> I don't think lowmem or LPAE is any problem, if the memory manager is
> designed with that in mind. Vast majority of the buffers kernel
> allocates do not need to be touched in kernel space.
> 
> Actually I can't think of any buffers that we allocate on behalf of user
> space that would need to be permanently mapped also to kernel. In case
> or relocs only push buffer needs to be temporarily mapped to kernel.
> 
> Ultimately even relocs are not necessary if we expose GPU virtual
> addresses directly to user space. But that's another topic.
> 
Nouveau already exposes constant virtual addresses to userspace and
skips the pushbuf patching when the presumed offset from userspace is
the same as what the kernel thinks it should be.

The problem with lowmem on ARM is that you can't unmap those pages from
the kernel cached mapping. So if you alloc a page, give it to userspace
and userspace decides to map the page WC you just produced a conflicting
mapping, which may yield undefined results on ARMv7. You may think this
is not a problem as you are not touching the kernel cached mapping, but
in fact it is. The CPUs prefetcher can still access this mapping.

Although it won't wander over a page boundary for automatic prefetching
it may still do this when explicitly instructed to do so by code, which
may happen for example if your page happens to be near a kernel list,
where the list traversal code includes explicit prefetch instructions.

Regards,
Lucas

-- 
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-25 Thread Terje Bergström
On 23.05.2014 17:40, Alex Courbot wrote:
> On 05/23/2014 06:59 PM, Lucas Stach wrote:
> So after checking with more knowledgeable people, it turns out this is 
> the expected behavior on ARM and BAR regions should be mapped uncached 
> on GK20A. All the more reasons to avoid using the BAR at all.

This is actually specific to Tegra.

>> You may want to make yourself aware of all the quirks required for
>> sharing memory between the GPU and CPU on an ARM host. I think there are
>> far more involved than what you see now and writing an replacement for
>> TTM will not be an easy task.
>>
>> Doing away with the concept of two memory areas will not get you to a
>> single unified address space. You would have to deal with things like
>> not being able to change the caching state of pages in the systems
>> lowmem yourself. You will still have to deal with remapping pages that
>> aren't currently visible to the CPU (ok this is not an issue on Jetson
>> right now as it only has 2GB of RAM), because it's in systems highmem,
>> or even in a different LPAE area.
>>
>> You really want to be sure you are aware of all the consequences of
>> this, before considering this task.
> 
> Yep, that's why I am seeking advice here. My first hope is that with a 
> few tweaks we will be able to keep using TTM and the current nouveau_bo 
> implementation. But unless I missed something this is not going to be easy.
> 
> We can also use something like the patch I originally sent to make it 
> work, although not with good performance, on GK20A. Not very graceful, 
> but it will allow applications to run.
> 
> In the long run though, we will want to achieve better performance, and 
> it seems like a BO implementation targeted at UMA devices would also be 
> beneficial to quite a few desktop GPUs. So as tricky as it may be I'm 
> interested in gathering thoughts and why not giving it a first try with 
> GK20A, even if it imposes some limitations like having buffers in lowmem 
> in a first time (we can probably live with this one for a short while, 
> and 64 bits will also be coming to the rescue :))

I don't think lowmem or LPAE is any problem, if the memory manager is
designed with that in mind. Vast majority of the buffers kernel
allocates do not need to be touched in kernel space.

Actually I can't think of any buffers that we allocate on behalf of user
space that would need to be permanently mapped also to kernel. In case
or relocs only push buffer needs to be temporarily mapped to kernel.

Ultimately even relocs are not necessary if we expose GPU virtual
addresses directly to user space. But that's another topic.

Terje
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-23 Thread Alexandre Courbot

On 05/23/2014 06:59 PM, Lucas Stach wrote:

Am Freitag, den 23.05.2014, 18:43 +0900 schrieb Alexandre Courbot:

On 05/23/2014 06:24 PM, Lucas Stach wrote:

Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:

On Mon, May 19, 2014 at 7:16 PM, Lucas Stach  wrote:

Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:

On 05/19/2014 06:57 PM, Lucas Stach wrote:

Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:

This patch is not meant to be merged, but rather to try and understand
why this is needed and what a more suitable solution could be.

Allowing BOs to be write-cached results in the following happening when
trying to run any program on Tegra/GK20A:

Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
...
(nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
(nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
(nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
(ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
(nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
(drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
(nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
(do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
(SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30

The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
mapped through the BAR.


Um wait, this memory is behind an already mapped bar? I think ioremap on
ARM defaults to uncached mappings, so if you want to access the memory
behind this bar as WC you need to map the BAR as a whole as WC by using
ioremap_wc.


Tried mapping the BAR using ioremap_wc(), but to no avail. On the other
hand, could it be that VRAM BOs end up creating a mapping over an
already-mapped region? I seem to remember that ARM might not like it...


Multiple mapping are generally allowed, as long as they have the same
caching state. It's conflicting mappings (uncached vs cached, or cached
vs wc), that are documented to yield undefined results.


Sorry about the confusion. The BAR is *not* mapped to the kernel yet
(it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
performed in ttm_bo_ioremap() to make the part of the BAR where the
buffer is mapped visible. It seems that doing an ioremap_wc() on the
BAR area on Tegra is what leads to these errors. ioremap() or
ioremap_nocache() (which are in effect the same on ARM) do not cause
this issue.


It would be cool if you could ask HW, or the blob developers, if this is
a general issue. The external abort is clearly the GPUs AXI client
responding with an error to the read request, though I'm not clear where
a WC read differs from an uncached one.


Will check that.


So after checking with more knowledgeable people, it turns out this is 
the expected behavior on ARM and BAR regions should be mapped uncached 
on GK20A. All the more reasons to avoid using the BAR at all.







The best way to solve this issue would be to not use the BAR at all
since the memory behind these objects can be directly accessed by the
CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
instead. But right now this is clearly not how nouveau_bo.c is written
and it does not look like this can easily be done. :/


Yeah, it sounds like we want this shortcut for stolen VRAM
implementations.


Actually, isn't it the case that we do not want to use TTM at all for
stolen VRAM (UMA) devices?

I am trying to wrap my head around this since a while already, and could
not think of a way to use the current TTM-based nouveau_bo optimally for
GK20A. Because we cannot do without the idea of VRAM and GART, we will
always have to "move" objects from one location to another, or deal with
constraints that do not make sense for UMA devices (like in the current
case, accessing VRAM objects through the BAR).

I am currently contemplating the idea of writing an alternative non-TTM
implementation of nouveau_bo for UMA devices, that would (hopefully) be
much simpler and would spare us a lot of stunts.

On the other hand, this sounds like a considerable work and I would like
to make sure that my lack of understanding of TTM is not driving me to
the wrong solution. Thoughts?


You may want to make yourself aware of all the quirks required for
sharing memory between the GPU and CPU on an ARM host. I think there are
far more involved than what you see now and writing an replacement for
TTM will not be an easy task.

Doing away with the concept of two memory areas will not get you to a
single unified address space. You would have to deal with things like
not being able to change the caching state of pages in the systems
lowmem yourself. You will still have to deal with remapping pages that
aren't currently visible to the CPU (ok this is not an issue on Jetson
right now as it only has 2GB of RAM), because it's in systems highmem,
or even in a different LPAE area.

You really want to be sure you are aware of all the consequences of
this, before considering this tas

Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-23 Thread Lucas Stach
Am Freitag, den 23.05.2014, 18:43 +0900 schrieb Alexandre Courbot:
> On 05/23/2014 06:24 PM, Lucas Stach wrote:
> > Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
> >> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach  
> >> wrote:
> >>> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:
>  On 05/19/2014 06:57 PM, Lucas Stach wrote:
> > Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:
> >> This patch is not meant to be merged, but rather to try and understand
> >> why this is needed and what a more suitable solution could be.
> >>
> >> Allowing BOs to be write-cached results in the following happening when
> >> trying to run any program on Tegra/GK20A:
> >>
> >> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
> >> ...
> >> (nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
> >> (nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
> >> (nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
> >> (ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
> >> (nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
> >> (drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
> >> (nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
> >> (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
> >> (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30
> >>
> >> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
> >> mapped through the BAR.
> >>
> > Um wait, this memory is behind an already mapped bar? I think ioremap on
> > ARM defaults to uncached mappings, so if you want to access the memory
> > behind this bar as WC you need to map the BAR as a whole as WC by using
> > ioremap_wc.
> 
>  Tried mapping the BAR using ioremap_wc(), but to no avail. On the other
>  hand, could it be that VRAM BOs end up creating a mapping over an
>  already-mapped region? I seem to remember that ARM might not like it...
> >>>
> >>> Multiple mapping are generally allowed, as long as they have the same
> >>> caching state. It's conflicting mappings (uncached vs cached, or cached
> >>> vs wc), that are documented to yield undefined results.
> >>
> >> Sorry about the confusion. The BAR is *not* mapped to the kernel yet
> >> (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
> >> performed in ttm_bo_ioremap() to make the part of the BAR where the
> >> buffer is mapped visible. It seems that doing an ioremap_wc() on the
> >> BAR area on Tegra is what leads to these errors. ioremap() or
> >> ioremap_nocache() (which are in effect the same on ARM) do not cause
> >> this issue.
> >>
> > It would be cool if you could ask HW, or the blob developers, if this is
> > a general issue. The external abort is clearly the GPUs AXI client
> > responding with an error to the read request, though I'm not clear where
> > a WC read differs from an uncached one.
> 
> Will check that.
> 
> >
> >> The best way to solve this issue would be to not use the BAR at all
> >> since the memory behind these objects can be directly accessed by the
> >> CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
> >> instead. But right now this is clearly not how nouveau_bo.c is written
> >> and it does not look like this can easily be done. :/
> >
> > Yeah, it sounds like we want this shortcut for stolen VRAM
> > implementations.
> 
> Actually, isn't it the case that we do not want to use TTM at all for 
> stolen VRAM (UMA) devices?
> 
> I am trying to wrap my head around this since a while already, and could 
> not think of a way to use the current TTM-based nouveau_bo optimally for 
> GK20A. Because we cannot do without the idea of VRAM and GART, we will 
> always have to "move" objects from one location to another, or deal with 
> constraints that do not make sense for UMA devices (like in the current 
> case, accessing VRAM objects through the BAR).
> 
> I am currently contemplating the idea of writing an alternative non-TTM 
> implementation of nouveau_bo for UMA devices, that would (hopefully) be 
> much simpler and would spare us a lot of stunts.
> 
> On the other hand, this sounds like a considerable work and I would like 
> to make sure that my lack of understanding of TTM is not driving me to 
> the wrong solution. Thoughts?
> 
You may want to make yourself aware of all the quirks required for
sharing memory between the GPU and CPU on an ARM host. I think there are
far more involved than what you see now and writing an replacement for
TTM will not be an easy task.

Doing away with the concept of two memory areas will not get you to a
single unified address space. You would have to deal with things like
not being able to change the caching state of pages in the systems
lowmem yourself. You will still have to deal with remapping pages that
aren't currently visible to the CPU (ok this is not an issue on Jetson
right now as it only has

Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-23 Thread Alexandre Courbot

On 05/23/2014 06:24 PM, Lucas Stach wrote:

Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:

On Mon, May 19, 2014 at 7:16 PM, Lucas Stach  wrote:

Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:

On 05/19/2014 06:57 PM, Lucas Stach wrote:

Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:

This patch is not meant to be merged, but rather to try and understand
why this is needed and what a more suitable solution could be.

Allowing BOs to be write-cached results in the following happening when
trying to run any program on Tegra/GK20A:

Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
...
(nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
(nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
(nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
(ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
(nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
(drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
(nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
(do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
(SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30

The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
mapped through the BAR.


Um wait, this memory is behind an already mapped bar? I think ioremap on
ARM defaults to uncached mappings, so if you want to access the memory
behind this bar as WC you need to map the BAR as a whole as WC by using
ioremap_wc.


Tried mapping the BAR using ioremap_wc(), but to no avail. On the other
hand, could it be that VRAM BOs end up creating a mapping over an
already-mapped region? I seem to remember that ARM might not like it...


Multiple mapping are generally allowed, as long as they have the same
caching state. It's conflicting mappings (uncached vs cached, or cached
vs wc), that are documented to yield undefined results.


Sorry about the confusion. The BAR is *not* mapped to the kernel yet
(it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
performed in ttm_bo_ioremap() to make the part of the BAR where the
buffer is mapped visible. It seems that doing an ioremap_wc() on the
BAR area on Tegra is what leads to these errors. ioremap() or
ioremap_nocache() (which are in effect the same on ARM) do not cause
this issue.


It would be cool if you could ask HW, or the blob developers, if this is
a general issue. The external abort is clearly the GPUs AXI client
responding with an error to the read request, though I'm not clear where
a WC read differs from an uncached one.


Will check that.




The best way to solve this issue would be to not use the BAR at all
since the memory behind these objects can be directly accessed by the
CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
instead. But right now this is clearly not how nouveau_bo.c is written
and it does not look like this can easily be done. :/


Yeah, it sounds like we want this shortcut for stolen VRAM
implementations.


Actually, isn't it the case that we do not want to use TTM at all for 
stolen VRAM (UMA) devices?


I am trying to wrap my head around this since a while already, and could 
not think of a way to use the current TTM-based nouveau_bo optimally for 
GK20A. Because we cannot do without the idea of VRAM and GART, we will 
always have to "move" objects from one location to another, or deal with 
constraints that do not make sense for UMA devices (like in the current 
case, accessing VRAM objects through the BAR).


I am currently contemplating the idea of writing an alternative non-TTM 
implementation of nouveau_bo for UMA devices, that would (hopefully) be 
much simpler and would spare us a lot of stunts.


On the other hand, this sounds like a considerable work and I would like 
to make sure that my lack of understanding of TTM is not driving me to 
the wrong solution. Thoughts?


Thanks,
Alex.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-23 Thread Lucas Stach
Am Freitag, den 23.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
> On Mon, May 19, 2014 at 7:16 PM, Lucas Stach  wrote:
> > Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:
> >> On 05/19/2014 06:57 PM, Lucas Stach wrote:
> >> > Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:
> >> >> This patch is not meant to be merged, but rather to try and understand
> >> >> why this is needed and what a more suitable solution could be.
> >> >>
> >> >> Allowing BOs to be write-cached results in the following happening when
> >> >> trying to run any program on Tegra/GK20A:
> >> >>
> >> >> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
> >> >> ...
> >> >> (nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
> >> >> (nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
> >> >> (nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
> >> >> (ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
> >> >> (nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
> >> >> (drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
> >> >> (nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
> >> >> (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
> >> >> (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30
> >> >>
> >> >> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
> >> >> mapped through the BAR.
> >> >>
> >> > Um wait, this memory is behind an already mapped bar? I think ioremap on
> >> > ARM defaults to uncached mappings, so if you want to access the memory
> >> > behind this bar as WC you need to map the BAR as a whole as WC by using
> >> > ioremap_wc.
> >>
> >> Tried mapping the BAR using ioremap_wc(), but to no avail. On the other
> >> hand, could it be that VRAM BOs end up creating a mapping over an
> >> already-mapped region? I seem to remember that ARM might not like it...
> >
> > Multiple mapping are generally allowed, as long as they have the same
> > caching state. It's conflicting mappings (uncached vs cached, or cached
> > vs wc), that are documented to yield undefined results.
> 
> Sorry about the confusion. The BAR is *not* mapped to the kernel yet
> (it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
> performed in ttm_bo_ioremap() to make the part of the BAR where the
> buffer is mapped visible. It seems that doing an ioremap_wc() on the
> BAR area on Tegra is what leads to these errors. ioremap() or
> ioremap_nocache() (which are in effect the same on ARM) do not cause
> this issue.
> 
It would be cool if you could ask HW, or the blob developers, if this is
a general issue. The external abort is clearly the GPUs AXI client
responding with an error to the read request, though I'm not clear where
a WC read differs from an uncached one.

> The best way to solve this issue would be to not use the BAR at all
> since the memory behind these objects can be directly accessed by the
> CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
> instead. But right now this is clearly not how nouveau_bo.c is written
> and it does not look like this can easily be done. :/

Yeah, it sounds like we want this shortcut for stolen VRAM
implementations.

Regards,
Lucas

-- 
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-23 Thread Alexandre Courbot
On Mon, May 19, 2014 at 7:16 PM, Lucas Stach  wrote:
> Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:
>> On 05/19/2014 06:57 PM, Lucas Stach wrote:
>> > Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:
>> >> This patch is not meant to be merged, but rather to try and understand
>> >> why this is needed and what a more suitable solution could be.
>> >>
>> >> Allowing BOs to be write-cached results in the following happening when
>> >> trying to run any program on Tegra/GK20A:
>> >>
>> >> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
>> >> ...
>> >> (nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
>> >> (nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
>> >> (nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
>> >> (ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
>> >> (nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
>> >> (drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
>> >> (nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
>> >> (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
>> >> (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30
>> >>
>> >> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
>> >> mapped through the BAR.
>> >>
>> > Um wait, this memory is behind an already mapped bar? I think ioremap on
>> > ARM defaults to uncached mappings, so if you want to access the memory
>> > behind this bar as WC you need to map the BAR as a whole as WC by using
>> > ioremap_wc.
>>
>> Tried mapping the BAR using ioremap_wc(), but to no avail. On the other
>> hand, could it be that VRAM BOs end up creating a mapping over an
>> already-mapped region? I seem to remember that ARM might not like it...
>
> Multiple mapping are generally allowed, as long as they have the same
> caching state. It's conflicting mappings (uncached vs cached, or cached
> vs wc), that are documented to yield undefined results.

Sorry about the confusion. The BAR is *not* mapped to the kernel yet
(it is BAR1, there is no BAR3 on GK20A) and an ioremap_*() is
performed in ttm_bo_ioremap() to make the part of the BAR where the
buffer is mapped visible. It seems that doing an ioremap_wc() on the
BAR area on Tegra is what leads to these errors. ioremap() or
ioremap_nocache() (which are in effect the same on ARM) do not cause
this issue.

The best way to solve this issue would be to not use the BAR at all
since the memory behind these objects can be directly accessed by the
CPU. As such it would better be mapped using ttm_bo_kmap_ttm()
instead. But right now this is clearly not how nouveau_bo.c is written
and it does not look like this can easily be done. :/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-19 Thread Lucas Stach
Am Montag, den 19.05.2014, 19:06 +0900 schrieb Alexandre Courbot:
> On 05/19/2014 06:57 PM, Lucas Stach wrote:
> > Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:
> >> This patch is not meant to be merged, but rather to try and understand
> >> why this is needed and what a more suitable solution could be.
> >>
> >> Allowing BOs to be write-cached results in the following happening when
> >> trying to run any program on Tegra/GK20A:
> >>
> >> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
> >> ...
> >> (nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
> >> (nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
> >> (nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
> >> (ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
> >> (nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
> >> (drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
> >> (nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
> >> (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
> >> (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30
> >>
> >> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
> >> mapped through the BAR.
> >>
> > Um wait, this memory is behind an already mapped bar? I think ioremap on
> > ARM defaults to uncached mappings, so if you want to access the memory
> > behind this bar as WC you need to map the BAR as a whole as WC by using
> > ioremap_wc.
> 
> Tried mapping the BAR using ioremap_wc(), but to no avail. On the other 
> hand, could it be that VRAM BOs end up creating a mapping over an 
> already-mapped region? I seem to remember that ARM might not like it...

Multiple mapping are generally allowed, as long as they have the same
caching state. It's conflicting mappings (uncached vs cached, or cached
vs wc), that are documented to yield undefined results.

Regards,
Lucas
-- 
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-19 Thread Alexandre Courbot

On 05/19/2014 06:57 PM, Lucas Stach wrote:

Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:

This patch is not meant to be merged, but rather to try and understand
why this is needed and what a more suitable solution could be.

Allowing BOs to be write-cached results in the following happening when
trying to run any program on Tegra/GK20A:

Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
...
(nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
(nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
(nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
(ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
(nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
(drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
(nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
(do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
(SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30

The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
mapped through the BAR.


Um wait, this memory is behind an already mapped bar? I think ioremap on
ARM defaults to uncached mappings, so if you want to access the memory
behind this bar as WC you need to map the BAR as a whole as WC by using
ioremap_wc.


Tried mapping the BAR using ioremap_wc(), but to no avail. On the other 
hand, could it be that VRAM BOs end up creating a mapping over an 
already-mapped region? I seem to remember that ARM might not like it...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-19 Thread Lucas Stach
Am Montag, den 19.05.2014, 18:46 +0900 schrieb Alexandre Courbot:
> This patch is not meant to be merged, but rather to try and understand
> why this is needed and what a more suitable solution could be.
> 
> Allowing BOs to be write-cached results in the following happening when
> trying to run any program on Tegra/GK20A:
> 
> Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
> ...
> (nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
> (nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
> (nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
> (ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
> (nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
> (drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
> (nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
> (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
> (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30
> 
> The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
> mapped through the BAR.
> 
Um wait, this memory is behind an already mapped bar? I think ioremap on
ARM defaults to uncached mappings, so if you want to access the memory
behind this bar as WC you need to map the BAR as a whole as WC by using
ioremap_wc.

Regards,
Lucas

> Any idea about the origin of this behavior? Does ARM forbid cached
> mappings over IO regions?
> 
> Signed-off-by: Alexandre Courbot 
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
> b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 8db54a217232..9cfb8e61f5c4 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -552,7 +552,11 @@ nouveau_bo_init_mem_type(struct ttm_bo_device *bdev, 
> uint32_t type,
>TTM_MEMTYPE_FLAG_MAPPABLE;
>   man->available_caching = TTM_PL_FLAG_UNCACHED |
>TTM_PL_FLAG_WC;
> +#if defined(__arm__)
> + man->default_caching = TTM_PL_FLAG_UNCACHED;
> +#else
>   man->default_caching = TTM_PL_FLAG_WC;
> +#endif
>   break;
>   case TTM_PL_TT:
>   if (nv_device(drm->device)->card_type >= NV_50)

-- 
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] drm/nouveau: disable caching for VRAM BOs on ARM

2014-05-19 Thread Alexandre Courbot
This patch is not meant to be merged, but rather to try and understand
why this is needed and what a more suitable solution could be.

Allowing BOs to be write-cached results in the following happening when
trying to run any program on Tegra/GK20A:

Unhandled fault: external abort on non-linefetch (0x1008) at 0xf0036010
...
(nouveau_bo_rd32) from [] (nouveau_fence_update+0x5c/0x80)
(nouveau_fence_update) from [] (nouveau_fence_done+0x1c/0x38)
(nouveau_fence_done) from [] (ttm_bo_wait+0xec/0x168)
(ttm_bo_wait) from [] (nouveau_gem_ioctl_cpu_prep+0x44/0x100)
(nouveau_gem_ioctl_cpu_prep) from [] (drm_ioctl+0x1d8/0x4f4)
(drm_ioctl) from [] (nouveau_drm_ioctl+0x54/0x80)
(nouveau_drm_ioctl) from [] (do_vfs_ioctl+0x3dc/0x5a0)
(do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
(SyS_ioctl) from [] (ret_fast_syscall+0x0/0x30

The offending nouveau_bo_rd32 is done over an IO-mapped BO, e.g. a BO
mapped through the BAR.

Any idea about the origin of this behavior? Does ARM forbid cached
mappings over IO regions?

Signed-off-by: Alexandre Courbot 
---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c 
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 8db54a217232..9cfb8e61f5c4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -552,7 +552,11 @@ nouveau_bo_init_mem_type(struct ttm_bo_device *bdev, 
uint32_t type,
 TTM_MEMTYPE_FLAG_MAPPABLE;
man->available_caching = TTM_PL_FLAG_UNCACHED |
 TTM_PL_FLAG_WC;
+#if defined(__arm__)
+   man->default_caching = TTM_PL_FLAG_UNCACHED;
+#else
man->default_caching = TTM_PL_FLAG_WC;
+#endif
break;
case TTM_PL_TT:
if (nv_device(drm->device)->card_type >= NV_50)
-- 
1.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/