Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-13 Thread Lucas Stach
Am Dienstag, den 12.08.2014, 22:17 -0400 schrieb Jerome Glisse:
[...]
> > I haven't tested the patch yet. For the original bug it won't help directly,
> > because the super-slow allocations which cause the desktop stall are
> > tt_cached allocations, so they go through the if (is_cached) code path which
> > isn't improved by Jerome's patch. is_cached always releases memory
> > immediately, so the tt_cached pool just bounces up and down between 4 and 7
> > pages. So this was an independent issue. The slow allocations i noticed were
> > mostly caused by exa allocating new gem bo's, i don't know which path is
> > taken by 3d graphics?
> > 
> > However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
> > completely killing CMA for its intended purpose. Typical CMA sizes are
> > probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
> > and the limit for the page pool seems to be more like 50% of all system RAM?
> > Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
> > probably will almost completely monopolize the whole CMA memory after a
> > short amount of time. ttm won't suffer stalls if it essentially doesn't
> > interact with CMA anymore after a warmup period, but actual clients which
> > really need CMA (ie., hardware without scatter-gather dma etc.) will be
> > starved of what they need as far as my limited understanding of the CMA
> > goes.
> 
> Yes currently we allow the pool to be way too big, given that pool was 
> probably
> never really use we most likely never had much of an issue. So i would hold on
> applying my patch until more proper limit are in place. My thinking was to go
> for something like 32/64M at most and less then that if < 256M total ram. I 
> also
> think that we should lower the pool size on first call to shrink and only 
> increase
> it again after some timeout since last call to shrink so that when shrink is 
> call
> we minimize our pool size at least for a time. Will put together couple 
> patches
> for doing that.
> 
> > 
> > So fwiw probably the fix to ttm will increase the urgency for the CMA people
> > to come up with a fix/optimization for the allocator. Unless it doesn't
> > matter if most desktop systems have CMA disabled by default, and ttm is
> > mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? I only
> > stumbled over the problem because the Ubuntu 3.16 mainline testing kernels
> > are compiled with CMA on.
> > 
> 
> Enabling cma on x86 is proof of brain damage that said the dma allocator 
> should
> not use the cma area for single page allocation.
> 
Harsh words.

Yes, allocating pages unconditionally from CMA if it is enabled is an
artifact of CMAs ARM heritage. While it seems completely backwards to
allocate single pages from CMA on x86, on ARM the CMA pool is the only
way to get lowmem pages on which you are allowed to change the caching
state.

So the obvious fix is to avoid CMA for order 0 allocations on x86. I can
cook a patch for this.

Regards,
Lucas 
-- 
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-13 Thread Lucas Stach
Am Dienstag, den 12.08.2014, 22:17 -0400 schrieb Jerome Glisse:
[...]
  I haven't tested the patch yet. For the original bug it won't help directly,
  because the super-slow allocations which cause the desktop stall are
  tt_cached allocations, so they go through the if (is_cached) code path which
  isn't improved by Jerome's patch. is_cached always releases memory
  immediately, so the tt_cached pool just bounces up and down between 4 and 7
  pages. So this was an independent issue. The slow allocations i noticed were
  mostly caused by exa allocating new gem bo's, i don't know which path is
  taken by 3d graphics?
  
  However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
  completely killing CMA for its intended purpose. Typical CMA sizes are
  probably around  100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
  and the limit for the page pool seems to be more like 50% of all system RAM?
  Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
  probably will almost completely monopolize the whole CMA memory after a
  short amount of time. ttm won't suffer stalls if it essentially doesn't
  interact with CMA anymore after a warmup period, but actual clients which
  really need CMA (ie., hardware without scatter-gather dma etc.) will be
  starved of what they need as far as my limited understanding of the CMA
  goes.
 
 Yes currently we allow the pool to be way too big, given that pool was 
 probably
 never really use we most likely never had much of an issue. So i would hold on
 applying my patch until more proper limit are in place. My thinking was to go
 for something like 32/64M at most and less then that if  256M total ram. I 
 also
 think that we should lower the pool size on first call to shrink and only 
 increase
 it again after some timeout since last call to shrink so that when shrink is 
 call
 we minimize our pool size at least for a time. Will put together couple 
 patches
 for doing that.
 
  
  So fwiw probably the fix to ttm will increase the urgency for the CMA people
  to come up with a fix/optimization for the allocator. Unless it doesn't
  matter if most desktop systems have CMA disabled by default, and ttm is
  mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? I only
  stumbled over the problem because the Ubuntu 3.16 mainline testing kernels
  are compiled with CMA on.
  
 
 Enabling cma on x86 is proof of brain damage that said the dma allocator 
 should
 not use the cma area for single page allocation.
 
Harsh words.

Yes, allocating pages unconditionally from CMA if it is enabled is an
artifact of CMAs ARM heritage. While it seems completely backwards to
allocate single pages from CMA on x86, on ARM the CMA pool is the only
way to get lowmem pages on which you are allowed to change the caching
state.

So the obvious fix is to avoid CMA for order 0 allocations on x86. I can
cook a patch for this.

Regards,
Lucas 
-- 
Pengutronix e.K. | Lucas Stach |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Jerome Glisse
On Wed, Aug 13, 2014 at 04:04:15AM +0200, Mario Kleiner wrote:
> On 08/13/2014 03:50 AM, Michel Dänzer wrote:
> >On 12.08.2014 00:17, Jerome Glisse wrote:
> >>On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >>>On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>The other problem is that probably TTM does not reuse pages from the
> >>DMA pool. If i trace the __ttm_dma_alloc_page
> >>
> >>and
> >>__ttm_dma_free_page
> >>
> >>calls for
> >>those single page allocs/frees, then over a 20 second interval of
> >>tracing and switching tabs in firefox, scrolling things around etc. i
> >>find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>1648 frees.
> >This is because historically the pools have been designed to keep only
> >pages with nonstandard caching attributes since changing page caching
> >attributes have been very slow but the kernel page allocators have been
> >reasonably fast.
> >
> >/Thomas
> Ok. A bit more ftraceing showed my hang problem case goes through the
> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> it bouncing up and down by 4 pages all the time.
> 
> But for the non-cached case, which i don't hit with my problem, could
> one of you look at line 954...
> 
> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> 
> 
> ... and tell me why that unconditional npages = count; assignment
> makes sense? It seems to essentially disable all recycling for the dma
> pool whenever the pool isn't filled up to/beyond its maximum with free
> pages? When the pool is filled up, lots of stuff is recycled, but when
> it is already somewhat below capacity, it gets "punished" by not
> getting refilled? I'd just like to understand the logic behind that line.
> 
> thanks,
> -mario
> >>>I'll happily forward that question to Konrad who wrote the code (or it
> >>>may even stem from the ordinary page pool code which IIRC has Dave
> >>>Airlie / Jerome Glisse as authors)
> >>This is effectively bogus code, i now wonder how it came to stay alive.
> >>Attached patch will fix that.
> >I haven't tested Mario's scenario specifically, but it survived piglit
> >and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
> >some BOs ended up in GTT instead with write-combined CPU mappings) on
> >radeonsi without any noticeable issues.
> >
> >Tested-by: Michel Dänzer 
> >
> >
> 
> I haven't tested the patch yet. For the original bug it won't help directly,
> because the super-slow allocations which cause the desktop stall are
> tt_cached allocations, so they go through the if (is_cached) code path which
> isn't improved by Jerome's patch. is_cached always releases memory
> immediately, so the tt_cached pool just bounces up and down between 4 and 7
> pages. So this was an independent issue. The slow allocations i noticed were
> mostly caused by exa allocating new gem bo's, i don't know which path is
> taken by 3d graphics?
> 
> However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
> completely killing CMA for its intended purpose. Typical CMA sizes are
> probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
> and the limit for the page pool seems to be more like 50% of all system RAM?
> Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
> probably will almost completely monopolize the whole CMA memory after a
> short amount of time. ttm won't suffer stalls if it essentially doesn't
> interact with CMA anymore after a warmup period, but actual clients which
> really need CMA (ie., hardware without scatter-gather dma etc.) will be
> starved of what they need as far as my limited understanding of the CMA
> goes.

Yes currently we allow the pool to be way too big, given that pool was probably
never really use we most likely never had much of an issue. So i would hold on

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Mario Kleiner

On 08/13/2014 03:50 AM, Michel Dänzer wrote:

On 12.08.2014 00:17, Jerome Glisse wrote:

On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:

On 08/10/2014 08:02 PM, Mario Kleiner wrote:

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:

On 08/10/2014 05:11 AM, Mario Kleiner wrote:

The other problem is that probably TTM does not reuse pages from the
DMA pool. If i trace the __ttm_dma_alloc_page

and
__ttm_dma_free_page

calls for
those single page allocs/frees, then over a 20 second interval of
tracing and switching tabs in firefox, scrolling things around etc. i
find about as many alloc's as i find free's, e.g., 1607 allocs vs.
1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas

Ok. A bit more ftraceing showed my hang problem case goes through the
"if (is_cached)" paths, so the pool doesn't recycle anything and i see
it bouncing up and down by 4 pages all the time.

But for the non-cached case, which i don't hit with my problem, could
one of you look at line 954...

https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60


... and tell me why that unconditional npages = count; assignment
makes sense? It seems to essentially disable all recycling for the dma
pool whenever the pool isn't filled up to/beyond its maximum with free
pages? When the pool is filled up, lots of stuff is recycled, but when
it is already somewhat below capacity, it gets "punished" by not
getting refilled? I'd just like to understand the logic behind that line.

thanks,
-mario

I'll happily forward that question to Konrad who wrote the code (or it
may even stem from the ordinary page pool code which IIRC has Dave
Airlie / Jerome Glisse as authors)

This is effectively bogus code, i now wonder how it came to stay alive.
Attached patch will fix that.

I haven't tested Mario's scenario specifically, but it survived piglit
and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
some BOs ended up in GTT instead with write-combined CPU mappings) on
radeonsi without any noticeable issues.

Tested-by: Michel Dänzer 




I haven't tested the patch yet. For the original bug it won't help 
directly, because the super-slow allocations which cause the desktop 
stall are tt_cached allocations, so they go through the if (is_cached) 
code path which isn't improved by Jerome's patch. is_cached always 
releases memory immediately, so the tt_cached pool just bounces up and 
down between 4 and 7 pages. So this was an independent issue. The slow 
allocations i noticed were mostly caused by exa allocating new gem bo's, 
i don't know which path is taken by 3d graphics?


However, the fixed ttm path could indirectly solve the DMA_CMA stalls by 
completely killing CMA for its intended purpose. Typical CMA sizes are 
probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 
MB), and the limit for the page pool seems to be more like 50% of all 
system RAM? Iow. if the ttm dma pool is allowed to grow that big with 
recycled pages, it probably will almost completely monopolize the whole 
CMA memory after a short amount of time. ttm won't suffer stalls if it 
essentially doesn't interact with CMA anymore after a warmup period, but 
actual clients which really need CMA (ie., hardware without 
scatter-gather dma etc.) will be starved of what they need as far as my 
limited understanding of the CMA goes.


So fwiw probably the fix to ttm will increase the urgency for the CMA 
people to come up with a fix/optimization for the allocator. Unless it 
doesn't matter if most desktop systems have CMA disabled by default, and 
ttm is mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? 
I only stumbled over the problem because the Ubuntu 3.16 mainline 
testing kernels are compiled with CMA on.


-mario

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Jerome Glisse
On Wed, Aug 13, 2014 at 10:50:25AM +0900, Michel Dänzer wrote:
> On 12.08.2014 00:17, Jerome Glisse wrote:
> > On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>  On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >
> > The other problem is that probably TTM does not reuse pages from the
> > DMA pool. If i trace the __ttm_dma_alloc_page
> > 
> > and
> > __ttm_dma_free_page
> > 
> > calls for
> > those single page allocs/frees, then over a 20 second interval of
> > tracing and switching tabs in firefox, scrolling things around etc. i
> > find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> > 1648 frees.
>  This is because historically the pools have been designed to keep only
>  pages with nonstandard caching attributes since changing page caching
>  attributes have been very slow but the kernel page allocators have been
>  reasonably fast.
> 
>  /Thomas
> >>>
> >>> Ok. A bit more ftraceing showed my hang problem case goes through the
> >>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> >>> it bouncing up and down by 4 pages all the time.
> >>>
> >>> But for the non-cached case, which i don't hit with my problem, could
> >>> one of you look at line 954...
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >>>
> >>>
> >>> ... and tell me why that unconditional npages = count; assignment
> >>> makes sense? It seems to essentially disable all recycling for the dma
> >>> pool whenever the pool isn't filled up to/beyond its maximum with free
> >>> pages? When the pool is filled up, lots of stuff is recycled, but when
> >>> it is already somewhat below capacity, it gets "punished" by not
> >>> getting refilled? I'd just like to understand the logic behind that line.
> >>>
> >>> thanks,
> >>> -mario
> >>
> >> I'll happily forward that question to Konrad who wrote the code (or it
> >> may even stem from the ordinary page pool code which IIRC has Dave
> >> Airlie / Jerome Glisse as authors)
> > 
> > This is effectively bogus code, i now wonder how it came to stay alive.
> > Attached patch will fix that.
> 
> I haven't tested Mario's scenario specifically, but it survived piglit
> and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
> some BOs ended up in GTT instead with write-combined CPU mappings) on
> radeonsi without any noticeable issues.
> 
> Tested-by: Michel Dänzer 
> 

My patch does not fix the cma bug, cma should not allocate single page into
it reserved contiguous memory. But cma is a broken technology in the first
place and it should not be enabled on x86 who ever did that is a moron.

So i would definitly encourage opening a bug against cma.

None the less ttm code was buggy too and this patch will fix that but will
only allieviate or delay the symptoms reported by Mario.

Cheers,
Jérôme

> 
> -- 
> Earthling Michel Dänzer|  http://www.amd.com
> Libre software enthusiast  |Mesa and X developer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Michel Dänzer
On 12.08.2014 00:17, Jerome Glisse wrote:
> On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
>> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
>>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
 On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>
> The other problem is that probably TTM does not reuse pages from the
> DMA pool. If i trace the __ttm_dma_alloc_page
> 
> and
> __ttm_dma_free_page
> 
> calls for
> those single page allocs/frees, then over a 20 second interval of
> tracing and switching tabs in firefox, scrolling things around etc. i
> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> 1648 frees.
 This is because historically the pools have been designed to keep only
 pages with nonstandard caching attributes since changing page caching
 attributes have been very slow but the kernel page allocators have been
 reasonably fast.

 /Thomas
>>>
>>> Ok. A bit more ftraceing showed my hang problem case goes through the
>>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
>>> it bouncing up and down by 4 pages all the time.
>>>
>>> But for the non-cached case, which i don't hit with my problem, could
>>> one of you look at line 954...
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>>>
>>>
>>> ... and tell me why that unconditional npages = count; assignment
>>> makes sense? It seems to essentially disable all recycling for the dma
>>> pool whenever the pool isn't filled up to/beyond its maximum with free
>>> pages? When the pool is filled up, lots of stuff is recycled, but when
>>> it is already somewhat below capacity, it gets "punished" by not
>>> getting refilled? I'd just like to understand the logic behind that line.
>>>
>>> thanks,
>>> -mario
>>
>> I'll happily forward that question to Konrad who wrote the code (or it
>> may even stem from the ordinary page pool code which IIRC has Dave
>> Airlie / Jerome Glisse as authors)
> 
> This is effectively bogus code, i now wonder how it came to stay alive.
> Attached patch will fix that.

I haven't tested Mario's scenario specifically, but it survived piglit
and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
some BOs ended up in GTT instead with write-combined CPU mappings) on
radeonsi without any noticeable issues.

Tested-by: Michel Dänzer 


-- 
Earthling Michel Dänzer|  http://www.amd.com
Libre software enthusiast  |Mesa and X developer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Konrad Rzeszutek Wilk
On Tue, Aug 12, 2014 at 02:12:07PM +0200, Mario Kleiner wrote:
> On 08/11/2014 05:17 PM, Jerome Glisse wrote:
> >On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >>On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>>On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >Resent this time without HTML formatting which lkml doesn't like.
> >Sorry.
> >
> >On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> >>On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> >>>On August 9, 2014 1:39:39 AM EDT, Thomas
> >>>Hellstrom  wrote:
> Hi.
> 
> >>>Hey Thomas!
> >>>
> IIRC I don't think the TTM DMA pool allocates coherent pages more
> than
> one page at a time, and _if that's true_ it's pretty unnecessary for
> the
> dma subsystem to route those allocations to CMA. Maybe Konrad could
> shed
> some light over this?
> >>>It should allocate in batches and keep them in the TTM DMA pool for
> >>>some time to be reused.
> >>>
> >>>The pages that it gets are in 4kb granularity though.
> >>Then I feel inclined to say this is a DMA subsystem bug. Single page
> >>allocations shouldn't get routed to CMA.
> >>
> >>/Thomas
> >Yes, seems you're both right. I read through the code a bit more and
> >indeed the TTM DMA pool allocates only one page during each
> >dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> >allocators don't check for single page CMA allocations and therefore
> >try to get it from the CMA area anyway, instead of skipping to the
> >much cheaper fallback.
> >
> >So the callers of dma_alloc_from_contiguous() could need that little
> >optimization of skipping it if only one page is requested. For
> >
> >dma_generic_alloc_coherent
> >
> >
> >andintel_alloc_coherent
> >
> >this
> >seems easy to do. Looking at the arm arch variants, e.g.,
> >
> >https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
> >
> >
> >and
> >
> >https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
> >
> >
> >i'm not sure if it is that easily done, as there aren't any fallbacks
> >for such a case and the code looks to me as if that's at least
> >somewhat intentional.
> >
> >As far as TTM goes, one quick one-line fix to prevent it from using
> >the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> >above methods) would be to clear the __GFP_WAIT
> >
> >flag from the
> >passed gfp_t flags. That would trigger the well working fallback.
> >So, is
> >
> >__GFP_WAIT
> >
> >needed
> >for those single page allocations that go through__ttm_dma_alloc_page
> >?
> >
> >
> >It would be nice to have such a simple, 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Mario Kleiner

On 08/11/2014 05:17 PM, Jerome Glisse wrote:

On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:

On 08/10/2014 08:02 PM, Mario Kleiner wrote:

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:

On 08/10/2014 05:11 AM, Mario Kleiner wrote:

Resent this time without HTML formatting which lkml doesn't like.
Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:

On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:

On August 9, 2014 1:39:39 AM EDT, Thomas
Hellstrom  wrote:

Hi.


Hey Thomas!


IIRC I don't think the TTM DMA pool allocates coherent pages more
than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for
some time to be reused.

The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas

Yes, seems you're both right. I read through the code a bit more and
indeed the TTM DMA pool allocates only one page during each
dma_alloc_coherent() call, so it doesn't need CMA memory. The current
allocators don't check for single page CMA allocations and therefore
try to get it from the CMA area anyway, instead of skipping to the
much cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent


andintel_alloc_coherent

this
seems easy to do. Looking at the arm arch variants, e.g.,

https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac


and

https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08


i'm not sure if it is that easily done, as there aren't any fallbacks
for such a case and the code looks to me as if that's at least
somewhat intentional.

As far as TTM goes, one quick one-line fix to prevent it from using
the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
above methods) would be to clear the __GFP_WAIT

flag from the
passed gfp_t flags. That would trigger the well working fallback.
So, is

__GFP_WAIT

needed
for those single page allocations that go through__ttm_dma_alloc_page
?


It would be nice to have such a simple, non-intrusive one-line patch
that we still could get into 3.17 and then backported to older stable
kernels to avoid the same desktop hangs there if CMA is enabled. It
would be also nice for actual users of CMA to not use up lots of CMA
space for gpu's which don't need it. I think DMA_CMA was introduced
around 3.12.


I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Mario Kleiner

On 08/11/2014 05:17 PM, Jerome Glisse wrote:

On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:

On 08/10/2014 08:02 PM, Mario Kleiner wrote:

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:

On 08/10/2014 05:11 AM, Mario Kleiner wrote:

Resent this time without HTML formatting which lkml doesn't like.
Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:

On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:

On August 9, 2014 1:39:39 AM EDT, Thomas
Hellstromthellst...@vmware.com  wrote:

Hi.


Hey Thomas!


IIRC I don't think the TTM DMA pool allocates coherent pages more
than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for
some time to be reused.

The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas

Yes, seems you're both right. I read through the code a bit more and
indeed the TTM DMA pool allocates only one page during each
dma_alloc_coherent() call, so it doesn't need CMA memory. The current
allocators don't check for single page CMA allocations and therefore
try to get it from the CMA area anyway, instead of skipping to the
much cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939

andintel_alloc_coherent
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd
this
seems easy to do. Looking at the arm arch variants, e.g.,

https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac


and

https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08


i'm not sure if it is that easily done, as there aren't any fallbacks
for such a case and the code looks to me as if that's at least
somewhat intentional.

As far as TTM goes, one quick one-line fix to prevent it from using
the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
above methods) would be to clear the __GFP_WAIT
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
flag from the
passed gfp_t flags. That would trigger the well working fallback.
So, is

__GFP_WAIT
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
needed
for those single page allocations that go through__ttm_dma_alloc_page
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0?


It would be nice to have such a simple, non-intrusive one-line patch
that we still could get into 3.17 and then backported to older stable
kernels to avoid the same desktop hangs there if CMA is enabled. It
would be also nice for actual users of CMA to not use up lots of CMA
space for gpu's which don't need it. I think DMA_CMA was introduced
around 3.12.


I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Konrad Rzeszutek Wilk
On Tue, Aug 12, 2014 at 02:12:07PM +0200, Mario Kleiner wrote:
 On 08/11/2014 05:17 PM, Jerome Glisse wrote:
 On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
 On 08/10/2014 08:02 PM, Mario Kleiner wrote:
 On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
 On 08/10/2014 05:11 AM, Mario Kleiner wrote:
 Resent this time without HTML formatting which lkml doesn't like.
 Sorry.
 
 On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
 On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
 On August 9, 2014 1:39:39 AM EDT, Thomas
 Hellstromthellst...@vmware.com  wrote:
 Hi.
 
 Hey Thomas!
 
 IIRC I don't think the TTM DMA pool allocates coherent pages more
 than
 one page at a time, and _if that's true_ it's pretty unnecessary for
 the
 dma subsystem to route those allocations to CMA. Maybe Konrad could
 shed
 some light over this?
 It should allocate in batches and keep them in the TTM DMA pool for
 some time to be reused.
 
 The pages that it gets are in 4kb granularity though.
 Then I feel inclined to say this is a DMA subsystem bug. Single page
 allocations shouldn't get routed to CMA.
 
 /Thomas
 Yes, seems you're both right. I read through the code a bit more and
 indeed the TTM DMA pool allocates only one page during each
 dma_alloc_coherent() call, so it doesn't need CMA memory. The current
 allocators don't check for single page CMA allocations and therefore
 try to get it from the CMA area anyway, instead of skipping to the
 much cheaper fallback.
 
 So the callers of dma_alloc_from_contiguous() could need that little
 optimization of skipping it if only one page is requested. For
 
 dma_generic_alloc_coherent
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939
 
 andintel_alloc_coherent
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd
 this
 seems easy to do. Looking at the arm arch variants, e.g.,
 
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
 
 
 and
 
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
 
 
 i'm not sure if it is that easily done, as there aren't any fallbacks
 for such a case and the code looks to me as if that's at least
 somewhat intentional.
 
 As far as TTM goes, one quick one-line fix to prevent it from using
 the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
 above methods) would be to clear the __GFP_WAIT
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
 flag from the
 passed gfp_t flags. That would trigger the well working fallback.
 So, is
 
 __GFP_WAIT
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
 needed
 for those single page allocations that go through__ttm_dma_alloc_page
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0?
 
 
 It would be nice to have such a simple, non-intrusive one-line patch
 that we still could get into 3.17 and then backported to older stable
 kernels to avoid the same desktop hangs there if CMA is enabled. It
 would be also nice for actual users of CMA to not use up lots of CMA
 space for gpu's which don't need it. I think DMA_CMA was introduced
 around 3.12.
 
 I don't think that's a good idea. Omitting __GFP_WAIT would cause
 unnecessary memory allocation errors on systems under stress.
 I 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Michel Dänzer
On 12.08.2014 00:17, Jerome Glisse wrote:
 On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
 On 08/10/2014 08:02 PM, Mario Kleiner wrote:
 On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
 On 08/10/2014 05:11 AM, Mario Kleiner wrote:

 The other problem is that probably TTM does not reuse pages from the
 DMA pool. If i trace the __ttm_dma_alloc_page
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
 and
 __ttm_dma_free_page
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
 calls for
 those single page allocs/frees, then over a 20 second interval of
 tracing and switching tabs in firefox, scrolling things around etc. i
 find about as many alloc's as i find free's, e.g., 1607 allocs vs.
 1648 frees.
 This is because historically the pools have been designed to keep only
 pages with nonstandard caching attributes since changing page caching
 attributes have been very slow but the kernel page allocators have been
 reasonably fast.

 /Thomas

 Ok. A bit more ftraceing showed my hang problem case goes through the
 if (is_cached) paths, so the pool doesn't recycle anything and i see
 it bouncing up and down by 4 pages all the time.

 But for the non-cached case, which i don't hit with my problem, could
 one of you look at line 954...

 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60


 ... and tell me why that unconditional npages = count; assignment
 makes sense? It seems to essentially disable all recycling for the dma
 pool whenever the pool isn't filled up to/beyond its maximum with free
 pages? When the pool is filled up, lots of stuff is recycled, but when
 it is already somewhat below capacity, it gets punished by not
 getting refilled? I'd just like to understand the logic behind that line.

 thanks,
 -mario

 I'll happily forward that question to Konrad who wrote the code (or it
 may even stem from the ordinary page pool code which IIRC has Dave
 Airlie / Jerome Glisse as authors)
 
 This is effectively bogus code, i now wonder how it came to stay alive.
 Attached patch will fix that.

I haven't tested Mario's scenario specifically, but it survived piglit
and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
some BOs ended up in GTT instead with write-combined CPU mappings) on
radeonsi without any noticeable issues.

Tested-by: Michel Dänzer michel.daen...@amd.com


-- 
Earthling Michel Dänzer|  http://www.amd.com
Libre software enthusiast  |Mesa and X developer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Jerome Glisse
On Wed, Aug 13, 2014 at 10:50:25AM +0900, Michel Dänzer wrote:
 On 12.08.2014 00:17, Jerome Glisse wrote:
  On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
  On 08/10/2014 08:02 PM, Mario Kleiner wrote:
  On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
  On 08/10/2014 05:11 AM, Mario Kleiner wrote:
 
  The other problem is that probably TTM does not reuse pages from the
  DMA pool. If i trace the __ttm_dma_alloc_page
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
  and
  __ttm_dma_free_page
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
  calls for
  those single page allocs/frees, then over a 20 second interval of
  tracing and switching tabs in firefox, scrolling things around etc. i
  find about as many alloc's as i find free's, e.g., 1607 allocs vs.
  1648 frees.
  This is because historically the pools have been designed to keep only
  pages with nonstandard caching attributes since changing page caching
  attributes have been very slow but the kernel page allocators have been
  reasonably fast.
 
  /Thomas
 
  Ok. A bit more ftraceing showed my hang problem case goes through the
  if (is_cached) paths, so the pool doesn't recycle anything and i see
  it bouncing up and down by 4 pages all the time.
 
  But for the non-cached case, which i don't hit with my problem, could
  one of you look at line 954...
 
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
 
 
  ... and tell me why that unconditional npages = count; assignment
  makes sense? It seems to essentially disable all recycling for the dma
  pool whenever the pool isn't filled up to/beyond its maximum with free
  pages? When the pool is filled up, lots of stuff is recycled, but when
  it is already somewhat below capacity, it gets punished by not
  getting refilled? I'd just like to understand the logic behind that line.
 
  thanks,
  -mario
 
  I'll happily forward that question to Konrad who wrote the code (or it
  may even stem from the ordinary page pool code which IIRC has Dave
  Airlie / Jerome Glisse as authors)
  
  This is effectively bogus code, i now wonder how it came to stay alive.
  Attached patch will fix that.
 
 I haven't tested Mario's scenario specifically, but it survived piglit
 and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
 some BOs ended up in GTT instead with write-combined CPU mappings) on
 radeonsi without any noticeable issues.
 
 Tested-by: Michel Dänzer michel.daen...@amd.com
 

My patch does not fix the cma bug, cma should not allocate single page into
it reserved contiguous memory. But cma is a broken technology in the first
place and it should not be enabled on x86 who ever did that is a moron.

So i would definitly encourage opening a bug against cma.

None the less ttm code was buggy too and this patch will fix that but will
only allieviate or delay the symptoms reported by Mario.

Cheers,
Jérôme

 
 -- 
 Earthling Michel Dänzer|  http://www.amd.com
 Libre software enthusiast  |Mesa and X developer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Mario Kleiner

On 08/13/2014 03:50 AM, Michel Dänzer wrote:

On 12.08.2014 00:17, Jerome Glisse wrote:

On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:

On 08/10/2014 08:02 PM, Mario Kleiner wrote:

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:

On 08/10/2014 05:11 AM, Mario Kleiner wrote:

The other problem is that probably TTM does not reuse pages from the
DMA pool. If i trace the __ttm_dma_alloc_page
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
and
__ttm_dma_free_page
https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
calls for
those single page allocs/frees, then over a 20 second interval of
tracing and switching tabs in firefox, scrolling things around etc. i
find about as many alloc's as i find free's, e.g., 1607 allocs vs.
1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas

Ok. A bit more ftraceing showed my hang problem case goes through the
if (is_cached) paths, so the pool doesn't recycle anything and i see
it bouncing up and down by 4 pages all the time.

But for the non-cached case, which i don't hit with my problem, could
one of you look at line 954...

https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60


... and tell me why that unconditional npages = count; assignment
makes sense? It seems to essentially disable all recycling for the dma
pool whenever the pool isn't filled up to/beyond its maximum with free
pages? When the pool is filled up, lots of stuff is recycled, but when
it is already somewhat below capacity, it gets punished by not
getting refilled? I'd just like to understand the logic behind that line.

thanks,
-mario

I'll happily forward that question to Konrad who wrote the code (or it
may even stem from the ordinary page pool code which IIRC has Dave
Airlie / Jerome Glisse as authors)

This is effectively bogus code, i now wonder how it came to stay alive.
Attached patch will fix that.

I haven't tested Mario's scenario specifically, but it survived piglit
and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
some BOs ended up in GTT instead with write-combined CPU mappings) on
radeonsi without any noticeable issues.

Tested-by: Michel Dänzer michel.daen...@amd.com




I haven't tested the patch yet. For the original bug it won't help 
directly, because the super-slow allocations which cause the desktop 
stall are tt_cached allocations, so they go through the if (is_cached) 
code path which isn't improved by Jerome's patch. is_cached always 
releases memory immediately, so the tt_cached pool just bounces up and 
down between 4 and 7 pages. So this was an independent issue. The slow 
allocations i noticed were mostly caused by exa allocating new gem bo's, 
i don't know which path is taken by 3d graphics?


However, the fixed ttm path could indirectly solve the DMA_CMA stalls by 
completely killing CMA for its intended purpose. Typical CMA sizes are 
probably around  100 MB (kernel default is 16 MB, Ubuntu config is 64 
MB), and the limit for the page pool seems to be more like 50% of all 
system RAM? Iow. if the ttm dma pool is allowed to grow that big with 
recycled pages, it probably will almost completely monopolize the whole 
CMA memory after a short amount of time. ttm won't suffer stalls if it 
essentially doesn't interact with CMA anymore after a warmup period, but 
actual clients which really need CMA (ie., hardware without 
scatter-gather dma etc.) will be starved of what they need as far as my 
limited understanding of the CMA goes.


So fwiw probably the fix to ttm will increase the urgency for the CMA 
people to come up with a fix/optimization for the allocator. Unless it 
doesn't matter if most desktop systems have CMA disabled by default, and 
ttm is mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? 
I only stumbled over the problem because the Ubuntu 3.16 mainline 
testing kernels are compiled with CMA on.


-mario

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-12 Thread Jerome Glisse
On Wed, Aug 13, 2014 at 04:04:15AM +0200, Mario Kleiner wrote:
 On 08/13/2014 03:50 AM, Michel Dänzer wrote:
 On 12.08.2014 00:17, Jerome Glisse wrote:
 On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
 On 08/10/2014 08:02 PM, Mario Kleiner wrote:
 On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
 On 08/10/2014 05:11 AM, Mario Kleiner wrote:
 The other problem is that probably TTM does not reuse pages from the
 DMA pool. If i trace the __ttm_dma_alloc_page
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
 and
 __ttm_dma_free_page
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0
 calls for
 those single page allocs/frees, then over a 20 second interval of
 tracing and switching tabs in firefox, scrolling things around etc. i
 find about as many alloc's as i find free's, e.g., 1607 allocs vs.
 1648 frees.
 This is because historically the pools have been designed to keep only
 pages with nonstandard caching attributes since changing page caching
 attributes have been very slow but the kernel page allocators have been
 reasonably fast.
 
 /Thomas
 Ok. A bit more ftraceing showed my hang problem case goes through the
 if (is_cached) paths, so the pool doesn't recycle anything and i see
 it bouncing up and down by 4 pages all the time.
 
 But for the non-cached case, which i don't hit with my problem, could
 one of you look at line 954...
 
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
 
 
 ... and tell me why that unconditional npages = count; assignment
 makes sense? It seems to essentially disable all recycling for the dma
 pool whenever the pool isn't filled up to/beyond its maximum with free
 pages? When the pool is filled up, lots of stuff is recycled, but when
 it is already somewhat below capacity, it gets punished by not
 getting refilled? I'd just like to understand the logic behind that line.
 
 thanks,
 -mario
 I'll happily forward that question to Konrad who wrote the code (or it
 may even stem from the ordinary page pool code which IIRC has Dave
 Airlie / Jerome Glisse as authors)
 This is effectively bogus code, i now wonder how it came to stay alive.
 Attached patch will fix that.
 I haven't tested Mario's scenario specifically, but it survived piglit
 and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
 some BOs ended up in GTT instead with write-combined CPU mappings) on
 radeonsi without any noticeable issues.
 
 Tested-by: Michel Dänzer michel.daen...@amd.com
 
 
 
 I haven't tested the patch yet. For the original bug it won't help directly,
 because the super-slow allocations which cause the desktop stall are
 tt_cached allocations, so they go through the if (is_cached) code path which
 isn't improved by Jerome's patch. is_cached always releases memory
 immediately, so the tt_cached pool just bounces up and down between 4 and 7
 pages. So this was an independent issue. The slow allocations i noticed were
 mostly caused by exa allocating new gem bo's, i don't know which path is
 taken by 3d graphics?
 
 However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
 completely killing CMA for its intended purpose. Typical CMA sizes are
 probably around  100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
 and the limit for the page pool seems to be more like 50% of all system RAM?
 Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
 probably will almost completely monopolize the whole CMA memory after a
 short amount of time. ttm won't suffer stalls if it essentially doesn't
 interact with CMA anymore after a warmup period, but actual clients which
 really need CMA (ie., hardware without scatter-gather dma etc.) will be
 starved of what they need as far as my limited understanding of the CMA
 goes.

Yes currently we allow the pool to be way too big, given that pool was probably
never really use we most likely never had much of an issue. So i would hold on
applying my patch until more proper limit are in place. My thinking was to go
for something like 32/64M at most and less then that if  256M total ram. I also
think that we should lower the pool size on first call to shrink and only 
increase
it again after 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-11 Thread Jerome Glisse
On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> > On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>> Resent this time without HTML formatting which lkml doesn't like.
> >>> Sorry.
> >>>
> >>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>  On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> > On August 9, 2014 1:39:39 AM EDT, Thomas
> > Hellstrom  wrote:
> >> Hi.
> >>
> > Hey Thomas!
> >
> >> IIRC I don't think the TTM DMA pool allocates coherent pages more
> >> than
> >> one page at a time, and _if that's true_ it's pretty unnecessary for
> >> the
> >> dma subsystem to route those allocations to CMA. Maybe Konrad could
> >> shed
> >> some light over this?
> > It should allocate in batches and keep them in the TTM DMA pool for
> > some time to be reused.
> >
> > The pages that it gets are in 4kb granularity though.
>  Then I feel inclined to say this is a DMA subsystem bug. Single page
>  allocations shouldn't get routed to CMA.
> 
>  /Thomas
> >>> Yes, seems you're both right. I read through the code a bit more and
> >>> indeed the TTM DMA pool allocates only one page during each
> >>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> >>> allocators don't check for single page CMA allocations and therefore
> >>> try to get it from the CMA area anyway, instead of skipping to the
> >>> much cheaper fallback.
> >>>
> >>> So the callers of dma_alloc_from_contiguous() could need that little
> >>> optimization of skipping it if only one page is requested. For
> >>>
> >>> dma_generic_alloc_coherent
> >>> 
> >>>
> >>> andintel_alloc_coherent
> >>> 
> >>>  
> >>> this
> >>> seems easy to do. Looking at the arm arch variants, e.g.,
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
> >>>
> >>>
> >>> and
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
> >>>
> >>>
> >>> i'm not sure if it is that easily done, as there aren't any fallbacks
> >>> for such a case and the code looks to me as if that's at least
> >>> somewhat intentional.
> >>>
> >>> As far as TTM goes, one quick one-line fix to prevent it from using
> >>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> >>> above methods) would be to clear the __GFP_WAIT
> >>> 
> >>> flag from the
> >>> passed gfp_t flags. That would trigger the well working fallback.
> >>> So, is
> >>>
> >>> __GFP_WAIT 
> >>> 
> >>>  
> >>> needed
> >>> for those single page allocations that go through__ttm_dma_alloc_page
> >>> ?
> >>>
> >>>
> >>> It would be nice to have such a simple, non-intrusive one-line patch
> >>> that we still could get into 3.17 and then backported to older stable
> >>> kernels to avoid the same desktop hangs there if CMA is enabled. It
> >>> would be 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-11 Thread Thomas Hellstrom
On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>> Resent this time without HTML formatting which lkml doesn't like.
>>> Sorry.
>>>
>>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
 On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> On August 9, 2014 1:39:39 AM EDT, Thomas
> Hellstrom  wrote:
>> Hi.
>>
> Hey Thomas!
>
>> IIRC I don't think the TTM DMA pool allocates coherent pages more
>> than
>> one page at a time, and _if that's true_ it's pretty unnecessary for
>> the
>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>> shed
>> some light over this?
> It should allocate in batches and keep them in the TTM DMA pool for
> some time to be reused.
>
> The pages that it gets are in 4kb granularity though.
 Then I feel inclined to say this is a DMA subsystem bug. Single page
 allocations shouldn't get routed to CMA.

 /Thomas
>>> Yes, seems you're both right. I read through the code a bit more and
>>> indeed the TTM DMA pool allocates only one page during each
>>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>>> allocators don't check for single page CMA allocations and therefore
>>> try to get it from the CMA area anyway, instead of skipping to the
>>> much cheaper fallback.
>>>
>>> So the callers of dma_alloc_from_contiguous() could need that little
>>> optimization of skipping it if only one page is requested. For
>>>
>>> dma_generic_alloc_coherent
>>> 
>>>
>>> andintel_alloc_coherent
>>> 
>>>  
>>> this
>>> seems easy to do. Looking at the arm arch variants, e.g.,
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
>>>
>>>
>>> and
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
>>>
>>>
>>> i'm not sure if it is that easily done, as there aren't any fallbacks
>>> for such a case and the code looks to me as if that's at least
>>> somewhat intentional.
>>>
>>> As far as TTM goes, one quick one-line fix to prevent it from using
>>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>>> above methods) would be to clear the __GFP_WAIT
>>> 
>>> flag from the
>>> passed gfp_t flags. That would trigger the well working fallback.
>>> So, is
>>>
>>> __GFP_WAIT 
>>> 
>>>  
>>> needed
>>> for those single page allocations that go through__ttm_dma_alloc_page
>>> ?
>>>
>>>
>>> It would be nice to have such a simple, non-intrusive one-line patch
>>> that we still could get into 3.17 and then backported to older stable
>>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>>> would be also nice for actual users of CMA to not use up lots of CMA
>>> space for gpu's which don't need it. I think DMA_CMA was introduced
>>> around 3.12.
>>>
>> I don't think that's a good idea. Omitting __GFP_WAIT would cause
>> 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-11 Thread Thomas Hellstrom
On 08/10/2014 08:02 PM, Mario Kleiner wrote:
 On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
 On 08/10/2014 05:11 AM, Mario Kleiner wrote:
 Resent this time without HTML formatting which lkml doesn't like.
 Sorry.

 On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
 On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
 On August 9, 2014 1:39:39 AM EDT, Thomas
 Hellstromthellst...@vmware.com  wrote:
 Hi.

 Hey Thomas!

 IIRC I don't think the TTM DMA pool allocates coherent pages more
 than
 one page at a time, and _if that's true_ it's pretty unnecessary for
 the
 dma subsystem to route those allocations to CMA. Maybe Konrad could
 shed
 some light over this?
 It should allocate in batches and keep them in the TTM DMA pool for
 some time to be reused.

 The pages that it gets are in 4kb granularity though.
 Then I feel inclined to say this is a DMA subsystem bug. Single page
 allocations shouldn't get routed to CMA.

 /Thomas
 Yes, seems you're both right. I read through the code a bit more and
 indeed the TTM DMA pool allocates only one page during each
 dma_alloc_coherent() call, so it doesn't need CMA memory. The current
 allocators don't check for single page CMA allocations and therefore
 try to get it from the CMA area anyway, instead of skipping to the
 much cheaper fallback.

 So the callers of dma_alloc_from_contiguous() could need that little
 optimization of skipping it if only one page is requested. For

 dma_generic_alloc_coherent
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939

 andintel_alloc_coherent
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd
  
 this
 seems easy to do. Looking at the arm arch variants, e.g.,

 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac


 and

 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08


 i'm not sure if it is that easily done, as there aren't any fallbacks
 for such a case and the code looks to me as if that's at least
 somewhat intentional.

 As far as TTM goes, one quick one-line fix to prevent it from using
 the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
 above methods) would be to clear the __GFP_WAIT
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
 flag from the
 passed gfp_t flags. That would trigger the well working fallback.
 So, is

 __GFP_WAIT 
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
  
 needed
 for those single page allocations that go through__ttm_dma_alloc_page
 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0?


 It would be nice to have such a simple, non-intrusive one-line patch
 that we still could get into 3.17 and then backported to older stable
 kernels to avoid the same desktop hangs there if CMA is enabled. It
 would be also nice for actual users of CMA to not use up lots of CMA
 space for gpu's which don't need it. I think DMA_CMA was introduced
 around 3.12.

 I don't think that's a good idea. Omitting __GFP_WAIT would cause
 unnecessary memory allocation errors on systems under stress.
 I think this should be filed as a DMA subsystem kernel bug / regression
 and an appropriate solution should be worked out together with the DMA
 subsystem maintainers and then backported.

 Ok, 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-11 Thread Jerome Glisse
On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
 On 08/10/2014 08:02 PM, Mario Kleiner wrote:
  On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
  On 08/10/2014 05:11 AM, Mario Kleiner wrote:
  Resent this time without HTML formatting which lkml doesn't like.
  Sorry.
 
  On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
  On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
  On August 9, 2014 1:39:39 AM EDT, Thomas
  Hellstromthellst...@vmware.com  wrote:
  Hi.
 
  Hey Thomas!
 
  IIRC I don't think the TTM DMA pool allocates coherent pages more
  than
  one page at a time, and _if that's true_ it's pretty unnecessary for
  the
  dma subsystem to route those allocations to CMA. Maybe Konrad could
  shed
  some light over this?
  It should allocate in batches and keep them in the TTM DMA pool for
  some time to be reused.
 
  The pages that it gets are in 4kb granularity though.
  Then I feel inclined to say this is a DMA subsystem bug. Single page
  allocations shouldn't get routed to CMA.
 
  /Thomas
  Yes, seems you're both right. I read through the code a bit more and
  indeed the TTM DMA pool allocates only one page during each
  dma_alloc_coherent() call, so it doesn't need CMA memory. The current
  allocators don't check for single page CMA allocations and therefore
  try to get it from the CMA area anyway, instead of skipping to the
  much cheaper fallback.
 
  So the callers of dma_alloc_from_contiguous() could need that little
  optimization of skipping it if only one page is requested. For
 
  dma_generic_alloc_coherent
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939
 
  andintel_alloc_coherent
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherentk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd
   
  this
  seems easy to do. Looking at the arm arch variants, e.g.,
 
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
 
 
  and
 
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
 
 
  i'm not sure if it is that easily done, as there aren't any fallbacks
  for such a case and the code looks to me as if that's at least
  somewhat intentional.
 
  As far as TTM goes, one quick one-line fix to prevent it from using
  the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
  above methods) would be to clear the __GFP_WAIT
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
  flag from the
  passed gfp_t flags. That would trigger the well working fallback.
  So, is
 
  __GFP_WAIT 
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAITk=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc
   
  needed
  for those single page allocations that go through__ttm_dma_alloc_page
  https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_pagek=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0As=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0?
 
 
  It would be nice to have such a simple, non-intrusive one-line patch
  that we still could get into 3.17 and then backported to older stable
  kernels to avoid the same desktop hangs there if CMA is enabled. It
  would be also nice for actual users of CMA to not use up lots of CMA
  space for gpu's which don't need it. I think DMA_CMA was introduced
  around 3.12.
 
  I don't think that's a good idea. Omitting __GFP_WAIT would cause
  unnecessary memory allocation errors on systems under stress.
  I think this should be filed as a DMA 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-10 Thread Mario Kleiner

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:

On 08/10/2014 05:11 AM, Mario Kleiner wrote:

Resent this time without HTML formatting which lkml doesn't like. Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:

On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:

On August 9, 2014 1:39:39 AM EDT, Thomas
Hellstrom  wrote:

Hi.


Hey Thomas!


IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for
some time to be reused.

The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas

Yes, seems you're both right. I read through the code a bit more and
indeed the TTM DMA pool allocates only one page during each
dma_alloc_coherent() call, so it doesn't need CMA memory. The current
allocators don't check for single page CMA allocations and therefore
try to get it from the CMA area anyway, instead of skipping to the
much cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent

andintel_alloc_coherent
  this
seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks
for such a case and the code looks to me as if that's at least
somewhat intentional.

As far as TTM goes, one quick one-line fix to prevent it from using
the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
above methods) would be to clear the __GFP_WAIT
 flag from the
passed gfp_t flags. That would trigger the well working fallback. So, is

__GFP_WAIT    needed
for those single page allocations that go through__ttm_dma_alloc_page
?

It would be nice to have such a simple, non-intrusive one-line patch
that we still could get into 3.17 and then backported to older stable
kernels to avoid the same desktop hangs there if CMA is enabled. It
would be also nice for actual users of CMA to not use up lots of CMA
space for gpu's which don't need it. I think DMA_CMA was introduced
around 3.12.


I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then backported.


Ok, so it is needed. I'll file a bug report.


The other problem is that probably TTM does not reuse pages from the
DMA pool. If i trace the __ttm_dma_alloc_page
 and
__ttm_dma_free_page
 calls for
those single page allocs/frees, then over a 20 second interval of
tracing and switching tabs in firefox, scrolling things around etc. i
find about as many alloc's as i find free's, e.g., 1607 allocs vs.
1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas


Ok. A bit more ftraceing showed my hang problem case goes through the 
"if (is_cached)" paths, so the pool doesn't recycle anything and i see 
it bouncing up and down by 4 pages all the time.


But for the non-cached case, which i don't hit with my problem, could 
one of you look at line 954...


http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

... and tell me why that unconditional npages = count; assignment makes sense? It seems 
to essentially disable all recycling for the dma pool whenever the pool isn't filled up 
to/beyond its maximum with free pages? When the pool is filled up, lots of stuff is 
recycled, but when it is already somewhat below capacity, it gets "punished" by 
not getting refilled? I'd just like to understand the logic behind that line.

thanks,
-mario



This bit of code fromttm_dma_unpopulate
()  (line
954 in 3.16) looks suspicious:

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954


Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
freed 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-10 Thread Thomas Hellstrom
On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> Resent this time without HTML formatting which lkml doesn't like. Sorry.
>
> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>> Hellstrom  wrote:
 Hi.

>>> Hey Thomas!
>>>
 IIRC I don't think the TTM DMA pool allocates coherent pages more than
 one page at a time, and _if that's true_ it's pretty unnecessary for
 the
 dma subsystem to route those allocations to CMA. Maybe Konrad could
 shed
 some light over this?
>>> It should allocate in batches and keep them in the TTM DMA pool for
>>> some time to be reused.
>>>
>>> The pages that it gets are in 4kb granularity though.
>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>> allocations shouldn't get routed to CMA.
>>
>> /Thomas
>
> Yes, seems you're both right. I read through the code a bit more and
> indeed the TTM DMA pool allocates only one page during each
> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> allocators don't check for single page CMA allocations and therefore
> try to get it from the CMA area anyway, instead of skipping to the
> much cheaper fallback.
>
> So the callers of dma_alloc_from_contiguous() could need that little
> optimization of skipping it if only one page is requested. For
>
> dma_generic_alloc_coherent 
>  
> andintel_alloc_coherent 
>   this
> seems easy to do. Looking at the arm arch variants, e.g.,
>
> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194
>
> and
>
> http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44
>
> i'm not sure if it is that easily done, as there aren't any fallbacks
> for such a case and the code looks to me as if that's at least
> somewhat intentional.
>
> As far as TTM goes, one quick one-line fix to prevent it from using
> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> above methods) would be to clear the __GFP_WAIT
>  flag from the
> passed gfp_t flags. That would trigger the well working fallback. So, is
>
> __GFP_WAIT    needed
> for those single page allocations that go through__ttm_dma_alloc_page 
> ?
>
> It would be nice to have such a simple, non-intrusive one-line patch
> that we still could get into 3.17 and then backported to older stable
> kernels to avoid the same desktop hangs there if CMA is enabled. It
> would be also nice for actual users of CMA to not use up lots of CMA
> space for gpu's which don't need it. I think DMA_CMA was introduced
> around 3.12.
>

I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then backported.

>
> The other problem is that probably TTM does not reuse pages from the
> DMA pool. If i trace the __ttm_dma_alloc_page
>  and
> __ttm_dma_free_page
>  calls for
> those single page allocs/frees, then over a 20 second interval of
> tracing and switching tabs in firefox, scrolling things around etc. i
> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> 1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas

>
> This bit of code fromttm_dma_unpopulate
> ()  (line
> 954 in 3.16) looks suspicious:
>
> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954
>
>
> Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
> freed and are not given back to the cached pool. But in the uncached
> case, there's logic to make sure the pool doesn't grow forever (line
> 955, checking against _manager->options.max_size), but before that
> check in line 954 there's an uncoditional assignment of npages =
> count; which seems to force freeing all pages as well, instead of
> recycling? Is this some debug code left over, or intentional and just
> me not understanding what happens there?
>
> thanks,
> -mario
>
>
 /Thomas


 On 08/08/2014 07:42 PM, Mario Kleiner wrote:
> Hi all,
>
> there is a rather severe performance problem i accidentally found
 when
> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-10 Thread Thomas Hellstrom
On 08/10/2014 05:11 AM, Mario Kleiner wrote:
 Resent this time without HTML formatting which lkml doesn't like. Sorry.

 On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
 On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
 On August 9, 2014 1:39:39 AM EDT, Thomas
 Hellstromthellst...@vmware.com  wrote:
 Hi.

 Hey Thomas!

 IIRC I don't think the TTM DMA pool allocates coherent pages more than
 one page at a time, and _if that's true_ it's pretty unnecessary for
 the
 dma subsystem to route those allocations to CMA. Maybe Konrad could
 shed
 some light over this?
 It should allocate in batches and keep them in the TTM DMA pool for
 some time to be reused.

 The pages that it gets are in 4kb granularity though.
 Then I feel inclined to say this is a DMA subsystem bug. Single page
 allocations shouldn't get routed to CMA.

 /Thomas

 Yes, seems you're both right. I read through the code a bit more and
 indeed the TTM DMA pool allocates only one page during each
 dma_alloc_coherent() call, so it doesn't need CMA memory. The current
 allocators don't check for single page CMA allocations and therefore
 try to get it from the CMA area anyway, instead of skipping to the
 much cheaper fallback.

 So the callers of dma_alloc_from_contiguous() could need that little
 optimization of skipping it if only one page is requested. For

 dma_generic_alloc_coherent 
 http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent 
 andintel_alloc_coherent 
 http://lxr.free-electrons.com/ident?i=intel_alloc_coherent  this
 seems easy to do. Looking at the arm arch variants, e.g.,

 http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

 and

 http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

 i'm not sure if it is that easily done, as there aren't any fallbacks
 for such a case and the code looks to me as if that's at least
 somewhat intentional.

 As far as TTM goes, one quick one-line fix to prevent it from using
 the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
 above methods) would be to clear the __GFP_WAIT
 http://lxr.free-electrons.com/ident?i=__GFP_WAIT flag from the
 passed gfp_t flags. That would trigger the well working fallback. So, is

 __GFP_WAIT  http://lxr.free-electrons.com/ident?i=__GFP_WAIT  needed
 for those single page allocations that go through__ttm_dma_alloc_page 
 http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page?

 It would be nice to have such a simple, non-intrusive one-line patch
 that we still could get into 3.17 and then backported to older stable
 kernels to avoid the same desktop hangs there if CMA is enabled. It
 would be also nice for actual users of CMA to not use up lots of CMA
 space for gpu's which don't need it. I think DMA_CMA was introduced
 around 3.12.


I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then backported.


 The other problem is that probably TTM does not reuse pages from the
 DMA pool. If i trace the __ttm_dma_alloc_page
 http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page and
 __ttm_dma_free_page
 http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page calls for
 those single page allocs/frees, then over a 20 second interval of
 tracing and switching tabs in firefox, scrolling things around etc. i
 find about as many alloc's as i find free's, e.g., 1607 allocs vs.
 1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas


 This bit of code fromttm_dma_unpopulate
 http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate()  (line
 954 in 3.16) looks suspicious:

 http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954


 Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
 freed and are not given back to the cached pool. But in the uncached
 case, there's logic to make sure the pool doesn't grow forever (line
 955, checking against _manager-options.max_size), but before that
 check in line 954 there's an uncoditional assignment of npages =
 count; which seems to force freeing all pages as well, instead of
 recycling? Is this some debug code left over, or intentional and just
 me not understanding what happens there?

 thanks,
 -mario


 /Thomas


 On 08/08/2014 07:42 PM, Mario Kleiner wrote:
 Hi all,

 there is a rather severe performance problem i accidentally found
 when
 trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
 Ubuntu 14.04 LTS with nouveau as graphics driver.

 I was lazy and just installed the Ubuntu precompiled mainline kernel.
 That kernel happens to have CONFIG_DMA_CMA=y set, 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-10 Thread Mario Kleiner

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:

On 08/10/2014 05:11 AM, Mario Kleiner wrote:

Resent this time without HTML formatting which lkml doesn't like. Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:

On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:

On August 9, 2014 1:39:39 AM EDT, Thomas
Hellstromthellst...@vmware.com  wrote:

Hi.


Hey Thomas!


IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for
some time to be reused.

The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas

Yes, seems you're both right. I read through the code a bit more and
indeed the TTM DMA pool allocates only one page during each
dma_alloc_coherent() call, so it doesn't need CMA memory. The current
allocators don't check for single page CMA allocations and therefore
try to get it from the CMA area anyway, instead of skipping to the
much cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent
http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent
andintel_alloc_coherent
http://lxr.free-electrons.com/ident?i=intel_alloc_coherent  this
seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks
for such a case and the code looks to me as if that's at least
somewhat intentional.

As far as TTM goes, one quick one-line fix to prevent it from using
the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
above methods) would be to clear the __GFP_WAIT
http://lxr.free-electrons.com/ident?i=__GFP_WAIT flag from the
passed gfp_t flags. That would trigger the well working fallback. So, is

__GFP_WAIT  http://lxr.free-electrons.com/ident?i=__GFP_WAIT  needed
for those single page allocations that go through__ttm_dma_alloc_page
http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page?

It would be nice to have such a simple, non-intrusive one-line patch
that we still could get into 3.17 and then backported to older stable
kernels to avoid the same desktop hangs there if CMA is enabled. It
would be also nice for actual users of CMA to not use up lots of CMA
space for gpu's which don't need it. I think DMA_CMA was introduced
around 3.12.


I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then backported.


Ok, so it is needed. I'll file a bug report.


The other problem is that probably TTM does not reuse pages from the
DMA pool. If i trace the __ttm_dma_alloc_page
http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page and
__ttm_dma_free_page
http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page calls for
those single page allocs/frees, then over a 20 second interval of
tracing and switching tabs in firefox, scrolling things around etc. i
find about as many alloc's as i find free's, e.g., 1607 allocs vs.
1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas


Ok. A bit more ftraceing showed my hang problem case goes through the 
if (is_cached) paths, so the pool doesn't recycle anything and i see 
it bouncing up and down by 4 pages all the time.


But for the non-cached case, which i don't hit with my problem, could 
one of you look at line 954...


http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

... and tell me why that unconditional npages = count; assignment makes sense? It seems 
to essentially disable all recycling for the dma pool whenever the pool isn't filled up 
to/beyond its maximum with free pages? When the pool is filled up, lots of stuff is 
recycled, but when it is already somewhat below capacity, it gets punished by 
not getting refilled? I'd just like to understand the logic behind that line.

thanks,
-mario



This bit of code fromttm_dma_unpopulate
http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate()  (line
954 in 3.16) looks suspicious:

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954


Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
freed 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-09 Thread Mario Kleiner

Resent this time without HTML formatting which lkml doesn't like. Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:

On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:

On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom  
wrote:

Hi.


Hey Thomas!


IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for some time 
to be reused.

The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas


Yes, seems you're both right. I read through the code a bit more and 
indeed the TTM DMA pool allocates only one page during each 
dma_alloc_coherent() call, so it doesn't need CMA memory. The current 
allocators don't check for single page CMA allocations and therefore try 
to get it from the CMA area anyway, instead of skipping to the much 
cheaper fallback.


So the callers of dma_alloc_from_contiguous() could need that little 
optimization of skipping it if only one page is requested. For


dma_generic_alloc_coherent  
  
andintel_alloc_coherent   
 this seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks 
for such a case and the code looks to me as if that's at least somewhat 
intentional.


As far as TTM goes, one quick one-line fix to prevent it from using the 
CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the above 
methods) would be to clear the __GFP_WAIT 
 flag from the passed 
gfp_t flags. That would trigger the well working fallback. So, is


__GFP_WAIT    needed for those 
single page allocations that go through__ttm_dma_alloc_page  
?

It would be nice to have such a simple, non-intrusive one-line patch 
that we still could get into 3.17 and then backported to older stable 
kernels to avoid the same desktop hangs there if CMA is enabled. It 
would be also nice for actual users of CMA to not use up lots of CMA 
space for gpu's which don't need it. I think DMA_CMA was introduced 
around 3.12.



The other problem is that probably TTM does not reuse pages from the DMA 
pool. If i trace the __ttm_dma_alloc_page 
 and 
__ttm_dma_free_page 
 calls for 
those single page allocs/frees, then over a 20 second interval of 
tracing and switching tabs in firefox, scrolling things around etc. i 
find about as many alloc's as i find free's, e.g., 1607 allocs vs. 1648 
frees.


This bit of code fromttm_dma_unpopulate 
()  (line 954 
in 3.16) looks suspicious:


http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

Alloc's from a tt_cached cached pool ( if (is_cached)...) always get 
freed and are not given back to the cached pool. But in the uncached 
case, there's logic to make sure the pool doesn't grow forever (line 
955, checking against _manager->options.max_size), but before that check 
in line 954 there's an uncoditional assignment of npages = count; which 
seems to force freeing all pages as well, instead of recycling? Is this 
some debug code left over, or intentional and just me not understanding 
what happens there?


thanks,
-mario



/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:

Hi all,

there is a rather severe performance problem i accidentally found

when

trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
Ubuntu 14.04 LTS with nouveau as graphics driver.

I was lazy and just installed the Ubuntu precompiled mainline kernel.
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
weren't compiled with CMA, so i only observed this on 3.16, but
previous kernels would likely be affected too.

After a few minutes of regular desktop use like switching workspaces,
scrolling text in a terminal window, Firefox with multiple tabs open,
Thunderbird etc. (tested with KDE/Kwin, with/without desktop
composition), i get chunky desktop updates, then multi-second

freezes,

after a few minutes the desktop hangs for over a minute on almost any
GUI action like switching windows etc. --> Unuseable.

ftrace'ing shows the culprit being 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-09 Thread Thomas Hellstrom


On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom  
> wrote:
>> Hi.
>>
> Hey Thomas!
>
>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>> one page at a time, and _if that's true_ it's pretty unnecessary for
>> the
>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>> shed
>> some light over this?
> It should allocate in batches and keep them in the TTM DMA pool for some time 
> to be reused.
>
> The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas


>> /Thomas
>>
>>
>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>> Hi all,
>>>
>>> there is a rather severe performance problem i accidentally found
>> when
>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>
>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>> previous kernels would likely be affected too.
>>>
>>> After a few minutes of regular desktop use like switching workspaces,
>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>> composition), i get chunky desktop updates, then multi-second
>> freezes,
>>> after a few minutes the desktop hangs for over a minute on almost any
>>> GUI action like switching windows etc. --> Unuseable.
>>>
>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>> example ftrace snippets at the end of this mail):
>>>
>>> ...ttm dma coherent memory allocations, e.g., from
>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>> dma_alloc_from_contiguous()
>>>
>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>> when
>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>
>>> With CMA, this function becomes progressively more slow with every
>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>> hundreds or thousands of microseconds (before it gives up and
>>> alloc_pages_node() fallback is used), so this causes the
>>> multi-second/minute hangs of the desktop.
>>>
>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>> the
>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>> find a fitting hole big enough to satisfy allocations with a retry
>>> loop (see
>>>
>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>> that takes forever.
> I am curious why it does not end up using the pool. As in use the TTM DMA 
> pool to pick pages instead of allocating (and freeing) new ones?
>
>>> This is not good, also not for other devices which actually need a
>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>> still need physically contiguous dma memory, maybe with exception of
>>> some embedded gpus?
> Oh. If I understood you correctly - the CMA ends up giving huge chunks of 
> contiguous area. But if the sizes are 4kb I wonder why it would do that?
>
> The modern GPUs on x86 can deal with scatter gather and as you surmise don't 
> need contiguous physical contiguous areas.
>>> My naive approach would be to add a new gfp_t flag a la
>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>> refrain from doing so if they have some fallback for getting memory.
>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>> around here:
>>>
>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>> However i'm not familiar enough with memory management, so likely
>>> greater minds here have much better ideas on how to deal with this?
>>>
> That is a bit of hack to deal with CMA being slow.
>
> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>> thanks,
>>> -mario
>>>
>>> Typical snippet from an example trace of a badly stalling desktop
>> with
>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>> ftrace_filter 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-09 Thread Konrad Rzeszutek Wilk
On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom  
wrote:
>Hi.
>
Hey Thomas!

>IIRC I don't think the TTM DMA pool allocates coherent pages more than
>one page at a time, and _if that's true_ it's pretty unnecessary for
>the
>dma subsystem to route those allocations to CMA. Maybe Konrad could
>shed
>some light over this?

It should allocate in batches and keep them in the TTM DMA pool for some time 
to be reused.

The pages that it gets are in 4kb granularity though.
>
>/Thomas
>
>
>On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>> Hi all,
>>
>> there is a rather severe performance problem i accidentally found
>when
>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>
>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>> weren't compiled with CMA, so i only observed this on 3.16, but
>> previous kernels would likely be affected too.
>>
>> After a few minutes of regular desktop use like switching workspaces,
>> scrolling text in a terminal window, Firefox with multiple tabs open,
>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>> composition), i get chunky desktop updates, then multi-second
>freezes,
>> after a few minutes the desktop hangs for over a minute on almost any
>> GUI action like switching windows etc. --> Unuseable.
>>
>> ftrace'ing shows the culprit being this callchain (typical good/bad
>> example ftrace snippets at the end of this mail):
>>
>> ...ttm dma coherent memory allocations, e.g., from
>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>> dma_alloc_from_contiguous()
>>
>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>when
>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>
>> With CMA, this function becomes progressively more slow with every
>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>> hundreds or thousands of microseconds (before it gives up and
>> alloc_pages_node() fallback is used), so this causes the
>> multi-second/minute hangs of the desktop.
>>
>> So it seems ttm memory allocations quickly fragment and/or exhaust
>the
>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>> find a fitting hole big enough to satisfy allocations with a retry
>> loop (see
>>
>http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
>> that takes forever.

I am curious why it does not end up using the pool. As in use the TTM DMA pool 
to pick pages instead of allocating (and freeing) new ones?

>>
>> This is not good, also not for other devices which actually need a
>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>> still need physically contiguous dma memory, maybe with exception of
>> some embedded gpus?

Oh. If I understood you correctly - the CMA ends up giving huge chunks of 
contiguous area. But if the sizes are 4kb I wonder why it would do that?

The modern GPUs on x86 can deal with scatter gather and as you surmise don't 
need contiguous physical contiguous areas.
>>
>> My naive approach would be to add a new gfp_t flag a la
>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>> refrain from doing so if they have some fallback for getting memory.
>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>> around here:
>>
>http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884
>>
>> However i'm not familiar enough with memory management, so likely
>> greater minds here have much better ideas on how to deal with this?
>>

That is a bit of hack to deal with CMA being slow.

Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>> thanks,
>> -mario
>>
>> Typical snippet from an example trace of a badly stalling desktop
>with
>> CMA (alloc_pages_node() fallback may have been missing in this traces
>> ftrace_filter settings):
>>
>> 1)   |  ttm_dma_pool_get_pages
>> [ttm]() {
>>  1)   | ttm_dma_page_pool_fill_locked [ttm]() {
>>  1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
>>  1)   | __ttm_dma_alloc_page [ttm]() {
>>  1)   | dma_generic_alloc_coherent() {
>>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>>  1) ! 1874.292 us |  }
>>  1) ! 1875.400 us |}
>>  1)   | __ttm_dma_alloc_page [ttm]() {
>>  1)   | dma_generic_alloc_coherent() {
>>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>>  1) ! 1869.586 us |  }
>>  1) ! 1870.053 us |}
>>  1)   | 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-09 Thread Konrad Rzeszutek Wilk
On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom thellst...@vmware.com 
wrote:
Hi.

Hey Thomas!

IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for some time 
to be reused.

The pages that it gets are in 4kb granularity though.

/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:
 Hi all,

 there is a rather severe performance problem i accidentally found
when
 trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
 Ubuntu 14.04 LTS with nouveau as graphics driver.

 I was lazy and just installed the Ubuntu precompiled mainline kernel.
 That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
 (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
 weren't compiled with CMA, so i only observed this on 3.16, but
 previous kernels would likely be affected too.

 After a few minutes of regular desktop use like switching workspaces,
 scrolling text in a terminal window, Firefox with multiple tabs open,
 Thunderbird etc. (tested with KDE/Kwin, with/without desktop
 composition), i get chunky desktop updates, then multi-second
freezes,
 after a few minutes the desktop hangs for over a minute on almost any
 GUI action like switching windows etc. -- Unuseable.

 ftrace'ing shows the culprit being this callchain (typical good/bad
 example ftrace snippets at the end of this mail):

 ...ttm dma coherent memory allocations, e.g., from
 __ttm_dma_alloc_page() ... -- dma_alloc_coherent() -- platform
 specific hooks ... - dma_generic_alloc_coherent() [on x86_64] --
 dma_alloc_from_contiguous()

 dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
when
 the machine is booted with kernel boot cmdline parameter cma=0, so
 it triggers the fast alloc_pages_node() fallback at least on x86_64.

 With CMA, this function becomes progressively more slow with every
 minute of desktop use, e.g., runtimes going up from  0.3 usecs to
 hundreds or thousands of microseconds (before it gives up and
 alloc_pages_node() fallback is used), so this causes the
 multi-second/minute hangs of the desktop.

 So it seems ttm memory allocations quickly fragment and/or exhaust
the
 CMA memory area, and dma_alloc_from_contiguous() tries very hard to
 find a fitting hole big enough to satisfy allocations with a retry
 loop (see

http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
 that takes forever.

I am curious why it does not end up using the pool. As in use the TTM DMA pool 
to pick pages instead of allocating (and freeing) new ones?


 This is not good, also not for other devices which actually need a
 non-fragmented CMA for DMA, so what to do? I doubt most current gpus
 still need physically contiguous dma memory, maybe with exception of
 some embedded gpus?

Oh. If I understood you correctly - the CMA ends up giving huge chunks of 
contiguous area. But if the sizes are 4kb I wonder why it would do that?

The modern GPUs on x86 can deal with scatter gather and as you surmise don't 
need contiguous physical contiguous areas.

 My naive approach would be to add a new gfp_t flag a la
 ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
 refrain from doing so if they have some fallback for getting memory.
 And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
 around here:

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884

 However i'm not familiar enough with memory management, so likely
 greater minds here have much better ideas on how to deal with this?


That is a bit of hack to deal with CMA being slow.

Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
 thanks,
 -mario

 Typical snippet from an example trace of a badly stalling desktop
with
 CMA (alloc_pages_node() fallback may have been missing in this traces
 ftrace_filter settings):

 1)   |  ttm_dma_pool_get_pages
 [ttm]() {
  1)   | ttm_dma_page_pool_fill_locked [ttm]() {
  1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1873.071 us | dma_alloc_from_contiguous();
  1) ! 1874.292 us |  }
  1) ! 1875.400 us |}
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1868.372 us | dma_alloc_from_contiguous();
  1) ! 1869.586 us |  }
  1) ! 1870.053 us |}
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1871.085 us | dma_alloc_from_contiguous();
  1) ! 1872.240 us |

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-09 Thread Thomas Hellstrom


On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
 On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom thellst...@vmware.com 
 wrote:
 Hi.

 Hey Thomas!

 IIRC I don't think the TTM DMA pool allocates coherent pages more than
 one page at a time, and _if that's true_ it's pretty unnecessary for
 the
 dma subsystem to route those allocations to CMA. Maybe Konrad could
 shed
 some light over this?
 It should allocate in batches and keep them in the TTM DMA pool for some time 
 to be reused.

 The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas


 /Thomas


 On 08/08/2014 07:42 PM, Mario Kleiner wrote:
 Hi all,

 there is a rather severe performance problem i accidentally found
 when
 trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
 Ubuntu 14.04 LTS with nouveau as graphics driver.

 I was lazy and just installed the Ubuntu precompiled mainline kernel.
 That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
 (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
 weren't compiled with CMA, so i only observed this on 3.16, but
 previous kernels would likely be affected too.

 After a few minutes of regular desktop use like switching workspaces,
 scrolling text in a terminal window, Firefox with multiple tabs open,
 Thunderbird etc. (tested with KDE/Kwin, with/without desktop
 composition), i get chunky desktop updates, then multi-second
 freezes,
 after a few minutes the desktop hangs for over a minute on almost any
 GUI action like switching windows etc. -- Unuseable.

 ftrace'ing shows the culprit being this callchain (typical good/bad
 example ftrace snippets at the end of this mail):

 ...ttm dma coherent memory allocations, e.g., from
 __ttm_dma_alloc_page() ... -- dma_alloc_coherent() -- platform
 specific hooks ... - dma_generic_alloc_coherent() [on x86_64] --
 dma_alloc_from_contiguous()

 dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
 when
 the machine is booted with kernel boot cmdline parameter cma=0, so
 it triggers the fast alloc_pages_node() fallback at least on x86_64.

 With CMA, this function becomes progressively more slow with every
 minute of desktop use, e.g., runtimes going up from  0.3 usecs to
 hundreds or thousands of microseconds (before it gives up and
 alloc_pages_node() fallback is used), so this causes the
 multi-second/minute hangs of the desktop.

 So it seems ttm memory allocations quickly fragment and/or exhaust
 the
 CMA memory area, and dma_alloc_from_contiguous() tries very hard to
 find a fitting hole big enough to satisfy allocations with a retry
 loop (see

 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0As=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
 that takes forever.
 I am curious why it does not end up using the pool. As in use the TTM DMA 
 pool to pick pages instead of allocating (and freeing) new ones?

 This is not good, also not for other devices which actually need a
 non-fragmented CMA for DMA, so what to do? I doubt most current gpus
 still need physically contiguous dma memory, maybe with exception of
 some embedded gpus?
 Oh. If I understood you correctly - the CMA ends up giving huge chunks of 
 contiguous area. But if the sizes are 4kb I wonder why it would do that?

 The modern GPUs on x86 can deal with scatter gather and as you surmise don't 
 need contiguous physical contiguous areas.
 My naive approach would be to add a new gfp_t flag a la
 ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
 refrain from doing so if they have some fallback for getting memory.
 And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
 around here:

 https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0Ar=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0Am=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0As=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
 However i'm not familiar enough with memory management, so likely
 greater minds here have much better ideas on how to deal with this?

 That is a bit of hack to deal with CMA being slow.

 Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
 thanks,
 -mario

 Typical snippet from an example trace of a badly stalling desktop
 with
 CMA (alloc_pages_node() fallback may have been missing in this traces
 ftrace_filter settings):

 1)   |  ttm_dma_pool_get_pages
 [ttm]() {
  1)   | ttm_dma_page_pool_fill_locked [ttm]() {
  1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)   

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-09 Thread Mario Kleiner

Resent this time without HTML formatting which lkml doesn't like. Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:

On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:

On August 9, 2014 1:39:39 AM EDT, Thomas Hellstromthellst...@vmware.com  
wrote:

Hi.


Hey Thomas!


IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for
the
dma subsystem to route those allocations to CMA. Maybe Konrad could
shed
some light over this?

It should allocate in batches and keep them in the TTM DMA pool for some time 
to be reused.

The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas


Yes, seems you're both right. I read through the code a bit more and 
indeed the TTM DMA pool allocates only one page during each 
dma_alloc_coherent() call, so it doesn't need CMA memory. The current 
allocators don't check for single page CMA allocations and therefore try 
to get it from the CMA area anyway, instead of skipping to the much 
cheaper fallback.


So the callers of dma_alloc_from_contiguous() could need that little 
optimization of skipping it if only one page is requested. For


dma_generic_alloc_coherent  
http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent  
andintel_alloc_coherent  http://lxr.free-electrons.com/ident?i=intel_alloc_coherent 
 this seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks 
for such a case and the code looks to me as if that's at least somewhat 
intentional.


As far as TTM goes, one quick one-line fix to prevent it from using the 
CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the above 
methods) would be to clear the __GFP_WAIT 
http://lxr.free-electrons.com/ident?i=__GFP_WAIT flag from the passed 
gfp_t flags. That would trigger the well working fallback. So, is


__GFP_WAIT  http://lxr.free-electrons.com/ident?i=__GFP_WAIT  needed for those 
single page allocations that go through__ttm_dma_alloc_page  
http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page?

It would be nice to have such a simple, non-intrusive one-line patch 
that we still could get into 3.17 and then backported to older stable 
kernels to avoid the same desktop hangs there if CMA is enabled. It 
would be also nice for actual users of CMA to not use up lots of CMA 
space for gpu's which don't need it. I think DMA_CMA was introduced 
around 3.12.



The other problem is that probably TTM does not reuse pages from the DMA 
pool. If i trace the __ttm_dma_alloc_page 
http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page and 
__ttm_dma_free_page 
http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page calls for 
those single page allocs/frees, then over a 20 second interval of 
tracing and switching tabs in firefox, scrolling things around etc. i 
find about as many alloc's as i find free's, e.g., 1607 allocs vs. 1648 
frees.


This bit of code fromttm_dma_unpopulate 
http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate()  (line 954 
in 3.16) looks suspicious:


http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

Alloc's from a tt_cached cached pool ( if (is_cached)...) always get 
freed and are not given back to the cached pool. But in the uncached 
case, there's logic to make sure the pool doesn't grow forever (line 
955, checking against _manager-options.max_size), but before that check 
in line 954 there's an uncoditional assignment of npages = count; which 
seems to force freeing all pages as well, instead of recycling? Is this 
some debug code left over, or intentional and just me not understanding 
what happens there?


thanks,
-mario



/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:

Hi all,

there is a rather severe performance problem i accidentally found

when

trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
Ubuntu 14.04 LTS with nouveau as graphics driver.

I was lazy and just installed the Ubuntu precompiled mainline kernel.
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
weren't compiled with CMA, so i only observed this on 3.16, but
previous kernels would likely be affected too.

After a few minutes of regular desktop use like switching workspaces,
scrolling text in a terminal window, Firefox with multiple tabs open,
Thunderbird etc. (tested with KDE/Kwin, with/without desktop
composition), i get chunky desktop updates, then multi-second

freezes,

after a few minutes the desktop hangs for over a minute on almost any
GUI action like switching windows etc. -- Unuseable.

ftrace'ing shows the culprit 

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-08 Thread Thomas Hellstrom
Hi.

IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for the
dma subsystem to route those allocations to CMA. Maybe Konrad could shed
some light over this?

/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:
> Hi all,
>
> there is a rather severe performance problem i accidentally found when
> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
> Ubuntu 14.04 LTS with nouveau as graphics driver.
>
> I was lazy and just installed the Ubuntu precompiled mainline kernel.
> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
> weren't compiled with CMA, so i only observed this on 3.16, but
> previous kernels would likely be affected too.
>
> After a few minutes of regular desktop use like switching workspaces,
> scrolling text in a terminal window, Firefox with multiple tabs open,
> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
> composition), i get chunky desktop updates, then multi-second freezes,
> after a few minutes the desktop hangs for over a minute on almost any
> GUI action like switching windows etc. --> Unuseable.
>
> ftrace'ing shows the culprit being this callchain (typical good/bad
> example ftrace snippets at the end of this mail):
>
> ...ttm dma coherent memory allocations, e.g., from
> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
> dma_alloc_from_contiguous()
>
> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when
> the machine is booted with kernel boot cmdline parameter "cma=0", so
> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>
> With CMA, this function becomes progressively more slow with every
> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
> hundreds or thousands of microseconds (before it gives up and
> alloc_pages_node() fallback is used), so this causes the
> multi-second/minute hangs of the desktop.
>
> So it seems ttm memory allocations quickly fragment and/or exhaust the
> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
> find a fitting hole big enough to satisfy allocations with a retry
> loop (see
> http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
> that takes forever.
>
> This is not good, also not for other devices which actually need a
> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
> still need physically contiguous dma memory, maybe with exception of
> some embedded gpus?
>
> My naive approach would be to add a new gfp_t flag a la
> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
> refrain from doing so if they have some fallback for getting memory.
> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
> around here:
> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884
>
> However i'm not familiar enough with memory management, so likely
> greater minds here have much better ideas on how to deal with this?
>
> thanks,
> -mario
>
> Typical snippet from an example trace of a badly stalling desktop with
> CMA (alloc_pages_node() fallback may have been missing in this traces
> ftrace_filter settings):
>
> 1)   |  ttm_dma_pool_get_pages
> [ttm]() {
>  1)   | ttm_dma_page_pool_fill_locked [ttm]() {
>  1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
>  1)   | __ttm_dma_alloc_page [ttm]() {
>  1)   | dma_generic_alloc_coherent() {
>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>  1) ! 1874.292 us |  }
>  1) ! 1875.400 us |}
>  1)   | __ttm_dma_alloc_page [ttm]() {
>  1)   | dma_generic_alloc_coherent() {
>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>  1) ! 1869.586 us |  }
>  1) ! 1870.053 us |}
>  1)   | __ttm_dma_alloc_page [ttm]() {
>  1)   | dma_generic_alloc_coherent() {
>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>  1) ! 1872.240 us |  }
>  1) ! 1872.669 us |}
>  1)   | __ttm_dma_alloc_page [ttm]() {
>  1)   | dma_generic_alloc_coherent() {
>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>  1) ! 1890.179 us |  }
>  1) ! 1890.608 us |}
>  1)   0.048 us| ttm_set_pages_caching [ttm]();
>  1) ! 7511.000 us |  }
>  1) ! 7511.306 us |}
>  1) ! 7511.623 us |  }
>
> The good case (with cma=0 kernel cmdline, so
> dma_alloc_from_contiguous() no-ops,)
>
> 0)   |

CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-08 Thread Mario Kleiner

Hi all,

there is a rather severe performance problem i accidentally found when 
trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under 
Ubuntu 14.04 LTS with nouveau as graphics driver.


I was lazy and just installed the Ubuntu precompiled mainline kernel. 
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA 
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels 
weren't compiled with CMA, so i only observed this on 3.16, but previous 
kernels would likely be affected too.


After a few minutes of regular desktop use like switching workspaces, 
scrolling text in a terminal window, Firefox with multiple tabs open, 
Thunderbird etc. (tested with KDE/Kwin, with/without desktop 
composition), i get chunky desktop updates, then multi-second freezes, 
after a few minutes the desktop hangs for over a minute on almost any 
GUI action like switching windows etc. --> Unuseable.


ftrace'ing shows the culprit being this callchain (typical good/bad 
example ftrace snippets at the end of this mail):


...ttm dma coherent memory allocations, e.g., from 
__ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform 
specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] --> 
dma_alloc_from_contiguous()


dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when 
the machine is booted with kernel boot cmdline parameter "cma=0", so it 
triggers the fast alloc_pages_node() fallback at least on x86_64.


With CMA, this function becomes progressively more slow with every 
minute of desktop use, e.g., runtimes going up from < 0.3 usecs to 
hundreds or thousands of microseconds (before it gives up and 
alloc_pages_node() fallback is used), so this causes the 
multi-second/minute hangs of the desktop.


So it seems ttm memory allocations quickly fragment and/or exhaust the 
CMA memory area, and dma_alloc_from_contiguous() tries very hard to find 
a fitting hole big enough to satisfy allocations with a retry loop (see 
http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339) 
that takes forever.


This is not good, also not for other devices which actually need a 
non-fragmented CMA for DMA, so what to do? I doubt most current gpus 
still need physically contiguous dma memory, maybe with exception of 
some embedded gpus?


My naive approach would be to add a new gfp_t flag a la ___GFP_AVOIDCMA, 
and make callers of dma_alloc_from_contiguous() refrain from doing so if 
they have some fallback for getting memory. And then add that flag to 
ttm's ttm_dma_populate() gfp_flags, e.g., around here: 
http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884


However i'm not familiar enough with memory management, so likely 
greater minds here have much better ideas on how to deal with this?


thanks,
-mario

Typical snippet from an example trace of a badly stalling desktop with 
CMA (alloc_pages_node() fallback may have been missing in this traces 
ftrace_filter settings):


1)   |  ttm_dma_pool_get_pages [ttm]() {
 1)   | ttm_dma_page_pool_fill_locked [ttm]() {
 1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1873.071 us | dma_alloc_from_contiguous();
 1) ! 1874.292 us |  }
 1) ! 1875.400 us |}
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1868.372 us | dma_alloc_from_contiguous();
 1) ! 1869.586 us |  }
 1) ! 1870.053 us |}
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1871.085 us | dma_alloc_from_contiguous();
 1) ! 1872.240 us |  }
 1) ! 1872.669 us |}
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1888.934 us | dma_alloc_from_contiguous();
 1) ! 1890.179 us |  }
 1) ! 1890.608 us |}
 1)   0.048 us| ttm_set_pages_caching [ttm]();
 1) ! 7511.000 us |  }
 1) ! 7511.306 us |}
 1) ! 7511.623 us |  }

The good case (with cma=0 kernel cmdline, so dma_alloc_from_contiguous() 
no-ops,)


0)   |  ttm_dma_pool_get_pages [ttm]() {
 0)   | ttm_dma_page_pool_fill_locked [ttm]() {
 0)   | ttm_dma_pool_alloc_new_pages [ttm]() {
 0)   | __ttm_dma_alloc_page [ttm]() {
 0)   | dma_generic_alloc_coherent() {
 0)   0.171 us| dma_alloc_from_contiguous();
 0)   0.849 us| __alloc_pages_nodemask();
 0)   3.029 us|  }
 0)   3.882 us|   

CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-08 Thread Mario Kleiner

Hi all,

there is a rather severe performance problem i accidentally found when 
trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under 
Ubuntu 14.04 LTS with nouveau as graphics driver.


I was lazy and just installed the Ubuntu precompiled mainline kernel. 
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA 
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels 
weren't compiled with CMA, so i only observed this on 3.16, but previous 
kernels would likely be affected too.


After a few minutes of regular desktop use like switching workspaces, 
scrolling text in a terminal window, Firefox with multiple tabs open, 
Thunderbird etc. (tested with KDE/Kwin, with/without desktop 
composition), i get chunky desktop updates, then multi-second freezes, 
after a few minutes the desktop hangs for over a minute on almost any 
GUI action like switching windows etc. -- Unuseable.


ftrace'ing shows the culprit being this callchain (typical good/bad 
example ftrace snippets at the end of this mail):


...ttm dma coherent memory allocations, e.g., from 
__ttm_dma_alloc_page() ... -- dma_alloc_coherent() -- platform 
specific hooks ... - dma_generic_alloc_coherent() [on x86_64] -- 
dma_alloc_from_contiguous()


dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when 
the machine is booted with kernel boot cmdline parameter cma=0, so it 
triggers the fast alloc_pages_node() fallback at least on x86_64.


With CMA, this function becomes progressively more slow with every 
minute of desktop use, e.g., runtimes going up from  0.3 usecs to 
hundreds or thousands of microseconds (before it gives up and 
alloc_pages_node() fallback is used), so this causes the 
multi-second/minute hangs of the desktop.


So it seems ttm memory allocations quickly fragment and/or exhaust the 
CMA memory area, and dma_alloc_from_contiguous() tries very hard to find 
a fitting hole big enough to satisfy allocations with a retry loop (see 
http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339) 
that takes forever.


This is not good, also not for other devices which actually need a 
non-fragmented CMA for DMA, so what to do? I doubt most current gpus 
still need physically contiguous dma memory, maybe with exception of 
some embedded gpus?


My naive approach would be to add a new gfp_t flag a la ___GFP_AVOIDCMA, 
and make callers of dma_alloc_from_contiguous() refrain from doing so if 
they have some fallback for getting memory. And then add that flag to 
ttm's ttm_dma_populate() gfp_flags, e.g., around here: 
http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884


However i'm not familiar enough with memory management, so likely 
greater minds here have much better ideas on how to deal with this?


thanks,
-mario

Typical snippet from an example trace of a badly stalling desktop with 
CMA (alloc_pages_node() fallback may have been missing in this traces 
ftrace_filter settings):


1)   |  ttm_dma_pool_get_pages [ttm]() {
 1)   | ttm_dma_page_pool_fill_locked [ttm]() {
 1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1873.071 us | dma_alloc_from_contiguous();
 1) ! 1874.292 us |  }
 1) ! 1875.400 us |}
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1868.372 us | dma_alloc_from_contiguous();
 1) ! 1869.586 us |  }
 1) ! 1870.053 us |}
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1871.085 us | dma_alloc_from_contiguous();
 1) ! 1872.240 us |  }
 1) ! 1872.669 us |}
 1)   | __ttm_dma_alloc_page [ttm]() {
 1)   | dma_generic_alloc_coherent() {
 1) ! 1888.934 us | dma_alloc_from_contiguous();
 1) ! 1890.179 us |  }
 1) ! 1890.608 us |}
 1)   0.048 us| ttm_set_pages_caching [ttm]();
 1) ! 7511.000 us |  }
 1) ! 7511.306 us |}
 1) ! 7511.623 us |  }

The good case (with cma=0 kernel cmdline, so dma_alloc_from_contiguous() 
no-ops,)


0)   |  ttm_dma_pool_get_pages [ttm]() {
 0)   | ttm_dma_page_pool_fill_locked [ttm]() {
 0)   | ttm_dma_pool_alloc_new_pages [ttm]() {
 0)   | __ttm_dma_alloc_page [ttm]() {
 0)   | dma_generic_alloc_coherent() {
 0)   0.171 us| dma_alloc_from_contiguous();
 0)   0.849 us| __alloc_pages_nodemask();
 0)   3.029 us|  }
 0)   3.882 us|   

Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.

2014-08-08 Thread Thomas Hellstrom
Hi.

IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for the
dma subsystem to route those allocations to CMA. Maybe Konrad could shed
some light over this?

/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:
 Hi all,

 there is a rather severe performance problem i accidentally found when
 trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
 Ubuntu 14.04 LTS with nouveau as graphics driver.

 I was lazy and just installed the Ubuntu precompiled mainline kernel.
 That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
 (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
 weren't compiled with CMA, so i only observed this on 3.16, but
 previous kernels would likely be affected too.

 After a few minutes of regular desktop use like switching workspaces,
 scrolling text in a terminal window, Firefox with multiple tabs open,
 Thunderbird etc. (tested with KDE/Kwin, with/without desktop
 composition), i get chunky desktop updates, then multi-second freezes,
 after a few minutes the desktop hangs for over a minute on almost any
 GUI action like switching windows etc. -- Unuseable.

 ftrace'ing shows the culprit being this callchain (typical good/bad
 example ftrace snippets at the end of this mail):

 ...ttm dma coherent memory allocations, e.g., from
 __ttm_dma_alloc_page() ... -- dma_alloc_coherent() -- platform
 specific hooks ... - dma_generic_alloc_coherent() [on x86_64] --
 dma_alloc_from_contiguous()

 dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when
 the machine is booted with kernel boot cmdline parameter cma=0, so
 it triggers the fast alloc_pages_node() fallback at least on x86_64.

 With CMA, this function becomes progressively more slow with every
 minute of desktop use, e.g., runtimes going up from  0.3 usecs to
 hundreds or thousands of microseconds (before it gives up and
 alloc_pages_node() fallback is used), so this causes the
 multi-second/minute hangs of the desktop.

 So it seems ttm memory allocations quickly fragment and/or exhaust the
 CMA memory area, and dma_alloc_from_contiguous() tries very hard to
 find a fitting hole big enough to satisfy allocations with a retry
 loop (see
 http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
 that takes forever.

 This is not good, also not for other devices which actually need a
 non-fragmented CMA for DMA, so what to do? I doubt most current gpus
 still need physically contiguous dma memory, maybe with exception of
 some embedded gpus?

 My naive approach would be to add a new gfp_t flag a la
 ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
 refrain from doing so if they have some fallback for getting memory.
 And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
 around here:
 http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884

 However i'm not familiar enough with memory management, so likely
 greater minds here have much better ideas on how to deal with this?

 thanks,
 -mario

 Typical snippet from an example trace of a badly stalling desktop with
 CMA (alloc_pages_node() fallback may have been missing in this traces
 ftrace_filter settings):

 1)   |  ttm_dma_pool_get_pages
 [ttm]() {
  1)   | ttm_dma_page_pool_fill_locked [ttm]() {
  1)   | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1873.071 us | dma_alloc_from_contiguous();
  1) ! 1874.292 us |  }
  1) ! 1875.400 us |}
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1868.372 us | dma_alloc_from_contiguous();
  1) ! 1869.586 us |  }
  1) ! 1870.053 us |}
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1871.085 us | dma_alloc_from_contiguous();
  1) ! 1872.240 us |  }
  1) ! 1872.669 us |}
  1)   | __ttm_dma_alloc_page [ttm]() {
  1)   | dma_generic_alloc_coherent() {
  1) ! 1888.934 us | dma_alloc_from_contiguous();
  1) ! 1890.179 us |  }
  1) ! 1890.608 us |}
  1)   0.048 us| ttm_set_pages_caching [ttm]();
  1) ! 7511.000 us |  }
  1) ! 7511.306 us |}
  1) ! 7511.623 us |  }

 The good case (with cma=0 kernel cmdline, so
 dma_alloc_from_contiguous() no-ops,)

 0)   |  ttm_dma_pool_get_pages
 [ttm]() {
  0)   | ttm_dma_page_pool_fill_locked [ttm]() {
  0)