Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On 06/03/17 14:36, Benjamin Herrenschmidt wrote: > On Mon, 2017-03-06 at 12:28 +1100, Alexey Kardashevskiy wrote: >> 8192*8192*8192*65536>>40 = 32768TB of addressable memory (but there is no >> good reason not to use huge pages); > > No, 39 bits is half a TB. That's not enough. Ah. My bad. 55 bits it is. It is 2 "indirect" levels + 1 "direct" level, each 8192 entries. So 13*3+16=55. > >> 8192*8192*8192*4096>>40 = 2048TB or addressable memory (even with 2 >> indirect levels but we can have all 5 levels with 4K IOMMU pages). >> >> Looks enough to me... >> >> And in this particular patch I am not limiting anything, I just replace >> already existing EEH condition with -EINVAL. If it is this important to >> have all 5 levels, then we can switch from alloc_pages_node() to >> kmem_cache_alloc_node(), in a separate patch. >> >> >> -- > -- Alexey
Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On Mon, 2017-03-06 at 12:28 +1100, Alexey Kardashevskiy wrote: > 8192*8192*8192*65536>>40 = 32768TB of addressable memory (but there is no > good reason not to use huge pages); No, 39 bits is half a TB. That's not enough. > 8192*8192*8192*4096>>40 = 2048TB or addressable memory (even with 2 > indirect levels but we can have all 5 levels with 4K IOMMU pages). > > Looks enough to me... > > And in this particular patch I am not limiting anything, I just replace > already existing EEH condition with -EINVAL. If it is this important to > have all 5 levels, then we can switch from alloc_pages_node() to > kmem_cache_alloc_node(), in a separate patch. > > > --
Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On 06/03/17 10:03, Benjamin Herrenschmidt wrote: > On Mon, 2017-02-27 at 22:00 +1100, Michael Ellerman wrote: >>> The alternative would be allocating TCE tables as big as PAGE_SIZE >>> but >>> only using parts of it but this would complicate a bit bits of code >>> responsible for overall amount of memory used for TCE table. >>> >>> Or kmem_cache_create() could be used to allocate as big TCE table >>> levels >>> as we really need but that API does not seem to support NUMA nodes. >> >> kmem_cache_alloc_node() ? > > Is that 55 bits of address space (ie, 3 indirect levels + 64k pages) ? > Or only 39 (2 indirect level + 64k pages) ? 39, yes. > In the former case, I'm happy to limit the levels to 3 for 64K pages, > 55 bits of TCE space is more than enough. 39 isn't however. 8192*8192*8192*65536>>40 = 32768TB of addressable memory (but there is no good reason not to use huge pages); 8192*8192*8192*4096>>40 = 2048TB or addressable memory (even with 2 indirect levels but we can have all 5 levels with 4K IOMMU pages). Looks enough to me... And in this particular patch I am not limiting anything, I just replace already existing EEH condition with -EINVAL. If it is this important to have all 5 levels, then we can switch from alloc_pages_node() to kmem_cache_alloc_node(), in a separate patch. -- Alexey
Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On Mon, 2017-02-27 at 22:00 +1100, Michael Ellerman wrote: > > The alternative would be allocating TCE tables as big as PAGE_SIZE > > but > > only using parts of it but this would complicate a bit bits of code > > responsible for overall amount of memory used for TCE table. > > > > Or kmem_cache_create() could be used to allocate as big TCE table > > levels > > as we really need but that API does not seem to support NUMA nodes. > > kmem_cache_alloc_node() ? Is that 55 bits of address space (ie, 3 indirect levels + 64k pages) ? Or only 39 (2 indirect level + 64k pages) ? In the former case, I'm happy to limit the levels to 3 for 64K pages, 55 bits of TCE space is more than enough. 39 isn't however. Cheers, Ben.
Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On 27/02/17 22:00, Michael Ellerman wrote: > Alexey Kardashevskiy writes: > >> The IODA2 specification says that a 64 DMA address cannot use top 4 bits >> (3 are reserved and one is a "TVE select"); bottom page_shift bits >> cannot be used for multilevel table addressing either. >> >> The existing IODA2 table allocation code aligns the minimum TCE table >> size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages, >> we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs >> 13 bits, the maximum number of levels is 48/13 = 3 so we physically >> cannot address more and EEH happens on DMA accesses. >> >> This adds a check that too many levels were requested. >> >> It is still possible to have 5 levels in the case of 4K system page size. >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> >> The alternative would be allocating TCE tables as big as PAGE_SIZE but >> only using parts of it but this would complicate a bit bits of code >> responsible for overall amount of memory used for TCE table. >> >> Or kmem_cache_create() could be used to allocate as big TCE table levels >> as we really need but that API does not seem to support NUMA nodes. > > kmem_cache_alloc_node() ? Yeah, discovered this later. Still, if a single level is used, then the table is 4MB and kmem_cache_alloc_node() does not seem the right tool here (although I cannot find any enforced upper limit). So to keep things simpler, I decided to stick to alloc_pages_node() and avoid mixing memory allocation APIs. -- Alexey
Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
Alexey Kardashevskiy writes: > The IODA2 specification says that a 64 DMA address cannot use top 4 bits > (3 are reserved and one is a "TVE select"); bottom page_shift bits > cannot be used for multilevel table addressing either. > > The existing IODA2 table allocation code aligns the minimum TCE table > size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages, > we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs > 13 bits, the maximum number of levels is 48/13 = 3 so we physically > cannot address more and EEH happens on DMA accesses. > > This adds a check that too many levels were requested. > > It is still possible to have 5 levels in the case of 4K system page size. > > Signed-off-by: Alexey Kardashevskiy > --- > > The alternative would be allocating TCE tables as big as PAGE_SIZE but > only using parts of it but this would complicate a bit bits of code > responsible for overall amount of memory used for TCE table. > > Or kmem_cache_create() could be used to allocate as big TCE table levels > as we really need but that API does not seem to support NUMA nodes. kmem_cache_alloc_node() ? cheers
Re: [PATCH kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On Wed, Feb 22, 2017 at 03:43:59PM +1100, Alexey Kardashevskiy wrote: >The IODA2 specification says that a 64 DMA address cannot use top 4 bits >(3 are reserved and one is a "TVE select"); bottom page_shift bits >cannot be used for multilevel table addressing either. > >The existing IODA2 table allocation code aligns the minimum TCE table >size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages, >we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs >13 bits, the maximum number of levels is 48/13 = 3 so we physically >cannot address more and EEH happens on DMA accesses. > >This adds a check that too many levels were requested. > >It is still possible to have 5 levels in the case of 4K system page size. > >Signed-off-by: Alexey Kardashevskiy >--- Acked-by: Gavin Shan