Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
On Tue, 16 Jun 2009 09:10:39 -0700 Roland Dreier wrote: > > > Yeah, the notifier code remains untouched as we still do not allow dynamic > > memory operations _while_ our module is loaded. The patch allows the > driver to > > cope with DMEM operations that happened before the module was loaded, which > > might result in a non-contiguous memory layout. When the driver registers > > its global memory region in the system, the memory layout must be > considered. > > > > We chose the term "toleration" instead of "support" to illustrate this. > > I see. So things just silently broke in some cases when the driver was > loaded after operations you didn't tolerate? > > Anyway, thanks for the explanation. Well, things did not break silently. The registration of the MR failed with an error code which was reported to userspace. Will you push the patch for .31 or .32? Thanks, Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
> Yeah, the notifier code remains untouched as we still do not allow dynamic > memory operations _while_ our module is loaded. The patch allows the driver > to > cope with DMEM operations that happened before the module was loaded, which > might result in a non-contiguous memory layout. When the driver registers > its global memory region in the system, the memory layout must be considered. > > We chose the term "toleration" instead of "support" to illustrate this. I see. So things just silently broke in some cases when the driver was loaded after operations you didn't tolerate? Anyway, thanks for the explanation. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
Hi Roland, thank you for taking a look at the code! On Fri, 12 Jun 2009 21:50:58 -0700 Roland Dreier wrote: > OK, one major issue with this patch and a few minor nits. > > First, the major issue is that I don't see anything in the patch that > changes the code in ehca_mem_notifier() in ehca_main.c: > > case MEM_GOING_ONLINE: > case MEM_GOING_OFFLINE: > /* only ok if no hca is attached to the lpar */ > spin_lock_irqsave(&shca_list_lock, flags); > if (list_empty(&shca_list)) { > spin_unlock_irqrestore(&shca_list_lock, flags); > return NOTIFY_OK; > } else { > spin_unlock_irqrestore(&shca_list_lock, flags); > if (printk_timed_ratelimit(&ehca_dmem_warn_time, > 30 * 1000)) > ehca_gen_err("DMEM operations are not allowed" >"in conjunction with eHCA"); > return NOTIFY_BAD; > } > > But your patch description says: > > > This patch implements toleration of dynamic memory operations > > But it seems you're still going to hit the same NOTIFY_BAD case above > after your patch. So something doesn't compute for me. Could you > explain more? Yeah, the notifier code remains untouched as we still do not allow dynamic memory operations _while_ our module is loaded. The patch allows the driver to cope with DMEM operations that happened before the module was loaded, which might result in a non-contiguous memory layout. When the driver registers its global memory region in the system, the memory layout must be considered. We chose the term "toleration" instead of "support" to illustrate this. I'll put some more details into the changelog, incorporate the other comments and send out a second version of the patch. Thanks, Alex > > Second, a nit: > > > +#define EHCA_REG_MR 0 > > +#define EHCA_REG_BUSMAP_MR (~0) > > and you pass these as the reg_busmap parm in: > > > int ehca_reg_mr(struct ehca_shca *shca, > >struct ehca_mr *e_mr, > >u64 *iova_start, > > @@ -991,7 +1031,8 @@ > >struct ehca_pd *e_pd, > >struct ehca_mr_pginfo *pginfo, > >u32 *lkey, /*OUT*/ > > - u32 *rkey) /*OUT*/ > > + u32 *rkey, /*OUT*/ > > + int reg_busmap) > > and test it as: > > > + if (reg_busmap) > > + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); > > + else > > + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); > > So the ~0 for true looks a bit odd. One option would be to make > reg_busmap a bool, since that's how you're using it, but then you lose > the nice self-documenting macro where you call things. > > So I think it would be cleaner to do something like > > enum ehca_reg_type { > EHCA_REG_MR, > EHCA_REG_BUSMAP_MR > }; > > and make the "int reg_busmap" parameter into "enum ehca_reg_type reg_type" > and have the code become > > + if (reg_type == EHCA_REG_BUSMAP_MR) > + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); > + else if (reg_type == EHCA_REG_MR) > + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); > + else > + ret = -EINVAL > > or something like that. > > > +struct ib_dma_mapping_ops ehca_dma_mapping_ops = { > > + .mapping_error = ehca_dma_mapping_error, > > + .map_single = ehca_dma_map_single, > > + .unmap_single = ehca_dma_unmap_single, > > + .map_page = ehca_dma_map_page, > > + .unmap_page = ehca_dma_unmap_page, > > + .map_sg = ehca_dma_map_sg, > > + .unmap_sg = ehca_dma_unmap_sg, > > + .dma_address = ehca_dma_address, > > + .dma_len = ehca_dma_len, > > + .sync_single_for_cpu = ehca_dma_sync_single_for_cpu, > > + .sync_single_for_device = ehca_dma_sync_single_for_device, > > + .alloc_coherent = ehca_dma_alloc_coherent, > > + .free_coherent = ehca_dma_free_coherent, > > +}; > > I always think structures like this are easier to read if you align the > '=' signs. But no big deal. > > > + ret = ehca_create_busmap(); > > + if (ret) { > > + ehca_gen_err("Cannot create busmap."); > > + goto module_init2; > > + } > > + > >ret = ibmebus_register_driver(&ehca_driver); > >if (ret) { > >ehca_gen_err("Cannot register eHCA device driver"); > >ret = -EINVAL; > > - goto module_init2; > > + goto module_init3; > >} > > > >ret = register_memory_notifier(&ehca_mem_nb); > >if (ret) { > >ehca_gen_err("Failed registering memory add/remove notifier"); > > - goto module_init3; > > + goto module_init4; > > Having to renumber unrelated things is when something changes is why I > don't like this style of error path labels. But I think it's well and > truly too late to fix that in e
Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
OK, one major issue with this patch and a few minor nits. First, the major issue is that I don't see anything in the patch that changes the code in ehca_mem_notifier() in ehca_main.c: case MEM_GOING_ONLINE: case MEM_GOING_OFFLINE: /* only ok if no hca is attached to the lpar */ spin_lock_irqsave(&shca_list_lock, flags); if (list_empty(&shca_list)) { spin_unlock_irqrestore(&shca_list_lock, flags); return NOTIFY_OK; } else { spin_unlock_irqrestore(&shca_list_lock, flags); if (printk_timed_ratelimit(&ehca_dmem_warn_time, 30 * 1000)) ehca_gen_err("DMEM operations are not allowed" "in conjunction with eHCA"); return NOTIFY_BAD; } But your patch description says: > This patch implements toleration of dynamic memory operations But it seems you're still going to hit the same NOTIFY_BAD case above after your patch. So something doesn't compute for me. Could you explain more? Second, a nit: > +#define EHCA_REG_MR 0 > +#define EHCA_REG_BUSMAP_MR (~0) and you pass these as the reg_busmap parm in: > int ehca_reg_mr(struct ehca_shca *shca, > struct ehca_mr *e_mr, > u64 *iova_start, > @@ -991,7 +1031,8 @@ > struct ehca_pd *e_pd, > struct ehca_mr_pginfo *pginfo, > u32 *lkey, /*OUT*/ > -u32 *rkey) /*OUT*/ > +u32 *rkey, /*OUT*/ > +int reg_busmap) and test it as: > +if (reg_busmap) > +ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); > +else > +ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); So the ~0 for true looks a bit odd. One option would be to make reg_busmap a bool, since that's how you're using it, but then you lose the nice self-documenting macro where you call things. So I think it would be cleaner to do something like enum ehca_reg_type { EHCA_REG_MR, EHCA_REG_BUSMAP_MR }; and make the "int reg_busmap" parameter into "enum ehca_reg_type reg_type" and have the code become + if (reg_type == EHCA_REG_BUSMAP_MR) + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); + else if (reg_type == EHCA_REG_MR) + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); + else + ret = -EINVAL or something like that. > +struct ib_dma_mapping_ops ehca_dma_mapping_ops = { > +.mapping_error = ehca_dma_mapping_error, > +.map_single = ehca_dma_map_single, > +.unmap_single = ehca_dma_unmap_single, > +.map_page = ehca_dma_map_page, > +.unmap_page = ehca_dma_unmap_page, > +.map_sg = ehca_dma_map_sg, > +.unmap_sg = ehca_dma_unmap_sg, > +.dma_address = ehca_dma_address, > +.dma_len = ehca_dma_len, > +.sync_single_for_cpu = ehca_dma_sync_single_for_cpu, > +.sync_single_for_device = ehca_dma_sync_single_for_device, > +.alloc_coherent = ehca_dma_alloc_coherent, > +.free_coherent = ehca_dma_free_coherent, > +}; I always think structures like this are easier to read if you align the '=' signs. But no big deal. > +ret = ehca_create_busmap(); > +if (ret) { > +ehca_gen_err("Cannot create busmap."); > +goto module_init2; > +} > + > ret = ibmebus_register_driver(&ehca_driver); > if (ret) { > ehca_gen_err("Cannot register eHCA device driver"); > ret = -EINVAL; > -goto module_init2; > +goto module_init3; > } > > ret = register_memory_notifier(&ehca_mem_nb); > if (ret) { > ehca_gen_err("Failed registering memory add/remove notifier"); > -goto module_init3; > +goto module_init4; Having to renumber unrelated things is when something changes is why I don't like this style of error path labels. But I think it's well and truly too late to fix that in ehca. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
Hi Michael, On Wednesday 10 June 2009 02:02:36 Michael Ellerman wrote: > For those of us who haven't read the HEA spec lately, can you give us > some more detail on that? :) first of all, please note that this patch is actually for the ehca infiniband driver. The ehca driver uses an internal memory region, which is supposed to contain all physical memory. A memory region maps a virtually contiguous adapter address space to the physical or better absolute address space. The limitation is that the memory region cannot map non-contiguous virtual adapter address space. However, on ppc64 machines there is a feature to dynamically add or remove memory to logical partitions during runtime. These operations scatter the absolute memory so that the kernel memory has a non-contiguous layout. This layout cannot be represented in a memory region. The purpose of this code is to detect the memory layout so that the memory region can be made up of the existing memory chunks. It also translates the kernel addresses to the memory region address, which is needed for interaction with the HCA. > How does it interact with kexec/kdump? We never tested the ehca driver with kexec/kdump. This patch also doesn't improve anything in this context. > phys_to_abs() ? As below, or does it come from somewhere else? You're right, actually that isn't needed on System p. On the other hand I needed to choose an address type, which is the base of all mapping. The "busmap" holds all addresses as absolute addresses. The absolute addresses can then be converted in any other type (virt, phys). Regards Hannes ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
On Tue, 2009-06-09 at 15:59 +0200, Hannes Hering wrote: > This patch implements toleration of dynamic memory operations and 16 GB > gigantic pages. On module load the driver walks through available system > memory, checks for available memory ranges and then registers the kernel > internal memory region accordingly. The translation of address ranges is > implemented via a 3-level busmap. Hi Hannes, For those of us who haven't read the HEA spec lately, can you give us some more detail on that? :) How does it interact with kexec/kdump? > +static int ehca_update_busmap(unsigned long pfn, unsigned long nr_pages) > +{ > + unsigned long i, start_section, end_section; > + int top, dir, idx; > + > + if (!nr_pages) > + return 0; > + > + if (!ehca_bmap) { > + ehca_bmap = kmalloc(sizeof(struct ehca_bmap), GFP_KERNEL); > + if (!ehca_bmap) > + return -ENOMEM; > + /* Set map block to 0xFF according to EHCA_INVAL_ADDR */ > + memset(ehca_bmap, 0xFF, EHCA_TOP_MAP_SIZE); > + } > + > + start_section = phys_to_abs(pfn * PAGE_SIZE) / EHCA_SECTSIZE; > + end_section = phys_to_abs((pfn + nr_pages) * PAGE_SIZE) / EHCA_SECTSIZE; phys_to_abs() ? As below, or does it come from somewhere else? arch/powerpc/include/asm/abs_addr.h: 47 static inline unsigned long phys_to_abs(unsigned long pa) 48 { 49 unsigned long chunk; 50 51 /* This is a no-op on non-iSeries */ 52 if (!firmware_has_feature(FW_FEATURE_ISERIES)) 53 return pa; 54 55 chunk = addr_to_chunk(pa); 56 57 if (chunk < mschunks_map.num_chunks) 58 chunk = mschunks_map.mapping[chunk]; 59 60 return chunk_to_addr(chunk) + (pa & MSCHUNKS_OFFSET_MASK); 61 } cheers signature.asc Description: This is a digitally signed message part ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
This patch implements toleration of dynamic memory operations and 16 GB gigantic pages. On module load the driver walks through available system memory, checks for available memory ranges and then registers the kernel internal memory region accordingly. The translation of address ranges is implemented via a 3-level busmap. Signed-off-by: Hannes Hering --- This patch is built and tested against infiniband.git. Please apply for 2.6.31. Regards Hannes Index: infiniband/drivers/infiniband/hw/ehca/ehca_main.c === --- infiniband.orig/drivers/infiniband/hw/ehca/ehca_main.c 2009-06-09 14:20:37.0 +0200 +++ infiniband/drivers/infiniband/hw/ehca/ehca_main.c 2009-06-09 14:20:47.0 +0200 @@ -52,7 +52,7 @@ #include "ehca_tools.h" #include "hcp_if.h" -#define HCAD_VERSION "0026" +#define HCAD_VERSION "0027" MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Christoph Raisch "); @@ -506,6 +506,7 @@ shca->ib_device.detach_mcast= ehca_detach_mcast; shca->ib_device.process_mad = ehca_process_mad; shca->ib_device.mmap= ehca_mmap; + shca->ib_device.dma_ops = &ehca_dma_mapping_ops; if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca->hca_cap)) { shca->ib_device.uverbs_cmd_mask |= @@ -1028,17 +1029,23 @@ goto module_init1; } + ret = ehca_create_busmap(); + if (ret) { + ehca_gen_err("Cannot create busmap."); + goto module_init2; + } + ret = ibmebus_register_driver(&ehca_driver); if (ret) { ehca_gen_err("Cannot register eHCA device driver"); ret = -EINVAL; - goto module_init2; + goto module_init3; } ret = register_memory_notifier(&ehca_mem_nb); if (ret) { ehca_gen_err("Failed registering memory add/remove notifier"); - goto module_init3; + goto module_init4; } if (ehca_poll_all_eqs != 1) { @@ -1053,9 +1060,12 @@ return 0; -module_init3: +module_init4: ibmebus_unregister_driver(&ehca_driver); +module_init3: + ehca_destroy_busmap(); + module_init2: ehca_destroy_slab_caches(); @@ -1073,6 +1083,8 @@ unregister_memory_notifier(&ehca_mem_nb); + ehca_destroy_busmap(); + ehca_destroy_slab_caches(); ehca_destroy_comp_pool(); Index: infiniband/drivers/infiniband/hw/ehca/ehca_mrmw.c === --- infiniband.orig/drivers/infiniband/hw/ehca/ehca_mrmw.c 2009-06-09 14:20:37.0 +0200 +++ infiniband/drivers/infiniband/hw/ehca/ehca_mrmw.c 2009-06-09 14:20:47.0 +0200 @@ -53,6 +53,39 @@ /* max number of rpages (per hcall register_rpages) */ #define MAX_RPAGES 512 +/* DMEM toleration management */ +#define EHCA_SECTSHIFTSECTION_SIZE_BITS +#define EHCA_SECTSIZE (1UL << EHCA_SECTSHIFT) +#define EHCA_HUGEPAGESHIFT 34 +#define EHCA_HUGEPAGE_SIZE (1UL << EHCA_HUGEPAGESHIFT) +#define EHCA_HUGEPAGE_PFN_MASK ((EHCA_HUGEPAGE_SIZE - 1) >> PAGE_SHIFT) +#define EHCA_INVAL_ADDR0xULL +#define EHCA_DIR_INDEX_SHIFT 13 /* 8k Entries in 64k block */ +#define EHCA_TOP_INDEX_SHIFT (EHCA_DIR_INDEX_SHIFT * 2) +#define EHCA_MAP_ENTRIES (1 << EHCA_DIR_INDEX_SHIFT) +#define EHCA_TOP_MAP_SIZE (0x1) /* currently fixed map size */ +#define EHCA_DIR_MAP_SIZE (0x1) +#define EHCA_ENT_MAP_SIZE (0x1) +#define EHCA_INDEX_MASK (EHCA_MAP_ENTRIES - 1) +#define EHCA_REG_MR 0 +#define EHCA_REG_BUSMAP_MR (~0) + +static unsigned long ehca_mr_len; +/* + * Memory map data structures + */ +struct ehca_dir_bmap { + u64 ent[EHCA_MAP_ENTRIES]; +}; +struct ehca_top_bmap { + struct ehca_dir_bmap *dir[EHCA_MAP_ENTRIES]; +}; +struct ehca_bmap { + struct ehca_top_bmap *top[EHCA_MAP_ENTRIES]; +}; + +static struct ehca_bmap *ehca_bmap; + static struct kmem_cache *mr_cache; static struct kmem_cache *mw_cache; @@ -68,6 +101,8 @@ #define EHCA_MR_PGSHIFT1M 20 #define EHCA_MR_PGSHIFT16M 24 +static u64 ehca_map_vaddr(void *caddr); + static u32 ehca_encode_hwpage_size(u32 pgsize) { int log = ilog2(pgsize); @@ -135,7 +170,8 @@ goto get_dma_mr_exit0; } - ret = ehca_reg_maxmr(shca, e_maxmr, (u64 *)KERNELBASE, + ret = ehca_reg_maxmr(shca, e_maxmr, +(void *)ehca_map_vaddr((void *)KERNELBASE), mr_access_flags, e_pd, &e_maxmr->ib.ib_mr.lkey, &e_maxmr->ib.ib_mr.rkey); @@ -251,7 +287,7 @@ ret = ehca_reg_mr(shca, e_mr, iova_start, size, mr_access_flags, e_pd,