On Mon, May 16, 2016 at 11:55 AM, Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > On 05/13/2016 06:41 PM, Bharata B Rao wrote: >> >> On Wed, May 4, 2016 at 12:22 PM, Alexey Kardashevskiy <a...@ozlabs.ru> >> wrote: > > >> >>> + >>> + avail = SPAPR_PCI_DMA_MAX_WINDOWS - >>> spapr_phb_get_active_win_num(sphb); >>> + >>> + rtas_st(rets, 0, RTAS_OUT_SUCCESS); >>> + rtas_st(rets, 1, avail); >>> + rtas_st(rets, 2, max_window_size); >>> + rtas_st(rets, 3, pgmask); >>> + rtas_st(rets, 4, 0); /* DMA migration mask, not supported */ >>> + >>> + trace_spapr_iommu_ddw_query(buid, addr, avail, max_window_size, >>> pgmask); >>> + return; >>> + >>> +param_error_exit: >>> + rtas_st(rets, 0, RTAS_OUT_PARAM_ERROR); >>> +} >>> + >>> +static void rtas_ibm_create_pe_dma_window(PowerPCCPU *cpu, >>> + sPAPRMachineState *spapr, >>> + uint32_t token, uint32_t >>> nargs, >>> + target_ulong args, >>> + uint32_t nret, target_ulong >>> rets) >>> +{ >>> + sPAPRPHBState *sphb; >>> + sPAPRTCETable *tcet = NULL; >>> + uint32_t addr, page_shift, window_shift, liobn; >>> + uint64_t buid; >>> + >>> + if ((nargs != 5) || (nret != 4)) { >>> + goto param_error_exit; >>> + } >>> + >>> + buid = ((uint64_t)rtas_ld(args, 1) << 32) | rtas_ld(args, 2); >>> + addr = rtas_ld(args, 0); >>> + sphb = spapr_pci_find_phb(spapr, buid); >>> + if (!sphb || !sphb->ddw_enabled) { >>> + goto param_error_exit; >>> + } >>> + >>> + page_shift = rtas_ld(args, 3); >>> + window_shift = rtas_ld(args, 4); >> >> >> Kernel has a bug due to which wrong window_shift gets returned here. I >> have posted possible fix here: >> https://patchwork.ozlabs.org/patch/621497/ >> >> I have tried to work around this issue in QEMU too >> https://lists.nongnu.org/archive/html/qemu-ppc/2016-04/msg00226.html >> >> But the above work around involves changing the memory representation >> in DT. > > > What is wrong with this workaround?
The above workaround will result in different representations for memory in DT before and after the workaround. Currently for -m 2G, -numa node,nodeid=0,mem=1G -numa node,nodeid=1,mem=0.5G, we will have the following nodes in DT: memory@0 memory@40000000 ibm,dynamic-reconfiguration-memory ibm,dynamic-memory will have only DR LMBs: [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory 0000000 0000 000a 0000 0000 8000 0000 8000 0008 0000010 0000 0000 ffff ffff 0000 0000 0000 0000 0000020 9000 0000 8000 0009 0000 0000 ffff ffff 0000030 0000 0000 0000 0000 a000 0000 8000 000a 0000040 0000 0000 ffff ffff 0000 0000 0000 0000 0000050 b000 0000 8000 000b 0000 0000 ffff ffff 0000060 0000 0000 0000 0000 c000 0000 8000 000c 0000070 0000 0000 ffff ffff 0000 0000 0000 0000 0000080 d000 0000 8000 000d 0000 0000 ffff ffff 0000090 0000 0000 0000 0000 e000 0000 8000 000e 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000 00000b0 f000 0000 8000 000f 0000 0000 ffff ffff 00000c0 0000 0000 0000 0001 0000 0000 8000 0010 00000d0 0000 0000 ffff ffff 0000 0000 0000 0001 00000e0 1000 0000 8000 0011 0000 0000 ffff ffff 00000f0 0000 0000 The memory region looks like this: memory-region: system 0000000000000000-ffffffffffffffff (prio 0, RW): system 0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram 0000000080000000-000000011fffffff (prio 0, RW): hotplug-memory After this workaround, all this will change like below: memory@0 ibm,dynamic-reconfiguration-memory All LMBs in ibm,dynamic-memory: [root@localhost ibm,dynamic-reconfiguration-memory]# hexdump ibm,dynamic-memory 0000000 0000 0010 0000 0000 0000 0000 8000 0000 0000010 0000 0000 0000 0000 0000 0080 0000 0000 0000020 1000 0000 8000 0001 0000 0000 0000 0000 0000030 0000 0080 0000 0000 2000 0000 8000 0002 0000040 0000 0000 0000 0000 0000 0080 0000 0000 0000050 3000 0000 8000 0003 0000 0000 0000 0000 0000060 0000 0080 0000 0000 4000 0000 8000 0004 0000070 0000 0000 0000 0001 0000 0008 0000 0000 0000080 5000 0000 8000 0005 0000 0000 0000 0001 0000090 0000 0008 0000 0000 6000 0000 8000 0006 00000a0 0000 0000 ffff ffff 0000 0000 0000 0000 00000b0 7000 0000 8000 0007 0000 0000 ffff ffff 00000c0 0000 0000 0000 0000 8000 0000 8000 0008 00000d0 0000 0000 ffff ffff 0000 0000 0000 0000 00000e0 9000 0000 8000 0009 0000 0000 ffff ffff 00000f0 0000 0000 0000 0000 a000 0000 8000 000a 0000100 0000 0000 ffff ffff 0000 0000 0000 0000 0000110 b000 0000 8000 000b 0000 0000 ffff ffff 0000120 0000 0000 0000 0000 c000 0000 8000 000c 0000130 0000 0000 ffff ffff 0000 0000 0000 0000 0000140 d000 0000 8000 000d 0000 0000 ffff ffff 0000150 0000 0000 0000 0000 e000 0000 8000 000e 0000160 0000 0000 ffff ffff 0000 0000 0000 0000 0000170 f000 0000 8000 000f 0000 0000 ffff ffff 0000180 0000 0000 Hotplug memory region gets a new address range now: memory-region: system 0000000000000000-ffffffffffffffff (prio 0, RW): system 0000000000000000-000000005fffffff (prio 0, RW): ppc_spapr.ram 0000000060000000-00000000ffffffff (prio 0, RW): hotplug-memory So when a guest that was booted with older QEMU is migrated to a newer QEMU that has this workaround, then it will start exhibiting the above changes after first reboot post migration. If user has done memory hotplug by explicitly specifying address in the source, then even migration would fail because the addr specified at the target will not be part of hotplug-memory range. Hence I believe we shoudn't workaround in this manner but have the workaround in the DDW code where the window can be easily fixed. > >> Hence I feel until the guest kernel changes are available, a >> simpler work around would be to discard the window_shift value above >> and recalculate the right value as below: >> >> if (machine->ram_size == machine->maxram_size) { >> max_window_size = machine->ram_size; >> } else { >> MemoryHotplugState *hpms = &spapr->hotplug_memory; >> max_window_size = hpms->base + memory_region_size(&hpms->mr); >> } >> window_shift = max_window_size >> SPAPR_TCE_PAGE_SHIFT; >> >> and create DDW based on this calculated window_shift value. Does that >> sound reasonable ? > > > The workaround should only do that for the second window, at least, or for > the default one but with page size at least 64K; otherwise it is going to be > a waste of memory (2MB per each 1GB of guest RAM). Ok, will sync up with you separately to understand more about the 'two' windows here. Regards, Bharata.