On Fri, 2015-11-20 at 19:21 +1100, David Gibson wrote: > On Wed, Nov 11, 2015 at 11:27:39AM +1100, Benjamin Herrenschmidt > wrote: > > No devices yet, not even an interrupt controller, just to get > > started. > > > > Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> > > --- > > default-configs/ppc64-softmmu.mak | 1 + > > hw/ppc/Makefile.objs | 2 + > > hw/ppc/pnv.c | 600 > > ++++++++++++++++++++++++++++++++++++++ > > include/hw/ppc/pnv.h | 36 +++ > > 4 files changed, 639 insertions(+) > > create mode 100644 hw/ppc/pnv.c > > create mode 100644 include/hw/ppc/pnv.h > > Many of my comments below may be made irrelevant by later patches in > the series.
Heh, well there is where the "meat" of the new platform starts showing up :-) .../... > > +#define _FDT(exp) \ > > + do { \ > > + int ret = (exp); \ > > + if (ret < 0) { \ > > + fprintf(stderr, "qemu: error creating device tree: %s: %s\n", \ > > + #exp, fdt_strerror(ret)); \ > > + exit(1); \ > > + } \ > > + } while (0) > > We should probably make a file where helper routines used by both > spapr.c and pnv.c can live. Probably but I'd see that as a later cleanup rather than doing it in this series... .../... > > + if (pcc->l1_dcache_size) { > > + _FDT((fdt_property_cell(fdt, "d-cache-size", > > pcc->l1_dcache_size))); > > + } else { > > + fprintf(stderr, "Warning: Unknown L1 dcache size for cpu\n"); > > Hmm (note to self) should probably change a bunch of these both in > spapr.c and pnv.c from explicit fprintfs() to modern error_report() > and similar. That's a train I completely missed, but yes. .../... > > + } > > + _FDT((fdt_property(fdt, "ibm,ppc-interrupt-server#s", > > + servers_prop, sizeof(servers_prop)))); > > + _FDT((fdt_property(fdt, "ibm,ppc-interrupt-gserver#s", > > + gservers_prop, sizeof(gservers_prop)))); > > + > > + _FDT((fdt_end_node(fdt))); > > +} > > + > > +static void *powernv_create_fdt(PnvSystem *sys, uint32_t initrd_base, > > uint32_t initrd_size) > > +{ > > So.. does it make sense for qemu to create a device tree for powernv, > rather than leaving it to Opal? Well, OPAL only creates a device-tree if you are on an FSP machine in which case it expects a complex data structure (HDAT) coming from the FSP to use as a source of info. On OpenPower machines, which is closer to what we simulate here, we do get a device-tree as an input in OPAL, it's generated by HostBoot. Now, I am not running HostBoot in qemu because most of what it does is completely irrelevant to an emulated system (training the various links, initializing the memory buffer chips, etc...). However we do need to pass a number of platform information to OPAL which HB does via the device-tree, such as which cores are enabled, the memory map configured for PCI, which PHBs are enabled, etc... so creating a DT in qemu makes sense, it mimmics HB in essence. OPAL will enrich that device-tree before starting Linux. > > + /* > > + * Add info to guest to indentify which host is it being run on > > + * and what is the uuid of the guest > > + */ > > + if (kvmppc_get_host_model(&buf)) { > > + _FDT((fdt_property_string(fdt, "host-model", buf))); > > + g_free(buf); > > + } > > + if (kvmppc_get_host_serial(&buf)) { > > + _FDT((fdt_property_string(fdt, "host-serial", buf))); > > + g_free(buf); > > + } > > Since you're emulating a "bare metal" machine, surely the host > properties aren't relevant here. They may or may not. But yes, I can take that out. > > + buf = g_strdup_printf(UUID_FMT, qemu_uuid[0], qemu_uuid[1], > > + qemu_uuid[2], qemu_uuid[3], qemu_uuid[4], > > + qemu_uuid[5], qemu_uuid[6], qemu_uuid[7], > > + qemu_uuid[8], qemu_uuid[9], qemu_uuid[10], > > + qemu_uuid[11], qemu_uuid[12], qemu_uuid[13], > > + qemu_uuid[14], qemu_uuid[15]); > > + > > + _FDT((fdt_property_string(fdt, "vm,uuid", buf))); > > + g_free(buf); > > + > > + _FDT((fdt_begin_node(fdt, "chosen"))); > > + _FDT((fdt_property(fdt, "linux,initrd-start", > > + &start_prop, sizeof(start_prop)))); > > + _FDT((fdt_property(fdt, "linux,initrd-end", > > + &end_prop, sizeof(end_prop)))); > > + _FDT((fdt_end_node(fdt))); > > + > > + _FDT((fdt_property_cell(fdt, "#address-cells", 0x2))); > > + _FDT((fdt_property_cell(fdt, "#size-cells", 0x2))); > > + > > + /* cpus */ > > + _FDT((fdt_begin_node(fdt, "cpus"))); > > + _FDT((fdt_property_cell(fdt, "#address-cells", 0x1))); > > + _FDT((fdt_property_cell(fdt, "#size-cells", 0x0))); > > + > > + CPU_FOREACH(cs) { > > + powernv_create_cpu_node(fdt, cs, smt); > > + } > > + > > + _FDT((fdt_end_node(fdt))); > > + > > + /* Memory */ > > + _FDT((powernv_populate_memory(fdt))); > > + > > + /* /hypervisor node */ > > + if (kvm_enabled()) { > > + uint8_t hypercall[16]; > > + > > + /* indicate KVM hypercall interface */ > > + _FDT((fdt_begin_node(fdt, "hypervisor"))); > > + _FDT((fdt_property_string(fdt, "compatible", "linux,kvm"))); > > + if (kvmppc_has_cap_fixup_hcalls()) { > > + /* > > + * Older KVM versions with older guest kernels were broken > > with the > > + * magic page, don't allow the guest to map it. > > + */ > > + kvmppc_get_hypercall(first_cpu->env_ptr, hypercall, > > + sizeof(hypercall)); > > + _FDT((fdt_property(fdt, "hcall-instructions", hypercall, > > + sizeof(hypercall)))); > > + } > > + _FDT((fdt_end_node(fdt))); > > + } > > And a hypercall interface surely doesn't make sense for powernv. It's qemu paravirt, it exist on G5 too :-) It's for PR KVM, it allows to speed up some bits and pieces. But yeah we don't yet really "support" it at this point. However we might. > > + > > + _FDT((fdt_end_node(fdt))); /* close root node */ > > + _FDT((fdt_finish(fdt))); > > + > > + return fdt; > > +} > > + > > +static void powernv_cpu_reset(void *opaque) > > +{ > > + PowerPCCPU *cpu = opaque; > > + CPUState *cs = CPU(cpu); > > + CPUPPCState *env = &cpu->env; > > + > > + cpu_reset(cs); > > + > > + env->spr[SPR_PIR] = ppc_get_vcpu_dt_id(cpu); > > + env->spr[SPR_HIOR] = 0; > > + env->gpr[3] = FDT_ADDR; > > + env->nip = 0x10; > > + env->msr |= MSR_HVB; > > +} > > So, I believe the qemu-ishly correct way of doing this is to have the > cpu initialization in the cpu code, rather than the platform code, as > much as possible. On PAPR we kind of get away with initialization in > the platform code on the grounds that it's a paravirt platform, but > powernv doesn't have that excuse. > > But this may well be stuff that changes in later patches, so.. Well no, not really. But here too, we mimmic the state as coming out of HostBoot, which isn't quite the same thing. We need to provide the FDT entry, etc... The "real" reset state of a P8 isn't something we can easily simulate... It runs some microcode from a SEEPROM with a small microcontroller which initializes a core, which then runs some HB code off it's L3 cache etc... really not something we want to do in qemu at least for now. So the initial state here is somewhat in between full virt and paravirt, we simulate a platform that has been partially initialized by HostBoot, to the state it has when it enters OPAL. > > +static const VMStateDescription vmstate_powernv = { > > + .name = "powernv", > > + .version_id = 1, > > + .minimum_version_id = 1, > > +}; > > It might be best to leave out the vmstate entirely until you're ready > to implement migration, rather than having a partial, probably not > working migration implementation. Ok. > > + > > +static void pnv_create_chip(PnvSystem *sys, unsigned int chip_no) > > +{ > > + PnvChip *chip = &sys->chips[chip_no]; > > + > > + if (chip_no >= PNV_MAX_CHIPS) { > > + return; > > + } > > + > > + /* XXX Improve chip numbering to better match HW */ > > + chip->chip_id = chip_no; > > I think modern qemu conventions would suggest creating the chips as > QOM objects rather than having a fixed array. Yeah, more code & much larger memory footprint for the same result :-) I can look into it but it's low priority. I still want to rework some of that chip stuff in future patches anyway. > > +} > > + > > +static void ppc_powernv_init(MachineState *machine) > > +{ > > + ram_addr_t ram_size = machine->ram_size; > > + const char *cpu_model = machine->cpu_model; > > + const char *kernel_filename = machine->kernel_filename; > > + const char *initrd_filename = machine->initrd_filename; > > + uint32_t initrd_base = 0; > > + long initrd_size = 0; > > + PowerPCCPU *cpu; > > + CPUPPCState *env; > > + MemoryRegion *sysmem = get_system_memory(); > > + MemoryRegion *ram = g_new(MemoryRegion, 1); > > + sPowerNVMachineState *pnv_machine = POWERNV_MACHINE(machine); > > + PnvSystem *sys = &pnv_machine->sys; > > + long fw_size; > > + char *filename; > > + void *fdt; > > + int i; > > + > > + /* init CPUs */ > > + if (cpu_model == NULL) { > > + cpu_model = kvm_enabled() ? "host" : "POWER8"; > > + } > > + > > + for (i = 0; i < smp_cpus; i++) { > > + cpu = cpu_ppc_init(cpu_model); > > + if (cpu == NULL) { > > + fprintf(stderr, "Unable to find PowerPC CPU definition\n"); > > + exit(1); > > + } > > + env = &cpu->env; > > + > > + /* Set time-base frequency to 512 MHz */ > > + cpu_ppc_tb_init(env, TIMEBASE_FREQ); > > + > > + /* MSR[IP] doesn't exist nowadays */ > > + env->msr_mask &= ~(1 << 6); > > + > > + qemu_register_reset(powernv_cpu_reset, cpu); > > + } > > + > > + /* allocate RAM */ > > + memory_region_allocate_system_memory(ram, NULL, "ppc_powernv.ram", > > ram_size); > > + memory_region_add_subregion(sysmem, 0, ram); > > + > > + /* XXX We should decide how many chips to create based on #cores and > > + * Venice vs. Murano vs. Naples chip type etc..., for now, just create > > + * one chip. Also creation of the CPUs should be done per-chip > > + */ > > + sys->num_chips = 1; > > + > > + /* Create only one PHB for now until I figure out what's wrong > > + * when I create more (resource assignment failures in Linux) > > + */ > > + pnv_create_chip(sys, 0); > > + > > + if (bios_name == NULL) { > > + bios_name = FW_FILE_NAME; > > + } > > + filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name); > > + fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE); > > + if (fw_size < 0) { > > + hw_error("qemu: could not load OPAL '%s'\n", filename); > > + exit(1); > > + } > > + g_free(filename); > > + > > + > > + if (kernel_filename == NULL) { > > + kernel_filename = KERNEL_FILE_NAME; > > + } > > + filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, > > kernel_filename); > > The commit withe Opal image should go in before this, no? Now this is a bit of an open discussion at the moment :-) The way OPAL is built on OPP machines today is by essentially building a complete flash image which contains HostBoot, OPAL and the petitboot- based bootloader which contains a Linux kernel etc... We could create a target without HB and with a slimmed down Linux but it would still probably be about 12MB I reckon, if not more. It feels a bit "big" to ship as a binary as part of qemu... We would also have to add code to qemu to "find" OPAL in that image, and then add a model for the flash controller. The other option is to bundle just OPAL itself. However that means you can't go anywhere without a -kernel argument, which would then be either a petitboot-based bootloader or your actual target kernel. Cheers, Ben.