On Fri, Aug 28, 2020 at 12:29:31PM -0400, Eduardo Habkost wrote: > On Fri, Aug 28, 2020 at 09:55:33AM +0100, Daniel P. Berrangé wrote: > > On Thu, Aug 27, 2020 at 10:55:26PM +0200, Igor Mammedov wrote: > > > On Thu, 27 Aug 2020 15:07:52 -0400 > > > Eduardo Habkost <ehabk...@redhat.com> wrote: > > > > > > > On Thu, Aug 27, 2020 at 07:03:14PM +0200, Igor Mammedov wrote: > > > > > On Wed, 26 Aug 2020 16:03:40 +0100 > > > > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > > > > > > > On Wed, Aug 26, 2020 at 04:02:58PM +0200, Igor Mammedov wrote: > > > > > > > On Wed, 26 Aug 2020 14:36:38 +0100 > > > > > > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 03:30:34PM +0200, Igor Mammedov wrote: > > > > > > > > > On Wed, 26 Aug 2020 13:50:59 +0100 > > > > > > > > > Daniel P. Berrangé <berra...@redhat.com> wrote: > > > > > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 02:38:49PM +0200, Igor Mammedov > > > > > > > > > > wrote: > > > > > > > > > > > On Fri, 21 Aug 2020 17:12:19 -0500 > > > > > > > > > > > Babu Moger <babu.mo...@amd.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > To support some of the complex topology, we introduced > > > > > > > > > > > > EPYC mode apicid decode. > > > > > > > > > > > > But, EPYC mode decode is running into problems. Also it > > > > > > > > > > > > can become quite a > > > > > > > > > > > > maintenance problem in the future. So, it was decided > > > > > > > > > > > > to remove that code and > > > > > > > > > > > > use the generic decode which works for majority of the > > > > > > > > > > > > topology. Most of the > > > > > > > > > > > > SPECed configuration would work just fine. With some > > > > > > > > > > > > non-SPECed user inputs, > > > > > > > > > > > > it will create some sub-optimal configuration. > > > > > > > > > > > > Here is the discussion thread. > > > > > > > > > > > > https://lore.kernel.org/qemu-devel/c0bcc1a6-1d84-a6e7-e468-d5b437c1b...@amd.com/ > > > > > > > > > > > > > > > > > > > > > > > > This series removes all the EPYC mode specific apicid > > > > > > > > > > > > changes and use the generic > > > > > > > > > > > > apicid decode. > > > > > > > > > > > > > > > > > > > > > > the main difference between EPYC and all other CPUs is > > > > > > > > > > > that > > > > > > > > > > > it requires numa configuration (it's not optional) > > > > > > > > > > > so we need an extra patch on top of this series to enfoce > > > > > > > > > > > that, i.e: > > > > > > > > > > > > > > > > > > > > > > if (epyc && !numa) > > > > > > > > > > > error("EPYC cpu requires numa to be configured") > > > > > > > > > > > > > > > > > > > > Please no. This will break 90% of current usage of the EPYC > > > > > > > > > > CPU in > > > > > > > > > > real world QEMU deployments. That is way too user hostile > > > > > > > > > > to introduce > > > > > > > > > > as a requirement. > > > > > > > > > > > > > > > > > > > > Why do we need to force this ? People have been > > > > > > > > > > successfuly using > > > > > > > > > > EPYC CPUs without NUMA in QEMU for years now. > > > > > > > > > > > > > > > > > > > > It might not match behaviour of bare metal silicon, but > > > > > > > > > > that hasn't > > > > > > > > > > obviously caused the world to come crashing down. > > > > > > > > > So far it produces warning in linux kernel (RHBZ1728166), > > > > > > > > > (resulting performance might be suboptimal), but I haven't > > > > > > > > > seen > > > > > > > > > anyone reporting crashes yet. > > > > > > > > > > > > > > > > > > > > > > > > > > > What other options do we have? > > > > > > > > > Perhaps we can turn on strict check for new machine types > > > > > > > > > only, > > > > > > > > > so old configs can keep broken topology (CPUID), > > > > > > > > > while new ones would require -numa and produce correct > > > > > > > > > topology. > > > > > > > > > > > > > > > > No, tieing this to machine types is not viable either. That is > > > > > > > > still > > > > > > > > going to break essentially every single management application > > > > > > > > that > > > > > > > > exists today using QEMU. > > > > > > > for that we have deprecation process, so users could switch to > > > > > > > new CLI > > > > > > > that would be required. > > > > > > > > > > > > We could, but I don't find the cost/benefit tradeoff is compelling. > > > > > > > > > > > > There are so many places where we diverge from what bare metal would > > > > > > do, that I don't see a good reason to introduce this breakage, even > > > > > > if we notify users via a deprecation message. > > > > > I find (3) and (4) good enough reasons to use deprecation. > > > > > > > > > > > If QEMU wants to require NUMA for EPYC, then QEMU could internally > > > > > > create a single NUMA node if none was specified for new machine > > > > > > types, such that there is no visible change or breakage to any > > > > > > mgmt apps. > > > > > > > > > > (1) for configs that started without -numa &&|| without -smp dies>1, > > > > > QEMU can do just that (enable auto_enable_numa). > > > > > > > > Why exactly do we need auto_enable_numa with dies=1? > > > > > > > > If I understand correctly, Babu said earlier in this thread[1] > > > > that we don't need auto_enable_numa. > > > > > > > > [1] > > > > https://lore.kernel.org/qemu-devel/11489e5f-2285-ddb4-9c35-c9f522d60...@amd.com/ > > > > > > in case of 1 die, -numa is not must have as it's one numa node only. > > > Though having auto_enable_numa, will allow to reuse the CPU.node-id > > > property > > > to compose CPUID_Fn8000001E_ECX. i.e only code one path vs numa|non-numa > > > variant. > > > > > > > > > > > (2) As for configs that are out of spec, I do not care much (junk in > > > > > - junk out) > > > > > (though not having to spend time on bug reports and debug issues, > > > > > just to say > > > > > it's not supported in the end, makes deprecation sound like a > > > > > reasonable > > > > > choice) > > > > > > > > > > (3) However if config matches bare metal i.e. CPU has more than 1 die > > > > > and within > > > > > dies limits (spec wise), QEMU has to produce valid CPUs. > > > > > In this case QEMU can't make up multiple numa nodes and mappings of > > > > > RAM/CPUs > > > > > on user's behalf. That's where we have to error out and ask for > > > > > explicit > > > > > numa configuration. > > > > > > > > > > For such configs, current code (since 5.0), will produce in the best > > > > > case > > > > > performance issues due to mismatching data in APICID, CPUID and ACPI > > > > > tables, > > > > > in the worst case issues might be related to invalid APIC ID if > > > > > running on EPYC host > > > > > and HW takes in account subfields of APIC ID (according to Babu real > > > > > CPU uses > > > > > die_id(aka node_id) internally). > > > > > I'd rather error out on nonsense configs earlier than debug such > > > > > issues > > > > > and than error out anyways later (upsetting more users). > > > > > > > > > > > > > The requirements are not clear to me. Is this just about making > > > > CPU die_id match the NUMA node ID, or are there additional > > > > constraints? > > > die_id is per socket numa node index, so it's not numa node id in > > > a sense we use it in qemu > > > (that's where all the confusion started that led to current code) > > > > > > I understood that each die in EPYC chip is a numa node, which encodes > > > NUMA node ID (system wide) in CPUID_Fn8000001E_ECX, that's why I > > > wrote earlier that EPYC makes -numa non optional. > > > > AFAIK, that isnt a hard requirement. In bare metal EPYC machine I > > have used, the BIOS lets you choose whether the dies are exposed as > > 1, 2 or 4 NUMA nodes. So there's no fixed die == numa node mapping > > that I see. > > If you change that setting, will all CPUID bits be kept the same, > or the die topology seen by the OS will change?
I don't know offhand, and don't currently have access to the hardware. All I know is that I was able to change between 1, 2 and 4 NUMA nodes and that was reflected in numactl display, I didn't check the CPUID when I was testing previously. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|