> -----Original Message-----
> From: Igor Mammedov <imamm...@redhat.com>
> Sent: Friday, August 28, 2020 6:25 AM
> To: Daniel P. Berrangé <berra...@redhat.com>
> Cc: Moger, Babu <babu.mo...@amd.com>; Dr. David Alan Gilbert
> <dgilb...@redhat.com>; ehabk...@redhat.com; m...@redhat.com; Michal
> Privoznik <mpriv...@redhat.com>; qemu-devel@nongnu.org;
> pbonz...@redhat.com; r...@twiddle.net
> Subject: Re: [PATCH v5 0/8] Remove EPYC mode apicid decode and use
> generic decode
> 
> On Fri, 28 Aug 2020 09:58:03 +0100
> Daniel P. Berrangé <berra...@redhat.com> wrote:
> 
> > On Thu, Aug 27, 2020 at 10:21:10PM +0200, Igor Mammedov wrote:
> > > On Wed, 26 Aug 2020 13:45:51 -0500
> > > Babu Moger <babu.mo...@amd.com> wrote:
> > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Dr. David Alan Gilbert <dgilb...@redhat.com>
> > > > > Sent: Wednesday, August 26, 2020 1:34 PM
> > > > > To: Moger, Babu <babu.mo...@amd.com>
> > > > > Cc: Igor Mammedov <imamm...@redhat.com>; Daniel P. Berrangé
> > > > > <berra...@redhat.com>; ehabk...@redhat.com; m...@redhat.com;
> > > > > Michal Privoznik <mpriv...@redhat.com>; qemu-
> de...@nongnu.org;
> > > > > pbonz...@redhat.com; r...@twiddle.net
> > > > > Subject: Re: [PATCH v5 0/8] Remove EPYC mode apicid decode and
> > > > > use generic decode
> > > > >
> > > > > * Babu Moger (babu.mo...@amd.com) wrote:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Igor Mammedov <imamm...@redhat.com>
> > > > > > > Sent: Wednesday, August 26, 2020 8:31 AM
> > > > > > > To: Daniel P. Berrangé <berra...@redhat.com>
> > > > > > > Cc: Moger, Babu <babu.mo...@amd.com>;
> pbonz...@redhat.com;
> > > > > > > r...@twiddle.net; ehabk...@redhat.com; qemu-
> de...@nongnu.org;
> > > > > > > m...@redhat.com; Michal Privoznik <mpriv...@redhat.com>
> > > > > > > Subject: Re: [PATCH v5 0/8] Remove EPYC mode apicid decode
> > > > > > > and use generic decode
> > > > > > >
> > > > > > > On Wed, 26 Aug 2020 13:50:59 +0100 Daniel P. Berrangé
> > > > > > > <berra...@redhat.com> wrote:
> > > > > > >
> > > > > > > > On Wed, Aug 26, 2020 at 02:38:49PM +0200, Igor Mammedov
> wrote:
> > > > > > > > > On Fri, 21 Aug 2020 17:12:19 -0500 Babu Moger
> > > > > > > > > <babu.mo...@amd.com> wrote:
> > > > > > > > >
> > > > > > > > > > To support some of the complex topology, we introduced
> > > > > > > > > > EPYC mode
> > > > > > > apicid decode.
> > > > > > > > > > But, EPYC mode decode is running into problems. Also
> > > > > > > > > > it can become quite a maintenance problem in the
> > > > > > > > > > future. So, it was decided to remove that code and use
> > > > > > > > > > the generic decode which works for majority of the
> > > > > > > > > > topology. Most of the SPECed configuration would work
> > > > > > > > > > just fine. With some non-SPECed user inputs, it will
> > > > > > > > > > create some sub-
> > > > > > > optimal configuration.
> > > > > > > > > > Here is the discussion thread.
> > > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=ht
> > > > > > > > > > tps%3A%252
> > > > > > > > > > F%2F
> > > > > > > > > > lore.kernel.org%2Fqemu-devel%2Fc0bcc1a6-1d84-a6e7-
> e468
> > > > > > > > > > -
> > > > > > > d5b437c1b25
> > > > > > > > > >
> > > > > > >
> > > > >
> 4%40amd.com%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C8a
> 5c
> > > > > > > 52f92
> > > > > > > > > >
> > > > > > >
> > > > >
> 3f04082a40808d849c43d49%7C3dd8961fe4884e608e11a82d994e183d%7C0%7
> > > > > > > C0
> > > > > > > > > >
> > > > > > >
> > > > >
> %7C637340454473508873&amp;sdata=VnW28H1v4XwK3GaNGFxu%2BhwiMe
> A
> > > > > > > YO%2B
> > > > > > > > > > 3WAzo3DeY5Ha8%3D&amp;reserved=0
> > > > > > > > > >
> > > > > > > > > > This series removes all the EPYC mode specific apicid
> > > > > > > > > > changes and use the generic apicid decode.
> > > > > > > > >
> > > > > > > > > the main difference between EPYC and all other CPUs is
> > > > > > > > > that it requires numa configuration (it's not optional)
> > > > > > > > > so we need an extra
> > > > > > No, That is not true. Because of that assumption we made all
> > > > > > these apicid changes. And here we are now.
> > > > > >
> > > > > > AMD supports varies mixed configurations. In case of
> > > > > > EPYC-Rome, we have NPS1, NPS2 and NPS4(Numa Nodes per
> socket).
> > > > > > In case of NPS1, basically we have all the cores in a socket
> > > > > > under one numa node. This is non-numa configuration.
> > > > > > Looking at the various configurations and also discussing
> > > > > > internally, it is not advisable to have (epyc && !numa) check.
> > > > >
> > > > > Indeed on real hardware, I don't think we always see NUMA; my
> > > > > single socket,
> > > > > 16 core/32 thread 7302P Dell box, shows the kernel printing 'No
> > > > > NUMA configuration found...Faking a node.'
> > > looks like firmware bug or maybe it's feature and there is a knob in
> > > fw to turn it on/off in case used OS doesn't like it for some reason.
> > >
> > >
> > > > > So if real hardware hasn't got a NUMA node, what's the real problem?
> > > >
> > > > I don't see any problem once we revert all these changes(patch 1-7).
> > > > We don't need if (epyc && !numa) error check or
> > > > auto_enable_numa=true unconditionally.
> > >
> > > We need revert to unbreak migration from QEMU < 5.0, everything else
> > > (fixes for CPUID_Fn8000001E) could go on top.
> > >
> > > So what's on top (because old code also wasn't correct when
> > > CPUID_Fn8000001E is taken in account, tha's why we are at this
> > > point),
> > >
> > > When starting QEMU without -numa
> > > Indeed we can skip "if (epyc && !numa) error check or
> > > auto_enable_numa=true", in case where there is 1 die (NPS1).
> > > (1) User however may set core/threads number bigger than possible by
> spec,
> > >     in which case CPUID_Fn8000001E_EBX/CPUID_Fn8000001E_ECX will not
> be
> > >     valid spec vise and could trigger OPPs in guest kernel.
> > >     Given we allow go out of spec, perhaps we should add a warning at
> > >     realize time saying that used -smp config is not supported since it
> > >     doesn't match AMD EPYC spec and might not work.
> > >
> > > (2) Earlier we agreed that we can reuse existing die_id instead of 
> > > internal
> > >     (topo_ids.node_id in current code)
> > >     (It's is called DIE_ID and NODE ID in spec interchangeably)
> > >     Same as (1) add a warning when '-smp dies' goes beyond spec limits.
> > >
> > > (3) "-smp dies>1" ''if'' we allow to run it without -numa,
> > >     then system wide NUMA node id in CPUID_Fn8000001E_ECX probably
> doesn't matter.
> > >     could be something like in spec but taking in account die offset, to
> produce
> > >     unique id.
> > >
> > >     Same, add a warning that there are more than 1 dies but numa is not
> enabled,
> > >     suggest to enable numa.
> > >
> > >     With current code it produces invalid APIC ID for valid '-smp'
> combination,
> > >     however if we revert it and switch to die_id than it should produce
> > >     valid APIC ID once again (as in 4.2).
> > >     Given it produces invalid APIC id, maybe we should just ditch the case
> and
> > >     fold it in (4) (i.e. require -numa if "-smp dies>1")
> > >
> > > (4) -numa is used (RHBZ1728166)
> > >     we need to ensure that socket*dies == ms->numa_state->num_nodes
> > >      and make sure that CPUID_Fn8000001E_ECX consistent with
> > >     cpu mapping provided with "-numa cpu=" option.
> >
> > Why do we need to socket*dies == ms->numa_state->num_nodes ? That
> > doesn't seem to be the case in bare metal EPYC nodes I've used which
> > lets you configure how many NUMA nodes in firmware.
> 
> (From dumps Babu has provided earlier, it was dies == nodes and
> CPUID_Fn8000001E_ECX == numa node ids in SRAT.)

Yes, That is correct. In most cases dies == nodes.

But that is going to change. In future(even in EPYC-Rome) with new f/w
BIOS option, users can configure their numa node. It will give the option
to keep NPS1, SPS2 or NSP4(Nodes per socket). In those cases dies and
nodes will not match. That is why I wanted to keep them separate. User can
change dies or -numa to match their bios config.

> 
> dumping CPUID_Fn8000001E and SRAT table for such configs will help us to
> figure out if we need socket*dies != nodes and how to compose config were
> SRAT differs from CPUID_Fn8000001E_ECX.
> 
> Babu, can you provide CPUID_Fn8000001E and SRAT dumps for above configs
> combinations? Or to some spec/guide how it should be.

I dont have the config right now. But I will try to get one.

> 
> 
> >
> >
> > Regards,
> > Daniel


Reply via email to