On 2/5/20 10:56 AM, Igor Mammedov wrote:
> On Wed, 5 Feb 2020 10:10:06 -0600
> Babu Moger <babu.mo...@amd.com> wrote:
>
>> On 2/5/20 3:38 AM, Igor Mammedov wrote:
>>> On Tue, 4 Feb 2020 13:08:58 -0600
>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>
>>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:
>>>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>>>
>>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:
>>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>>>> Babu Moger <babu.mo...@amd.com> wrote:
>>>>>>>
>>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&sdata=vDAkIxR3U6LX%2FmnYjZPRC55smMqLend%2FHQjbfYWydBk%3D&reserved=0
>>>>>>>>
>>>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and
>>>>>>>> other
>>>>>>>> vendors' configurations, but this decoding sequence does not follow
>>>>>>>> that of
>>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU
>>>>>>>> topology
>>>>>>>> inconsistency. When booting a guest VM, the kernel tries to validate
>>>>>>>> the
>>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu
>>>>>>>> models.
>>>>>>>>
>>>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>>>> Processors. It is available at
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&sdata=rVMRN%2BbUeGWEksKO5uQ3Wxc71eeHCXMrkLVRbo4JHHI%3D&reserved=0
>>>>>>>>
>>>>>>>> Here is the text from the PPR.
>>>>>>>> Operating systems are expected to use
>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>>>> number of least significant bits in the Initial APIC ID that indicate
>>>>>>>> core ID
>>>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of
>>>>>>>> cores
>>>>>>>> (MNC) that the processor could theoretically support, not the actual
>>>>>>>> number of
>>>>>>>> cores that are actually implemented or enabled on the processor, as
>>>>>>>> indicated
>>>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>>>> • ApicId[6] = Socket ID.
>>>>>>>> • ApicId[5:4] = Node ID.
>>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} :
>>>>>>>> {1'b0,LogicalCoreID[1:0]}
>>>>>>>
>>>>>>>
>>>>>>> After checking out all patches and some pondering, used here approach
>>>>>>> looks to me too intrusive for the task at hand especially where it
>>>>>>> comes to generic code.
>>>>>>>
>>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>>>> reasoning behind it first)
>>>>>>>
>>>>>>> Lets look for a way to simplify it a little bit.
>>>>>>>
>>>>>>> So problem we are trying to solve,
>>>>>>> 1: calculate APIC IDs based on cpu type (to e more specific: for EPYC
>>>>>>> based CPUs)
>>>>>>> 2: it depends on knowing total number of numa nodes.
>>>>>>>
>>>>>>> Externally workflow looks like following:
>>>>>>> 1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>>> that's used by possible_cpu_arch_ids() singleton to build list of
>>>>>>> possible CPUs (which is available to user via command
>>>>>>> 'hotpluggable-cpus')
>>>>>>>
>>>>>>> Hook could be called very early and possible_cpus data might be
>>>>>>> not complete. It builds a list of possible CPUs which user could
>>>>>>> modify later.
>>>>>>>
>>>>>>> 2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa
>>>>>>> node,node_id=x,cpus="
>>>>>>> options to assign cpus to nodes, which is one way or another
>>>>>>> calling
>>>>>>> machine_set_cpu_numa_node(). The later updates 'possible_cpus'
>>>>>>> list
>>>>>>> with node information. It happens early when total number of nodes
>>>>>>> is not available.
>>>>>>>
>>>>>>> 2.2 user does not provide explicit node mappings for CPUs.
>>>>>>> QEMU steps in and assigns possible cpus to nodes in
>>>>>>> machine_numa_finish_cpu_init()
>>>>>>> (using the same machine_set_cpu_numa_node()) right before calling
>>>>>>> boards
>>>>>>> specific machine init(). At that time total number of nodes is
>>>>>>> known.
>>>>>>>
>>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be
>>>>>>> defined before
>>>>>>> boards init() is run.
>>>>
>>>> In case of 2.1, we need to have the arch_id already generated. This is
>>>> done inside possible_cpu_arch_ids. The arch_id is used by
>>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.
>>>
>>> I might have missed something but I don't see arch_id itself being used in
>>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus
>>
>> Before calling machine_set_cpu_numa_node, we call
>> cpu_index_to_instance_props -> x86_cpu_index_to_props->
>> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
>>
>> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
>> the available cpus. Based on the arch_id, it also sets up the props.
>
>
> x86_possible_cpu_arch_ids()
> arch_id = x86_cpu_apic_id_from_index(x86ms, i)
> x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,
> ms->smp.threads, &topo);
> // assign socket/die/core/thread from topo
>
> so currently it uses indirect way to convert index in possible_cpus->cpus[]
> to socket/die/core/thread ids.
> But essentially it take '-smp' options and [0..max_cpus) number as original
> data
> converts it into intermediate apic_id and then reverse engineer it back to
> topo info.
>
> Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency
> on apic_id?
It might work. But this feels like a work-around and delaying the problem
for later. Just re-arranging the numa code little bit we can address this.
>
>
>
>> And these props values are used to assign the nodes in
>> machine_set_cpu_numa_node.
>>
>> At this point we are still parsing the numa nodes and so we don't know the
>> total number of numa nodes. Without that information, the arch_id
>> generated here will not be correct for EPYC models.
>>
>> This is the reason for changing the generic numa code(patch #12-Split the
>> numa initialization).
>>
>>>
>>>
>>>> If we want to move the arch_id generation into board init(), then we need
>>>> to save the cpu indexes belonging to each node somewhere.
>>>
>>> when cpus are assigned explicitly, decision what cpus go to what nodes is
>>> up to user and user configured mapping is stored in
>>> MachineState::possible_cpus
>>> which is accessed by via possible_cpu_arch_ids() callback.
>>> Hence I don see any reason to touch cpu indexes.
>>
>> Please see my reasoning above.
>>
>>>
>>>>
>>>>>>>
>>>>>>> In 2.2 case it calls get_default_cpu_node_id() ->
>>>>>>> x86_get_default_cpu_node_id()
>>>>>>> which uses arch_id calculate numa node.
>>>>>>> But then question is: does it have to use APIC id or could it infer
>>>>>>> 'pkg_id',
>>>>>>> it's after, from ms->possible_cpus->cpus[i].props data?
>>>>>>
>>>>>> Not sure if I got the question right. In this case because the numa
>>>>>> information is not provided all the cpus are assigned to only one node.
>>>>>> The apic id is used here to get the correct pkg_id.
>>>>>
>>>>> apicid was composed from socket/core/thread[/die] tuple which
>>>>> cpus[i].props is.
>>>>>
>>>>> Question is if we can compose only pkg_id based on the same data without
>>>>> converting it to apicid and then "reverse engineering" it back
>>>>> original data?
>>>>
>>>> Yes. It is possible.
>>>>
>>>>>
>>>>> Or more direct question: is socket-id the same as pkg_id?
>>>>
>>>> Yes. Socket_id and pkg_id is same.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> With that out of the way APIC ID will be used only during board's
>>>>>>> init(),
>>>>>>> so board could update possible_cpus with valid APIC IDs at the start of
>>>>>>> x86_cpus_init().
>>>>>>>
>>>>>>> ====
>>>>>>> in nutshell it would be much easier to do following:
>>>>>>>
>>>>>>> 1. make x86_get_default_cpu_node_id() APIC ID in-depended or
>>>>>>> if impossible as alternative recompute APIC IDs there if cpu
>>>>>>> type is EPYC based (since number of nodes is already known)
>>>>>>> 2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
>>>>>>>
>>>>>>> this way one doesn't need to touch generic numa code, introduce
>>>>>>> x86 specific init_apicid_fn() hook into generic code and keep
>>>>>>> x86/EPYC nuances contained within x86 code only.
>>>>>>
>>>>>> I was kind of already working in the similar direction in v4.
>>>>>> 1. We already have split the numa initialization in patch #12(Split the
>>>>>> numa initialization). This way we know exactly how many numa nodes are
>>>>>> there before hand.
>>>>>
>>>>> I suggest to drop that patch, It's the one that touches generic numa
>>>>> code and adding more legacy based extensions like cpu_indexes.
>>>>> Which I'd like to get rid of to begin with, so only -numa cpu is left.
>>>>>
>>>>> I think it's not necessary to touch numa code at all for apicid generation
>>>>> purpose, as I tried to explain above. We should be able to keep
>>>>> this x86 only business.
>>>>
>>>> This is going to be difficult without touching the generic numa code.patch
>>>> #12(Split the
>>>>>> numa initialization)
>>>
>>> Looking at current code I don't see why one would touch numa code.
>>> Care to explain in more details why you'd have to touch it?
>>
>> Please see the reasoning above.
>>>
>>>>>> 2. Planning to remove init_apicid_fn
>>>>>> 3. Insert the handlers inside X86CPUDefinition.
>>>>> what handlers do you mean?
>>>>
>>>> Apicid generation logic can be separated into 3 types of handlers.
>>>> x86_apicid_from_cpu_idx: Generate apicid from cpu index.
>>>> x86_topo_ids_from_apicid: Generate topo ids from apic id.
>>>> x86_apicid_from_topo_ids: Generate apicid from topo ids.
>>>>
>>>> We should be able to generate one id from other(you can see topology.h).
>>>>
>>>> X86CPUDefinition will have the handlers specific to each model like the
>>>> way we have features now. The above 3 handlers will be used as default
>>>> handler.
>>>
>>> it probably shouldn't be a part of X86CPUDefinition,
>>> as it's machines responsibility to generate and set APIC ID.
>>>
>>> What you are doing with this topo functions in this version
>>> looks more that enough to me.
>>
>> It is all the exact same topo functions. Only making these functions as
>> the handlers inside the X86CPUDefinition.
>>
>>>
>>>> The EPYC model will have its corresponding handlers.
>>>>
>>>> x86_apicid_from_cpu_idx_epyc
>>>> x86_topo_ids_from_apicid_epyc
>>>> x86_apicid_from_topo_ids_epyc.
>>>
>>> CPU might use call backs, but does it have to?
>>> I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo
>>> info and then compose various leaves based on it.
>>> Within CPU code I'd just use
>>> if (i_am_epyc)
>>> x86_topo_ids_from_apicid_epyc()
>>> else
>>> x86_topo_ids_from_apicid()
>>> it's easier to read and one doesn't have to go figure
>>> indirection chain to figure out what code is called.
>>
>> Eduardo already commented on this idea. Anything specific to cpu models
>> should be part of the X86CPUDefinition. We should not compare the specific
>> model here. Comparing the specific model does not scale. We are achieving
>> this by loading the model definition(similar to what we do in
>> x86_cpu_load_model).
>
> ok
>
>>
>>>
>>>>>> 4. EPYC model will have its own apid id handlers. Everything else will be
>>>>>> initialized with a default handlers(current default handler).
>>>>>> 5. The function pc_possible_cpu_arch_ids will load the model definition
>>>>>> and initialize the PCMachineState data structure with the model specific
>>>>>> handlers.
>>>>> I'm not sure what do you mean here.
>>>>
>>>> PCMachineState will have the function pointers to the above handlers.
>>>> I was going to load the correct handler based on the mode type.
>>>
>>> Could be done like this, but considering that within machine we need
>>> to calculate apic_id only once, the same 'if' trick would be simpler
>>>
>>> x86_cpus_init() {
>>>
>>> if (cpu == epic) {
>>> make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms))
>>> }
>>
>> Once again, this does not scale. Please see my response above.
>>
>>>
>>> // go on with creating cpus ...
>>> }
>>>
>>>>>> Does that sound similar to what you are thinking. Thoughts?
>>>>> If you have something to share and can push it on github,
>>>>> I can look at, whether it has design issues to spare you a round trip on
>>>>> a list.
>>>>> (it won't be proper review but at least I can help to pinpoint most
>>>>> problematic parts)
>>>>>
>>>> My code for the current approach is kind of ready(yet to be tested). I can
>>>> send it as v3.1 if you want to look. Or we can wait for our discussion to
>>>> settle. I will post it after our discussion.
>>> ok, lets wait till we finish this discussion
>>
>> I can post my draft patch to give you more idea about what i am talking
>> about now. Let me know.
>>
>>>
>>>> There is one more problem we need to address. I was going to address later
>>>> in v4 or v5.
>>>>
>>>> This works
>>>> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7
>>>>
>>>> This does not work
>>>> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7
>>> Is it supposed to work (i.e. can real hardware do such topology)?
>>
>> Hardware does not support this configuration. That is why I did not think
>> it is serious enough to fix this problem right now.
>>
>>>
>>>> This requires the generic code to pass the node information to the x86
>>>> code which requires some handler changes. I was thinking my code will
>>>> simplify the changes to address this issue.
>>>
>>> without more information, it's hard to comment on issue and whether
>>> extra complexity of callbacks is justificated.
>>>
>>> There could be 2 ways here, add fixes to this series so we could see the
>>> reason
>>> or make this series simple to solve apic_id problem only and then on top of
>>> it send the second series that solves another issue.
>>>
>>> Considering that this series is already big/complicated enough,
>>> personally I'd go for 2nd option. As it's easier to describe what patches
>>> are
>>> doing and easier to review => should result in faster reaching consensus
>>> and merging.
>>> [...]
>>>
>>
>