Hi Guy,
After reading more code relative to bug 6745357, I found there
may be another better way to fix it.
In file uts/i86pc/vm/vm_machdep.c, all "mnoderanges" relative
logic has an assumption that entries in mnoderanges array are arranged
in ascendent order with memory physical address. But there's no
existing way to ensure that assumption. A quick fix would be to add
logic to ensure mnoderanges is in ascendent order. Follow is a small
patch to achieve that which keeps mnoderanges in ascendent order
when creating mnoderanges in mnode_range_setup().
========================================================
diff -r fd335a2c3bc4 usr/src/uts/i86pc/vm/vm_machdep.c
--- a/usr/src/uts/i86pc/vm/vm_machdep.c Wed Mar 18 00:36:41 2009 +0800
+++ b/usr/src/uts/i86pc/vm/vm_machdep.c Wed Mar 18 12:17:57 2009 +0800
@@ -1250,10 +1250,26 @@
mnode_range_setup(mnoderange_t *mnoderanges)
{
int mnode, mri;
+ int i, max_mnodes = 0;
+ int mnodes[MAX_MEM_NODES];
for (mnode = 0; mnode < max_mem_nodes; mnode++) {
if (mem_node_config[mnode].exists == 0)
continue;
+ for (i = max_mnodes; i > 0; i--) {
+ if (mem_node_config[mnode].physbase >
+ mem_node_config[mnodes[i - 1]].physbase) {
+ break;
+ } else {
+ mnodes[i] = mnodes[i - 1];
+ }
+ }
+ mnodes[i] = mnode;
+ max_mnodes++;
+ }
+
+ for (i = 0; i < max_mnodes; i++) {
+ mnode = mnodes[i];
mri = nranges - 1;
========================================================
The above patch may work for current platform, but it still has issue
to support
memory migration and hotplug. To really make thing right, mem_node_config
relative
logic in vm_machdep.c should be cleaned up.
I have delayed to send out the patch one day to find a machine to
verify it.
Now on my test machine, the patch works correctly and could solve 6745357 and
relative bugs.
Any comments?
Guy <> wrote:
> Hello Gerry,
>
> About your former post :
>
>> The patch is still based on the assumption that memory node with
>> bigger node id will have higher memory address with it. That
>> assumption is true for most current platforms, but things change
>> fast and that assumption may become broken with future platform.
>
> This patch proposal address a problem for a given set of server.
> It doesn't address to a RFE for future evolution of ACPI and/or
> future NUMA architecture.
>
> nevada is broken since build 88 on these platform. s10 is broken
> since u6 and will stay broken until u8 (at least).
> I do think it's worth fixing this particular problem, even if we know
> a problem could arise on future platform or acpi specs. But the later
> could be addressed by a separate RFE. It's less urgent since these
> cases do not exist yet.
>
>> 1) According to ACPI spec, there's no guarantee that domain ids
>> will be continuous starting from 0. On a NUMA platform with
>> unpolulated socket, there may be domains existing in SLIT/SRAT but
>> disabled/unused.
>
> I think this situation is already addressed by current code
> ("exists" property of different objects).
>
>> According to my understanding, Gavin's patch should fixed on design
>> defect in x86 lgrp implemention.
>
> The fix referred by Gavin in this thread doesn't work.
>
> According to Kit Chow in a discussion we had by email with Jonathan
> Chew :
>
>>>> mnode 0 contains a higher physical address range than mnode 1. This
>>>> breaks various assumptions made by software that deal with physical
>>>> memory. Very likely the reason for the panic...
>>>>
>>>> Jonathan, is this ordering problem caused by what you had
>>>> previously described to me (something like srat index info
>>>> starting at 1 instead of 0 and you grabbed the info from index 2
>>>> first because 2%2 = 0)?
>>>
>>> Yes. If possible, I want to confirm what you suspect and make sure
>>> that we really find the root cause because there seems to be a bunch
>>> of issues associated with 6745357 and none of them seem to have been
>>> root caused (or at least they aren't documented very well).
>>>
>>> Is there some way to tell based on where the kernel died and
>>> examining the relevant data structures to determine what's going on
>>> and pinpoint the root cause?
>>>
>> mem_node_config and mnoderanges referenced below has a range of
>> 0x80000-f57f5 in slot 0. This is bad and needs to be addressed first
>> and foremost even if there could be other issues. The one assertion
>> that I
>> saw about the calculation of a pfn not matching its mnode is very
>> very likely because of the ordering problem.
>>
>> Kit
>
> which lead me to think that changing the code to support situations
> where mnodes are not is ascending order should be addressed in a
> separate RFE.
>
> Thank you
>
> Best regards
>
> Guy
Liu Jiang (Gerry)
OpenSolaris, OTC, SSG, Intel
vm_machdep.diff
Description: vm_machdep.diff
_______________________________________________ opensolaris-code mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/opensolaris-code
