On Tue, 12 Oct 2021 15:13:08 +0200 Andrew Jones <drjo...@redhat.com> wrote:
> On Tue, Oct 12, 2021 at 02:27:54PM +0200, Igor Mammedov wrote: > > On Tue, 12 Oct 2021 12:37:54 +0200 > > Andrew Jones <drjo...@redhat.com> wrote: > > > > > On Tue, Oct 12, 2021 at 11:40:16AM +0200, Igor Mammedov wrote: > > > > On Wed, 6 Oct 2021 18:22:08 +0800 > > > > Gavin Shan <gs...@redhat.com> wrote: > > > > > > > > > The following option is used to specify the distance map. It's > > > > > possible the option isn't provided by user. In this case, the > > > > > distance map isn't populated and exposed to platform. On the > > > > > other hand, the empty NUMA node, where no memory resides, is > > > > > allowed on ARM64 virt platform. For these empty NUMA nodes, > > > > > their corresponding device-tree nodes aren't populated, but > > > > > their NUMA IDs should be included in the "/distance-map" > > > > > device-tree node, so that kernel can probe them properly if > > > > > device-tree is used. > > > > > > > > > > -numa,dist,src=<numa_id>,dst=<numa_id>,val=<distance> > > > > > > > > > > So when user doesn't specify distance map, we need to generate > > > > > the default distance map, where the local and remote distances > > > > > are 10 and 20 separately. This adds an extra parameter to the > > > > > exiting complete_init_numa_distance() to generate the default > > > > > distance map for this case. > > > > > > > > > > Signed-off-by: Gavin Shan <gs...@redhat.com> > > > > > > > > > > > > how about error-ing out if distance map is required but > > > > not provided by user explicitly and asking user to fix > > > > command line? > > > > > > > > Reasoning behind this that defaults are hard to maintain > > > > and will require compat hacks and being raod blocks down > > > > the road. > > > > Approach I was taking with generic NUMA code, is deprecating > > > > defaults and replacing them with sanity checks, which bail > > > > out on incorrect configuration and ask user to correct command line. > > > > Hence I dislike approach taken in this patch. > > > > > > > > If you really wish to provide default, push it out of > > > > generic code into ARM specific one > > > > (then I won't oppose it that much (I think PPC does > > > > some magic like this)) > > > > Also behavior seems to be ARM specific so generic > > > > NUMA code isn't a place for it anyways > > > > > > The distance-map DT node and the default 10/20 distance-map values > > > aren't arch-specific. RISCV is using it too. > > > > > > I'm on the fence with this. I see erroring-out to require users > > > to provide explicit command lines as a good thing, but I also > > > see it as potentially an unnecessary burden for those that want > > > the default map anyway. The optional nature of the distance-map > > > node and the specification of the default map is here [1] > > > > > > [1] Linux source: Documentation/devicetree/bindings/numa.txt > > > > Looking at proposed linux patches [ https://lkml.org/lkml/2021/9/27/31 ], > > using optional distance table as source for numa-node-ids, > > looks like a hack around kernel's inability to fish them out > > from CPU &| PCI nodes (using those nodes as source should > > cover memory-less node use-case). > > > > I consider including optional node as a policy decision. > > So user shall include it explicitly on QEMU command line > > if necessary (that works just fine for x86), or guest OS > > can make up defaults on its own in absence of data. > > OK, so erroring-out on configs that must provide distance-maps, rather > than automatically generating them for all configs is better. > > > > > > So, my r-b stands for this patch, but I also wouldn't complain > > > about respinning it to error out instead. > > > > > I would complain about > > > moving the logic to Arm specific code, though, since RISCV would > > > then need to duplicate it. > > > > Instead of putting workaround in QEMU and then making them generic, > > I'd prefer to: > > 1. make QEMU to be able generate DT with memory-less nodes > > How? DT syntax doesn't allow this, because each node needs a unique > name which is derived from its base address, which an empty numa you are talking about memory@foo nodes, aren't you? > node doesn't have. Looking at Documentation/devicetree/bindings/numa.txt mem/cpu/pci nodes also contain numa-node-id attribute, so idea is to collect IDs from all present sources instead of abusing distance map. That would allow QEMU to skip memory@foo elements for memory-less nodes because they obviously do not exist and there is no way to describe them using 'memory' nodes. > > 2. fix guest to get numa-node-id from CPU/PCI nodes if > > memory node isn't present, > > I'm not sure that's possible with DT. If it is, then proposing it > upstream to Linux DT maintainers would be the next step. Added Rob to CC. > > > or use ACPI tables which can > > describe memory-less NUMA nodes if fixing how DT is > > parsed unfeasible. > > We use ACPI already for our guests, but we also generate a DT (which > edk2 consumes). We can't generate a valid DT when empty numa nodes does edk2 actually uses numa info from QEMU? > are put on the command line unless we follow a DT spec saying how > to do that. The current spec says we should have a distance-map > that contains those nodes. can you point out to the spec and place within it, pls? > Thanks, > drew >