Hello, Yinghai. On Mon, Oct 14, 2013 at 07:25:55PM -0700, Yinghai Lu wrote: > > Wouldn't that amount be fairly static and restricted? If you wanna > > chunk memory init anyway, there's no reason to init more than > > necessary until smp stage is reached. The more you do early, the more > > serialized you're, so wouldn't the goal naturally be initing the > > minimum possible? > > Even we try to go minimum range instead of range that whole range on boot > node, > without parsing srat at first, the minimum range could be crossed the boundary > of nodes.
I guess it depends on how much is the minimum we're talking about, but let's say it isn't multiple orders of magnitude larger than the kernel image. That shouldn't be a problem then, no? The thing is I don't really see how SRAT would help much. I don't know how the existing systems are configured but it's natural to assume that hardware-wise per-stick removal will be supported, right? There's no reason for memory sticks of the first numa node can't be hotunplugged. Likely we'll end up with SRAT map which splits the first node into two pieces - the first smaller part which can't be removed because firmwares and stuff depend on them and the larger tailing chunk which can be removed. Allocating early non-migratable stuff near the kernel image, which can't be moved without an additional layer of indirection anyway would be fairly good choice regardless, right? Even if we parse SRAT early, we can't unconditionally make the kernel allocate early stuff from node0. We do not know how SRAT will look like in future configurations. If what the hotplug people are saying is true, the first non-hotpluggable node being relatively small seems actually quite likely. I don't think we want to factor all those variables into very early bootstrap stages and it's not like we're talking about gigabytes of memory. e.g. bring up the first half or one gig and go from there. That part of memory is highly unlikely to be unpluggable anyway. > > * 4k page mappings. It'd be nice to keep everything working for 4k > > but just following SRAT isn't enough. What if the non-hotpluggable > > boot node doesn't stretch high enough and page table reaches down > > too far? This won't be an optional behavior, so it is actually > > *likely* to happen on certain setups. > > no, do not assume 4k page. even we are using 1GB mapping, we will still have > chance to have one node to take 512G RAM, that means we can have one 4k page > on local node ram. Sure, the kernel image can also be located such that the last page spills over to the next node too. No matter what we do, without an extra layer of indirection, this can't be a complete solution. Think about the usual node configuration and where kernel image is usually loaded. As long as page table is relatively small, it is highly unlikely to increase the chance of such issues. Again, it's all about benefit and cost. Sure, parsing SRAT early will definitely decrease the chance of such issues. However, as long as the size of page table is small enough, just allocating those on top of the kernel isn't significantly worse. Also, following SRAT earlier not only increases complexity in vulnerable stages of boot but also carries higher risk with the existing and future configurations depending on how their SRAT looks like if the new behavior is applied unconditionally. If we decide to make early SRAT usage conditional, that a *LOT* more conditional code than what's added by bottom-up allocation. > On x86 system with intel new cpus there is memory controller built-in., > could have hotplug modules (with socket and memory) and those hotplug modules > will be serviced as one single point. Just nowadays like we have pcie > card hotplugable. > > I don't see where is the " a clear performance trade-off". Because kernel data structures have to be allocated off-node. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/