On Mon, Nov 12, 2018 at 10:15:46PM +0000, Elliott, Robert (Persistent Memory) wrote: > > > > -----Original Message----- > > From: Daniel Jordan <[email protected]> > > Sent: Monday, November 12, 2018 11:54 AM > > To: Elliott, Robert (Persistent Memory) <[email protected]> > > Cc: Daniel Jordan <[email protected]>; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected]; [email protected]; > > [email protected]; [email protected]; [email protected]; > > [email protected] > > Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page > > initialization within each node > > > > On Sat, Nov 10, 2018 at 03:48:14AM +0000, Elliott, Robert (Persistent > > Memory) wrote: > > > > -----Original Message----- > > > > From: [email protected] <linux-kernel- > > > > [email protected]> On Behalf Of Daniel Jordan > > > > Sent: Monday, November 05, 2018 10:56 AM > > > > Subject: [RFC PATCH v4 11/13] mm: parallelize deferred struct page > > > > initialization within each node > > > > > ... > > > > In testing, a reasonable value turned out to be about a quarter of the > > > > CPUs on the node. > > > ... > > > > + /* > > > > + * We'd like to know the memory bandwidth of the chip to > > > > calculate the > > > > + * most efficient number of threads to start, but we can't. > > > > + * In testing, a good value for a variety of systems was a > > > > quarter of the CPUs on the node. > > > > + */ > > > > + nr_node_cpus = DIV_ROUND_UP(cpumask_weight(cpumask), 4); > > > > > > > > > You might want to base that calculation on and limit the threads to > > > physical cores, not hyperthreaded cores. > > > > Why? Hyperthreads can be beneficial when waiting on memory. That said, I > > don't have data that shows that in this case. > > I think that's only if there are some register-based calculations to do while > waiting. If both threads are just doing memory accesses, they'll both stall, > and > there doesn't seem to be any benefit in having two contexts generate the IOs > rather than one (at least on the systems I've used). I think it takes longer > to switch contexts than to just turnaround the next IO.
(Sorry for the delay, Plumbers is over now...) I guess we're both just waving our hands without data. I've only got x86, so using a quarter of the CPUs rules out HT on my end. Do you have a system that you can test this on, where using a quarter of the CPUs will involve HT? Thanks, Daniel

