Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-05-01 Thread Fengguang Wu
On Wed, Apr 17, 2019 at 11:17:48AM +0200, Michal Hocko wrote: On Tue 16-04-19 12:19:21, Yang Shi wrote: On 4/16/19 12:47 AM, Michal Hocko wrote: [...] > Why cannot we simply demote in the proximity order? Why do you make > cpuless nodes so special? If other close nodes are vacant then just

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-30 Thread Fengguang Wu
On Thu, Apr 18, 2019 at 11:02:27AM +0200, Michal Hocko wrote: On Wed 17-04-19 13:43:44, Yang Shi wrote: [...] And, I'm wondering whether this optimization is also suitable to general NUMA balancing or not. If there are convincing numbers then this should be a preferable way to deal with it.

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-18 Thread Zi Yan
On 18 Apr 2019, at 15:23, Yang Shi wrote: > On 4/18/19 11:16 AM, Keith Busch wrote: >> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote: >>> On 4/17/19 2:23 AM, Michal Hocko wrote: yes. This could be achieved by GFP_NOWAIT opportunistic allocation for the migration target.

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-18 Thread Yang Shi
On 4/18/19 11:16 AM, Keith Busch wrote: On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote: On 4/17/19 2:23 AM, Michal Hocko wrote: yes. This could be achieved by GFP_NOWAIT opportunistic allocation for the migration target. That should prevent from loops or artificial nodes

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-18 Thread Keith Busch
On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote: > On 4/17/19 2:23 AM, Michal Hocko wrote: > > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for > > the migration target. That should prevent from loops or artificial nodes > > exhausting quite naturaly AFAICS. Maybe

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-18 Thread Yang Shi
On 4/17/19 10:51 AM, Michal Hocko wrote: On Wed 17-04-19 10:26:05, Yang Shi wrote: On 4/17/19 9:39 AM, Michal Hocko wrote: On Wed 17-04-19 09:37:39, Keith Busch wrote: On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: On Wed 17-04-19 09:23:46, Keith Busch wrote: On Wed, Apr

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-18 Thread Michal Hocko
On Wed 17-04-19 13:43:44, Yang Shi wrote: [...] > And, I'm wondering whether this optimization is also suitable to general > NUMA balancing or not. If there are convincing numbers then this should be a preferable way to deal with it. Please note that the number of promotions is not the only

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Yang Shi
I would also not touch the numa balancing logic at this stage and rather see how the current implementation behaves. I agree we would prefer start from something simpler and see how it works. The "twice access" optimization is aimed to reduce the PMEM bandwidth burden since the

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Michal Hocko
On Wed 17-04-19 10:13:44, Dave Hansen wrote: > On 4/17/19 2:23 AM, Michal Hocko wrote: > >> 3. The demotion path can not have cycles > > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for > > the migration target. That should prevent from loops or artificial nodes > >

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Michal Hocko
On Wed 17-04-19 10:26:05, Yang Shi wrote: > > > On 4/17/19 9:39 AM, Michal Hocko wrote: > > On Wed 17-04-19 09:37:39, Keith Busch wrote: > > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > > > > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > > > > On Wed, Apr 17, 2019 at

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Keith Busch
On Wed, Apr 17, 2019 at 10:26:05AM -0700, Yang Shi wrote: > On 4/17/19 9:39 AM, Michal Hocko wrote: > > On Wed 17-04-19 09:37:39, Keith Busch wrote: > > > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > > > > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > > > > On Wed, Apr 17,

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Yang Shi
On 4/17/19 9:39 AM, Michal Hocko wrote: On Wed 17-04-19 09:37:39, Keith Busch wrote: On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: On Wed 17-04-19 09:23:46, Keith Busch wrote: On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: On Tue 16-04-19 14:22:33, Dave

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Dave Hansen
On 4/17/19 2:23 AM, Michal Hocko wrote: >> 3. The demotion path can not have cycles > yes. This could be achieved by GFP_NOWAIT opportunistic allocation for > the migration target. That should prevent from loops or artificial nodes > exhausting quite naturaly AFAICS. Maybe we will need some tricks

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Michal Hocko
On Wed 17-04-19 09:37:39, Keith Busch wrote: > On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > > > Keith

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Keith Busch
On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote: > On Wed 17-04-19 09:23:46, Keith Busch wrote: > > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > > Keith Busch had a set of patches to let you specify the demotion

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Michal Hocko
On Wed 17-04-19 09:23:46, Keith Busch wrote: > On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > > Keith Busch had a set of patches to let you specify the demotion order > > > via sysfs for fun. The rules we came up with were: > >

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Keith Busch
On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote: > On Tue 16-04-19 14:22:33, Dave Hansen wrote: > > Keith Busch had a set of patches to let you specify the demotion order > > via sysfs for fun. The rules we came up with were: > > I am not a fan of any sysfs "fun" I'm hung up on the

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Keith Busch
On Tue, Apr 16, 2019 at 04:17:44PM -0700, Yang Shi wrote: > On 4/16/19 4:04 PM, Dave Hansen wrote: > > On 4/16/19 2:59 PM, Yang Shi wrote: > > > On 4/16/19 2:22 PM, Dave Hansen wrote: > > > > Keith Busch had a set of patches to let you specify the demotion order > > > > via sysfs for fun.  The

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Michal Hocko
On Tue 16-04-19 14:22:33, Dave Hansen wrote: > On 4/16/19 12:19 PM, Yang Shi wrote: > > would we prefer to try all the nodes in the fallback order to find the > > first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? > > Once a page went to DRAM1, how would we tell that it

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-17 Thread Michal Hocko
On Tue 16-04-19 12:19:21, Yang Shi wrote: > > > On 4/16/19 12:47 AM, Michal Hocko wrote: [...] > > Why cannot we simply demote in the proximity order? Why do you make > > cpuless nodes so special? If other close nodes are vacant then just use > > them. > > We could. But, this raises another

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Yang Shi
Why cannot we start simple and build from there? In other words I do not think we really need anything like N_CPU_MEM at all. In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes. They would be the preferred demotion target.  Of course, we could rely on firmware to

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Yang Shi
On 4/16/19 4:04 PM, Dave Hansen wrote: On 4/16/19 2:59 PM, Yang Shi wrote: On 4/16/19 2:22 PM, Dave Hansen wrote: Keith Busch had a set of patches to let you specify the demotion order via sysfs for fun.  The rules we came up with were: 1. Pages keep no history of where they have been 2.

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Dave Hansen
On 4/16/19 2:59 PM, Yang Shi wrote: > On 4/16/19 2:22 PM, Dave Hansen wrote: >> Keith Busch had a set of patches to let you specify the demotion order >> via sysfs for fun.  The rules we came up with were: >> 1. Pages keep no history of where they have been >> 2. Each node can only demote to one

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Yang Shi
On 4/16/19 2:22 PM, Dave Hansen wrote: On 4/16/19 12:19 PM, Yang Shi wrote: would we prefer to try all the nodes in the fallback order to find the first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? Once a page went to DRAM1, how would we tell that it originated in

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Dave Hansen
On 4/16/19 12:19 PM, Yang Shi wrote: > would we prefer to try all the nodes in the fallback order to find the > first less contended one (i.e. DRAM0 -> PMEM0 -> DRAM1 -> PMEM1 -> Swap)? Once a page went to DRAM1, how would we tell that it originated in DRAM0 and is following the DRAM0 path rather

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Yang Shi
On 4/16/19 12:47 AM, Michal Hocko wrote: On Mon 15-04-19 17:09:07, Yang Shi wrote: On 4/12/19 1:47 AM, Michal Hocko wrote: On Thu 11-04-19 11:56:50, Yang Shi wrote: [...] Design == Basically, the approach is aimed to spread data from DRAM (closest to local CPU) down further to PMEM

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Michal Hocko
On Tue 16-04-19 08:46:56, Dave Hansen wrote: > On 4/16/19 7:39 AM, Michal Hocko wrote: > >> Strict binding also doesn't keep another app from moving the > >> memory. > > I would consider that a bug. > > A bug where, though? Certainly not in the kernel. Kernel should refrain from moving

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Zi Yan
On 16 Apr 2019, at 11:55, Dave Hansen wrote: > On 4/16/19 8:33 AM, Zi Yan wrote: >>> We have a reasonable argument that demotion is better than >>> swapping. So, we could say that even if a VMA has a strict NUMA >>> policy, demoting pages mapped there pages still beats swapping >>> them or

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Dave Hansen
On 4/16/19 8:33 AM, Zi Yan wrote: >> We have a reasonable argument that demotion is better than >> swapping. So, we could say that even if a VMA has a strict NUMA >> policy, demoting pages mapped there pages still beats swapping >> them or tossing the page cache. It's doing them a favor to >>

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Dave Hansen
On 4/16/19 7:39 AM, Michal Hocko wrote: >> Strict binding also doesn't keep another app from moving the >> memory. > I would consider that a bug. A bug where, though? Certainly not in the kernel. I'm just saying that if an app has an assumption that strict binding means that its memory can

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Zi Yan
On 16 Apr 2019, at 10:30, Dave Hansen wrote: > On 4/16/19 12:47 AM, Michal Hocko wrote: >> You definitely have to follow policy. You cannot demote to a node which >> is outside of the cpuset/mempolicy because you are breaking contract >> expected by the userspace. That implies doing a rmap walk.

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Michal Hocko
On Tue 16-04-19 07:30:20, Dave Hansen wrote: > On 4/16/19 12:47 AM, Michal Hocko wrote: > > You definitely have to follow policy. You cannot demote to a node which > > is outside of the cpuset/mempolicy because you are breaking contract > > expected by the userspace. That implies doing a rmap

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Dave Hansen
On 4/16/19 12:47 AM, Michal Hocko wrote: > You definitely have to follow policy. You cannot demote to a node which > is outside of the cpuset/mempolicy because you are breaking contract > expected by the userspace. That implies doing a rmap walk. What *is* the contract with userspace, anyway? :)

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-16 Thread Michal Hocko
On Mon 15-04-19 17:09:07, Yang Shi wrote: > > > On 4/12/19 1:47 AM, Michal Hocko wrote: > > On Thu 11-04-19 11:56:50, Yang Shi wrote: > > [...] > > > Design > > > == > > > Basically, the approach is aimed to spread data from DRAM (closest to > > > local > > > CPU) down further to PMEM and

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-15 Thread Yang Shi
On 4/12/19 1:47 AM, Michal Hocko wrote: On Thu 11-04-19 11:56:50, Yang Shi wrote: [...] Design == Basically, the approach is aimed to spread data from DRAM (closest to local CPU) down further to PMEM and disk (typically assume the lower tier storage is slower, larger and cheaper than the

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-12 Thread Michal Hocko
On Thu 11-04-19 11:56:50, Yang Shi wrote: [...] > Design > == > Basically, the approach is aimed to spread data from DRAM (closest to local > CPU) down further to PMEM and disk (typically assume the lower tier storage > is slower, larger and cheaper than the upper tier) by their hotness. The

Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-11 Thread Dave Hansen
This isn't so much another aproach, as it it some tweaks on top of what's there. Right? This set seems to present a bunch of ideas, like "promote if accessed twice". Seems like a good idea, but I'm a lot more interested in seeing data about it being a good idea. What workloads is it good for?

[v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-10 Thread Yang Shi
With Dave Hansen's patches merged into Linus's tree https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c221c0b0308fd01d9fb33a16f64d2fd95f8830a4 PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUMA node effectively and efficiently is still a