RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-25 Thread Du, Fan


>-Original Message-
>From: Dan Williams [mailto:dan.j.willi...@intel.com]
>Sent: Thursday, April 25, 2019 11:43 PM
>To: Du, Fan 
>Cc: Michal Hocko ; a...@linux-foundation.org; Wu,
>Fengguang ; Hansen, Dave
>; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
>; linux...@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu, Apr 25, 2019 at 1:05 AM Du, Fan  wrote:
>>
>>
>>
>> >-Original Message-
>> >From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
>> >Behalf Of Michal Hocko
>> >Sent: Thursday, April 25, 2019 3:54 PM
>> >To: Du, Fan 
>> >Cc: a...@linux-foundation.org; Wu, Fengguang
>;
>> >Williams, Dan J ; Hansen, Dave
>> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
>> >; linux...@kvack.org;
>linux-kernel@vger.kernel.org
>> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >memory system
>> >
>> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
>> >>
>> >>
>> >> >-Original Message-
>> >> >From: Michal Hocko [mailto:mho...@kernel.org]
>> >> >Sent: Thursday, April 25, 2019 2:37 PM
>> >> >To: Du, Fan 
>> >> >Cc: a...@linux-foundation.org; Wu, Fengguang
>> >;
>> >> >Williams, Dan J ; Hansen, Dave
>> >> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
>> >> >; linux...@kvack.org;
>> >linux-kernel@vger.kernel.org
>> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >> >memory system
>> >> >
>> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
>> >> >[...]
>> >> >> However PMEM has different characteristics from DRAM,
>> >> >> the more reasonable or desirable fallback style would be:
>> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> >> >> When DRAM is exhausted, try PMEM then.
>> >> >
>> >> >Why and who does care? NUMA is fundamentally about memory nodes
>> >with
>> >> >different access characteristics so why is PMEM any special?
>> >>
>> >> Michal, thanks for your comments!
>> >>
>> >> The "different" lies in the local or remote access, usually the underlying
>> >> memory is the same type, i.e. DRAM.
>> >>
>> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
>> >> while with different read/write access latency than DRAM.
>> >
>> >You are describing a NUMA in general here. Yes access to different NUMA
>> >nodes has a different read/write latency. But that doesn't make PMEM
>> >really special from a regular DRAM.
>>
>> Not the numa distance b/w cpu and PMEM node make PMEM different
>than
>> DRAM. The difference lies in the physical layer. The access latency
>characteristics
>> comes from media level.
>
>No, there is no such thing as a "PMEM node". I've pushed back on this
>broken concept in the past [1] [2]. Consider that PMEM could be as
>fast as DRAM for technologies like NVDIMM-N or in emulation
>environments. These attempts to look at persistence as an attribute of
>performance are entirely missing the point that the system can have
>multiple varied memory types and the platform firmware needs to
>enumerate these performance properties in the HMAT on ACPI platforms.
>Any scheme that only considers a binary DRAM and not-DRAM property is
>immediately invalidated the moment the OS needs to consider a 3rd or
>4th memory type, or a more varied connection topology.

Dan, Thanks for your comments!

I've understood your point from the very beginning time of your post before.
Below is my something in my mind as a [standalone personal contributor] only:
a. I fully recognized what HMAT is designed for.
b. I understood your point for the "type" thing is temporal, and think you are 
right about your
  point.

A generic approach is indeed required, however I what to elaborate the point of 
the problem
I'm trying to solve for customer, not how we and other people solve it one way 
or another..

Customer require to fully utilized system memory, no matter DRAM, 1st 
generation PMEM,
future xth generation PMEM which beats DRAM.
Customer require to explicitly [coarse grained] control the memory allocation 
for different
latency/bandwidth.

Maybe it's more worthwhile to think what is needed essentially to solve the 
problem,
And make 

Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-25 Thread Dan Williams
On Thu, Apr 25, 2019 at 1:05 AM Du, Fan  wrote:
>
>
>
> >-Original Message-
> >From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> >Behalf Of Michal Hocko
> >Sent: Thursday, April 25, 2019 3:54 PM
> >To: Du, Fan 
> >Cc: a...@linux-foundation.org; Wu, Fengguang ;
> >Williams, Dan J ; Hansen, Dave
> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
> >; linux...@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >memory system
> >
> >On Thu 25-04-19 07:41:40, Du, Fan wrote:
> >>
> >>
> >> >-Original Message-
> >> >From: Michal Hocko [mailto:mho...@kernel.org]
> >> >Sent: Thursday, April 25, 2019 2:37 PM
> >> >To: Du, Fan 
> >> >Cc: a...@linux-foundation.org; Wu, Fengguang
> >;
> >> >Williams, Dan J ; Hansen, Dave
> >> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
> >> >; linux...@kvack.org;
> >linux-kernel@vger.kernel.org
> >> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >> >memory system
> >> >
> >> >On Thu 25-04-19 09:21:30, Fan Du wrote:
> >> >[...]
> >> >> However PMEM has different characteristics from DRAM,
> >> >> the more reasonable or desirable fallback style would be:
> >> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> >> >> When DRAM is exhausted, try PMEM then.
> >> >
> >> >Why and who does care? NUMA is fundamentally about memory nodes
> >with
> >> >different access characteristics so why is PMEM any special?
> >>
> >> Michal, thanks for your comments!
> >>
> >> The "different" lies in the local or remote access, usually the underlying
> >> memory is the same type, i.e. DRAM.
> >>
> >> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
> >> while with different read/write access latency than DRAM.
> >
> >You are describing a NUMA in general here. Yes access to different NUMA
> >nodes has a different read/write latency. But that doesn't make PMEM
> >really special from a regular DRAM.
>
> Not the numa distance b/w cpu and PMEM node make PMEM different than
> DRAM. The difference lies in the physical layer. The access latency 
> characteristics
> comes from media level.

No, there is no such thing as a "PMEM node". I've pushed back on this
broken concept in the past [1] [2]. Consider that PMEM could be as
fast as DRAM for technologies like NVDIMM-N or in emulation
environments. These attempts to look at persistence as an attribute of
performance are entirely missing the point that the system can have
multiple varied memory types and the platform firmware needs to
enumerate these performance properties in the HMAT on ACPI platforms.
Any scheme that only considers a binary DRAM and not-DRAM property is
immediately invalidated the moment the OS needs to consider a 3rd or
4th memory type, or a more varied connection topology.

[1]: 
https://lore.kernel.org/lkml/CAPcyv4heiUbZvP7Ewoy-Hy=-mprdjcjeusw+0rwdouhdjwe...@mail.gmail.com/

[2]: 
https://lore.kernel.org/lkml/capcyv4it1w7sddvbv24crcvhtlb3s1pvb5+sdm02uw4rbah...@mail.gmail.com/


RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-25 Thread Du, Fan



>-Original Message-
>From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
>Behalf Of Michal Hocko
>Sent: Thursday, April 25, 2019 3:54 PM
>To: Du, Fan 
>Cc: a...@linux-foundation.org; Wu, Fengguang ;
>Williams, Dan J ; Hansen, Dave
>; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
>; linux...@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu 25-04-19 07:41:40, Du, Fan wrote:
>>
>>
>> >-Original Message-
>> >From: Michal Hocko [mailto:mho...@kernel.org]
>> >Sent: Thursday, April 25, 2019 2:37 PM
>> >To: Du, Fan 
>> >Cc: a...@linux-foundation.org; Wu, Fengguang
>;
>> >Williams, Dan J ; Hansen, Dave
>> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
>> >; linux...@kvack.org;
>linux-kernel@vger.kernel.org
>> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>> >memory system
>> >
>> >On Thu 25-04-19 09:21:30, Fan Du wrote:
>> >[...]
>> >> However PMEM has different characteristics from DRAM,
>> >> the more reasonable or desirable fallback style would be:
>> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> >> When DRAM is exhausted, try PMEM then.
>> >
>> >Why and who does care? NUMA is fundamentally about memory nodes
>with
>> >different access characteristics so why is PMEM any special?
>>
>> Michal, thanks for your comments!
>>
>> The "different" lies in the local or remote access, usually the underlying
>> memory is the same type, i.e. DRAM.
>>
>> By "special", PMEM is usually in gigantic capacity than DRAM per dimm,
>> while with different read/write access latency than DRAM.
>
>You are describing a NUMA in general here. Yes access to different NUMA
>nodes has a different read/write latency. But that doesn't make PMEM
>really special from a regular DRAM. 

Not the numa distance b/w cpu and PMEM node make PMEM different than
DRAM. The difference lies in the physical layer. The access latency 
characteristics
comes from media level.

>There are few other people trying to
>work with PMEM as NUMA nodes and these kind of arguments are repeating
>again and again. So far I haven't really heard much beyond hand waving.
>Please go and read through those discussion so that we do not have to go
>throug the same set of arguments again.
>
>I absolutely do see and understand people want to find a way to use
>their shiny NVIDIMs but please step back and try to think in more
>general terms than PMEM is special and we have to treat it that way.
>We currently have ways to use it as DAX device and a NUMA node then
>focus on how to improve our NUMA handling so that we can get maximum
>out
>of the HW rather than make a PMEM NUMA node a special snow flake.
>
>Thank you.
>
>--
>Michal Hocko
>SUSE Labs



Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-25 Thread Michal Hocko
On Thu 25-04-19 07:41:40, Du, Fan wrote:
> 
> 
> >-Original Message-
> >From: Michal Hocko [mailto:mho...@kernel.org]
> >Sent: Thursday, April 25, 2019 2:37 PM
> >To: Du, Fan 
> >Cc: a...@linux-foundation.org; Wu, Fengguang ;
> >Williams, Dan J ; Hansen, Dave
> >; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
> >; linux...@kvack.org; linux-kernel@vger.kernel.org
> >Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
> >memory system
> >
> >On Thu 25-04-19 09:21:30, Fan Du wrote:
> >[...]
> >> However PMEM has different characteristics from DRAM,
> >> the more reasonable or desirable fallback style would be:
> >> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> >> When DRAM is exhausted, try PMEM then.
> >
> >Why and who does care? NUMA is fundamentally about memory nodes with
> >different access characteristics so why is PMEM any special?
> 
> Michal, thanks for your comments!
> 
> The "different" lies in the local or remote access, usually the underlying
> memory is the same type, i.e. DRAM.
> 
> By "special", PMEM is usually in gigantic capacity than DRAM per dimm, 
> while with different read/write access latency than DRAM.

You are describing a NUMA in general here. Yes access to different NUMA
nodes has a different read/write latency. But that doesn't make PMEM
really special from a regular DRAM. There are few other people trying to
work with PMEM as NUMA nodes and these kind of arguments are repeating
again and again. So far I haven't really heard much beyond hand waving.
Please go and read through those discussion so that we do not have to go
throug the same set of arguments again.

I absolutely do see and understand people want to find a way to use
their shiny NVIDIMs but please step back and try to think in more
general terms than PMEM is special and we have to treat it that way.
We currently have ways to use it as DAX device and a NUMA node then
focus on how to improve our NUMA handling so that we can get maximum out
of the HW rather than make a PMEM NUMA node a special snow flake.

Thank you.

-- 
Michal Hocko
SUSE Labs


RE: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-25 Thread Du, Fan



>-Original Message-
>From: Michal Hocko [mailto:mho...@kernel.org]
>Sent: Thursday, April 25, 2019 2:37 PM
>To: Du, Fan 
>Cc: a...@linux-foundation.org; Wu, Fengguang ;
>Williams, Dan J ; Hansen, Dave
>; xishi.qiuxi...@alibaba-inc.com; Huang, Ying
>; linux...@kvack.org; linux-kernel@vger.kernel.org
>Subject: Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous
>memory system
>
>On Thu 25-04-19 09:21:30, Fan Du wrote:
>[...]
>> However PMEM has different characteristics from DRAM,
>> the more reasonable or desirable fallback style would be:
>> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
>> When DRAM is exhausted, try PMEM then.
>
>Why and who does care? NUMA is fundamentally about memory nodes with
>different access characteristics so why is PMEM any special?

Michal, thanks for your comments!

The "different" lies in the local or remote access, usually the underlying
memory is the same type, i.e. DRAM.

By "special", PMEM is usually in gigantic capacity than DRAM per dimm, 
while with different read/write access latency than DRAM. Iow PMEM
sits right under DRAM in the memory tier hierarchy.

This makes PMEM to be far memory, or second class memory.
So we give first class DRAM page to user, fallback to PMEM when
necessary.

The Cloud Service Provider can use DRAM + PMEM in their system,
Leveraging method [1] to keep hot page in DRAM and warm or cold
Page in PMEM, achieve optimal performance and reduce total cost
of ownership at the same time.

[1]:
https://github.com/fengguang/memory-optimizer

>--
>Michal Hocko
>SUSE Labs


Re: [RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-24 Thread Michal Hocko
On Thu 25-04-19 09:21:30, Fan Du wrote:
[...]
> However PMEM has different characteristics from DRAM,
> the more reasonable or desirable fallback style would be:
> DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
> When DRAM is exhausted, try PMEM then. 

Why and who does care? NUMA is fundamentally about memory nodes with
different access characteristics so why is PMEM any special?

-- 
Michal Hocko
SUSE Labs


[RFC PATCH 0/5] New fallback workflow for heterogeneous memory system

2019-04-24 Thread Fan Du
This is another approach of building zonelist based on patch #10 of
patchset[1].

For systems with heterogeneous DRAM and PMEM (persistent memory),

1) change ZONELIST_FALLBACK to first fallback to same type nodes,
   then the other types

2) add ZONELIST_FALLBACK_SAME_TYPE to fallback only same type nodes.
   To be explicitly selected by __GFP_SAME_NODE_TYPE.

For example, a 2S DRAM+PMEM system may have NUMA distances:
node   0   1   2   3 
  0:  10  21  17  28 
  1:  21  10  28  17 
  2:  17  28  10  28 
  3:  28  17  28  10

Node 0,1 are DRAM nodes, node 2, 3 are PMEM nodes.

ZONELIST_FALLBACK
=
Current zoned fallback lists are based on numa distance only,
which means page allocation request from node 0 will iterate zone order
like: DRAM node 0 -> PMEM node 2 -> DRAM node 1 -> PMEM node 3.

However PMEM has different characteristics from DRAM,
the more reasonable or desirable fallback style would be:
DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3.
When DRAM is exhausted, try PMEM then. 

ZONELIST_FALLBACK_SAME_TYPE
===
Some cases are more suitable to fit PMEM characteristics, like page is
read more frequently than written. Other cases may prefer DRAM only.
It doesn't matter page is from local node, or remote.

Create __GFP_SAME_NODE_TYPE to request page of same node type,
either we got DRAM(from node 0, 1) or PMEM (from node 2, 3), it's kind
of extension to the nofallback list, but with the same node type. 

This patchset is self-contained, and based on Linux 5.1-rc6.

[1]:
https://lkml.org/lkml/2018/12/26/138

Fan Du (5):
  acpi/numa: memorize NUMA node type from SRAT table
  mmzone: new pgdat flags for DRAM and PMEM
  x86,numa: update numa node type
  mm, page alloc: build fallback list on per node type basis
  mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list

 arch/x86/include/asm/numa.h |  2 ++
 arch/x86/mm/numa.c  |  3 +++
 drivers/acpi/numa.c |  5 
 include/linux/gfp.h |  7 ++
 include/linux/mmzone.h  | 35 
 mm/page_alloc.c | 57 -
 6 files changed, 93 insertions(+), 16 deletions(-)

-- 
1.8.3.1