[PATCH v2] vhost-vdpa: account iommu allocations

2023-12-26 Thread Pasha Tatashin
iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin 
Acked-by: Michael S. Tsirkin 
---
 drivers/vhost/vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Changelog:

v1:
This patch is spinned of from the series:
https://lore.kernel.org/all/20231128204938.1453583-1-pasha.tatas...@soleen.com

v2:
- Synced with v6.7-rc7
- Added Acked-by Michael S. Tsirkin.

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..a51c69c078d9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm), GFP_KERNEL);
+ perm_to_iommu_flags(perm),
+ GFP_KERNEL_ACCOUNT);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-- 
2.43.0.472.g3155946c3a-goog




[PATCH] vhost-vdpa: account iommu allocations

2023-11-30 Thread Pasha Tatashin
iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin 
---
 drivers/vhost/vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

This patch is spinned of from the series:
https://lore.kernel.org/all/20231128204938.1453583-1-pasha.tatas...@soleen.com

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..a51c69c078d9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm), GFP_KERNEL);
+ perm_to_iommu_flags(perm),
+ GFP_KERNEL_ACCOUNT);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-- 
2.43.0.rc2.451.g8631bc7472-goog




Re: [PATCH v1] virtio_pmem: populate numa information

2022-10-26 Thread Pasha Tatashin
On Mon, Oct 17, 2022 at 1:11 PM Michael Sammler  wrote:
>
> Compute the numa information for a virtio_pmem device from the memory
> range of the device. Previously, the target_node was always 0 since
> the ndr_desc.target_node field was never explicitly set. The code for
> computing the numa node is taken from cxl_pmem_region_probe in
> drivers/cxl/pmem.c.
>
> Signed-off-by: Michael Sammler 

Enables the hot-plugging of virtio-pmem memory into correct memory
nodes. Does not look like it effect the FS_DAX.

Reviewed-by: Pasha Tatashin 

Thanks,
Pasha

> ---
>  drivers/nvdimm/virtio_pmem.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index 20da455d2ef6..a92eb172f0e7 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
>  static int virtio_pmem_probe(struct virtio_device *vdev)
>  {
> struct nd_region_desc ndr_desc = {};
> -   int nid = dev_to_node(>dev);
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> dev_set_drvdata(>dev, vpmem->nvdimm_bus);
>
> ndr_desc.res = 
> -   ndr_desc.numa_node = nid;
> +
> +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> +   ndr_desc.target_node = phys_to_target_node(res.start);
> +   if (ndr_desc.target_node == NUMA_NO_NODE) {
> +   ndr_desc.target_node = ndr_desc.numa_node;
> +   dev_dbg(>dev, "changing target node from %d to %d",
> +   NUMA_NO_NODE, ndr_desc.target_node);
> +   }
> +
> ndr_desc.flush = async_pmem_flush;
> ndr_desc.provider_data = vdev;
> set_bit(ND_REGION_PAGEMAP, _desc.flags);
> --
> 2.38.0.413.g74048e4d9e-goog



Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init

2018-10-31 Thread Pasha Tatashin


On 10/31/18 12:05 PM, Alexander Duyck wrote:
> On Wed, 2018-10-31 at 15:40 +0000, Pasha Tatashin wrote:
>>
>> On 10/17/18 7:54 PM, Alexander Duyck wrote:
>>> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>>>
>>> This iterator will take care of making sure a given memory range provided
>>> is in fact contained within a zone. It takes are of all the bounds checking
>>> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
>>> it should help to speed up the search a bit by iterating until the end of a
>>> range is greater than the start of the zone pfn range, and will exit
>>> completely if the start is beyond the end of the zone.
>>>
>>> This patch adds yet another iterator called
>>> for_each_free_mem_range_in_zone_from and then uses it to support
>>> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
>>> By doing this we can greatly improve the cache locality of the pages while
>>> we do several loops over them in the init and freeing process.
>>>
>>> We are able to tighten the loops as a result since we only really need the
>>> checks for first_init_pfn in our first iteration and after that we can
>>> assume that all future values will be greater than this. So I have added a
>>> function called deferred_init_mem_pfn_range_in_zone that primes the
>>> iterators and if it fails we can just exit.
>>>
>>> On my x86_64 test system with 384GB of memory per node I saw a reduction in
>>> initialization time from 1.85s to 1.38s as a result of this patch.
>>>
>>> Signed-off-by: Alexander Duyck 
>>
>> Hi Alex,
>>
>> Could you please split this patch into two parts:
>>
>> 1. Add deferred_init_maxorder()
>> 2. Add memblock iterator?
>>
>> This would allow a better bisecting in case of problems. Chaning two
>> loops into deferred_init_maxorder() while a good idea, is still
>> non-trivial and might lead to bugs.
>>
>> Thank you,
>> Pavel
> 
> I can do that, but I will need to flip the order. I will add the new
> iterator first and then deferred_init_maxorder. Otherwise the
> intermediate step ends up being too much throw-away code.

That sounds good.

Thank you,
Pavel

> 
> - Alex
> 

Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init

2018-10-31 Thread Pasha Tatashin


On 10/31/18 12:05 PM, Alexander Duyck wrote:
> On Wed, 2018-10-31 at 15:40 +0000, Pasha Tatashin wrote:
>>
>> On 10/17/18 7:54 PM, Alexander Duyck wrote:
>>> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>>>
>>> This iterator will take care of making sure a given memory range provided
>>> is in fact contained within a zone. It takes are of all the bounds checking
>>> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
>>> it should help to speed up the search a bit by iterating until the end of a
>>> range is greater than the start of the zone pfn range, and will exit
>>> completely if the start is beyond the end of the zone.
>>>
>>> This patch adds yet another iterator called
>>> for_each_free_mem_range_in_zone_from and then uses it to support
>>> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
>>> By doing this we can greatly improve the cache locality of the pages while
>>> we do several loops over them in the init and freeing process.
>>>
>>> We are able to tighten the loops as a result since we only really need the
>>> checks for first_init_pfn in our first iteration and after that we can
>>> assume that all future values will be greater than this. So I have added a
>>> function called deferred_init_mem_pfn_range_in_zone that primes the
>>> iterators and if it fails we can just exit.
>>>
>>> On my x86_64 test system with 384GB of memory per node I saw a reduction in
>>> initialization time from 1.85s to 1.38s as a result of this patch.
>>>
>>> Signed-off-by: Alexander Duyck 
>>
>> Hi Alex,
>>
>> Could you please split this patch into two parts:
>>
>> 1. Add deferred_init_maxorder()
>> 2. Add memblock iterator?
>>
>> This would allow a better bisecting in case of problems. Chaning two
>> loops into deferred_init_maxorder() while a good idea, is still
>> non-trivial and might lead to bugs.
>>
>> Thank you,
>> Pavel
> 
> I can do that, but I will need to flip the order. I will add the new
> iterator first and then deferred_init_maxorder. Otherwise the
> intermediate step ends up being too much throw-away code.

That sounds good.

Thank you,
Pavel

> 
> - Alex
> 

Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init

2018-10-31 Thread Pasha Tatashin


On 10/17/18 7:54 PM, Alexander Duyck wrote:
> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> 
> This iterator will take care of making sure a given memory range provided
> is in fact contained within a zone. It takes are of all the bounds checking
> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> it should help to speed up the search a bit by iterating until the end of a
> range is greater than the start of the zone pfn range, and will exit
> completely if the start is beyond the end of the zone.
> 
> This patch adds yet another iterator called
> for_each_free_mem_range_in_zone_from and then uses it to support
> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> By doing this we can greatly improve the cache locality of the pages while
> we do several loops over them in the init and freeing process.
> 
> We are able to tighten the loops as a result since we only really need the
> checks for first_init_pfn in our first iteration and after that we can
> assume that all future values will be greater than this. So I have added a
> function called deferred_init_mem_pfn_range_in_zone that primes the
> iterators and if it fails we can just exit.
> 
> On my x86_64 test system with 384GB of memory per node I saw a reduction in
> initialization time from 1.85s to 1.38s as a result of this patch.
> 
> Signed-off-by: Alexander Duyck 

Hi Alex,

Could you please split this patch into two parts:

1. Add deferred_init_maxorder()
2. Add memblock iterator?

This would allow a better bisecting in case of problems. Chaning two
loops into deferred_init_maxorder() while a good idea, is still
non-trivial and might lead to bugs.

Thank you,
Pavel

> ---
>  include/linux/memblock.h |   58 +++
>  mm/memblock.c|   63 
>  mm/page_alloc.c  |  176 
> --
>  3 files changed, 242 insertions(+), 55 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..2ddd1bafdd03 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
> *out_start,
> p_start, p_end, p_nid))
>  
>  /**
> + * for_each_mem_range_from - iterate through memblock areas from type_a and 
> not
> + * included in type_b. Or just type_a if type_b is NULL.
> + * @i: u64 used as loop variable
> + * @type_a: ptr to memblock_type to iterate
> + * @type_b: ptr to memblock_type which excludes from the iteration
> + * @nid: node selector, %NUMA_NO_NODE for all nodes
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + * @p_nid: ptr to int for nid of the range, can be %NULL
> + */
> +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,   
> \
> +p_start, p_end, p_nid)   \
> + for (i = 0, __next_mem_range(, nid, flags, type_a, type_b,\
> +  p_start, p_end, p_nid);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_range(, nid, flags, type_a, type_b,   \
> +   p_start, p_end, p_nid))
> +/**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
>   * type_a and not included in type_b. Or just type_a if type_b is NULL.
>   * @i: u64 used as loop variable
> @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned 
> long *out_start_pfn,
>i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid))
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +   unsigned long *out_spfn,
> +   unsigned long *out_epfn);
> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \
> + for (i = 0, \
> +  __next_mem_pfn_range_in_zone(, zone, p_start, p_end);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_pfn_range_in_zone(, zone, p_start, p_end))
> +
> +/**
> + * 

Re: [mm PATCH v4 3/6] mm: Use memblock/zone specific iterator for handling deferred page init

2018-10-31 Thread Pasha Tatashin


On 10/17/18 7:54 PM, Alexander Duyck wrote:
> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
> 
> This iterator will take care of making sure a given memory range provided
> is in fact contained within a zone. It takes are of all the bounds checking
> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> it should help to speed up the search a bit by iterating until the end of a
> range is greater than the start of the zone pfn range, and will exit
> completely if the start is beyond the end of the zone.
> 
> This patch adds yet another iterator called
> for_each_free_mem_range_in_zone_from and then uses it to support
> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> By doing this we can greatly improve the cache locality of the pages while
> we do several loops over them in the init and freeing process.
> 
> We are able to tighten the loops as a result since we only really need the
> checks for first_init_pfn in our first iteration and after that we can
> assume that all future values will be greater than this. So I have added a
> function called deferred_init_mem_pfn_range_in_zone that primes the
> iterators and if it fails we can just exit.
> 
> On my x86_64 test system with 384GB of memory per node I saw a reduction in
> initialization time from 1.85s to 1.38s as a result of this patch.
> 
> Signed-off-by: Alexander Duyck 

Hi Alex,

Could you please split this patch into two parts:

1. Add deferred_init_maxorder()
2. Add memblock iterator?

This would allow a better bisecting in case of problems. Chaning two
loops into deferred_init_maxorder() while a good idea, is still
non-trivial and might lead to bugs.

Thank you,
Pavel

> ---
>  include/linux/memblock.h |   58 +++
>  mm/memblock.c|   63 
>  mm/page_alloc.c  |  176 
> --
>  3 files changed, 242 insertions(+), 55 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..2ddd1bafdd03 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
> *out_start,
> p_start, p_end, p_nid))
>  
>  /**
> + * for_each_mem_range_from - iterate through memblock areas from type_a and 
> not
> + * included in type_b. Or just type_a if type_b is NULL.
> + * @i: u64 used as loop variable
> + * @type_a: ptr to memblock_type to iterate
> + * @type_b: ptr to memblock_type which excludes from the iteration
> + * @nid: node selector, %NUMA_NO_NODE for all nodes
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + * @p_nid: ptr to int for nid of the range, can be %NULL
> + */
> +#define for_each_mem_range_from(i, type_a, type_b, nid, flags,   
> \
> +p_start, p_end, p_nid)   \
> + for (i = 0, __next_mem_range(, nid, flags, type_a, type_b,\
> +  p_start, p_end, p_nid);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_range(, nid, flags, type_a, type_b,   \
> +   p_start, p_end, p_nid))
> +/**
>   * for_each_mem_range_rev - reverse iterate through memblock areas from
>   * type_a and not included in type_b. Or just type_a if type_b is NULL.
>   * @i: u64 used as loop variable
> @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned 
> long *out_start_pfn,
>i >= 0; __next_mem_pfn_range(, nid, p_start, p_end, p_nid))
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +   unsigned long *out_spfn,
> +   unsigned long *out_epfn);
> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \
> + for (i = 0, \
> +  __next_mem_pfn_range_in_zone(, zone, p_start, p_end);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_pfn_range_in_zone(, zone, p_start, p_end))
> +
> +/**
> + * 

Re: [PATCH v4 3/5] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-21 Thread Pasha Tatashin


On 9/20/18 6:29 PM, Alexander Duyck wrote:
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
> 
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
> 
> Signed-off-by: Alexander Duyck 

> +void __ref memmap_init_zone_device(struct zone *zone,
> +unsigned long start_pfn,
> +unsigned long size,
> +struct dev_pagemap *pgmap)
> +{
> + unsigned long pfn, end_pfn = start_pfn + size;
> + struct pglist_data *pgdat = zone->zone_pgdat;
> + unsigned long zone_idx = zone_idx(zone);
> + unsigned long start = jiffies;
> + int nid = pgdat->node_id;
> +
> + if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
> + return;
> +
> + /*
> +  * The call to memmap_init_zone should have already taken care
> +  * of the pages reserved for the memmap, so we can just jump to
> +  * the end of that region and start processing the device pages.
> +  */
> + if (pgmap->altmap_valid) {
> + struct vmem_altmap *altmap = >altmap;
> +
> + start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> + size = end_pfn - start_pfn;
> + }
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + struct page *page = pfn_to_page(pfn);
> +
> + __init_single_page(page, pfn, zone_idx, nid);
> +
> + /*
> +  * Mark page reserved as it will need to wait for onlining
> +  * phase for it to be fully associated with a zone.
> +  *
> +  * We can use the non-atomic __set_bit operation for setting
> +  * the flag as we are still initializing the pages.
> +  */
> + __SetPageReserved(page);
> +
> + /*
> +  * ZONE_DEVICE pages union ->lru with a ->pgmap back
> +  * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
> +  * page is ever freed or placed on a driver-private list.
> +  */
> + page->pgmap = pgmap;
> + page->hmm_data = 0;

__init_single_page()
  mm_zero_struct_page()

Takes care of zeroing, no need to do another store here.


Looks good otherwise.

Reviewed-by: Pavel Tatashin 


Re: [PATCH v4 3/5] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-09-21 Thread Pasha Tatashin


On 9/20/18 6:29 PM, Alexander Duyck wrote:
> The ZONE_DEVICE pages were being initialized in two locations. One was with
> the memory_hotplug lock held and another was outside of that lock. The
> problem with this is that it was nearly doubling the memory initialization
> time. Instead of doing this twice, once while holding a global lock and
> once without, I am opting to defer the initialization to the one outside of
> the lock. This allows us to avoid serializing the overhead for memory init
> and we can instead focus on per-node init times.
> 
> One issue I encountered is that devm_memremap_pages and
> hmm_devmmem_pages_create were initializing only the pgmap field the same
> way. One wasn't initializing hmm_data, and the other was initializing it to
> a poison value. Since this is something that is exposed to the driver in
> the case of hmm I am opting for a third option and just initializing
> hmm_data to 0 since this is going to be exposed to unknown third party
> drivers.
> 
> Signed-off-by: Alexander Duyck 

> +void __ref memmap_init_zone_device(struct zone *zone,
> +unsigned long start_pfn,
> +unsigned long size,
> +struct dev_pagemap *pgmap)
> +{
> + unsigned long pfn, end_pfn = start_pfn + size;
> + struct pglist_data *pgdat = zone->zone_pgdat;
> + unsigned long zone_idx = zone_idx(zone);
> + unsigned long start = jiffies;
> + int nid = pgdat->node_id;
> +
> + if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
> + return;
> +
> + /*
> +  * The call to memmap_init_zone should have already taken care
> +  * of the pages reserved for the memmap, so we can just jump to
> +  * the end of that region and start processing the device pages.
> +  */
> + if (pgmap->altmap_valid) {
> + struct vmem_altmap *altmap = >altmap;
> +
> + start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> + size = end_pfn - start_pfn;
> + }
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + struct page *page = pfn_to_page(pfn);
> +
> + __init_single_page(page, pfn, zone_idx, nid);
> +
> + /*
> +  * Mark page reserved as it will need to wait for onlining
> +  * phase for it to be fully associated with a zone.
> +  *
> +  * We can use the non-atomic __set_bit operation for setting
> +  * the flag as we are still initializing the pages.
> +  */
> + __SetPageReserved(page);
> +
> + /*
> +  * ZONE_DEVICE pages union ->lru with a ->pgmap back
> +  * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
> +  * page is ever freed or placed on a driver-private list.
> +  */
> + page->pgmap = pgmap;
> + page->hmm_data = 0;

__init_single_page()
  mm_zero_struct_page()

Takes care of zeroing, no need to do another store here.


Looks good otherwise.

Reviewed-by: Pavel Tatashin 


Re: [PATCH v2 3/4] mm/memory_hotplug: Simplify node_states_check_changes_online

2018-09-21 Thread Pasha Tatashin


On 9/21/18 9:26 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> While looking at node_states_check_changes_online, I stumbled
> upon some confusing things.
> 
> Right after entering the function, we find this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> This is wrong.
> N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
> My guess is that this wanted to be something like:
> 
> if (N_NORMAL_MEMORY == N_HIGH_MEMORY)
> 
> to check if we have CONFIG_HIGHMEM.
> 
> Later on, in the CONFIG_HIGHMEM block, we have:
> 
> if (N_MEMORY == N_HIGH_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> Again, this is wrong, and will never be evaluated to true.
> 
> Besides removing these wrong if statements, I simplified
> the function a bit.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: Pavel Tatashin 

Reviewed-by: Pavel Tatashin 

Re: [PATCH v2 3/4] mm/memory_hotplug: Simplify node_states_check_changes_online

2018-09-21 Thread Pasha Tatashin


On 9/21/18 9:26 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> While looking at node_states_check_changes_online, I stumbled
> upon some confusing things.
> 
> Right after entering the function, we find this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> This is wrong.
> N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
> My guess is that this wanted to be something like:
> 
> if (N_NORMAL_MEMORY == N_HIGH_MEMORY)
> 
> to check if we have CONFIG_HIGHMEM.
> 
> Later on, in the CONFIG_HIGHMEM block, we have:
> 
> if (N_MEMORY == N_HIGH_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> Again, this is wrong, and will never be evaluated to true.
> 
> Besides removing these wrong if statements, I simplified
> the function a bit.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: Pavel Tatashin 

Reviewed-by: Pavel Tatashin 

Re: [PATCH 5/5] mm/memory_hotplug: Clean up node_states_check_changes_offline

2018-09-20 Thread Pasha Tatashin
On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> This patch, as the previous one, gets rid of the wrong if statements.
> While at it, I realized that the comments are sometimes very confusing,
> to say the least, and wrong.
> For example:
> 
> ---
> zone_last = ZONE_MOVABLE;
> 
> /*
>  * check whether node_states[N_HIGH_MEMORY] will be changed
>  * If we try to offline the last present @nr_pages from the node,
>  * we can determind we will need to clear the node from
>  * node_states[N_HIGH_MEMORY].
>  */
> 
> for (; zt <= zone_last; zt++)
>   present_pages += pgdat->node_zones[zt].present_pages;
> if (nr_pages >= present_pages)
>   arg->status_change_nid = zone_to_nid(zone);
> else
>   arg->status_change_nid = -1;
> ---
> 
> In case the node gets empry, it must be removed from N_MEMORY.
> We already check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM
> ifdef code.
> Not to say that status_change_nid is for N_MEMORY, and not for
> N_HIGH_MEMORY.
> 
> So I re-wrote some of the comments to what I think is better.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory_hotplug.c | 71 
> +
>  1 file changed, 28 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ab3c1de18c5d..15ecf3d7a554 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1485,51 +1485,36 @@ static void 
> node_states_check_changes_offline(unsigned long nr_pages,
>  {
>   struct pglist_data *pgdat = zone->zone_pgdat;
>   unsigned long present_pages = 0;
> - enum zone_type zt, zone_last = ZONE_NORMAL;
> + enum zone_type zt;
>  
>   /*
> -  * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_NORMAL,
> -  * set zone_last to ZONE_NORMAL.
> -  *
> -  * If we don't have HIGHMEM nor movable node,
> -  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> -  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> +  * Check whether node_states[N_NORMAL_MEMORY] will be changed.
> +  * If the memory to be offline is within the range
> +  * [0..ZONE_NORMAL], and it is the last present memory there,
> +  * the zones in that range will become empty after the offlining,
> +  * thus we can determine that we need to clear the node from
> +  * node_states[N_NORMAL_MEMORY].
>*/
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - zone_last = ZONE_MOVABLE;
> -
> - /*
> -  * check whether node_states[N_NORMAL_MEMORY] will be changed.
> -  * If the memory to be offline is in a zone of 0...zone_last,
> -  * and it is the last present memory, 0...zone_last will
> -  * become empty after offline , thus we can determind we will
> -  * need to clear the node from node_states[N_NORMAL_MEMORY].
> -  */
> - for (zt = 0; zt <= zone_last; zt++)
> + for (zt = 0; zt <= ZONE_NORMAL; zt++)
>   present_pages += pgdat->node_zones[zt].present_pages;
> - if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
> + if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
>   arg->status_change_nid_normal = zone_to_nid(zone);
>   else
>   arg->status_change_nid_normal = -1;
>  
>  #ifdef CONFIG_HIGHMEM
>   /*
> -  * If we have movable node, node_states[N_HIGH_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_HIGHMEM,
> -  * set zone_last to ZONE_HIGHMEM.
> -  *
> -  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_MOVABLE,
> -  * set zone_last to ZONE_MOVABLE.
> +  * node_states[N_HIGH_MEMORY] contains nodes which
> +  * have normal memory or high memory.
> +  * Here we add the present_pages belonging to ZONE_HIGHMEM.
> +  * If the zone is within the range of [0..ZONE_HIGHMEM), and
> +  * we determine that the zones in that range become empty,
> +  * we need to clear the node for N_HIGH_MEMORY.
>*/
> - zone_last = ZONE_HIGHMEM;
> - if (N_MEMORY == N_HIGH_MEMORY)
> - zone_last = ZONE_MOVABLE;
> + zt = ZONE_HIGHMEM;
> + present_pages += pgdat->node_zones[zt].present_pages;
>  
> - for (; zt <= zone_last; zt++)
> - present_pages += pgdat->node_zones[zt].present_pages;
> - if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
> + if (zone_idx(zone) <= zt && nr_pages >= present_pages)
>   arg->status_change_nid_high = zone_to_nid(zone);
>   else
>   arg->status_change_nid_high = -1;
> @@ -1542,18 +1527,18 @@ static void 
> node_states_check_changes_offline(unsigned long nr_pages,
>  #endif
>  
>   /*
> -  * node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
> +  * We have accounted the pages from [0..ZONE_NORMAL), and
> +  * in 

Re: [PATCH 5/5] mm/memory_hotplug: Clean up node_states_check_changes_offline

2018-09-20 Thread Pasha Tatashin
On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> This patch, as the previous one, gets rid of the wrong if statements.
> While at it, I realized that the comments are sometimes very confusing,
> to say the least, and wrong.
> For example:
> 
> ---
> zone_last = ZONE_MOVABLE;
> 
> /*
>  * check whether node_states[N_HIGH_MEMORY] will be changed
>  * If we try to offline the last present @nr_pages from the node,
>  * we can determind we will need to clear the node from
>  * node_states[N_HIGH_MEMORY].
>  */
> 
> for (; zt <= zone_last; zt++)
>   present_pages += pgdat->node_zones[zt].present_pages;
> if (nr_pages >= present_pages)
>   arg->status_change_nid = zone_to_nid(zone);
> else
>   arg->status_change_nid = -1;
> ---
> 
> In case the node gets empry, it must be removed from N_MEMORY.
> We already check N_HIGH_MEMORY a bit above within the CONFIG_HIGHMEM
> ifdef code.
> Not to say that status_change_nid is for N_MEMORY, and not for
> N_HIGH_MEMORY.
> 
> So I re-wrote some of the comments to what I think is better.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory_hotplug.c | 71 
> +
>  1 file changed, 28 insertions(+), 43 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index ab3c1de18c5d..15ecf3d7a554 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1485,51 +1485,36 @@ static void 
> node_states_check_changes_offline(unsigned long nr_pages,
>  {
>   struct pglist_data *pgdat = zone->zone_pgdat;
>   unsigned long present_pages = 0;
> - enum zone_type zt, zone_last = ZONE_NORMAL;
> + enum zone_type zt;
>  
>   /*
> -  * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_NORMAL,
> -  * set zone_last to ZONE_NORMAL.
> -  *
> -  * If we don't have HIGHMEM nor movable node,
> -  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> -  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> +  * Check whether node_states[N_NORMAL_MEMORY] will be changed.
> +  * If the memory to be offline is within the range
> +  * [0..ZONE_NORMAL], and it is the last present memory there,
> +  * the zones in that range will become empty after the offlining,
> +  * thus we can determine that we need to clear the node from
> +  * node_states[N_NORMAL_MEMORY].
>*/
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - zone_last = ZONE_MOVABLE;
> -
> - /*
> -  * check whether node_states[N_NORMAL_MEMORY] will be changed.
> -  * If the memory to be offline is in a zone of 0...zone_last,
> -  * and it is the last present memory, 0...zone_last will
> -  * become empty after offline , thus we can determind we will
> -  * need to clear the node from node_states[N_NORMAL_MEMORY].
> -  */
> - for (zt = 0; zt <= zone_last; zt++)
> + for (zt = 0; zt <= ZONE_NORMAL; zt++)
>   present_pages += pgdat->node_zones[zt].present_pages;
> - if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
> + if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
>   arg->status_change_nid_normal = zone_to_nid(zone);
>   else
>   arg->status_change_nid_normal = -1;
>  
>  #ifdef CONFIG_HIGHMEM
>   /*
> -  * If we have movable node, node_states[N_HIGH_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_HIGHMEM,
> -  * set zone_last to ZONE_HIGHMEM.
> -  *
> -  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_MOVABLE,
> -  * set zone_last to ZONE_MOVABLE.
> +  * node_states[N_HIGH_MEMORY] contains nodes which
> +  * have normal memory or high memory.
> +  * Here we add the present_pages belonging to ZONE_HIGHMEM.
> +  * If the zone is within the range of [0..ZONE_HIGHMEM), and
> +  * we determine that the zones in that range become empty,
> +  * we need to clear the node for N_HIGH_MEMORY.
>*/
> - zone_last = ZONE_HIGHMEM;
> - if (N_MEMORY == N_HIGH_MEMORY)
> - zone_last = ZONE_MOVABLE;
> + zt = ZONE_HIGHMEM;
> + present_pages += pgdat->node_zones[zt].present_pages;
>  
> - for (; zt <= zone_last; zt++)
> - present_pages += pgdat->node_zones[zt].present_pages;
> - if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
> + if (zone_idx(zone) <= zt && nr_pages >= present_pages)
>   arg->status_change_nid_high = zone_to_nid(zone);
>   else
>   arg->status_change_nid_high = -1;
> @@ -1542,18 +1527,18 @@ static void 
> node_states_check_changes_offline(unsigned long nr_pages,
>  #endif
>  
>   /*
> -  * node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
> +  * We have accounted the pages from [0..ZONE_NORMAL), and
> +  * in 

Re: [PATCH 4/5] mm/memory_hotplug: Simplify node_states_check_changes_online

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> While looking at node_states_check_changes_online, I stumbled
> upon some confusing things.
> 
> Right after entering the function, we find this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> This is wrong.
> N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
> My guess is that this wanted to be something like:
> 
> if (N_NORMAL_MEMORY == N_HIGH_MEMORY)
> 
> to check if we have CONFIG_HIGHMEM.
> 
> Later on, in the CONFIG_HIGHMEM block, we have:
> 
> if (N_MEMORY == N_HIGH_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> Again, this is wrong, and will never be evaluated to true.
> 
> Besides removing these wrong if statements, I simplified
> the function a bit.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory_hotplug.c | 71 
> +
>  1 file changed, 23 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 131c08106d54..ab3c1de18c5d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -687,61 +687,36 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   struct zone *zone, struct memory_notify *arg)
>  {
>   int nid = zone_to_nid(zone);
> - enum zone_type zone_last = ZONE_NORMAL;
>  
>   /*
> -  * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_NORMAL,
> -  * set zone_last to ZONE_NORMAL.
> -  *
> -  * If we don't have HIGHMEM nor movable node,
> -  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> -  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> +  * zone_for_pfn_range() can only return a zone within
> +  * (0..ZONE_NORMAL] or ZONE_MOVABLE.

But what if that changes, will this function need to change as well?

> +  * If the zone is within the range (0..ZONE_NORMAL],
> +  * we need to check if:
> +  * 1) We need to set the node for N_NORMAL_MEMORY
> +  * 2) On CONFIG_HIGHMEM systems, we need to also set
> +  *the node for N_HIGH_MEMORY.
> +  * 3) On !CONFIG_HIGHMEM, we can disregard N_HIGH_MEMORY,
> +  *as N_HIGH_MEMORY falls back to N_NORMAL_MEMORY.
>*/
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - zone_last = ZONE_MOVABLE;
>  
> - /*
> -  * if the memory to be online is in a zone of 0...zone_last, and
> -  * the zones of 0...zone_last don't have memory before online, we will
> -  * need to set the node to node_states[N_NORMAL_MEMORY] after
> -  * the memory is online.
> -  */
> - if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
> - arg->status_change_nid_normal = nid;
> - else
> - arg->status_change_nid_normal = -1;
> -
> -#ifdef CONFIG_HIGHMEM
> - /*
> -  * If we have movable node, node_states[N_HIGH_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_HIGHMEM,
> -  * set zone_last to ZONE_HIGHMEM.
> -  *
> -  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_MOVABLE,
> -  * set zone_last to ZONE_MOVABLE.
> -  */
> - zone_last = ZONE_HIGHMEM;
> - if (N_MEMORY == N_HIGH_MEMORY)
> - zone_last = ZONE_MOVABLE;
> + if (zone_idx(zone) <= ZONE_NORMAL) {
> + if (!node_state(nid, N_NORMAL_MEMORY))
> + arg->status_change_nid_normal = nid;
> + else
> + arg->status_change_nid_normal = -1;
>  
> - if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
> - arg->status_change_nid_high = nid;
> - else
> - arg->status_change_nid_high = -1;
> -#else
> - /*
> -  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> -  * so setting the node for N_NORMAL_MEMORY is enough.
> -  */
> - arg->status_change_nid_high = -1;
> -#endif
> + if (IS_ENABLED(CONFIG_HIGHMEM)) {
> + if (!node_state(nid, N_HIGH_MEMORY))
> + arg->status_change_nid_high = nid;

Should not we have:
else
arg->status_change_nid_high = -1; ?

> + } else
> + arg->status_change_nid_high = -1;


I prefer to have braces in else part as well when if has braces.

> + }
>  
>   /*
> -  * if the node don't have memory befor online, we will need to
> -  * set the node to node_states[N_MEMORY] after the memory
> -  * is online.
> +  * if the node doesn't have memory before onlining it, we will need
> +  * to set the node to node_states[N_MEMORY] after the memory
> +  * gets onlined.
>*/
>   if (!node_state(nid, N_MEMORY))
>   arg->status_change_nid = nid;
> 

I think it is simpler to have 

Re: [PATCH 4/5] mm/memory_hotplug: Simplify node_states_check_changes_online

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> While looking at node_states_check_changes_online, I stumbled
> upon some confusing things.
> 
> Right after entering the function, we find this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> This is wrong.
> N_MEMORY cannot really be equal to N_NORMAL_MEMORY.
> My guess is that this wanted to be something like:
> 
> if (N_NORMAL_MEMORY == N_HIGH_MEMORY)
> 
> to check if we have CONFIG_HIGHMEM.
> 
> Later on, in the CONFIG_HIGHMEM block, we have:
> 
> if (N_MEMORY == N_HIGH_MEMORY)
> zone_last = ZONE_MOVABLE;
> 
> Again, this is wrong, and will never be evaluated to true.
> 
> Besides removing these wrong if statements, I simplified
> the function a bit.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/memory_hotplug.c | 71 
> +
>  1 file changed, 23 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 131c08106d54..ab3c1de18c5d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -687,61 +687,36 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   struct zone *zone, struct memory_notify *arg)
>  {
>   int nid = zone_to_nid(zone);
> - enum zone_type zone_last = ZONE_NORMAL;
>  
>   /*
> -  * If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_NORMAL,
> -  * set zone_last to ZONE_NORMAL.
> -  *
> -  * If we don't have HIGHMEM nor movable node,
> -  * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> -  * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> +  * zone_for_pfn_range() can only return a zone within
> +  * (0..ZONE_NORMAL] or ZONE_MOVABLE.

But what if that changes, will this function need to change as well?

> +  * If the zone is within the range (0..ZONE_NORMAL],
> +  * we need to check if:
> +  * 1) We need to set the node for N_NORMAL_MEMORY
> +  * 2) On CONFIG_HIGHMEM systems, we need to also set
> +  *the node for N_HIGH_MEMORY.
> +  * 3) On !CONFIG_HIGHMEM, we can disregard N_HIGH_MEMORY,
> +  *as N_HIGH_MEMORY falls back to N_NORMAL_MEMORY.
>*/
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - zone_last = ZONE_MOVABLE;
>  
> - /*
> -  * if the memory to be online is in a zone of 0...zone_last, and
> -  * the zones of 0...zone_last don't have memory before online, we will
> -  * need to set the node to node_states[N_NORMAL_MEMORY] after
> -  * the memory is online.
> -  */
> - if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
> - arg->status_change_nid_normal = nid;
> - else
> - arg->status_change_nid_normal = -1;
> -
> -#ifdef CONFIG_HIGHMEM
> - /*
> -  * If we have movable node, node_states[N_HIGH_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_HIGHMEM,
> -  * set zone_last to ZONE_HIGHMEM.
> -  *
> -  * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> -  * contains nodes which have zones of 0...ZONE_MOVABLE,
> -  * set zone_last to ZONE_MOVABLE.
> -  */
> - zone_last = ZONE_HIGHMEM;
> - if (N_MEMORY == N_HIGH_MEMORY)
> - zone_last = ZONE_MOVABLE;
> + if (zone_idx(zone) <= ZONE_NORMAL) {
> + if (!node_state(nid, N_NORMAL_MEMORY))
> + arg->status_change_nid_normal = nid;
> + else
> + arg->status_change_nid_normal = -1;
>  
> - if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
> - arg->status_change_nid_high = nid;
> - else
> - arg->status_change_nid_high = -1;
> -#else
> - /*
> -  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> -  * so setting the node for N_NORMAL_MEMORY is enough.
> -  */
> - arg->status_change_nid_high = -1;
> -#endif
> + if (IS_ENABLED(CONFIG_HIGHMEM)) {
> + if (!node_state(nid, N_HIGH_MEMORY))
> + arg->status_change_nid_high = nid;

Should not we have:
else
arg->status_change_nid_high = -1; ?

> + } else
> + arg->status_change_nid_high = -1;


I prefer to have braces in else part as well when if has braces.

> + }
>  
>   /*
> -  * if the node don't have memory befor online, we will need to
> -  * set the node to node_states[N_MEMORY] after the memory
> -  * is online.
> +  * if the node doesn't have memory before onlining it, we will need
> +  * to set the node to node_states[N_MEMORY] after the memory
> +  * gets onlined.
>*/
>   if (!node_state(nid, N_MEMORY))
>   arg->status_change_nid = nid;
> 

I think it is simpler to have 

Re: [PATCH 3/5] mm/memory_hotplug: Tidy up node_states_clear_node

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> node_states_clear has the following if statements:
> 
> if ((N_MEMORY != N_NORMAL_MEMORY) &&
> (arg->status_change_nid_high >= 0))
> ...
> 
> if ((N_MEMORY != N_HIGH_MEMORY) &&
> (arg->status_change_nid >= 0))
> ...
> 
> N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor
> N_HIGH_MEMORY.
> 
> Similar problem was found in [1].
> Since this is wrong, let us get rid of it.
> 
> [1] 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F10579155%2Fdata=02%7C01%7CPavel.Tatashin%40microsoft.com%7C1e31e6a5c8754abe0b4608d61e17e01c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636729485241367584sdata=ztkPNyRIv2c0j5lrujwGM%2FrD5in6G7AvvdqxVXCzwGs%3Dreserved=0
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

> ---
>  mm/memory_hotplug.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c2c7359bd0a7..131c08106d54 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1590,12 +1590,10 @@ static void node_states_clear_node(int node, struct 
> memory_notify *arg)
>   if (arg->status_change_nid_normal >= 0)
>   node_clear_state(node, N_NORMAL_MEMORY);
>  
> - if ((N_MEMORY != N_NORMAL_MEMORY) &&
> - (arg->status_change_nid_high >= 0))
> + if (arg->status_change_nid_high >= 0)
>   node_clear_state(node, N_HIGH_MEMORY);
>  
> - if ((N_MEMORY != N_HIGH_MEMORY) &&
> - (arg->status_change_nid >= 0))
> + if (arg->status_change_nid >= 0)
>   node_clear_state(node, N_MEMORY);
>  }
>  
> 

Re: [PATCH 3/5] mm/memory_hotplug: Tidy up node_states_clear_node

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> node_states_clear has the following if statements:
> 
> if ((N_MEMORY != N_NORMAL_MEMORY) &&
> (arg->status_change_nid_high >= 0))
> ...
> 
> if ((N_MEMORY != N_HIGH_MEMORY) &&
> (arg->status_change_nid >= 0))
> ...
> 
> N_MEMORY can never be equal to neither N_NORMAL_MEMORY nor
> N_HIGH_MEMORY.
> 
> Similar problem was found in [1].
> Since this is wrong, let us get rid of it.
> 
> [1] 
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fpatch%2F10579155%2Fdata=02%7C01%7CPavel.Tatashin%40microsoft.com%7C1e31e6a5c8754abe0b4608d61e17e01c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636729485241367584sdata=ztkPNyRIv2c0j5lrujwGM%2FrD5in6G7AvvdqxVXCzwGs%3Dreserved=0
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

> ---
>  mm/memory_hotplug.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index c2c7359bd0a7..131c08106d54 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1590,12 +1590,10 @@ static void node_states_clear_node(int node, struct 
> memory_notify *arg)
>   if (arg->status_change_nid_normal >= 0)
>   node_clear_state(node, N_NORMAL_MEMORY);
>  
> - if ((N_MEMORY != N_NORMAL_MEMORY) &&
> - (arg->status_change_nid_high >= 0))
> + if (arg->status_change_nid_high >= 0)
>   node_clear_state(node, N_HIGH_MEMORY);
>  
> - if ((N_MEMORY != N_HIGH_MEMORY) &&
> - (arg->status_change_nid >= 0))
> + if (arg->status_change_nid >= 0)
>   node_clear_state(node, N_MEMORY);
>  }
>  
> 

Re: [PATCH 2/5] mm/memory_hotplug: Avoid node_set/clear_state(N_HIGH_MEMORY) when !CONFIG_HIGHMEM

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, when !CONFIG_HIGHMEM, status_change_nid_high is being set
> to status_change_nid_normal, but on such systems N_HIGH_MEMORY falls
> back to N_NORMAL_MEMORY.
> That means that if status_change_nid_normal is not -1,
> we will perform two calls to node_set_state for the same memory type.
> 
> Set status_change_nid_high to -1 for !CONFIG_HIGHMEM, so we skip the
> double call in node_states_set_node.
> 
> The same goes for node_clear_state.
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

This is a rare case where I think comments are unnecessary as the code
is self explanatory. So, I would remove the comments before:

> + arg->status_change_nid_high = -1;

Pavel

> ---
>  mm/memory_hotplug.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 63facfc57224..c2c7359bd0a7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -731,7 +731,11 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   else
>   arg->status_change_nid_high = -1;
>  #else
> - arg->status_change_nid_high = arg->status_change_nid_normal;
> + /*
> +  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> +  * so setting the node for N_NORMAL_MEMORY is enough.
> +  */
> + arg->status_change_nid_high = -1;
>  #endif
>  
>   /*
> @@ -1555,7 +1559,11 @@ static void node_states_check_changes_offline(unsigned 
> long nr_pages,
>   else
>   arg->status_change_nid_high = -1;
>  #else
> - arg->status_change_nid_high = arg->status_change_nid_normal;
> + /*
> +  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> +  * so clearing the node for N_NORMAL_MEMORY is enough.
> +  */
> + arg->status_change_nid_high = -1;
>  #endif
>  
>   /*
> 

Re: [PATCH 2/5] mm/memory_hotplug: Avoid node_set/clear_state(N_HIGH_MEMORY) when !CONFIG_HIGHMEM

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, when !CONFIG_HIGHMEM, status_change_nid_high is being set
> to status_change_nid_normal, but on such systems N_HIGH_MEMORY falls
> back to N_NORMAL_MEMORY.
> That means that if status_change_nid_normal is not -1,
> we will perform two calls to node_set_state for the same memory type.
> 
> Set status_change_nid_high to -1 for !CONFIG_HIGHMEM, so we skip the
> double call in node_states_set_node.
> 
> The same goes for node_clear_state.
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

This is a rare case where I think comments are unnecessary as the code
is self explanatory. So, I would remove the comments before:

> + arg->status_change_nid_high = -1;

Pavel

> ---
>  mm/memory_hotplug.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 63facfc57224..c2c7359bd0a7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -731,7 +731,11 @@ static void node_states_check_changes_online(unsigned 
> long nr_pages,
>   else
>   arg->status_change_nid_high = -1;
>  #else
> - arg->status_change_nid_high = arg->status_change_nid_normal;
> + /*
> +  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> +  * so setting the node for N_NORMAL_MEMORY is enough.
> +  */
> + arg->status_change_nid_high = -1;
>  #endif
>  
>   /*
> @@ -1555,7 +1559,11 @@ static void node_states_check_changes_offline(unsigned 
> long nr_pages,
>   else
>   arg->status_change_nid_high = -1;
>  #else
> - arg->status_change_nid_high = arg->status_change_nid_normal;
> + /*
> +  * When !CONFIG_HIGHMEM, N_HIGH_MEMORY equals N_NORMAL_MEMORY
> +  * so clearing the node for N_NORMAL_MEMORY is enough.
> +  */
> + arg->status_change_nid_high = -1;
>  #endif
>  
>   /*
> 

Re: [PATCH 1/5] mm/memory_hotplug: Spare unnecessary calls to node_set_state

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> In node_states_check_changes_online, we check if the node will
> have to be set for any of the N_*_MEMORY states after the pages
> have been onlined.
> 
> Later on, we perform the activation in node_states_set_node.
> Currently, in node_states_set_node we set the node to N_MEMORY
> unconditionally.
> This means that we call node_set_state for N_MEMORY every time
> pages go online, but we only need to do it if the node has not yet been
> set for N_MEMORY.
> 
> Fix this by checking status_change_nid.
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

> ---
>  mm/memory_hotplug.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 38d94b703e9d..63facfc57224 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -753,7 +753,8 @@ static void node_states_set_node(int node, struct 
> memory_notify *arg)
>   if (arg->status_change_nid_high >= 0)
>   node_set_state(node, N_HIGH_MEMORY);
>  
> - node_set_state(node, N_MEMORY);
> + if (arg->status_change_nid >= 0)
> + node_set_state(node, N_MEMORY);
>  }
>  
>  static void __meminit resize_zone_range(struct zone *zone, unsigned long 
> start_pfn,
> 

Re: [PATCH 1/5] mm/memory_hotplug: Spare unnecessary calls to node_set_state

2018-09-20 Thread Pasha Tatashin


On 9/19/18 6:08 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> In node_states_check_changes_online, we check if the node will
> have to be set for any of the N_*_MEMORY states after the pages
> have been onlined.
> 
> Later on, we perform the activation in node_states_set_node.
> Currently, in node_states_set_node we set the node to N_MEMORY
> unconditionally.
> This means that we call node_set_state for N_MEMORY every time
> pages go online, but we only need to do it if the node has not yet been
> set for N_MEMORY.
> 
> Fix this by checking status_change_nid.
> 
> Signed-off-by: Oscar Salvador 

Reviewed-by: Pavel Tatashin 

> ---
>  mm/memory_hotplug.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 38d94b703e9d..63facfc57224 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -753,7 +753,8 @@ static void node_states_set_node(int node, struct 
> memory_notify *arg)
>   if (arg->status_change_nid_high >= 0)
>   node_set_state(node, N_HIGH_MEMORY);
>  
> - node_set_state(node, N_MEMORY);
> + if (arg->status_change_nid >= 0)
> + node_set_state(node, N_MEMORY);
>  }
>  
>  static void __meminit resize_zone_range(struct zone *zone, unsigned long 
> start_pfn,
> 

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Pasha Tatashin


On 9/12/18 10:27 AM, Gerald Schaefer wrote:
> On Wed, 12 Sep 2018 15:39:33 +0200
> Michal Hocko  wrote:
> 
>> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
>> [...]
>>> BTW, those sysfs attributes are world-readable, so anyone can trigger
>>> the panic by simply reading them, or just run lsmem (also available for
>>> x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
>>> mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
>>> still access uninitialized struct pages. This sounds very wrong, and I
>>> think it really should be fixed.  
>>
>> Ohh, absolutely. Nobody is questioning that. The thing is that the
>> code has been likely always broken. We just haven't noticed because
>> those unitialized parts where zeroed previously. Now that the implicit
>> zeroying is gone it is just visible.
>>
>> All that I am arguing is that there are many places which assume
>> pageblocks to be fully initialized and plugging one place that blows up
>> at the time is just whack a mole. We need to address this much earlier.
>> E.g. by allowing only full pageblocks when adding a memory range.
> 
> Just to make sure we are talking about the same thing: when you say
> "pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
> unit of pages, or do you mean the memory (hotplug) block unit?

From early discussion, it was about pageblock_nr_pages not about
memory_block_size_bytes

> 
> I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
> pageblocks, and if there was such an issue, of course you are right that
> this would affect many places. If there was such an issue, I would also
> assume that we would see the new page poison warning in many other places.
> 
> The bug that Mikhails patch would fix only affects code that operates
> on / iterates through memory (hotplug) blocks, and that does not happen
> in many places, only in the two functions that his patch fixes.

Just to be clear, so memory is pageblock_nr_pages aligned, yet
memory_block are larger and panic is still triggered?

I ask, because 3075M is not 128M aligned.

> 
> When you say "address this much earlier", do you mean changing the way
> that free_area_init_core()/memmap_init() initialize struct pages, i.e.
> have them not use zone->spanned_pages as limit, but rather align that
> up to the memory block (not pageblock) boundary?
> 

This was my initial proposal, to fix memmap_init() and initialize struct
pages beyond the "end", and before the "start" to cover the whole
section. But, I think Michal suggested (and he might correct me) to
simply ignore unaligned memory to section memory much earlier: so
anything that does not align to sparse order is not added at all to the
system.

I think Michal's proposal would simplify and strengthen memory mapping
overall.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-12 Thread Pasha Tatashin


On 9/12/18 10:27 AM, Gerald Schaefer wrote:
> On Wed, 12 Sep 2018 15:39:33 +0200
> Michal Hocko  wrote:
> 
>> On Wed 12-09-18 15:03:56, Gerald Schaefer wrote:
>> [...]
>>> BTW, those sysfs attributes are world-readable, so anyone can trigger
>>> the panic by simply reading them, or just run lsmem (also available for
>>> x86 since util-linux 2.32). OK, you need a special not-memory-block-aligned
>>> mem= parameter and DEBUG_VM for poison check, but w/o DEBUG_VM you would
>>> still access uninitialized struct pages. This sounds very wrong, and I
>>> think it really should be fixed.  
>>
>> Ohh, absolutely. Nobody is questioning that. The thing is that the
>> code has been likely always broken. We just haven't noticed because
>> those unitialized parts where zeroed previously. Now that the implicit
>> zeroying is gone it is just visible.
>>
>> All that I am arguing is that there are many places which assume
>> pageblocks to be fully initialized and plugging one place that blows up
>> at the time is just whack a mole. We need to address this much earlier.
>> E.g. by allowing only full pageblocks when adding a memory range.
> 
> Just to make sure we are talking about the same thing: when you say
> "pageblocks", do you mean the MAX_ORDER_NR_PAGES / pageblock_nr_pages
> unit of pages, or do you mean the memory (hotplug) block unit?

From early discussion, it was about pageblock_nr_pages not about
memory_block_size_bytes

> 
> I do not see any issue here with MAX_ORDER_NR_PAGES / pageblock_nr_pages
> pageblocks, and if there was such an issue, of course you are right that
> this would affect many places. If there was such an issue, I would also
> assume that we would see the new page poison warning in many other places.
> 
> The bug that Mikhails patch would fix only affects code that operates
> on / iterates through memory (hotplug) blocks, and that does not happen
> in many places, only in the two functions that his patch fixes.

Just to be clear, so memory is pageblock_nr_pages aligned, yet
memory_block are larger and panic is still triggered?

I ask, because 3075M is not 128M aligned.

> 
> When you say "address this much earlier", do you mean changing the way
> that free_area_init_core()/memmap_init() initialize struct pages, i.e.
> have them not use zone->spanned_pages as limit, but rather align that
> up to the memory block (not pageblock) boundary?
> 

This was my initial proposal, to fix memmap_init() and initialize struct
pages beyond the "end", and before the "start" to cover the whole
section. But, I think Michal suggested (and he might correct me) to
simply ignore unaligned memory to section memory much earlier: so
anything that does not align to sparse order is not added at all to the
system.

I think Michal's proposal would simplify and strengthen memory mapping
overall.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin


On 9/10/18 10:41 AM, Michal Hocko wrote:
> On Mon 10-09-18 14:32:16, Pavel Tatashin wrote:
>> On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko  wrote:
>>>
>>> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
 Hi Michal,

 It is tricky, but probably can be done. Either change
 memmap_init_zone() or its caller to also cover the ends and starts of
 unaligned sections to initialize and reserve pages.

 The same thing would also need to be done in deferred_init_memmap() to
 cover the deferred init case.
>>>
>>> Well, I am not sure TBH. I have to think about that much more. Maybe it
>>> would be much more simple to make sure that we will never add incomplete
>>> memblocks and simply refuse them during the discovery. At least for now.
>>
>> On x86 memblocks can be upto 2G on machines with over 64G of RAM.
> 
> sorry I meant pageblock_nr_pages rather than memblocks.

OK. This sound reasonable, but, to be honest I am not sure how to
achieve this yet, I need to think more about this. In theory, if we have
sparse memory model, it makes sense to enforce memory alignment to
section sizes, sounds a lot safer.

> 
>> Also, memory size is way to easy too change via qemu arguments when VM
>> starts. If we simply disable unaligned trailing memblocks, I am sure
>> we would get tons of noise of missing memory.
>>
>> I think, adding check_hotplug_memory_range() would work to fix the
>> immediate problem. But, we do need to figure out  a better solution.
>>
>> memblock design is based on archaic assumption that hotplug units are
>> physical dimms. VMs and hypervisors changed all of that, and we can
>> have much finer hotplug requests on machines with huge DIMMs. Yet, we
>> do not want to pollute sysfs with millions of tiny memory devices. I
>> am not sure what a long term proper solution for this problem should
>> be, but I see that linux hotplug/hotremove subsystems must be
>> redesigned based on the new requirements.
> 
> Not an easy task though. Anyway, sparse memory modely is highly based on
> memory sections so it makes some sense to have hotplug section based as
> well. Memblocks as a higher logical unit on top of that is kinda hack.
> The userspace API has never been properly thought through I am afraid.

I agree memoryblock is a hack, it fails to do both things it was
designed to do:

1. On bare metal you cannot free a physical dimm of memory using
memoryblock granularity because memory devices do not equal to physical
dimms. Thus, if for some reason a particular dimm must be
remove/replaced, memoryblock does not help us.

2. On machines with hypervisors it fails to provide an adequate
granularity to add/remove memory.

We should define a new user interface where memory can be added/removed
at a finer granularity: sparse section size, but without a memory
devices for each section. We should also provide an optional access to
legacy interface where memory devices are exported but each is of
section size.

So, when legacy interface is enabled, current way would work:

echo offline > /sys/devices/system/memory/memoryXXX/state

And new interface would allow us to do something like this:

echo offline 256M > /sys/devices/system/node/nodeXXX/memory

With optional start address for offline memory.
echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory
start_pa and size must be section size aligned (128M).

It would probably be a good discussion for the next MM Summit how to
solve the current memory hotplug interface limitations.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin


On 9/10/18 10:41 AM, Michal Hocko wrote:
> On Mon 10-09-18 14:32:16, Pavel Tatashin wrote:
>> On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko  wrote:
>>>
>>> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
 Hi Michal,

 It is tricky, but probably can be done. Either change
 memmap_init_zone() or its caller to also cover the ends and starts of
 unaligned sections to initialize and reserve pages.

 The same thing would also need to be done in deferred_init_memmap() to
 cover the deferred init case.
>>>
>>> Well, I am not sure TBH. I have to think about that much more. Maybe it
>>> would be much more simple to make sure that we will never add incomplete
>>> memblocks and simply refuse them during the discovery. At least for now.
>>
>> On x86 memblocks can be upto 2G on machines with over 64G of RAM.
> 
> sorry I meant pageblock_nr_pages rather than memblocks.

OK. This sound reasonable, but, to be honest I am not sure how to
achieve this yet, I need to think more about this. In theory, if we have
sparse memory model, it makes sense to enforce memory alignment to
section sizes, sounds a lot safer.

> 
>> Also, memory size is way to easy too change via qemu arguments when VM
>> starts. If we simply disable unaligned trailing memblocks, I am sure
>> we would get tons of noise of missing memory.
>>
>> I think, adding check_hotplug_memory_range() would work to fix the
>> immediate problem. But, we do need to figure out  a better solution.
>>
>> memblock design is based on archaic assumption that hotplug units are
>> physical dimms. VMs and hypervisors changed all of that, and we can
>> have much finer hotplug requests on machines with huge DIMMs. Yet, we
>> do not want to pollute sysfs with millions of tiny memory devices. I
>> am not sure what a long term proper solution for this problem should
>> be, but I see that linux hotplug/hotremove subsystems must be
>> redesigned based on the new requirements.
> 
> Not an easy task though. Anyway, sparse memory modely is highly based on
> memory sections so it makes some sense to have hotplug section based as
> well. Memblocks as a higher logical unit on top of that is kinda hack.
> The userspace API has never been properly thought through I am afraid.

I agree memoryblock is a hack, it fails to do both things it was
designed to do:

1. On bare metal you cannot free a physical dimm of memory using
memoryblock granularity because memory devices do not equal to physical
dimms. Thus, if for some reason a particular dimm must be
remove/replaced, memoryblock does not help us.

2. On machines with hypervisors it fails to provide an adequate
granularity to add/remove memory.

We should define a new user interface where memory can be added/removed
at a finer granularity: sparse section size, but without a memory
devices for each section. We should also provide an optional access to
legacy interface where memory devices are exported but each is of
section size.

So, when legacy interface is enabled, current way would work:

echo offline > /sys/devices/system/memory/memoryXXX/state

And new interface would allow us to do something like this:

echo offline 256M > /sys/devices/system/node/nodeXXX/memory

With optional start address for offline memory.
echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory
start_pa and size must be section size aligned (128M).

It would probably be a good discussion for the next MM Summit how to
solve the current memory hotplug interface limitations.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin
On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko  wrote:
>
> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
> > Hi Michal,
> >
> > It is tricky, but probably can be done. Either change
> > memmap_init_zone() or its caller to also cover the ends and starts of
> > unaligned sections to initialize and reserve pages.
> >
> > The same thing would also need to be done in deferred_init_memmap() to
> > cover the deferred init case.
>
> Well, I am not sure TBH. I have to think about that much more. Maybe it
> would be much more simple to make sure that we will never add incomplete
> memblocks and simply refuse them during the discovery. At least for now.

On x86 memblocks can be upto 2G on machines with over 64G of RAM.
Also, memory size is way to easy too change via qemu arguments when VM
starts. If we simply disable unaligned trailing memblocks, I am sure
we would get tons of noise of missing memory.

I think, adding check_hotplug_memory_range() would work to fix the
immediate problem. But, we do need to figure out  a better solution.

memblock design is based on archaic assumption that hotplug units are
physical dimms. VMs and hypervisors changed all of that, and we can
have much finer hotplug requests on machines with huge DIMMs. Yet, we
do not want to pollute sysfs with millions of tiny memory devices. I
am not sure what a long term proper solution for this problem should
be, but I see that linux hotplug/hotremove subsystems must be
redesigned based on the new requirements.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin
On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko  wrote:
>
> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote:
> > Hi Michal,
> >
> > It is tricky, but probably can be done. Either change
> > memmap_init_zone() or its caller to also cover the ends and starts of
> > unaligned sections to initialize and reserve pages.
> >
> > The same thing would also need to be done in deferred_init_memmap() to
> > cover the deferred init case.
>
> Well, I am not sure TBH. I have to think about that much more. Maybe it
> would be much more simple to make sure that we will never add incomplete
> memblocks and simply refuse them during the discovery. At least for now.

On x86 memblocks can be upto 2G on machines with over 64G of RAM.
Also, memory size is way to easy too change via qemu arguments when VM
starts. If we simply disable unaligned trailing memblocks, I am sure
we would get tons of noise of missing memory.

I think, adding check_hotplug_memory_range() would work to fix the
immediate problem. But, we do need to figure out  a better solution.

memblock design is based on archaic assumption that hotplug units are
physical dimms. VMs and hypervisors changed all of that, and we can
have much finer hotplug requests on machines with huge DIMMs. Yet, we
do not want to pollute sysfs with millions of tiny memory devices. I
am not sure what a long term proper solution for this problem should
be, but I see that linux hotplug/hotremove subsystems must be
redesigned based on the new requirements.

Pavel

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin
Hi Michal,

It is tricky, but probably can be done. Either change
memmap_init_zone() or its caller to also cover the ends and starts of
unaligned sections to initialize and reserve pages.

The same thing would also need to be done in deferred_init_memmap() to
cover the deferred init case.

For hotplugged memory we do not need to worry, as that one is always
section aligned.

Do you think this would be a better approach?

Thank you,
Pavel
On Mon, Sep 10, 2018 at 10:00 AM Michal Hocko  wrote:
>
> On Mon 10-09-18 13:46:45, Pavel Tatashin wrote:
> >
> >
> > On 9/10/18 9:17 AM, Michal Hocko wrote:
> > > [Cc Pavel]
> > >
> > > On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
> > >> If memory end is not aligned with the linux memory section boundary, such
> > >> a section is only partly initialized. This may lead to VM_BUG_ON due to
> > >> uninitialized struct pages access from is_mem_section_removable() or
> > >> test_pages_in_a_zone() function.
> > >>
> > >> Here is one of the panic examples:
> > >>  CONFIG_DEBUG_VM_PGFLAGS=y
> > >>  kernel parameter mem=3075M
> > >
> > > OK, so the last memory section is not full and we have a partial memory
> > > block right?
> > >
> > >>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > >
> > > OK, this means that the struct page is not fully initialized. Do you
> > > have a specific place which has triggered this assert?
> > >
> > >>  [ cut here ]
> > >>  Call Trace:
> > >>  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
> > >>   [<009558ba>] show_mem_removable+0xda/0xe0
> > >>   [<009325fc>] dev_attr_show+0x3c/0x80
> > >>   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
> > >>   [<003fc4e0>] seq_read+0x208/0x4c8
> > >>   [<003cb80e>] __vfs_read+0x46/0x180
> > >>   [<003cb9ce>] vfs_read+0x86/0x148
> > >>   [<003cc06a>] ksys_read+0x62/0xc0
> > >>   [<00c001c0>] system_call+0xdc/0x2d8
> > >>
> > >> This fix checks if the page lies within the zone boundaries before
> > >> accessing the struct page data. The check is added to both functions.
> > >> Actually similar check has already been present in
> > >> is_pageblock_removable_nolock() function but only after the struct page
> > >> is accessed.
> > >>
> > >
> > > Well, I am afraid this is not the proper solution. We are relying on the
> > > full pageblock worth of initialized struct pages at many other place. We
> > > used to do that in the past because we have initialized the full
> > > section but this has been changed recently. Pavel, do you have any ideas
> > > how to deal with this partial mem sections now?
> >
> > We have:
> >
> > remove_memory()
> >   BUG_ON(check_hotplug_memory_range(start, size))
> >
> > That supposed to safely check for this condition: if [start, start +
> > size) not block size aligned (and we know block size is section
> > aligned), hot remove is not allowed. The problem is this check is late,
> > and only happens when invalid range has already passed through previous
> > checks.
> >
> > We could add check_hotplug_memory_range() to is_mem_section_removable():
> >
> > is_mem_section_removable(start_pfn, nr_pages)
> >  if (check_hotplug_memory_range(PFN_PHYS(start_pfn), PFN_PHYS(nr_pages)))
> >   return false;
> >
> > I think it should work.
>
> I do not think we want to sprinkle these tests over all pfn walkers. Can
> we simply initialize those uninitialized holes as well and make them
> reserved without handing them over to the page allocator? That would be
> much more robust approach IMHO.
> --
> Michal Hocko
> SUSE Labs
>

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin
Hi Michal,

It is tricky, but probably can be done. Either change
memmap_init_zone() or its caller to also cover the ends and starts of
unaligned sections to initialize and reserve pages.

The same thing would also need to be done in deferred_init_memmap() to
cover the deferred init case.

For hotplugged memory we do not need to worry, as that one is always
section aligned.

Do you think this would be a better approach?

Thank you,
Pavel
On Mon, Sep 10, 2018 at 10:00 AM Michal Hocko  wrote:
>
> On Mon 10-09-18 13:46:45, Pavel Tatashin wrote:
> >
> >
> > On 9/10/18 9:17 AM, Michal Hocko wrote:
> > > [Cc Pavel]
> > >
> > > On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
> > >> If memory end is not aligned with the linux memory section boundary, such
> > >> a section is only partly initialized. This may lead to VM_BUG_ON due to
> > >> uninitialized struct pages access from is_mem_section_removable() or
> > >> test_pages_in_a_zone() function.
> > >>
> > >> Here is one of the panic examples:
> > >>  CONFIG_DEBUG_VM_PGFLAGS=y
> > >>  kernel parameter mem=3075M
> > >
> > > OK, so the last memory section is not full and we have a partial memory
> > > block right?
> > >
> > >>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > >
> > > OK, this means that the struct page is not fully initialized. Do you
> > > have a specific place which has triggered this assert?
> > >
> > >>  [ cut here ]
> > >>  Call Trace:
> > >>  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
> > >>   [<009558ba>] show_mem_removable+0xda/0xe0
> > >>   [<009325fc>] dev_attr_show+0x3c/0x80
> > >>   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
> > >>   [<003fc4e0>] seq_read+0x208/0x4c8
> > >>   [<003cb80e>] __vfs_read+0x46/0x180
> > >>   [<003cb9ce>] vfs_read+0x86/0x148
> > >>   [<003cc06a>] ksys_read+0x62/0xc0
> > >>   [<00c001c0>] system_call+0xdc/0x2d8
> > >>
> > >> This fix checks if the page lies within the zone boundaries before
> > >> accessing the struct page data. The check is added to both functions.
> > >> Actually similar check has already been present in
> > >> is_pageblock_removable_nolock() function but only after the struct page
> > >> is accessed.
> > >>
> > >
> > > Well, I am afraid this is not the proper solution. We are relying on the
> > > full pageblock worth of initialized struct pages at many other place. We
> > > used to do that in the past because we have initialized the full
> > > section but this has been changed recently. Pavel, do you have any ideas
> > > how to deal with this partial mem sections now?
> >
> > We have:
> >
> > remove_memory()
> >   BUG_ON(check_hotplug_memory_range(start, size))
> >
> > That supposed to safely check for this condition: if [start, start +
> > size) not block size aligned (and we know block size is section
> > aligned), hot remove is not allowed. The problem is this check is late,
> > and only happens when invalid range has already passed through previous
> > checks.
> >
> > We could add check_hotplug_memory_range() to is_mem_section_removable():
> >
> > is_mem_section_removable(start_pfn, nr_pages)
> >  if (check_hotplug_memory_range(PFN_PHYS(start_pfn), PFN_PHYS(nr_pages)))
> >   return false;
> >
> > I think it should work.
>
> I do not think we want to sprinkle these tests over all pfn walkers. Can
> we simply initialize those uninitialized holes as well and make them
> reserved without handing them over to the page allocator? That would be
> much more robust approach IMHO.
> --
> Michal Hocko
> SUSE Labs
>

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin


On 9/10/18 9:17 AM, Michal Hocko wrote:
> [Cc Pavel]
> 
> On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
>> If memory end is not aligned with the linux memory section boundary, such
>> a section is only partly initialized. This may lead to VM_BUG_ON due to
>> uninitialized struct pages access from is_mem_section_removable() or
>> test_pages_in_a_zone() function.
>>
>> Here is one of the panic examples:
>>  CONFIG_DEBUG_VM_PGFLAGS=y
>>  kernel parameter mem=3075M
> 
> OK, so the last memory section is not full and we have a partial memory
> block right?
> 
>>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> 
> OK, this means that the struct page is not fully initialized. Do you
> have a specific place which has triggered this assert?
> 
>>  [ cut here ]
>>  Call Trace:
>>  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
>>   [<009558ba>] show_mem_removable+0xda/0xe0
>>   [<009325fc>] dev_attr_show+0x3c/0x80
>>   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
>>   [<003fc4e0>] seq_read+0x208/0x4c8
>>   [<003cb80e>] __vfs_read+0x46/0x180
>>   [<003cb9ce>] vfs_read+0x86/0x148
>>   [<003cc06a>] ksys_read+0x62/0xc0
>>   [<00c001c0>] system_call+0xdc/0x2d8
>>
>> This fix checks if the page lies within the zone boundaries before
>> accessing the struct page data. The check is added to both functions.
>> Actually similar check has already been present in
>> is_pageblock_removable_nolock() function but only after the struct page
>> is accessed.
>>
> 
> Well, I am afraid this is not the proper solution. We are relying on the
> full pageblock worth of initialized struct pages at many other place. We
> used to do that in the past because we have initialized the full
> section but this has been changed recently. Pavel, do you have any ideas
> how to deal with this partial mem sections now?

We have:

remove_memory()
BUG_ON(check_hotplug_memory_range(start, size))

That supposed to safely check for this condition: if [start, start +
size) not block size aligned (and we know block size is section
aligned), hot remove is not allowed. The problem is this check is late,
and only happens when invalid range has already passed through previous
checks.

We could add check_hotplug_memory_range() to is_mem_section_removable():

is_mem_section_removable(start_pfn, nr_pages)
 if (check_hotplug_memory_range(PFN_PHYS(start_pfn), PFN_PHYS(nr_pages)))
  return false;

I think it should work.

Pavel

> 
>> Signed-off-by: Mikhail Zaslonko 
>> Reviewed-by: Gerald Schaefer 
>> Cc: 
>> ---
>>  mm/memory_hotplug.c | 20 +++-
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 9eea6e809a4e..8e20e8fcc3b0 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1229,9 +1229,8 @@ static struct page *next_active_pageblock(struct page 
>> *page)
>>  return page + pageblock_nr_pages;
>>  }
>>  
>> -static bool is_pageblock_removable_nolock(struct page *page)
>> +static bool is_pageblock_removable_nolock(struct page *page, struct zone 
>> **zone)
>>  {
>> -struct zone *zone;
>>  unsigned long pfn;
>>  
>>  /*
>> @@ -1241,15 +1240,14 @@ static bool is_pageblock_removable_nolock(struct 
>> page *page)
>>   * We have to take care about the node as well. If the node is offline
>>   * its NODE_DATA will be NULL - see page_zone.
>>   */
>> -if (!node_online(page_to_nid(page)))
>> -return false;
>> -
>> -zone = page_zone(page);
>>  pfn = page_to_pfn(page);
>> -if (!zone_spans_pfn(zone, pfn))
>> +if (*zone && !zone_spans_pfn(*zone, pfn))
>>  return false;
>> +if (!node_online(page_to_nid(page)))
>> +return false;
>> +*zone = page_zone(page);
>>  
>> -return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, true);
>> +return !has_unmovable_pages(*zone, page, 0, MIGRATE_MOVABLE, true);
>>  }
>>  
>>  /* Checks if this range of memory is likely to be hot-removable. */
>> @@ -1257,10 +1255,11 @@ bool is_mem_section_removable(unsigned long 
>> start_pfn, unsigned long nr_pages)
>>  {
>>  struct page *page = pfn_to_page(start_pfn);
>>  struct page *end_page = page + nr_pages;
>> +struct zone *zone = NULL;
>>  
>>  /* Check the starting page of each pageblock within the range */
>>  for (; page < end_page; page = next_active_pageblock(page)) {
>> -if (!is_pageblock_removable_nolock(page))
>> +if (!is_pageblock_removable_nolock(page, ))
>>  return false;
>>  cond_resched();
>>  }
>> @@ -1296,6 +1295,9 @@ int test_pages_in_a_zone(unsigned long start_pfn, 
>> unsigned long end_pfn,
>>  i++;
>>  if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
>>  continue;
>> +/* Check if we got 

Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary

2018-09-10 Thread Pasha Tatashin


On 9/10/18 9:17 AM, Michal Hocko wrote:
> [Cc Pavel]
> 
> On Mon 10-09-18 14:35:27, Mikhail Zaslonko wrote:
>> If memory end is not aligned with the linux memory section boundary, such
>> a section is only partly initialized. This may lead to VM_BUG_ON due to
>> uninitialized struct pages access from is_mem_section_removable() or
>> test_pages_in_a_zone() function.
>>
>> Here is one of the panic examples:
>>  CONFIG_DEBUG_VM_PGFLAGS=y
>>  kernel parameter mem=3075M
> 
> OK, so the last memory section is not full and we have a partial memory
> block right?
> 
>>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> 
> OK, this means that the struct page is not fully initialized. Do you
> have a specific place which has triggered this assert?
> 
>>  [ cut here ]
>>  Call Trace:
>>  ([<0039b8a4>] is_mem_section_removable+0xcc/0x1c0)
>>   [<009558ba>] show_mem_removable+0xda/0xe0
>>   [<009325fc>] dev_attr_show+0x3c/0x80
>>   [<0047e7ea>] sysfs_kf_seq_show+0xda/0x160
>>   [<003fc4e0>] seq_read+0x208/0x4c8
>>   [<003cb80e>] __vfs_read+0x46/0x180
>>   [<003cb9ce>] vfs_read+0x86/0x148
>>   [<003cc06a>] ksys_read+0x62/0xc0
>>   [<00c001c0>] system_call+0xdc/0x2d8
>>
>> This fix checks if the page lies within the zone boundaries before
>> accessing the struct page data. The check is added to both functions.
>> Actually similar check has already been present in
>> is_pageblock_removable_nolock() function but only after the struct page
>> is accessed.
>>
> 
> Well, I am afraid this is not the proper solution. We are relying on the
> full pageblock worth of initialized struct pages at many other place. We
> used to do that in the past because we have initialized the full
> section but this has been changed recently. Pavel, do you have any ideas
> how to deal with this partial mem sections now?

We have:

remove_memory()
BUG_ON(check_hotplug_memory_range(start, size))

That supposed to safely check for this condition: if [start, start +
size) not block size aligned (and we know block size is section
aligned), hot remove is not allowed. The problem is this check is late,
and only happens when invalid range has already passed through previous
checks.

We could add check_hotplug_memory_range() to is_mem_section_removable():

is_mem_section_removable(start_pfn, nr_pages)
 if (check_hotplug_memory_range(PFN_PHYS(start_pfn), PFN_PHYS(nr_pages)))
  return false;

I think it should work.

Pavel

> 
>> Signed-off-by: Mikhail Zaslonko 
>> Reviewed-by: Gerald Schaefer 
>> Cc: 
>> ---
>>  mm/memory_hotplug.c | 20 +++-
>>  1 file changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 9eea6e809a4e..8e20e8fcc3b0 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1229,9 +1229,8 @@ static struct page *next_active_pageblock(struct page 
>> *page)
>>  return page + pageblock_nr_pages;
>>  }
>>  
>> -static bool is_pageblock_removable_nolock(struct page *page)
>> +static bool is_pageblock_removable_nolock(struct page *page, struct zone 
>> **zone)
>>  {
>> -struct zone *zone;
>>  unsigned long pfn;
>>  
>>  /*
>> @@ -1241,15 +1240,14 @@ static bool is_pageblock_removable_nolock(struct 
>> page *page)
>>   * We have to take care about the node as well. If the node is offline
>>   * its NODE_DATA will be NULL - see page_zone.
>>   */
>> -if (!node_online(page_to_nid(page)))
>> -return false;
>> -
>> -zone = page_zone(page);
>>  pfn = page_to_pfn(page);
>> -if (!zone_spans_pfn(zone, pfn))
>> +if (*zone && !zone_spans_pfn(*zone, pfn))
>>  return false;
>> +if (!node_online(page_to_nid(page)))
>> +return false;
>> +*zone = page_zone(page);
>>  
>> -return !has_unmovable_pages(zone, page, 0, MIGRATE_MOVABLE, true);
>> +return !has_unmovable_pages(*zone, page, 0, MIGRATE_MOVABLE, true);
>>  }
>>  
>>  /* Checks if this range of memory is likely to be hot-removable. */
>> @@ -1257,10 +1255,11 @@ bool is_mem_section_removable(unsigned long 
>> start_pfn, unsigned long nr_pages)
>>  {
>>  struct page *page = pfn_to_page(start_pfn);
>>  struct page *end_page = page + nr_pages;
>> +struct zone *zone = NULL;
>>  
>>  /* Check the starting page of each pageblock within the range */
>>  for (; page < end_page; page = next_active_pageblock(page)) {
>> -if (!is_pageblock_removable_nolock(page))
>> +if (!is_pageblock_removable_nolock(page, ))
>>  return false;
>>  cond_resched();
>>  }
>> @@ -1296,6 +1295,9 @@ int test_pages_in_a_zone(unsigned long start_pfn, 
>> unsigned long end_pfn,
>>  i++;
>>  if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
>>  continue;
>> +/* Check if we got 

Re: [PATCH] Revert "x86/tsc: Consolidate init code"

2018-09-10 Thread Pasha Tatashin
Hi Ville,


The failure is surprising, because the commit is tiny, and almost does
not change the code logic.

From looking through the commit, the only functional difference this
commit makes is:

static_branch_enable(&__use_tsc) was called unconditionally from
tsc_init(), but after the commit only when tsc_khz == 0.

I wonder if on p3 static_branch_enable(&__use_tsc) fails to enable
early, when it supposed to? But, I would first try to make that
unconditional call again, and see if this fixes the problem, and then
figure out why it was not enabled when it was supposed to.

So, in tsc_init(void)

First try to add this one line back:

cyc2ns_init_secondary_cpus();
-   static_branch_enable(&__use_tsc);


See if it fixes everything, and lets work from there.

Thank you,
Pavel

On 9/10/18 8:48 AM, Thomas Gleixner wrote:
> Ville,
> 
> On Mon, 10 Sep 2018, Ville Syrjala wrote:
> 
>> From: Ville Syrjälä 
>>
>> This reverts commit 608008a45798fe9e2aee04f99b5270ea57c1376f.
>>
>> It breaks wifi on my pentium 3 Fujitsu-Siemens Lifebook S6010
>> laptop. Scanning for APs doesn't seem to work most of the time,
>> and, even when it manages to find some APs it never manages to
>> authenticate successfully. dmesg is just littered with:
>> "wlan0: send auth to ... (try 1/3)
>>  wlan0: send auth to ... (try 2/3)
>>  wlan0: send auth to ... (try 3/3)
>>  wlan0: authentication with ... timed out"
> 
> I asked for that before and I really do not understand why you do not even
> make an attempt to report an issue first and allow the developers to work
> with you to figure out what exactly is the problem. All you do is to send
> an revert patch with a changelog which describes symptoms and probably
> breaks more than it cures. Not really helpful, really.
> 
> It's surely helpful to know that you bisected it to that commit and
> reverting it helps. Can you please provide more detailes information like
> dmesg of an good and a bad boot?
> 
> Thanks,
> 
>   tglx
> 
> 
> 

Re: [PATCH] Revert "x86/tsc: Consolidate init code"

2018-09-10 Thread Pasha Tatashin
Hi Ville,


The failure is surprising, because the commit is tiny, and almost does
not change the code logic.

From looking through the commit, the only functional difference this
commit makes is:

static_branch_enable(&__use_tsc) was called unconditionally from
tsc_init(), but after the commit only when tsc_khz == 0.

I wonder if on p3 static_branch_enable(&__use_tsc) fails to enable
early, when it supposed to? But, I would first try to make that
unconditional call again, and see if this fixes the problem, and then
figure out why it was not enabled when it was supposed to.

So, in tsc_init(void)

First try to add this one line back:

cyc2ns_init_secondary_cpus();
-   static_branch_enable(&__use_tsc);


See if it fixes everything, and lets work from there.

Thank you,
Pavel

On 9/10/18 8:48 AM, Thomas Gleixner wrote:
> Ville,
> 
> On Mon, 10 Sep 2018, Ville Syrjala wrote:
> 
>> From: Ville Syrjälä 
>>
>> This reverts commit 608008a45798fe9e2aee04f99b5270ea57c1376f.
>>
>> It breaks wifi on my pentium 3 Fujitsu-Siemens Lifebook S6010
>> laptop. Scanning for APs doesn't seem to work most of the time,
>> and, even when it manages to find some APs it never manages to
>> authenticate successfully. dmesg is just littered with:
>> "wlan0: send auth to ... (try 1/3)
>>  wlan0: send auth to ... (try 2/3)
>>  wlan0: send auth to ... (try 3/3)
>>  wlan0: authentication with ... timed out"
> 
> I asked for that before and I really do not understand why you do not even
> make an attempt to report an issue first and allow the developers to work
> with you to figure out what exactly is the problem. All you do is to send
> an revert patch with a changelog which describes symptoms and probably
> breaks more than it cures. Not really helpful, really.
> 
> It's surely helpful to know that you bisected it to that commit and
> reverting it helps. Can you please provide more detailes information like
> dmesg of an good and a bad boot?
> 
> Thanks,
> 
>   tglx
> 
> 
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-06 Thread Pasha Tatashin


On 9/6/18 1:03 PM, Michal Hocko wrote:
> On Thu 06-09-18 08:41:52, Alexander Duyck wrote:
>> On Thu, Sep 6, 2018 at 8:13 AM Michal Hocko  wrote:
>>>
>>> On Thu 06-09-18 07:59:03, Dave Hansen wrote:
 On 09/05/2018 10:47 PM, Michal Hocko wrote:
> why do you have to keep DEBUG_VM enabled for workloads where the boot
> time matters so much that few seconds matter?

 There are a number of distributions that run with it enabled in the
 default build.  Fedora, for one.  We've basically assumed for a while
 that we have to live with it in production environments.

 So, where does leave us?  I think we either need a _generic_ debug
 option like:

   CONFIG_DEBUG_VM_SLOW_AS_HECK

 under which we can put this an other really slow VM debugging.  Or, we
 need some kind of boot-time parameter to trigger the extra checking
 instead of a new CONFIG option.
>>>
>>> I strongly suspect nobody will ever enable such a scary looking config
>>> TBH. Besides I am not sure what should go under that config option.
>>> Something that takes few cycles but it is called often or one time stuff
>>> that takes quite a long but less than aggregated overhead of the former?
>>>
>>> Just consider this particular case. It basically re-adds an overhead
>>> that has always been there before the struct page init optimization
>>> went it. The poisoning just returns it in a different form to catch
>>> potential left overs. And we would like to have as many people willing
>>> to running in debug mode to test for those paths because they are
>>> basically impossible to review by the code inspection. More importantnly
>>> the major overhead is boot time so my question still stands. Is this
>>> worth a separate config option almost nobody is going to enable?
>>>
>>> Enabling DEBUG_VM by Fedora and others serves us a very good testing
>>> coverage and I appreciate that because it has generated some useful bug
>>> reports. Those people are paying quite a lot of overhead in runtime
>>> which can aggregate over time is it so much to ask about one time boot
>>> overhead?
>>
>> The kind of boot time add-on I saw as a result of this was about 170
>> seconds, or 2 minutes and 50 seconds on a 12TB system.
> 
> Just curious. How long does it take to get from power on to even reaach
> boot loader on that machine... ;)
> 
>> I spent a
>> couple minutes wondering if I had built a bad kernel or not as I was
>> staring at a dead console the entire time after the grub prompt since
>> I hit this so early in the boot. That is the reason why I am so eager
>> to slice this off and make it something separate. I could easily see
>> this as something that would get in the way of other debugging that is
>> going on in a system.
> 
> But you would get the same overhead a kernel release ago when the
> memmap init optimization was merged. So you are basically back to what
> we used to have for years. Unless I misremember.

You remeber this correctly:

2f47a91f4dab1905cdcfced9dfcaf3f5257e has data before vs after
zeroing memory in memblock allocator.

On SPARC with 15T we saved 55.4s, because SPARC has larger base pages,
thus fewer struct pages.

On x86 with 1T saved 15.8s:  which is 189.6s if it was 12T machine
Alexander is using, close to 170s he is seeing, but CPU must be faster.

Pavel

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-06 Thread Pasha Tatashin


On 9/6/18 1:03 PM, Michal Hocko wrote:
> On Thu 06-09-18 08:41:52, Alexander Duyck wrote:
>> On Thu, Sep 6, 2018 at 8:13 AM Michal Hocko  wrote:
>>>
>>> On Thu 06-09-18 07:59:03, Dave Hansen wrote:
 On 09/05/2018 10:47 PM, Michal Hocko wrote:
> why do you have to keep DEBUG_VM enabled for workloads where the boot
> time matters so much that few seconds matter?

 There are a number of distributions that run with it enabled in the
 default build.  Fedora, for one.  We've basically assumed for a while
 that we have to live with it in production environments.

 So, where does leave us?  I think we either need a _generic_ debug
 option like:

   CONFIG_DEBUG_VM_SLOW_AS_HECK

 under which we can put this an other really slow VM debugging.  Or, we
 need some kind of boot-time parameter to trigger the extra checking
 instead of a new CONFIG option.
>>>
>>> I strongly suspect nobody will ever enable such a scary looking config
>>> TBH. Besides I am not sure what should go under that config option.
>>> Something that takes few cycles but it is called often or one time stuff
>>> that takes quite a long but less than aggregated overhead of the former?
>>>
>>> Just consider this particular case. It basically re-adds an overhead
>>> that has always been there before the struct page init optimization
>>> went it. The poisoning just returns it in a different form to catch
>>> potential left overs. And we would like to have as many people willing
>>> to running in debug mode to test for those paths because they are
>>> basically impossible to review by the code inspection. More importantnly
>>> the major overhead is boot time so my question still stands. Is this
>>> worth a separate config option almost nobody is going to enable?
>>>
>>> Enabling DEBUG_VM by Fedora and others serves us a very good testing
>>> coverage and I appreciate that because it has generated some useful bug
>>> reports. Those people are paying quite a lot of overhead in runtime
>>> which can aggregate over time is it so much to ask about one time boot
>>> overhead?
>>
>> The kind of boot time add-on I saw as a result of this was about 170
>> seconds, or 2 minutes and 50 seconds on a 12TB system.
> 
> Just curious. How long does it take to get from power on to even reaach
> boot loader on that machine... ;)
> 
>> I spent a
>> couple minutes wondering if I had built a bad kernel or not as I was
>> staring at a dead console the entire time after the grub prompt since
>> I hit this so early in the boot. That is the reason why I am so eager
>> to slice this off and make it something separate. I could easily see
>> this as something that would get in the way of other debugging that is
>> going on in a system.
> 
> But you would get the same overhead a kernel release ago when the
> memmap init optimization was merged. So you are basically back to what
> we used to have for years. Unless I misremember.

You remeber this correctly:

2f47a91f4dab1905cdcfced9dfcaf3f5257e has data before vs after
zeroing memory in memblock allocator.

On SPARC with 15T we saved 55.4s, because SPARC has larger base pages,
thus fewer struct pages.

On x86 with 1T saved 15.8s:  which is 189.6s if it was 12T machine
Alexander is using, close to 170s he is seeing, but CPU must be faster.

Pavel

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-06 Thread Pasha Tatashin


On 9/6/18 11:41 AM, Alexander Duyck wrote:
> On Thu, Sep 6, 2018 at 8:13 AM Michal Hocko  wrote:
>>
>> On Thu 06-09-18 07:59:03, Dave Hansen wrote:
>>> On 09/05/2018 10:47 PM, Michal Hocko wrote:
 why do you have to keep DEBUG_VM enabled for workloads where the boot
 time matters so much that few seconds matter?
>>>
>>> There are a number of distributions that run with it enabled in the
>>> default build.  Fedora, for one.  We've basically assumed for a while
>>> that we have to live with it in production environments.
>>>
>>> So, where does leave us?  I think we either need a _generic_ debug
>>> option like:
>>>
>>>   CONFIG_DEBUG_VM_SLOW_AS_HECK
>>>
>>> under which we can put this an other really slow VM debugging.  Or, we
>>> need some kind of boot-time parameter to trigger the extra checking
>>> instead of a new CONFIG option.
>>
>> I strongly suspect nobody will ever enable such a scary looking config
>> TBH. Besides I am not sure what should go under that config option.
>> Something that takes few cycles but it is called often or one time stuff
>> that takes quite a long but less than aggregated overhead of the former?
>>
>> Just consider this particular case. It basically re-adds an overhead
>> that has always been there before the struct page init optimization
>> went it. The poisoning just returns it in a different form to catch
>> potential left overs. And we would like to have as many people willing
>> to running in debug mode to test for those paths because they are
>> basically impossible to review by the code inspection. More importantnly
>> the major overhead is boot time so my question still stands. Is this
>> worth a separate config option almost nobody is going to enable?
>>
>> Enabling DEBUG_VM by Fedora and others serves us a very good testing
>> coverage and I appreciate that because it has generated some useful bug
>> reports. Those people are paying quite a lot of overhead in runtime
>> which can aggregate over time is it so much to ask about one time boot
>> overhead?
> 
> The kind of boot time add-on I saw as a result of this was about 170
> seconds, or 2 minutes and 50 seconds on a 12TB system. I spent a
> couple minutes wondering if I had built a bad kernel or not as I was
> staring at a dead console the entire time after the grub prompt since
> I hit this so early in the boot. That is the reason why I am so eager
> to slice this off and make it something separate. I could easily see
> this as something that would get in the way of other debugging that is
> going on in a system.
> 
> If we don't want to do a config option, then what about adding a
> kernel parameter to put a limit on how much memory we will initialize
> like this before we just start skipping it. We could put a default
> limit on it like 256GB and then once we cross that threshold we just
> don't bother poisoning any more memory. With that we would probably be
> able to at least cover most of the early memory init, and that value
> should cover most systems without getting into delays on the order of
> minutes.

I am OK with a boot parameter to optionally disable it when DEBUG_VM is
enabled. But, I do not think it is a good idea to make that parameter
"smart" basically always poison memory with DEBUG_VM unless bootet with
a parameter that tells not to poison memory.

CONFIG_DEBUG_VM is disbled on:

RedHat, Oracle Linux, CentOS, Ubuntu, Arch Linux, SUSE

Enabled on:

Fedora

Are there other distros where it is enabled? I think, this could be
filed as a performance bug against Fedora distro, and let the decide
what to do about it.

I do not want to make this feature less tested. Poisoning memory allowed
us to catch corner case bugs like these:

ab1e8d8960b68f54af42b6484b5950bd13a4054b
mm: don't allow deferred pages with NEED_PER_CPU_KM

e181ae0c5db9544de9c53239eb22bc012ce75033
mm: zero unavailable pages before memmap init

And several more that were fixed by other people.

For a very long linux relied on assumption that boot memory is zeroed,
and I am sure we will continue detect more bugs over time.

Thank you,
Pavel

> 
> - Alex
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-06 Thread Pasha Tatashin


On 9/6/18 11:41 AM, Alexander Duyck wrote:
> On Thu, Sep 6, 2018 at 8:13 AM Michal Hocko  wrote:
>>
>> On Thu 06-09-18 07:59:03, Dave Hansen wrote:
>>> On 09/05/2018 10:47 PM, Michal Hocko wrote:
 why do you have to keep DEBUG_VM enabled for workloads where the boot
 time matters so much that few seconds matter?
>>>
>>> There are a number of distributions that run with it enabled in the
>>> default build.  Fedora, for one.  We've basically assumed for a while
>>> that we have to live with it in production environments.
>>>
>>> So, where does leave us?  I think we either need a _generic_ debug
>>> option like:
>>>
>>>   CONFIG_DEBUG_VM_SLOW_AS_HECK
>>>
>>> under which we can put this an other really slow VM debugging.  Or, we
>>> need some kind of boot-time parameter to trigger the extra checking
>>> instead of a new CONFIG option.
>>
>> I strongly suspect nobody will ever enable such a scary looking config
>> TBH. Besides I am not sure what should go under that config option.
>> Something that takes few cycles but it is called often or one time stuff
>> that takes quite a long but less than aggregated overhead of the former?
>>
>> Just consider this particular case. It basically re-adds an overhead
>> that has always been there before the struct page init optimization
>> went it. The poisoning just returns it in a different form to catch
>> potential left overs. And we would like to have as many people willing
>> to running in debug mode to test for those paths because they are
>> basically impossible to review by the code inspection. More importantnly
>> the major overhead is boot time so my question still stands. Is this
>> worth a separate config option almost nobody is going to enable?
>>
>> Enabling DEBUG_VM by Fedora and others serves us a very good testing
>> coverage and I appreciate that because it has generated some useful bug
>> reports. Those people are paying quite a lot of overhead in runtime
>> which can aggregate over time is it so much to ask about one time boot
>> overhead?
> 
> The kind of boot time add-on I saw as a result of this was about 170
> seconds, or 2 minutes and 50 seconds on a 12TB system. I spent a
> couple minutes wondering if I had built a bad kernel or not as I was
> staring at a dead console the entire time after the grub prompt since
> I hit this so early in the boot. That is the reason why I am so eager
> to slice this off and make it something separate. I could easily see
> this as something that would get in the way of other debugging that is
> going on in a system.
> 
> If we don't want to do a config option, then what about adding a
> kernel parameter to put a limit on how much memory we will initialize
> like this before we just start skipping it. We could put a default
> limit on it like 256GB and then once we cross that threshold we just
> don't bother poisoning any more memory. With that we would probably be
> able to at least cover most of the early memory init, and that value
> should cover most systems without getting into delays on the order of
> minutes.

I am OK with a boot parameter to optionally disable it when DEBUG_VM is
enabled. But, I do not think it is a good idea to make that parameter
"smart" basically always poison memory with DEBUG_VM unless bootet with
a parameter that tells not to poison memory.

CONFIG_DEBUG_VM is disbled on:

RedHat, Oracle Linux, CentOS, Ubuntu, Arch Linux, SUSE

Enabled on:

Fedora

Are there other distros where it is enabled? I think, this could be
filed as a performance bug against Fedora distro, and let the decide
what to do about it.

I do not want to make this feature less tested. Poisoning memory allowed
us to catch corner case bugs like these:

ab1e8d8960b68f54af42b6484b5950bd13a4054b
mm: don't allow deferred pages with NEED_PER_CPU_KM

e181ae0c5db9544de9c53239eb22bc012ce75033
mm: zero unavailable pages before memmap init

And several more that were fixed by other people.

For a very long linux relied on assumption that boot memory is zeroed,
and I am sure we will continue detect more bugs over time.

Thank you,
Pavel

> 
> - Alex
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-05 Thread Pasha Tatashin


On 9/5/18 5:29 PM, Alexander Duyck wrote:
> On Wed, Sep 5, 2018 at 2:22 PM Pasha Tatashin
>  wrote:
>>
>>
>>
>> On 9/5/18 5:13 PM, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> On systems with a large amount of memory it can take a significant amount
>>> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
>>> value. I have seen it take over 2 minutes to initialize a system with
>>> over 12GB of RAM.
>>>
>>> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
>>> the boot time returned to something much more reasonable as the
>>> arch_add_memory call completed in milliseconds versus seconds. However in
>>> doing that I had to disable all of the other VM debugging on the system.
>>>
>>> Instead of keeping the value in CONFIG_DEBUG_VM I am adding a new CONFIG
>>> value called CONFIG_DEBUG_VM_PAGE_INIT_POISON that will control the page
>>> poisoning independent of the CONFIG_DEBUG_VM option.
>>>
>>> Signed-off-by: Alexander Duyck 
>>> ---
>>>  include/linux/page-flags.h |8 
>>>  lib/Kconfig.debug  |   14 ++
>>>  mm/memblock.c  |5 ++---
>>>  mm/sparse.c|4 +---
>>>  4 files changed, 25 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 74bee8cecf4c..0e95ca63375a 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -13,6 +13,7 @@
>>>  #include 
>>>  #include 
>>>  #endif /* !__GENERATING_BOUNDS_H */
>>> +#include 
>>>
>>>  /*
>>>   * Various page->flags bits:
>>> @@ -162,6 +163,13 @@ static inline int PagePoisoned(const struct page *page)
>>>   return page->flags == PAGE_POISON_PATTERN;
>>>  }
>>>
>>> +static inline void page_init_poison(struct page *page, size_t size)
>>> +{
>>> +#ifdef CONFIG_DEBUG_VM_PAGE_INIT_POISON
>>> + memset(page, PAGE_POISON_PATTERN, size);
>>> +#endif
>>> +}
>>> +
>>>  /*
>>>   * Page flags policies wrt compound pages
>>>   *
>>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>>> index 613316724c6a..3b1277c52fed 100644
>>> --- a/lib/Kconfig.debug
>>> +++ b/lib/Kconfig.debug
>>> @@ -637,6 +637,20 @@ config DEBUG_VM_PGFLAGS
>>>
>>> If unsure, say N.
>>>
>>> +config DEBUG_VM_PAGE_INIT_POISON
>>> + bool "Enable early page metadata poisoning"
>>> + default y
>>> + depends on DEBUG_VM
>>> + help
>>> +   Seed the page metadata with a poison pattern to improve the
>>> +   likelihood of detecting attempts to access the page prior to
>>> +   initialization by the memory subsystem.
>>> +
>>> +   This initialization can result in a longer boot time for systems
>>> +   with a large amount of memory.
>>
>> What happens when DEBUG_VM_PGFLAGS = y and
>> DEBUG_VM_PAGE_INIT_POISON = n ?
>>
>> We are testing for pattern that was not set?
>>
>> I think DEBUG_VM_PAGE_INIT_POISON must depend on DEBUG_VM_PGFLAGS instead.
>>
>> Looks good otherwise.
>>
>> Thank you,
>> Pavel
> 
> The problem is that I then end up in the same situation I had in the
> last patch where you have to have DEBUG_VM_PGFLAGS on in order to do
> the seeding with poison.

OK

> 
> I can wrap the bit of code in PagePoisoned to just always return false
> if we didn't set the pattern. I figure there is value to be had for
> running DEBUG_VM_PGFLAGS regardless of the poison check, or
> DEBUG_VM_PAGE_INIT_POISON without the PGFLAGS check. That is why I
> wanted to leave them independent.

How about:

Remove "depends on DEBUG_VM", but make DEBUG_VM_PGFLAGS to depend on
both DEBUG_VM and DEBUG_VM_PAGE_INIT_POISON?

DEBUG_VM_PGFLAGS is already extremely slow, so having this extra
dependency is OK.

Thank you,
Pavel

> 
> - Alex
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-05 Thread Pasha Tatashin


On 9/5/18 5:29 PM, Alexander Duyck wrote:
> On Wed, Sep 5, 2018 at 2:22 PM Pasha Tatashin
>  wrote:
>>
>>
>>
>> On 9/5/18 5:13 PM, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> On systems with a large amount of memory it can take a significant amount
>>> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
>>> value. I have seen it take over 2 minutes to initialize a system with
>>> over 12GB of RAM.
>>>
>>> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
>>> the boot time returned to something much more reasonable as the
>>> arch_add_memory call completed in milliseconds versus seconds. However in
>>> doing that I had to disable all of the other VM debugging on the system.
>>>
>>> Instead of keeping the value in CONFIG_DEBUG_VM I am adding a new CONFIG
>>> value called CONFIG_DEBUG_VM_PAGE_INIT_POISON that will control the page
>>> poisoning independent of the CONFIG_DEBUG_VM option.
>>>
>>> Signed-off-by: Alexander Duyck 
>>> ---
>>>  include/linux/page-flags.h |8 
>>>  lib/Kconfig.debug  |   14 ++
>>>  mm/memblock.c  |5 ++---
>>>  mm/sparse.c|4 +---
>>>  4 files changed, 25 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>>> index 74bee8cecf4c..0e95ca63375a 100644
>>> --- a/include/linux/page-flags.h
>>> +++ b/include/linux/page-flags.h
>>> @@ -13,6 +13,7 @@
>>>  #include 
>>>  #include 
>>>  #endif /* !__GENERATING_BOUNDS_H */
>>> +#include 
>>>
>>>  /*
>>>   * Various page->flags bits:
>>> @@ -162,6 +163,13 @@ static inline int PagePoisoned(const struct page *page)
>>>   return page->flags == PAGE_POISON_PATTERN;
>>>  }
>>>
>>> +static inline void page_init_poison(struct page *page, size_t size)
>>> +{
>>> +#ifdef CONFIG_DEBUG_VM_PAGE_INIT_POISON
>>> + memset(page, PAGE_POISON_PATTERN, size);
>>> +#endif
>>> +}
>>> +
>>>  /*
>>>   * Page flags policies wrt compound pages
>>>   *
>>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>>> index 613316724c6a..3b1277c52fed 100644
>>> --- a/lib/Kconfig.debug
>>> +++ b/lib/Kconfig.debug
>>> @@ -637,6 +637,20 @@ config DEBUG_VM_PGFLAGS
>>>
>>> If unsure, say N.
>>>
>>> +config DEBUG_VM_PAGE_INIT_POISON
>>> + bool "Enable early page metadata poisoning"
>>> + default y
>>> + depends on DEBUG_VM
>>> + help
>>> +   Seed the page metadata with a poison pattern to improve the
>>> +   likelihood of detecting attempts to access the page prior to
>>> +   initialization by the memory subsystem.
>>> +
>>> +   This initialization can result in a longer boot time for systems
>>> +   with a large amount of memory.
>>
>> What happens when DEBUG_VM_PGFLAGS = y and
>> DEBUG_VM_PAGE_INIT_POISON = n ?
>>
>> We are testing for pattern that was not set?
>>
>> I think DEBUG_VM_PAGE_INIT_POISON must depend on DEBUG_VM_PGFLAGS instead.
>>
>> Looks good otherwise.
>>
>> Thank you,
>> Pavel
> 
> The problem is that I then end up in the same situation I had in the
> last patch where you have to have DEBUG_VM_PGFLAGS on in order to do
> the seeding with poison.

OK

> 
> I can wrap the bit of code in PagePoisoned to just always return false
> if we didn't set the pattern. I figure there is value to be had for
> running DEBUG_VM_PGFLAGS regardless of the poison check, or
> DEBUG_VM_PAGE_INIT_POISON without the PGFLAGS check. That is why I
> wanted to leave them independent.

How about:

Remove "depends on DEBUG_VM", but make DEBUG_VM_PGFLAGS to depend on
both DEBUG_VM and DEBUG_VM_PAGE_INIT_POISON?

DEBUG_VM_PGFLAGS is already extremely slow, so having this extra
dependency is OK.

Thank you,
Pavel

> 
> - Alex
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-05 Thread Pasha Tatashin


On 9/5/18 5:13 PM, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> Instead of keeping the value in CONFIG_DEBUG_VM I am adding a new CONFIG
> value called CONFIG_DEBUG_VM_PAGE_INIT_POISON that will control the page
> poisoning independent of the CONFIG_DEBUG_VM option.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  include/linux/page-flags.h |8 
>  lib/Kconfig.debug  |   14 ++
>  mm/memblock.c  |5 ++---
>  mm/sparse.c|4 +---
>  4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74bee8cecf4c..0e95ca63375a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #endif /* !__GENERATING_BOUNDS_H */
> +#include 
>  
>  /*
>   * Various page->flags bits:
> @@ -162,6 +163,13 @@ static inline int PagePoisoned(const struct page *page)
>   return page->flags == PAGE_POISON_PATTERN;
>  }
>  
> +static inline void page_init_poison(struct page *page, size_t size)
> +{
> +#ifdef CONFIG_DEBUG_VM_PAGE_INIT_POISON
> + memset(page, PAGE_POISON_PATTERN, size);
> +#endif
> +}
> +
>  /*
>   * Page flags policies wrt compound pages
>   *
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 613316724c6a..3b1277c52fed 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -637,6 +637,20 @@ config DEBUG_VM_PGFLAGS
>  
> If unsure, say N.
>  
> +config DEBUG_VM_PAGE_INIT_POISON
> + bool "Enable early page metadata poisoning"
> + default y
> + depends on DEBUG_VM
> + help
> +   Seed the page metadata with a poison pattern to improve the
> +   likelihood of detecting attempts to access the page prior to
> +   initialization by the memory subsystem.
> +
> +   This initialization can result in a longer boot time for systems
> +   with a large amount of memory.

What happens when DEBUG_VM_PGFLAGS = y and
DEBUG_VM_PAGE_INIT_POISON = n ?

We are testing for pattern that was not set?

I think DEBUG_VM_PAGE_INIT_POISON must depend on DEBUG_VM_PGFLAGS instead.

Looks good otherwise.

Thank you,
Pavel

> +
> +   If unsure, say Y.
> +
>  config ARCH_HAS_DEBUG_VIRTUAL
>   bool
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..a85315083b5a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,10 +1444,9 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>   ptr = memblock_virt_alloc_internal(size, align,
>  min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
>   if (ptr && size > 0)
> - memset(ptr, PAGE_POISON_PATTERN, size);
> -#endif
> + page_init_poison(ptr, size);
> +
>   return ptr;
>  }
>  
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..67ad061f7fb8 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,13 +696,11 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
>   goto out;
>   }
>  
> -#ifdef CONFIG_DEBUG_VM
>   /*
>* Poison uninitialized struct pages in order to catch invalid flags
>* combinations.
>*/
> - memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page) * 
> PAGES_PER_SECTION);
> -#endif
> + page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>  
>   section_mark_present(ms);
>   sparse_init_one_section(ms, section_nr, memmap, usemap);
> 

Re: [PATCH v2 1/2] mm: Move page struct poisoning to CONFIG_DEBUG_VM_PAGE_INIT_POISON

2018-09-05 Thread Pasha Tatashin


On 9/5/18 5:13 PM, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> Instead of keeping the value in CONFIG_DEBUG_VM I am adding a new CONFIG
> value called CONFIG_DEBUG_VM_PAGE_INIT_POISON that will control the page
> poisoning independent of the CONFIG_DEBUG_VM option.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  include/linux/page-flags.h |8 
>  lib/Kconfig.debug  |   14 ++
>  mm/memblock.c  |5 ++---
>  mm/sparse.c|4 +---
>  4 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 74bee8cecf4c..0e95ca63375a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #endif /* !__GENERATING_BOUNDS_H */
> +#include 
>  
>  /*
>   * Various page->flags bits:
> @@ -162,6 +163,13 @@ static inline int PagePoisoned(const struct page *page)
>   return page->flags == PAGE_POISON_PATTERN;
>  }
>  
> +static inline void page_init_poison(struct page *page, size_t size)
> +{
> +#ifdef CONFIG_DEBUG_VM_PAGE_INIT_POISON
> + memset(page, PAGE_POISON_PATTERN, size);
> +#endif
> +}
> +
>  /*
>   * Page flags policies wrt compound pages
>   *
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 613316724c6a..3b1277c52fed 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -637,6 +637,20 @@ config DEBUG_VM_PGFLAGS
>  
> If unsure, say N.
>  
> +config DEBUG_VM_PAGE_INIT_POISON
> + bool "Enable early page metadata poisoning"
> + default y
> + depends on DEBUG_VM
> + help
> +   Seed the page metadata with a poison pattern to improve the
> +   likelihood of detecting attempts to access the page prior to
> +   initialization by the memory subsystem.
> +
> +   This initialization can result in a longer boot time for systems
> +   with a large amount of memory.

What happens when DEBUG_VM_PGFLAGS = y and
DEBUG_VM_PAGE_INIT_POISON = n ?

We are testing for pattern that was not set?

I think DEBUG_VM_PAGE_INIT_POISON must depend on DEBUG_VM_PGFLAGS instead.

Looks good otherwise.

Thank you,
Pavel

> +
> +   If unsure, say Y.
> +
>  config ARCH_HAS_DEBUG_VIRTUAL
>   bool
>  
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..a85315083b5a 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,10 +1444,9 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>   ptr = memblock_virt_alloc_internal(size, align,
>  min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
>   if (ptr && size > 0)
> - memset(ptr, PAGE_POISON_PATTERN, size);
> -#endif
> + page_init_poison(ptr, size);
> +
>   return ptr;
>  }
>  
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..67ad061f7fb8 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,13 +696,11 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
>   goto out;
>   }
>  
> -#ifdef CONFIG_DEBUG_VM
>   /*
>* Poison uninitialized struct pages in order to catch invalid flags
>* combinations.
>*/
> - memset(memmap, PAGE_POISON_PATTERN, sizeof(struct page) * 
> PAGES_PER_SECTION);
> -#endif
> + page_init_poison(memmap, sizeof(struct page) * PAGES_PER_SECTION);
>  
>   section_mark_present(ms);
>   sparse_init_one_section(ms, section_nr, memmap, usemap);
> 

Re: [PATCH 2/2] mm: Create non-atomic version of SetPageReserved for init use

2018-09-05 Thread Pasha Tatashin


On 9/5/18 4:18 PM, Alexander Duyck wrote:
> On Tue, Sep 4, 2018 at 11:24 PM Michal Hocko  wrote:
>>
>> On Tue 04-09-18 11:33:45, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> It doesn't make much sense to use the atomic SetPageReserved at init time
>>> when we are using memset to clear the memory and manipulating the page
>>> flags via simple "&=" and "|=" operations in __init_single_page.
>>>
>>> This patch adds a non-atomic version __SetPageReserved that can be used
>>> during page init and shows about a 10% improvement in initialization times
>>> on the systems I have available for testing.
>>
>> I agree with Dave about a comment is due. I am also quite surprised that
>> this leads to such a large improvement. Could you be more specific about
>> your test and machines you were testing on?
> 
> So my test case has been just initializing 4 3TB blocks of persistent
> memory with a few trace_printk values added to track total time in
> move_pfn_range_to_zone.
> 
> What I have been seeing is that the time needed for the call drops on
> average from 35-36 seconds down to around 31-32.

Just curious why is there variance? During boot time is usually pretty
consistent, as there is only one thread and system is in pretty much the
same state.

A dmesg output in the commit log would be helpful.

Thank you,
Pavel

> 
>> Other than that the patch makes sense to me.
>>
>>> Signed-off-by: Alexander Duyck 
>>
>> With the above addressed, feel free to add
>> Acked-by: Michal Hocko 
>>
>> Thanks!
> 
> As far as adding a comment are we just talking about why it is
> reserved, or do we need a description of the __SetPageReserved versus
> SetPageReserved. For now I was looking at adding a comment like:
> @@ -5517,8 +5517,13 @@ void __meminit memmap_init_zone(unsigned long
> size, int nid, unsigned long zone,
>  not_early:
> page = pfn_to_page(pfn);
> __init_single_page(page, pfn, zone, nid);
> +
> +   /*
> +* Mark page reserved as it will need to wait for onlining
> +* phase for it to be fully associated with a zone.
> +*/
> if (context == MEMMAP_HOTPLUG)
> -   SetPageReserved(page);
> +   __SetPageReserved(page);
> 
> /*
>  * Mark the block movable so that blocks are reserved for
> 
> Any thoughts on this?
> 
> Thanks.
> 
> - Alex
> 

Re: [PATCH 2/2] mm: Create non-atomic version of SetPageReserved for init use

2018-09-05 Thread Pasha Tatashin


On 9/5/18 4:18 PM, Alexander Duyck wrote:
> On Tue, Sep 4, 2018 at 11:24 PM Michal Hocko  wrote:
>>
>> On Tue 04-09-18 11:33:45, Alexander Duyck wrote:
>>> From: Alexander Duyck 
>>>
>>> It doesn't make much sense to use the atomic SetPageReserved at init time
>>> when we are using memset to clear the memory and manipulating the page
>>> flags via simple "&=" and "|=" operations in __init_single_page.
>>>
>>> This patch adds a non-atomic version __SetPageReserved that can be used
>>> during page init and shows about a 10% improvement in initialization times
>>> on the systems I have available for testing.
>>
>> I agree with Dave about a comment is due. I am also quite surprised that
>> this leads to such a large improvement. Could you be more specific about
>> your test and machines you were testing on?
> 
> So my test case has been just initializing 4 3TB blocks of persistent
> memory with a few trace_printk values added to track total time in
> move_pfn_range_to_zone.
> 
> What I have been seeing is that the time needed for the call drops on
> average from 35-36 seconds down to around 31-32.

Just curious why is there variance? During boot time is usually pretty
consistent, as there is only one thread and system is in pretty much the
same state.

A dmesg output in the commit log would be helpful.

Thank you,
Pavel

> 
>> Other than that the patch makes sense to me.
>>
>>> Signed-off-by: Alexander Duyck 
>>
>> With the above addressed, feel free to add
>> Acked-by: Michal Hocko 
>>
>> Thanks!
> 
> As far as adding a comment are we just talking about why it is
> reserved, or do we need a description of the __SetPageReserved versus
> SetPageReserved. For now I was looking at adding a comment like:
> @@ -5517,8 +5517,13 @@ void __meminit memmap_init_zone(unsigned long
> size, int nid, unsigned long zone,
>  not_early:
> page = pfn_to_page(pfn);
> __init_single_page(page, pfn, zone, nid);
> +
> +   /*
> +* Mark page reserved as it will need to wait for onlining
> +* phase for it to be fully associated with a zone.
> +*/
> if (context == MEMMAP_HOTPLUG)
> -   SetPageReserved(page);
> +   __SetPageReserved(page);
> 
> /*
>  * Mark the block movable so that blocks are reserved for
> 
> Any thoughts on this?
> 
> Thanks.
> 
> - Alex
> 

Re: Plumbers 2018 - Performance and Scalability Microconference

2018-09-05 Thread Pasha Tatashin


On 9/5/18 2:38 AM, Mike Rapoport wrote:
> On Tue, Sep 04, 2018 at 05:28:13PM -0400, Daniel Jordan wrote:
>> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance 
>> and scalability microconference this year at Plumbers[*], which is happening 
>> in Vancouver this year.  The microconference is scheduled for the morning of 
>> the second day (Wed, Nov 14).
>>
>> We have a preliminary agenda and a list of confirmed and interested 
>> attendees (cc'ed), and are seeking more of both!
>>
>> Some of the items on the agenda as it stands now are:
>>
>>  - Promoting huge page usage:  With memory sizes becoming ever larger, huge 
>> pages are becoming more and more important to reduce TLB misses and the 
>> overhead of memory management itself--that is, to make the system scalable 
>> with the memory size.  But there are still some remaining gaps that prevent 
>> huge pages from being deployed in some situations, such as huge page 
>> allocation latency and memory fragmentation.
>>
>>  - Reducing the number of users of mmap_sem:  This semaphore is frequently 
>> used throughout the kernel.  In order to facilitate scaling this 
>> longstanding bottleneck, these uses should be documented and unnecessary 
>> users should be fixed.
>>
>>  - Parallelizing cpu-intensive kernel work:  Resolve problems of past 
>> approaches including extra threads interfering with other processes, playing 
>> well with power management, and proper cgroup accounting for the extra 
>> threads.  Bonus topic: proper accounting of workqueue threads running on 
>> behalf of cgroups.
>>
>>  - Preserving userland during kexec with a hibernation-like mechanism.
> 
> Just some crazy idea: have you considered using checkpoint-restore as a
> replacement or an addition to hibernation?

Hi Mike,

Yes, this is one way I was thinking about, and use kernel to pass the
application stored state to new kernel in pmem. The only problem is that
we waste memory: when there is not enough system memory to copy and pass
application state to new kernel this scheme won't work. Think about DB
that occupies 80% of system memory and we want to checkpoint/restore it.

So, we need to have another way, where the preserved memory is the
memory that is actually used by the applications, not copied. One easy
way is to give each application that has a large state that is expensive
to recreate a persistent memory device and let applications to keep its
state on that device (say /dev/pmemN). The only problem is that memory
on that device must be accessible just as fast as regular memory without
any file system overhead and hopefully without need for DAX.

I just want to get some ideas of what people are thinking about this,
and what would be the best way to achieve it.

Pavel


>  
>> These center around our interests, but having lots of topics to choose from 
>> ensures we cover what's most important to the community, so we would like to 
>> hear about additional topics and extensions to those listed here.  This 
>> includes, but is certainly not limited to, work in progress that would 
>> benefit from in-person discussion, real-world performance problems, and 
>> experimental and academic work.
>>
>> If you haven't already done so, please let us know if you are interested in 
>> attending, or have suggestions for other attendees.
>>
>> Thanks,
>> Daniel
>>
>> [*] https://blog.linuxplumbersconf.org/2018/performance-mc/
>>
> 

Re: Plumbers 2018 - Performance and Scalability Microconference

2018-09-05 Thread Pasha Tatashin


On 9/5/18 2:38 AM, Mike Rapoport wrote:
> On Tue, Sep 04, 2018 at 05:28:13PM -0400, Daniel Jordan wrote:
>> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance 
>> and scalability microconference this year at Plumbers[*], which is happening 
>> in Vancouver this year.  The microconference is scheduled for the morning of 
>> the second day (Wed, Nov 14).
>>
>> We have a preliminary agenda and a list of confirmed and interested 
>> attendees (cc'ed), and are seeking more of both!
>>
>> Some of the items on the agenda as it stands now are:
>>
>>  - Promoting huge page usage:  With memory sizes becoming ever larger, huge 
>> pages are becoming more and more important to reduce TLB misses and the 
>> overhead of memory management itself--that is, to make the system scalable 
>> with the memory size.  But there are still some remaining gaps that prevent 
>> huge pages from being deployed in some situations, such as huge page 
>> allocation latency and memory fragmentation.
>>
>>  - Reducing the number of users of mmap_sem:  This semaphore is frequently 
>> used throughout the kernel.  In order to facilitate scaling this 
>> longstanding bottleneck, these uses should be documented and unnecessary 
>> users should be fixed.
>>
>>  - Parallelizing cpu-intensive kernel work:  Resolve problems of past 
>> approaches including extra threads interfering with other processes, playing 
>> well with power management, and proper cgroup accounting for the extra 
>> threads.  Bonus topic: proper accounting of workqueue threads running on 
>> behalf of cgroups.
>>
>>  - Preserving userland during kexec with a hibernation-like mechanism.
> 
> Just some crazy idea: have you considered using checkpoint-restore as a
> replacement or an addition to hibernation?

Hi Mike,

Yes, this is one way I was thinking about, and use kernel to pass the
application stored state to new kernel in pmem. The only problem is that
we waste memory: when there is not enough system memory to copy and pass
application state to new kernel this scheme won't work. Think about DB
that occupies 80% of system memory and we want to checkpoint/restore it.

So, we need to have another way, where the preserved memory is the
memory that is actually used by the applications, not copied. One easy
way is to give each application that has a large state that is expensive
to recreate a persistent memory device and let applications to keep its
state on that device (say /dev/pmemN). The only problem is that memory
on that device must be accessible just as fast as regular memory without
any file system overhead and hopefully without need for DAX.

I just want to get some ideas of what people are thinking about this,
and what would be the best way to achieve it.

Pavel


>  
>> These center around our interests, but having lots of topics to choose from 
>> ensures we cover what's most important to the community, so we would like to 
>> hear about additional topics and extensions to those listed here.  This 
>> includes, but is certainly not limited to, work in progress that would 
>> benefit from in-person discussion, real-world performance problems, and 
>> experimental and academic work.
>>
>> If you haven't already done so, please let us know if you are interested in 
>> attending, or have suggestions for other attendees.
>>
>> Thanks,
>> Daniel
>>
>> [*] https://blog.linuxplumbersconf.org/2018/performance-mc/
>>
> 

Re: [PATCH 1/2] mm: Move page struct poisoning from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS

2018-09-04 Thread Pasha Tatashin


On 9/4/18 5:13 PM, Alexander Duyck wrote:
> On Tue, Sep 4, 2018 at 1:07 PM Pasha Tatashin
>  wrote:
>>
>> Hi Alexander,
>>
>> This is a wrong way to do it. memblock_virt_alloc_try_nid_raw() does not
>> initialize allocated memory, and by setting memory to all ones in debug
>> build we ensure that no callers rely on this function to return zeroed
>> memory just by accident.
> 
> I get that, but setting this to all 1's is still just debugging code
> and that is adding significant overhead.

That's correct debugging code on debugging kernel.

> 
>> And, the accidents are frequent because most of the BIOSes and
>> hypervisors zero memory for us. The exception is kexec reboot.
>>
>> So, the fact that page flags checks this pattern, does not mean that
>> this is the only user. Memory that is returned by
>> memblock_virt_alloc_try_nid_raw() is used for page table as well, and
>> can be used in other places as well that don't want memblock to zero the
>> memory for them for performance reasons.
> 
> The logic behind this statement is confusing. You are saying they
> don't want memblock to zero the memory for performance reasons, yet
> you are setting it to all 1's for debugging reasons? I get that it is
> wrapped, but in my mind just using CONFIG_DEBUG_VM is too broad of a
> brush. Especially with distros like Fedora enabling it by default.

The idea is not to zero memory on production kernel, and ensure that not
zeroing memory does not cause any accidental bugs by having debug code
on debug kernel.

> 
>> I am surprised that CONFIG_DEBUG_VM is used in production kernel, but if
>> so perhaps a new CONFIG should be added: CONFIG_DEBUG_MEMBLOCK
>>
>> Thank you,
>> Pavel
> 
> I don't know about production. I am running a Fedora kernel on my
> development system and it has it enabled. It looks like it has been
> that way for a while based on a FC20 Bugzilla
> (https://bugzilla.redhat.com/show_bug.cgi?id=1074710). A quick look at
> one of my CentOS systems shows that it doesn't have it set. I suspect
> it will vary from distro to distro. I just know it spooked me when I
> was stuck staring at a blank screen for three minutes when I was
> booting a system with 12TB of memory since this delay can hit you
> early in the boot.

I understand, this is the delay that I fixed when I removed memset(0)
from struct page initialization code. However, we still need to keep
this debug code memset(1) in order to catch some bugs. And we do from
time to time.

For far too long linux was expecting that the memory that is returned by
memblock and boot allocator is always zeroed.

> 
> I had considered adding a completely new CONFIG. The only thing is it
> doesn't make much sense to have the logic setting the value to all 1's
> without any logic to test for it. 

When memory is zeroed, page table works by accident as the entries are
empty. However, when entries are all ones, and we accidentally try to
use that memory as page table invalid VA in page table will crash debug
kernel (and it has in the past helped finding some bugs).

So, the testing is not only that uninitialized struct pages are not
accessed, but also that only explicitly initialized page tables are
accessed.


That is why I thought it made more
> sense to just fold it into CONFIG_DEBUG_VM_PGFLAGS. I suppose I could
> look at something like CONFIG_DEBUG_PAGE_INIT if we want to go that
> route. I figure using something like MEMBLOCK probably wouldn't make
> sense since this also impacts sparse section init.

If distros are using CONFIG_DEBUG_VM in production kernels (as you
pointed out above), it makes sense to add CONFIG_DEBUG_MEMBLOCK.

Thank you,
Pavel

Re: [PATCH 1/2] mm: Move page struct poisoning from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS

2018-09-04 Thread Pasha Tatashin


On 9/4/18 5:13 PM, Alexander Duyck wrote:
> On Tue, Sep 4, 2018 at 1:07 PM Pasha Tatashin
>  wrote:
>>
>> Hi Alexander,
>>
>> This is a wrong way to do it. memblock_virt_alloc_try_nid_raw() does not
>> initialize allocated memory, and by setting memory to all ones in debug
>> build we ensure that no callers rely on this function to return zeroed
>> memory just by accident.
> 
> I get that, but setting this to all 1's is still just debugging code
> and that is adding significant overhead.

That's correct debugging code on debugging kernel.

> 
>> And, the accidents are frequent because most of the BIOSes and
>> hypervisors zero memory for us. The exception is kexec reboot.
>>
>> So, the fact that page flags checks this pattern, does not mean that
>> this is the only user. Memory that is returned by
>> memblock_virt_alloc_try_nid_raw() is used for page table as well, and
>> can be used in other places as well that don't want memblock to zero the
>> memory for them for performance reasons.
> 
> The logic behind this statement is confusing. You are saying they
> don't want memblock to zero the memory for performance reasons, yet
> you are setting it to all 1's for debugging reasons? I get that it is
> wrapped, but in my mind just using CONFIG_DEBUG_VM is too broad of a
> brush. Especially with distros like Fedora enabling it by default.

The idea is not to zero memory on production kernel, and ensure that not
zeroing memory does not cause any accidental bugs by having debug code
on debug kernel.

> 
>> I am surprised that CONFIG_DEBUG_VM is used in production kernel, but if
>> so perhaps a new CONFIG should be added: CONFIG_DEBUG_MEMBLOCK
>>
>> Thank you,
>> Pavel
> 
> I don't know about production. I am running a Fedora kernel on my
> development system and it has it enabled. It looks like it has been
> that way for a while based on a FC20 Bugzilla
> (https://bugzilla.redhat.com/show_bug.cgi?id=1074710). A quick look at
> one of my CentOS systems shows that it doesn't have it set. I suspect
> it will vary from distro to distro. I just know it spooked me when I
> was stuck staring at a blank screen for three minutes when I was
> booting a system with 12TB of memory since this delay can hit you
> early in the boot.

I understand, this is the delay that I fixed when I removed memset(0)
from struct page initialization code. However, we still need to keep
this debug code memset(1) in order to catch some bugs. And we do from
time to time.

For far too long linux was expecting that the memory that is returned by
memblock and boot allocator is always zeroed.

> 
> I had considered adding a completely new CONFIG. The only thing is it
> doesn't make much sense to have the logic setting the value to all 1's
> without any logic to test for it. 

When memory is zeroed, page table works by accident as the entries are
empty. However, when entries are all ones, and we accidentally try to
use that memory as page table invalid VA in page table will crash debug
kernel (and it has in the past helped finding some bugs).

So, the testing is not only that uninitialized struct pages are not
accessed, but also that only explicitly initialized page tables are
accessed.


That is why I thought it made more
> sense to just fold it into CONFIG_DEBUG_VM_PGFLAGS. I suppose I could
> look at something like CONFIG_DEBUG_PAGE_INIT if we want to go that
> route. I figure using something like MEMBLOCK probably wouldn't make
> sense since this also impacts sparse section init.

If distros are using CONFIG_DEBUG_VM in production kernels (as you
pointed out above), it makes sense to add CONFIG_DEBUG_MEMBLOCK.

Thank you,
Pavel

Re: [PATCH 1/2] mm: Move page struct poisoning from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS

2018-09-04 Thread Pasha Tatashin
Hi Alexander,

This is a wrong way to do it. memblock_virt_alloc_try_nid_raw() does not
initialize allocated memory, and by setting memory to all ones in debug
build we ensure that no callers rely on this function to return zeroed
memory just by accident.

And, the accidents are frequent because most of the BIOSes and
hypervisors zero memory for us. The exception is kexec reboot.

So, the fact that page flags checks this pattern, does not mean that
this is the only user. Memory that is returned by
memblock_virt_alloc_try_nid_raw() is used for page table as well, and
can be used in other places as well that don't want memblock to zero the
memory for them for performance reasons.

I am surprised that CONFIG_DEBUG_VM is used in production kernel, but if
so perhaps a new CONFIG should be added: CONFIG_DEBUG_MEMBLOCK

Thank you,
Pavel

On 9/4/18 2:33 PM, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> I did a bit of research and it seems like the only function that checks
> for this poison value is the PagePoisoned function, and it is only called
> in two spots. One is the PF_POISONED_CHECK macro that is only in use when
> CONFIG_DEBUG_VM_PGFLAGS is defined, and the other is as a part of the
> __dump_page function which is using the check to prevent a recursive
> failure in the event of discovering a poisoned page.
> 
> With this being the case I am opting to move the poisoning of the page
> structs from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS so that we are
> only performing the memset if it will be used to test for failures.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  mm/memblock.c |2 +-
>  mm/sparse.c   |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..51e8ae927257 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,7 +1444,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>   ptr = memblock_virt_alloc_internal(size, align,
>  min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
> +#ifdef CONFIG_DEBUG_VM_PGFLAGS
>   if (ptr && size > 0)
>   memset(ptr, PAGE_POISON_PATTERN, size);
>  #endif
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..0fd9ad5021b0 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,7 +696,7 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
>   goto out;
>   }
>  
> -#ifdef CONFIG_DEBUG_VM
> +#ifdef CONFIG_DEBUG_VM_PGFLAGS
>   /*
>* Poison uninitialized struct pages in order to catch invalid flags
>* combinations.
> 

Re: [PATCH 1/2] mm: Move page struct poisoning from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS

2018-09-04 Thread Pasha Tatashin
Hi Alexander,

This is a wrong way to do it. memblock_virt_alloc_try_nid_raw() does not
initialize allocated memory, and by setting memory to all ones in debug
build we ensure that no callers rely on this function to return zeroed
memory just by accident.

And, the accidents are frequent because most of the BIOSes and
hypervisors zero memory for us. The exception is kexec reboot.

So, the fact that page flags checks this pattern, does not mean that
this is the only user. Memory that is returned by
memblock_virt_alloc_try_nid_raw() is used for page table as well, and
can be used in other places as well that don't want memblock to zero the
memory for them for performance reasons.

I am surprised that CONFIG_DEBUG_VM is used in production kernel, but if
so perhaps a new CONFIG should be added: CONFIG_DEBUG_MEMBLOCK

Thank you,
Pavel

On 9/4/18 2:33 PM, Alexander Duyck wrote:
> From: Alexander Duyck 
> 
> On systems with a large amount of memory it can take a significant amount
> of time to initialize all of the page structs with the PAGE_POISON_PATTERN
> value. I have seen it take over 2 minutes to initialize a system with
> over 12GB of RAM.
> 
> In order to work around the issue I had to disable CONFIG_DEBUG_VM and then
> the boot time returned to something much more reasonable as the
> arch_add_memory call completed in milliseconds versus seconds. However in
> doing that I had to disable all of the other VM debugging on the system.
> 
> I did a bit of research and it seems like the only function that checks
> for this poison value is the PagePoisoned function, and it is only called
> in two spots. One is the PF_POISONED_CHECK macro that is only in use when
> CONFIG_DEBUG_VM_PGFLAGS is defined, and the other is as a part of the
> __dump_page function which is using the check to prevent a recursive
> failure in the event of discovering a poisoned page.
> 
> With this being the case I am opting to move the poisoning of the page
> structs from CONFIG_DEBUG_VM to CONFIG_DEBUG_VM_PGFLAGS so that we are
> only performing the memset if it will be used to test for failures.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  mm/memblock.c |2 +-
>  mm/sparse.c   |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 237944479d25..51e8ae927257 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1444,7 +1444,7 @@ void * __init memblock_virt_alloc_try_nid_raw(
>  
>   ptr = memblock_virt_alloc_internal(size, align,
>  min_addr, max_addr, nid);
> -#ifdef CONFIG_DEBUG_VM
> +#ifdef CONFIG_DEBUG_VM_PGFLAGS
>   if (ptr && size > 0)
>   memset(ptr, PAGE_POISON_PATTERN, size);
>  #endif
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..0fd9ad5021b0 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -696,7 +696,7 @@ int __meminit sparse_add_one_section(struct pglist_data 
> *pgdat,
>   goto out;
>   }
>  
> -#ifdef CONFIG_DEBUG_VM
> +#ifdef CONFIG_DEBUG_VM_PGFLAGS
>   /*
>* Poison uninitialized struct pages in order to catch invalid flags
>* combinations.
> 

[PATCH] mm: Disable deferred struct page for 32-bit arches

2018-08-31 Thread Pasha Tatashin
Deferred struct page init is needed only on systems with large amount of
physical memory to improve boot performance. 32-bit systems do not benefit
from this feature.

Jiri reported a problem where deferred struct pages do not work well with
x86-32:

[0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.036269] Initializing CPU#0
[0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
[0.038459] page:f678 is uninitialized and poisoned
[0.038460] raw:       
 
[0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
[0.040038] [ cut here ]
[0.040399] kernel BUG at include/linux/page-flags.h:293!
[0.040823] invalid opcode:  [#1] SMP PTI
[0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
[0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/2014
[0.042496] EIP: free_highmem_page+0x64/0x80
[0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 
5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 
76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
[0.044338] EAX: 003c EBX: f678 ECX:  EDX: d856cbe8
[0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
[0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
[0.045913] CR0: 80050033 CR2:  CR3: 18556000 CR4: 00040690
[0.046413] DR0:  DR1:  DR2:  DR3: 
[0.046913] DR6: fffe0ff0 DR7: 0400
[0.047220] Call Trace:
[0.047419]  add_highpages_with_active_regions+0xbd/0x10d
[0.047854]  set_highmem_pages_init+0x5b/0x71
[0.048202]  mem_init+0x2b/0x1e8
[0.048460]  start_kernel+0x1d2/0x425
[0.048757]  i386_start_kernel+0x93/0x97
[0.049073]  startup_32_smp+0x164/0x168
[0.049379] Modules linked in:
[0.049626] ---[ end trace 337949378db0abbb ]---

We free highmem pages before their struct pages are initialized:

mem_init()
 set_highmem_pages_init()
  add_highpages_with_active_regions()
   free_highmem_page()
.. Access uninitialized struct page here..

Because there is no reason to have this feature on 32-bit systems, just
disable it.

Fixes: 2e3ca40f03bb ("mm: relax deferred struct page requirements")
Cc: sta...@vger.kernel.org

Reported-by: Jiri Slaby 
Signed-off-by: Pavel Tatashin 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index a550635ea5c3..de64ea658716 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -637,6 +637,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on NO_BOOTMEM
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
+   depends on 64BIT
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
-- 
2.18.0


[PATCH] mm: Disable deferred struct page for 32-bit arches

2018-08-31 Thread Pasha Tatashin
Deferred struct page init is needed only on systems with large amount of
physical memory to improve boot performance. 32-bit systems do not benefit
from this feature.

Jiri reported a problem where deferred struct pages do not work well with
x86-32:

[0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
[0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
[0.036269] Initializing CPU#0
[0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
[0.038459] page:f678 is uninitialized and poisoned
[0.038460] raw:       
 
[0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
[0.040038] [ cut here ]
[0.040399] kernel BUG at include/linux/page-flags.h:293!
[0.040823] invalid opcode:  [#1] SMP PTI
[0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
[0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/2014
[0.042496] EIP: free_highmem_page+0x64/0x80
[0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 
5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 <0f> 0b 8d 
76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
[0.044338] EAX: 003c EBX: f678 ECX:  EDX: d856cbe8
[0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
[0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
[0.045913] CR0: 80050033 CR2:  CR3: 18556000 CR4: 00040690
[0.046413] DR0:  DR1:  DR2:  DR3: 
[0.046913] DR6: fffe0ff0 DR7: 0400
[0.047220] Call Trace:
[0.047419]  add_highpages_with_active_regions+0xbd/0x10d
[0.047854]  set_highmem_pages_init+0x5b/0x71
[0.048202]  mem_init+0x2b/0x1e8
[0.048460]  start_kernel+0x1d2/0x425
[0.048757]  i386_start_kernel+0x93/0x97
[0.049073]  startup_32_smp+0x164/0x168
[0.049379] Modules linked in:
[0.049626] ---[ end trace 337949378db0abbb ]---

We free highmem pages before their struct pages are initialized:

mem_init()
 set_highmem_pages_init()
  add_highpages_with_active_regions()
   free_highmem_page()
.. Access uninitialized struct page here..

Because there is no reason to have this feature on 32-bit systems, just
disable it.

Fixes: 2e3ca40f03bb ("mm: relax deferred struct page requirements")
Cc: sta...@vger.kernel.org

Reported-by: Jiri Slaby 
Signed-off-by: Pavel Tatashin 
---
 mm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index a550635ea5c3..de64ea658716 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -637,6 +637,7 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on NO_BOOTMEM
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
+   depends on 64BIT
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
-- 
2.18.0


Re: [PATCH] mm/page_alloc: Clean up check_for_memory

2018-08-31 Thread Pasha Tatashin


On 8/31/18 8:24 AM, Oscar Salvador wrote:
> On Thu, Aug 30, 2018 at 01:55:29AM +0000, Pasha Tatashin wrote:
>> I would re-write the above function like this:
>> static void check_for_memory(pg_data_t *pgdat, int nid)
>> {
>> enum zone_type zone_type;
>>
>> for (zone_type = 0; zone_type < ZONE_MOVABLE; zone_type++) {
>> if (populated_zone(>node_zones[zone_type])) { 
>> node_set_state(nid, zone_type <= ZONE_NORMAL ?
>>N_NORMAL_MEMORY: N_HIGH_MEMORY);
>> break;
>> }
>> }
>> }
> 
> Hi Pavel,
> 
> the above would not work fine.
> You set either N_NORMAL_MEMORY or N_HIGH_MEMORY, but a node can have both
> types of memory at the same time (on CONFIG_HIGHMEM systems).
> 
> N_HIGH_MEMORY stands for regular or high memory
> while N_NORMAL_MEMORY stands only for regular memory,
> that is why we set it only in case the zone is <= ZONE_NORMAL.

Hi Oscar,

Are you saying the code that is in mainline is broken? Because we set
node_set_state(nid, N_NORMAL_MEMORY); even on node with N_HIGH_MEMORY:

6826if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
6827zone_type <= ZONE_NORMAL)
6828node_set_state(nid, N_NORMAL_MEMORY);

Thank you,
Pavel

Re: [PATCH] mm/page_alloc: Clean up check_for_memory

2018-08-31 Thread Pasha Tatashin


On 8/31/18 8:24 AM, Oscar Salvador wrote:
> On Thu, Aug 30, 2018 at 01:55:29AM +0000, Pasha Tatashin wrote:
>> I would re-write the above function like this:
>> static void check_for_memory(pg_data_t *pgdat, int nid)
>> {
>> enum zone_type zone_type;
>>
>> for (zone_type = 0; zone_type < ZONE_MOVABLE; zone_type++) {
>> if (populated_zone(>node_zones[zone_type])) { 
>> node_set_state(nid, zone_type <= ZONE_NORMAL ?
>>N_NORMAL_MEMORY: N_HIGH_MEMORY);
>> break;
>> }
>> }
>> }
> 
> Hi Pavel,
> 
> the above would not work fine.
> You set either N_NORMAL_MEMORY or N_HIGH_MEMORY, but a node can have both
> types of memory at the same time (on CONFIG_HIGHMEM systems).
> 
> N_HIGH_MEMORY stands for regular or high memory
> while N_NORMAL_MEMORY stands only for regular memory,
> that is why we set it only in case the zone is <= ZONE_NORMAL.

Hi Oscar,

Are you saying the code that is in mainline is broken? Because we set
node_set_state(nid, N_NORMAL_MEMORY); even on node with N_HIGH_MEMORY:

6826if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
6827zone_type <= ZONE_NORMAL)
6828node_set_state(nid, N_NORMAL_MEMORY);

Thank you,
Pavel

Re: [PATCH v1 5/5] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages()

2018-08-30 Thread Pasha Tatashin


On 8/20/18 6:46 AM, David Hildenbrand wrote:
> On 16.08.2018 12:06, David Hildenbrand wrote:
>> Let's try to minimze the noise.
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index bbbd16f9d877..6fec2dc6a73d 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -966,9 +966,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
>> nr_pages, int online_typ
>>  return 0;
>>  
>>  failed_addition:
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
>>   (unsigned long long) pfn << PAGE_SHIFT,
>>   (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
>> +#endif
>>  memory_notify(MEM_CANCEL_ONLINE, );
>>  return ret;
>>  }
>> @@ -1660,7 +1662,9 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  offlined_pages = check_pages_isolated(start_pfn, end_pfn);
>>  if (offlined_pages < 0)
>>  goto repeat;
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_info("Offlined Pages %ld\n", offlined_pages);
>> +#endif
>>  /* Ok, all of our target is isolated.
>> We cannot do rollback at this point. */
>>  offline_isolated_pages(start_pfn, end_pfn);
>> @@ -1695,9 +1699,11 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  return 0;
>>  
>>  failed_removal:
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
>>   (unsigned long long) start_pfn << PAGE_SHIFT,
>>   ((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
>> +#endif
>>  memory_notify(MEM_CANCEL_OFFLINE, );
>>  /* pushback to free area */
>>  undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>
> 
> I'll drop this patch for now, maybe the error messages are actually
> useful when debugging a crashdump of a system without CONFIG_DEBUG_VM.
> 

Yes, please drop it, no reason to reduce amount of these messages. They
are useful.

Thank you,
Pavel

Re: [PATCH v1 5/5] mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages()

2018-08-30 Thread Pasha Tatashin


On 8/20/18 6:46 AM, David Hildenbrand wrote:
> On 16.08.2018 12:06, David Hildenbrand wrote:
>> Let's try to minimze the noise.
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index bbbd16f9d877..6fec2dc6a73d 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -966,9 +966,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
>> nr_pages, int online_typ
>>  return 0;
>>  
>>  failed_addition:
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_debug("online_pages [mem %#010llx-%#010llx] failed\n",
>>   (unsigned long long) pfn << PAGE_SHIFT,
>>   (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1);
>> +#endif
>>  memory_notify(MEM_CANCEL_ONLINE, );
>>  return ret;
>>  }
>> @@ -1660,7 +1662,9 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  offlined_pages = check_pages_isolated(start_pfn, end_pfn);
>>  if (offlined_pages < 0)
>>  goto repeat;
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_info("Offlined Pages %ld\n", offlined_pages);
>> +#endif
>>  /* Ok, all of our target is isolated.
>> We cannot do rollback at this point. */
>>  offline_isolated_pages(start_pfn, end_pfn);
>> @@ -1695,9 +1699,11 @@ int offline_pages(unsigned long start_pfn, unsigned 
>> long nr_pages)
>>  return 0;
>>  
>>  failed_removal:
>> +#ifdef CONFIG_DEBUG_VM
>>  pr_debug("memory offlining [mem %#010llx-%#010llx] failed\n",
>>   (unsigned long long) start_pfn << PAGE_SHIFT,
>>   ((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
>> +#endif
>>  memory_notify(MEM_CANCEL_OFFLINE, );
>>  /* pushback to free area */
>>  undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>>
> 
> I'll drop this patch for now, maybe the error messages are actually
> useful when debugging a crashdump of a system without CONFIG_DEBUG_VM.
> 

Yes, please drop it, no reason to reduce amount of these messages. They
are useful.

Thank you,
Pavel

Re: [PATCH v1 4/5] mm/memory_hotplug: onlining pages can only fail due to notifiers

2018-08-30 Thread Pasha Tatashin
LGTM

Reviewed-by: Pavel Tatashin 

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> Onlining pages can only fail if a notifier reported a problem (e.g. -ENOMEM).
> online_pages_range() can never fail.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3dc6d2a309c2..bbbd16f9d877 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -933,13 +933,8 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>   setup_zone_pageset(zone);
>   }
>  
> - ret = walk_system_ram_range(pfn, nr_pages, _pages,
> - online_pages_range);
> - if (ret) {
> - if (need_zonelists_rebuild)
> - zone_pcp_reset(zone);
> - goto failed_addition;
> - }
> + walk_system_ram_range(pfn, nr_pages, _pages,
> +   online_pages_range);
>  
>   zone->present_pages += onlined_pages;
>  
> 

Re: [PATCH v1 4/5] mm/memory_hotplug: onlining pages can only fail due to notifiers

2018-08-30 Thread Pasha Tatashin
LGTM

Reviewed-by: Pavel Tatashin 

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> Onlining pages can only fail if a notifier reported a problem (e.g. -ENOMEM).
> online_pages_range() can never fail.
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 3dc6d2a309c2..bbbd16f9d877 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -933,13 +933,8 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>   setup_zone_pageset(zone);
>   }
>  
> - ret = walk_system_ram_range(pfn, nr_pages, _pages,
> - online_pages_range);
> - if (ret) {
> - if (need_zonelists_rebuild)
> - zone_pcp_reset(zone);
> - goto failed_addition;
> - }
> + walk_system_ram_range(pfn, nr_pages, _pages,
> +   online_pages_range);
>  
>   zone->present_pages += onlined_pages;
>  
> 

Re: [PATCH v1 3/5] mm/memory_hotplug: check if sections are already online/offline

2018-08-30 Thread Pasha Tatashin
On 8/16/18 7:00 AM, David Hildenbrand wrote:
> On 16.08.2018 12:47, Oscar Salvador wrote:
>> On Thu, Aug 16, 2018 at 12:06:26PM +0200, David Hildenbrand wrote:
>>
>>> +
>>> +/* check if all mem sections are offline */
>>> +bool mem_sections_offline(unsigned long pfn, unsigned long end_pfn)
>>> +{
>>> +   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>> +   unsigned long section_nr = pfn_to_section_nr(pfn);
>>> +
>>> +   if (WARN_ON(!valid_section_nr(section_nr)))
>>> +   continue;
>>> +   if (online_section_nr(section_nr))
>>> +   return false;
>>> +   }
>>> +   return true;
>>> +}
>>
>> AFAICS pages_correctly_probed will catch this first.
>> pages_correctly_probed checks for the section to be:
>>
>> - present
>> - valid
>> - !online
> 
> Right, I missed that function.
> 
>>
>> Maybe it makes sense to rename it, and write another pages_correctly_probed 
>> routine
>> for the offline case.
>>
>> So all checks would stay in memory_block_action level, and we would not need
>> the mem_sections_offline/online stuff.

I am OK with that, but will wait for a patch to review.

Pavel

Re: [PATCH v1 3/5] mm/memory_hotplug: check if sections are already online/offline

2018-08-30 Thread Pasha Tatashin
On 8/16/18 7:00 AM, David Hildenbrand wrote:
> On 16.08.2018 12:47, Oscar Salvador wrote:
>> On Thu, Aug 16, 2018 at 12:06:26PM +0200, David Hildenbrand wrote:
>>
>>> +
>>> +/* check if all mem sections are offline */
>>> +bool mem_sections_offline(unsigned long pfn, unsigned long end_pfn)
>>> +{
>>> +   for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>>> +   unsigned long section_nr = pfn_to_section_nr(pfn);
>>> +
>>> +   if (WARN_ON(!valid_section_nr(section_nr)))
>>> +   continue;
>>> +   if (online_section_nr(section_nr))
>>> +   return false;
>>> +   }
>>> +   return true;
>>> +}
>>
>> AFAICS pages_correctly_probed will catch this first.
>> pages_correctly_probed checks for the section to be:
>>
>> - present
>> - valid
>> - !online
> 
> Right, I missed that function.
> 
>>
>> Maybe it makes sense to rename it, and write another pages_correctly_probed 
>> routine
>> for the offline case.
>>
>> So all checks would stay in memory_block_action level, and we would not need
>> the mem_sections_offline/online stuff.

I am OK with that, but will wait for a patch to review.

Pavel

Re: [PATCH v1 2/5] mm/memory_hotplug: enforce section alignment when onlining/offlining

2018-08-30 Thread Pasha Tatashin
Hi David,

I am not sure this is needed, because we already have a stricter checker:

check_hotplug_memory_range()

You could call it from online_pages(), if you think there is a reason to
do it, but other than that it is done from add_memory_resource() and
from remove_memory().

Thank you,
Pavel

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> onlining/offlining code works on whole sections, so let's enforce that.
> Existing code only allows to add memory in memory block size. And only
> whole memory blocks can be onlined/offlined. Memory blocks are always
> aligned to sections, so this should not break anything.
> 
> online_pages/offline_pages will implicitly mark whole sections
> online/offline, so the code really can only handle such granularities.
> 
> (especially offlining code cannot deal with pageblock_nr_pages but
>  theoretically only MAX_ORDER-1)
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 090cf474de87..30d2fa42b0bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -897,6 +897,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>   struct memory_notify arg;
>   struct memory_block *mem;
>  
> + if (!IS_ALIGNED(pfn, PAGES_PER_SECTION))
> + return -EINVAL;
> + if (!IS_ALIGNED(nr_pages, PAGES_PER_SECTION))
> + return -EINVAL;
> +
>   /*
>* We can't use pfn_to_nid() because nid might be stored in struct page
>* which is not yet initialized. Instead, we find nid from memory block.
> @@ -1600,10 +1605,9 @@ int offline_pages(unsigned long start_pfn, unsigned 
> long nr_pages)
>   struct zone *zone;
>   struct memory_notify arg;
>  
> - /* at least, alignment against pageblock is necessary */
> - if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
> + if (!IS_ALIGNED(start_pfn, PAGES_PER_SECTION))
>   return -EINVAL;
> - if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
> + if (!IS_ALIGNED(nr_pages, PAGES_PER_SECTION))
>   return -EINVAL;
>   /* This makes hotplug much easier...and readable.
>  we assume this for now. .*/
> 

Re: [PATCH v1 2/5] mm/memory_hotplug: enforce section alignment when onlining/offlining

2018-08-30 Thread Pasha Tatashin
Hi David,

I am not sure this is needed, because we already have a stricter checker:

check_hotplug_memory_range()

You could call it from online_pages(), if you think there is a reason to
do it, but other than that it is done from add_memory_resource() and
from remove_memory().

Thank you,
Pavel

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> onlining/offlining code works on whole sections, so let's enforce that.
> Existing code only allows to add memory in memory block size. And only
> whole memory blocks can be onlined/offlined. Memory blocks are always
> aligned to sections, so this should not break anything.
> 
> online_pages/offline_pages will implicitly mark whole sections
> online/offline, so the code really can only handle such granularities.
> 
> (especially offlining code cannot deal with pageblock_nr_pages but
>  theoretically only MAX_ORDER-1)
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 090cf474de87..30d2fa42b0bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -897,6 +897,11 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages, int online_typ
>   struct memory_notify arg;
>   struct memory_block *mem;
>  
> + if (!IS_ALIGNED(pfn, PAGES_PER_SECTION))
> + return -EINVAL;
> + if (!IS_ALIGNED(nr_pages, PAGES_PER_SECTION))
> + return -EINVAL;
> +
>   /*
>* We can't use pfn_to_nid() because nid might be stored in struct page
>* which is not yet initialized. Instead, we find nid from memory block.
> @@ -1600,10 +1605,9 @@ int offline_pages(unsigned long start_pfn, unsigned 
> long nr_pages)
>   struct zone *zone;
>   struct memory_notify arg;
>  
> - /* at least, alignment against pageblock is necessary */
> - if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
> + if (!IS_ALIGNED(start_pfn, PAGES_PER_SECTION))
>   return -EINVAL;
> - if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
> + if (!IS_ALIGNED(nr_pages, PAGES_PER_SECTION))
>   return -EINVAL;
>   /* This makes hotplug much easier...and readable.
>  we assume this for now. .*/
> 

Re: [PATCH v1 1/5] mm/memory_hotplug: drop intermediate __offline_pages

2018-08-30 Thread Pasha Tatashin


On 8/30/18 4:17 PM, Pasha Tatashin wrote:
> I guess the wrap was done because of __ref, but no reason to have this
> wrap. So looks good to me.
> 
> Reviewed-by: Pavel Tatashin >
> On 8/16/18 6:06 AM, David Hildenbrand wrote:
>> Let's avoid this indirection and just call the function offline_pages().
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 13 +++--
>>  1 file changed, 3 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 6a2726920ed2..090cf474de87 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1589,10 +1589,10 @@ static void node_states_clear_node(int node, struct 
>> memory_notify *arg)
>>  node_clear_state(node, N_MEMORY);
>>  }
>>  
>> -static int __ref __offline_pages(unsigned long start_pfn,
>> -  unsigned long end_pfn)
>> +/* Must be protected by mem_hotplug_begin() or a device_lock */
>> +int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
  ^^^
I meant to say keep the __ref, otherwise looks good.

Thank you,
Pavel

Re: [PATCH v1 1/5] mm/memory_hotplug: drop intermediate __offline_pages

2018-08-30 Thread Pasha Tatashin


On 8/30/18 4:17 PM, Pasha Tatashin wrote:
> I guess the wrap was done because of __ref, but no reason to have this
> wrap. So looks good to me.
> 
> Reviewed-by: Pavel Tatashin >
> On 8/16/18 6:06 AM, David Hildenbrand wrote:
>> Let's avoid this indirection and just call the function offline_pages().
>>
>> Signed-off-by: David Hildenbrand 
>> ---
>>  mm/memory_hotplug.c | 13 +++--
>>  1 file changed, 3 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 6a2726920ed2..090cf474de87 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1589,10 +1589,10 @@ static void node_states_clear_node(int node, struct 
>> memory_notify *arg)
>>  node_clear_state(node, N_MEMORY);
>>  }
>>  
>> -static int __ref __offline_pages(unsigned long start_pfn,
>> -  unsigned long end_pfn)
>> +/* Must be protected by mem_hotplug_begin() or a device_lock */
>> +int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
  ^^^
I meant to say keep the __ref, otherwise looks good.

Thank you,
Pavel

Re: [PATCH v1 1/5] mm/memory_hotplug: drop intermediate __offline_pages

2018-08-30 Thread Pasha Tatashin
I guess the wrap was done because of __ref, but no reason to have this
wrap. So looks good to me.

Reviewed-by: Pavel Tatashin 

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> Let's avoid this indirection and just call the function offline_pages().
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6a2726920ed2..090cf474de87 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1589,10 +1589,10 @@ static void node_states_clear_node(int node, struct 
> memory_notify *arg)
>   node_clear_state(node, N_MEMORY);
>  }
>  
> -static int __ref __offline_pages(unsigned long start_pfn,
> -   unsigned long end_pfn)
> +/* Must be protected by mem_hotplug_begin() or a device_lock */
> +int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> - unsigned long pfn, nr_pages;
> + unsigned long pfn, end_pfn = start_pfn + nr_pages;
>   long offlined_pages;
>   int ret, node;
>   unsigned long flags;
> @@ -1612,7 +1612,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>  
>   zone = page_zone(pfn_to_page(valid_start));
>   node = zone_to_nid(zone);
> - nr_pages = end_pfn - start_pfn;
>  
>   /* set above range as isolated */
>   ret = start_isolate_page_range(start_pfn, end_pfn,
> @@ -1700,12 +1699,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>   undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>   return ret;
>  }
> -
> -/* Must be protected by mem_hotplug_begin() or a device_lock */
> -int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> -{
> - return __offline_pages(start_pfn, start_pfn + nr_pages);
> -}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  /**
> 

Re: [PATCH v1 1/5] mm/memory_hotplug: drop intermediate __offline_pages

2018-08-30 Thread Pasha Tatashin
I guess the wrap was done because of __ref, but no reason to have this
wrap. So looks good to me.

Reviewed-by: Pavel Tatashin 

On 8/16/18 6:06 AM, David Hildenbrand wrote:
> Let's avoid this indirection and just call the function offline_pages().
> 
> Signed-off-by: David Hildenbrand 
> ---
>  mm/memory_hotplug.c | 13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6a2726920ed2..090cf474de87 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1589,10 +1589,10 @@ static void node_states_clear_node(int node, struct 
> memory_notify *arg)
>   node_clear_state(node, N_MEMORY);
>  }
>  
> -static int __ref __offline_pages(unsigned long start_pfn,
> -   unsigned long end_pfn)
> +/* Must be protected by mem_hotplug_begin() or a device_lock */
> +int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
>  {
> - unsigned long pfn, nr_pages;
> + unsigned long pfn, end_pfn = start_pfn + nr_pages;
>   long offlined_pages;
>   int ret, node;
>   unsigned long flags;
> @@ -1612,7 +1612,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>  
>   zone = page_zone(pfn_to_page(valid_start));
>   node = zone_to_nid(zone);
> - nr_pages = end_pfn - start_pfn;
>  
>   /* set above range as isolated */
>   ret = start_isolate_page_range(start_pfn, end_pfn,
> @@ -1700,12 +1699,6 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>   undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>   return ret;
>  }
> -
> -/* Must be protected by mem_hotplug_begin() or a device_lock */
> -int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
> -{
> - return __offline_pages(start_pfn, start_pfn + nr_pages);
> -}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
>  /**
> 

Re: [PATCH] mm/page_alloc: Clean up check_for_memory

2018-08-29 Thread Pasha Tatashin


On 8/28/18 5:01 PM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> check_for_memory looks a bit confusing.
> First of all, we have this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
>   return;
> 
> Checking the ENUM declaration, looks like N_MEMORY canot be equal to
> N_NORMAL_MEMORY.
> I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other
> way around either, so unless I am missing something, this condition 
> will never evaluate to true.
> It makes sense to get rid of it.
> 
> Moving forward, the operations whithin the loop look a bit confusing
> as well.
> 
> We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY
> in case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY)
> and zone <= ZONE_NORMAL.
> (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on !CONFIG_HIGHMEM systems,
> and that is why we can just go ahead and set N_HIGH_MEMORY unconditionally)
> 
> Although this works, it is a bit subtle.
> 
> I think that this could be easier to follow:
> 
> First, we should only set N_HIGH_MEMORY in case we have
> CONFIG_HIGHMEM.
> And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL,
> without further checking whether we have CONFIG_HIGHMEM or not.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/page_alloc.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 839e0cc17f2c..6aa947f9e614 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6819,15 +6819,12 @@ static void check_for_memory(pg_data_t *pgdat, int 
> nid)
>  {
>   enum zone_type zone_type;
>  
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - return;
> -
>   for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
>   struct zone *zone = >node_zones[zone_type];
>   if (populated_zone(zone)) {
> - node_set_state(nid, N_HIGH_MEMORY);
> - if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
> - zone_type <= ZONE_NORMAL)
> + if (IS_ENABLED(CONFIG_HIGHMEM))
> + node_set_state(nid, N_HIGH_MEMORY);
> + if (zone_type <= ZONE_NORMAL)
>   node_set_state(nid, N_NORMAL_MEMORY);
>   break;
>   }
> 

I would re-write the above function like this:
static void check_for_memory(pg_data_t *pgdat, int nid)
{
enum zone_type zone_type;

for (zone_type = 0; zone_type < ZONE_MOVABLE; zone_type++) {
if (populated_zone(>node_zones[zone_type])) { 
node_set_state(nid, zone_type <= ZONE_NORMAL ?
   N_NORMAL_MEMORY: N_HIGH_MEMORY);
break;
}
}
}

zone_type <= ZONE_MOVABLE - 1
is the same as:
zone_type < ZONE_MOVABLE

If zone > ZONE_NORMAL, it means that CONFIG_HIGHMEM is enabled, no need to 
check for it.

Pavel

Re: [PATCH] mm/page_alloc: Clean up check_for_memory

2018-08-29 Thread Pasha Tatashin


On 8/28/18 5:01 PM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> check_for_memory looks a bit confusing.
> First of all, we have this:
> 
> if (N_MEMORY == N_NORMAL_MEMORY)
>   return;
> 
> Checking the ENUM declaration, looks like N_MEMORY canot be equal to
> N_NORMAL_MEMORY.
> I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other
> way around either, so unless I am missing something, this condition 
> will never evaluate to true.
> It makes sense to get rid of it.
> 
> Moving forward, the operations whithin the loop look a bit confusing
> as well.
> 
> We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY
> in case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY)
> and zone <= ZONE_NORMAL.
> (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on !CONFIG_HIGHMEM systems,
> and that is why we can just go ahead and set N_HIGH_MEMORY unconditionally)
> 
> Although this works, it is a bit subtle.
> 
> I think that this could be easier to follow:
> 
> First, we should only set N_HIGH_MEMORY in case we have
> CONFIG_HIGHMEM.
> And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL,
> without further checking whether we have CONFIG_HIGHMEM or not.
> 
> Signed-off-by: Oscar Salvador 
> ---
>  mm/page_alloc.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 839e0cc17f2c..6aa947f9e614 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6819,15 +6819,12 @@ static void check_for_memory(pg_data_t *pgdat, int 
> nid)
>  {
>   enum zone_type zone_type;
>  
> - if (N_MEMORY == N_NORMAL_MEMORY)
> - return;
> -
>   for (zone_type = 0; zone_type <= ZONE_MOVABLE - 1; zone_type++) {
>   struct zone *zone = >node_zones[zone_type];
>   if (populated_zone(zone)) {
> - node_set_state(nid, N_HIGH_MEMORY);
> - if (N_NORMAL_MEMORY != N_HIGH_MEMORY &&
> - zone_type <= ZONE_NORMAL)
> + if (IS_ENABLED(CONFIG_HIGHMEM))
> + node_set_state(nid, N_HIGH_MEMORY);
> + if (zone_type <= ZONE_NORMAL)
>   node_set_state(nid, N_NORMAL_MEMORY);
>   break;
>   }
> 

I would re-write the above function like this:
static void check_for_memory(pg_data_t *pgdat, int nid)
{
enum zone_type zone_type;

for (zone_type = 0; zone_type < ZONE_MOVABLE; zone_type++) {
if (populated_zone(>node_zones[zone_type])) { 
node_set_state(nid, zone_type <= ZONE_NORMAL ?
   N_NORMAL_MEMORY: N_HIGH_MEMORY);
break;
}
}
}

zone_type <= ZONE_MOVABLE - 1
is the same as:
zone_type < ZONE_MOVABLE

If zone > ZONE_NORMAL, it means that CONFIG_HIGHMEM is enabled, no need to 
check for it.

Pavel

Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

2018-08-29 Thread Pasha Tatashin

On 8/17/18 11:41 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, we decrement zone/node spanned_pages when we
> remove memory and not when we offline it.
> 
> This, besides of not being consistent with the current code,
> implies that we can access steal pages if we never get to online
> that memory.
> 
> In order to prevent that, we have to move all zone/pages stuff to
> the offlining memory stage.
> Removing memory path should only care about memory sections and memory
> blocks.
> 
> Another thing to notice here is that this is not so easy to be done
> as HMM/devm have a particular handling of memory-hotplug.
> They do not go through the common path, and so, they do not
> call either offline_pages() nor online_pages().
> 
> All they care about is to add the sections, move the pages to
> ZONE_DEVICE, and in some cases, to create the linear mapping.
> 
> In order to do this more smooth, two new functions are created
> to deal with these particular cases:
> 
> del_device_memory
> add_device_memory
> 
> add_device_memory is in charge of
> 
> a) calling either arch_add_memory() or add_pages(), depending on whether
>we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>expand zone/pgdat spanned pages and initialize its pages
> 
> del_device_memory, on the other hand, is in charge of
> 
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
>whether we need to tear down the linear mapping or not
> 
> These two functions are called from:
> 
> add_device_memory:
>   - devm_memremap_pages()
>   - hmm_devmem_pages_create()
> 
> del_device_memory:
>   - devm_memremap_pages_release()
>   - hmm_devmem_release()
> 
> I think that this will get easier as soon as [1] gets merged.
> 
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
> 
> [1] https://lkml.org/lkml/2018/6/19/110
> 
> Signed-off-by: Oscar Salvador 
> ---
>  arch/ia64/mm/init.c|   4 +-
>  arch/powerpc/mm/mem.c  |  10 +--
>  arch/sh/mm/init.c  |   4 +-
>  arch/x86/mm/init_32.c  |   4 +-
>  arch/x86/mm/init_64.c  |   8 +--
>  include/linux/memory_hotplug.h |   9 ++-
>  kernel/memremap.c  |  14 ++--
>  kernel/resource.c  |  16 +
>  mm/hmm.c   |  32 -
>  mm/memory_hotplug.c| 143 
> +++--
>  mm/sparse.c|   4 +-
>  11 files changed, 145 insertions(+), 103 deletions(-)

Hi Oscar,

I have been studying this patch, and do not see anything bad about it
except that it begs to be split into smaller patches. I think you can
send this work as a series without RFC if this patch is split into 3 or
so patches. I will review that series.

Thank you,
Pavel

Re: [RFC v2 2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

2018-08-29 Thread Pasha Tatashin

On 8/17/18 11:41 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, we decrement zone/node spanned_pages when we
> remove memory and not when we offline it.
> 
> This, besides of not being consistent with the current code,
> implies that we can access steal pages if we never get to online
> that memory.
> 
> In order to prevent that, we have to move all zone/pages stuff to
> the offlining memory stage.
> Removing memory path should only care about memory sections and memory
> blocks.
> 
> Another thing to notice here is that this is not so easy to be done
> as HMM/devm have a particular handling of memory-hotplug.
> They do not go through the common path, and so, they do not
> call either offline_pages() nor online_pages().
> 
> All they care about is to add the sections, move the pages to
> ZONE_DEVICE, and in some cases, to create the linear mapping.
> 
> In order to do this more smooth, two new functions are created
> to deal with these particular cases:
> 
> del_device_memory
> add_device_memory
> 
> add_device_memory is in charge of
> 
> a) calling either arch_add_memory() or add_pages(), depending on whether
>we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>expand zone/pgdat spanned pages and initialize its pages
> 
> del_device_memory, on the other hand, is in charge of
> 
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
>whether we need to tear down the linear mapping or not
> 
> These two functions are called from:
> 
> add_device_memory:
>   - devm_memremap_pages()
>   - hmm_devmem_pages_create()
> 
> del_device_memory:
>   - devm_memremap_pages_release()
>   - hmm_devmem_release()
> 
> I think that this will get easier as soon as [1] gets merged.
> 
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
> 
> [1] https://lkml.org/lkml/2018/6/19/110
> 
> Signed-off-by: Oscar Salvador 
> ---
>  arch/ia64/mm/init.c|   4 +-
>  arch/powerpc/mm/mem.c  |  10 +--
>  arch/sh/mm/init.c  |   4 +-
>  arch/x86/mm/init_32.c  |   4 +-
>  arch/x86/mm/init_64.c  |   8 +--
>  include/linux/memory_hotplug.h |   9 ++-
>  kernel/memremap.c  |  14 ++--
>  kernel/resource.c  |  16 +
>  mm/hmm.c   |  32 -
>  mm/memory_hotplug.c| 143 
> +++--
>  mm/sparse.c|   4 +-
>  11 files changed, 145 insertions(+), 103 deletions(-)

Hi Oscar,

I have been studying this patch, and do not see anything bad about it
except that it begs to be split into smaller patches. I think you can
send this work as a series without RFC if this patch is split into 3 or
so patches. I will review that series.

Thank you,
Pavel

Re: [PATCH] memory_hotplug: fix kernel_panic on offline page processing

2018-08-28 Thread Pasha Tatashin


On 8/28/18 7:25 AM, Michal Hocko wrote:
> On Tue 28-08-18 11:05:39, Mikhail Zaslonko wrote:
>> Within show_valid_zones() the function test_pages_in_a_zone() should be
>> called for online memory blocks only. Otherwise it might lead to the
>> VM_BUG_ON due to uninitialized struct pages (when CONFIG_DEBUG_VM_PGFLAGS
>> kernel option is set):
>>
>>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>  [ cut here ]
>>  Call Trace:
>>  ([<0038f91e>] test_pages_in_a_zone+0xe6/0x168)
>>   [<00923472>] show_valid_zones+0x5a/0x1a8
>>   [<00900284>] dev_attr_show+0x3c/0x78
>>   [<0046f6f0>] sysfs_kf_seq_show+0xd0/0x150
>>   [<003ef662>] seq_read+0x212/0x4b8
>>   [<003bf202>] __vfs_read+0x3a/0x178
>>   [<003bf3ca>] vfs_read+0x8a/0x148
>>   [<003bfa3a>] ksys_read+0x62/0xb8
>>   [<00bc2220>] system_call+0xdc/0x2d8
>>
>> That VM_BUG_ON was triggered by the page poisoning introduced in
>> mm/sparse.c with the git commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>> memory hotplug")
>> With the same commit the new 'nid' field has been added to the struct
>> memory_block in order to store and later on derive the node id for offline
>> pages (instead of accessing struct page which might be uninitialized). But
>> one reference to nid in show_valid_zones() function has been overlooked.
>> Fixed with current commit.
>> Also, nr_pages will not be used any more after test_pages_in_a_zone() call,
>> do not update it.
>>
>> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
>> Cc:  # v4.17+
>> Cc: Pavel Tatashin 
>> Signed-off-by: Mikhail Zaslonko 
> 
> Btw. this land mines which are basically impossible to find during the
> review are the reason why I was not all that happy about d0dc12e86b31.
> It added a margninal improvement but opened a can of warms. On the other

Hi Michal,

I agree, the hotplug code is very fragile. But, it only means that it
requires improvements.

I specifically added PagePoisoned() check with hotplug optimizations
changes in order to catch these kind of bugs, and I am glad we are
catching them.

Reviewed-by: Pavel Tatashin 

Thank you,
Pavel

> hand maybe we just had to open that can one day...
> 
> Acked-by: Michal Hocko 
> 
> Thanks!
> 
>> ---
>>  drivers/base/memory.c | 20 +---
>>  1 file changed, 9 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index f5e560188a18..622ab8edc035 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -416,26 +416,24 @@ static ssize_t show_valid_zones(struct device *dev,
>>  struct zone *default_zone;
>>  int nid;
>>  
>> -/*
>> - * The block contains more than one zone can not be offlined.
>> - * This can happen e.g. for ZONE_DMA and ZONE_DMA32
>> - */
>> -if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, 
>> _start_pfn, _end_pfn))
>> -return sprintf(buf, "none\n");
>> -
>> -start_pfn = valid_start_pfn;
>> -nr_pages = valid_end_pfn - start_pfn;
>> -
>>  /*
>>   * Check the existing zone. Make sure that we do that only on the
>>   * online nodes otherwise the page_zone is not reliable
>>   */
>>  if (mem->state == MEM_ONLINE) {
>> +/*
>> + * The block contains more than one zone can not be offlined.
>> + * This can happen e.g. for ZONE_DMA and ZONE_DMA32
>> + */
>> +if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages,
>> +  _start_pfn, _end_pfn))
>> +return sprintf(buf, "none\n");
>> +start_pfn = valid_start_pfn;
>>  strcat(buf, page_zone(pfn_to_page(start_pfn))->name);
>>  goto out;
>>  }
>>  
>> -nid = pfn_to_nid(start_pfn);
>> +nid = mem->nid;
>>  default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, 
>> nr_pages);
>>  strcat(buf, default_zone->name);
>>  
>> -- 
>> 2.16.4
> 

Re: [PATCH] memory_hotplug: fix kernel_panic on offline page processing

2018-08-28 Thread Pasha Tatashin


On 8/28/18 7:25 AM, Michal Hocko wrote:
> On Tue 28-08-18 11:05:39, Mikhail Zaslonko wrote:
>> Within show_valid_zones() the function test_pages_in_a_zone() should be
>> called for online memory blocks only. Otherwise it might lead to the
>> VM_BUG_ON due to uninitialized struct pages (when CONFIG_DEBUG_VM_PGFLAGS
>> kernel option is set):
>>
>>  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>  [ cut here ]
>>  Call Trace:
>>  ([<0038f91e>] test_pages_in_a_zone+0xe6/0x168)
>>   [<00923472>] show_valid_zones+0x5a/0x1a8
>>   [<00900284>] dev_attr_show+0x3c/0x78
>>   [<0046f6f0>] sysfs_kf_seq_show+0xd0/0x150
>>   [<003ef662>] seq_read+0x212/0x4b8
>>   [<003bf202>] __vfs_read+0x3a/0x178
>>   [<003bf3ca>] vfs_read+0x8a/0x148
>>   [<003bfa3a>] ksys_read+0x62/0xb8
>>   [<00bc2220>] system_call+0xdc/0x2d8
>>
>> That VM_BUG_ON was triggered by the page poisoning introduced in
>> mm/sparse.c with the git commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>> memory hotplug")
>> With the same commit the new 'nid' field has been added to the struct
>> memory_block in order to store and later on derive the node id for offline
>> pages (instead of accessing struct page which might be uninitialized). But
>> one reference to nid in show_valid_zones() function has been overlooked.
>> Fixed with current commit.
>> Also, nr_pages will not be used any more after test_pages_in_a_zone() call,
>> do not update it.
>>
>> Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug")
>> Cc:  # v4.17+
>> Cc: Pavel Tatashin 
>> Signed-off-by: Mikhail Zaslonko 
> 
> Btw. this land mines which are basically impossible to find during the
> review are the reason why I was not all that happy about d0dc12e86b31.
> It added a margninal improvement but opened a can of warms. On the other

Hi Michal,

I agree, the hotplug code is very fragile. But, it only means that it
requires improvements.

I specifically added PagePoisoned() check with hotplug optimizations
changes in order to catch these kind of bugs, and I am glad we are
catching them.

Reviewed-by: Pavel Tatashin 

Thank you,
Pavel

> hand maybe we just had to open that can one day...
> 
> Acked-by: Michal Hocko 
> 
> Thanks!
> 
>> ---
>>  drivers/base/memory.c | 20 +---
>>  1 file changed, 9 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index f5e560188a18..622ab8edc035 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -416,26 +416,24 @@ static ssize_t show_valid_zones(struct device *dev,
>>  struct zone *default_zone;
>>  int nid;
>>  
>> -/*
>> - * The block contains more than one zone can not be offlined.
>> - * This can happen e.g. for ZONE_DMA and ZONE_DMA32
>> - */
>> -if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, 
>> _start_pfn, _end_pfn))
>> -return sprintf(buf, "none\n");
>> -
>> -start_pfn = valid_start_pfn;
>> -nr_pages = valid_end_pfn - start_pfn;
>> -
>>  /*
>>   * Check the existing zone. Make sure that we do that only on the
>>   * online nodes otherwise the page_zone is not reliable
>>   */
>>  if (mem->state == MEM_ONLINE) {
>> +/*
>> + * The block contains more than one zone can not be offlined.
>> + * This can happen e.g. for ZONE_DMA and ZONE_DMA32
>> + */
>> +if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages,
>> +  _start_pfn, _end_pfn))
>> +return sprintf(buf, "none\n");
>> +start_pfn = valid_start_pfn;
>>  strcat(buf, page_zone(pfn_to_page(start_pfn))->name);
>>  goto out;
>>  }
>>  
>> -nid = pfn_to_nid(start_pfn);
>> +nid = mem->nid;
>>  default_zone = zone_for_pfn_range(MMOP_ONLINE_KEEP, nid, start_pfn, 
>> nr_pages);
>>  strcat(buf, default_zone->name);
>>  
>> -- 
>> 2.16.4
> 

Re: [PATCH v4 3/4] mm/memory_hotplug: Define nodemask_t as a stack variable

2018-08-28 Thread Pasha Tatashin


On 8/17/18 5:00 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
> in order to check whithin the loop which nodes have already been unlinked,
> so we do not repeat the operation on them.
> 
> NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
> it just declares a nodemask_t variable whithin the stack.
> 
> Since kmalloc() can fail, we actually check whether NODEMASK_ALLOC failed
> or not, and we return -ENOMEM accordingly.
> remove_memory_section() does not check for the return value though.
> It is pretty rare that such a tiny allocation can fail, but if it does,
> we will be left with dangled symlinks under /sys/devices/system/node/,
> since the mem_blk's directories will be removed no matter what
> unregister_mem_sect_under_nodes() returns.
> 
> One way to solve this is to check whether unlinked_nodes is NULL or not.
> In the case it is not, we can just use it as before, but if it is NULL,
> we can just skip the node_test_and_set check, and call sysfs_remove_link()
> unconditionally.
> This is harmless as sysfs_remove_link() backs off somewhere down the chain
> in case the link has already been removed.
> This method was presented in v3 of the path [1].
> 
> But since the maximum number of nodes we can have is 1024,
> when NODES_SHIFT = 10, that gives us a nodemask_t of 128 bytes.
> Although everything depends on how deep the stack is, I think we can afford
> to define the nodemask_t variable whithin the stack.
> 
> That simplifies the code, and we do not need to worry about untested error
> code paths.
> 
> If we see that this causes troubles with the stack, we can always return to 
> [1].
> 
> Signed-off-by: Oscar Salvador 

LGTM:

Reviewed-by: Pavel Tatashin 

Thank you,
Pavel

Re: [PATCH v4 3/4] mm/memory_hotplug: Define nodemask_t as a stack variable

2018-08-28 Thread Pasha Tatashin


On 8/17/18 5:00 AM, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Currently, unregister_mem_sect_under_nodes() tries to allocate a nodemask_t
> in order to check whithin the loop which nodes have already been unlinked,
> so we do not repeat the operation on them.
> 
> NODEMASK_ALLOC calls kmalloc() if NODES_SHIFT > 8, otherwise
> it just declares a nodemask_t variable whithin the stack.
> 
> Since kmalloc() can fail, we actually check whether NODEMASK_ALLOC failed
> or not, and we return -ENOMEM accordingly.
> remove_memory_section() does not check for the return value though.
> It is pretty rare that such a tiny allocation can fail, but if it does,
> we will be left with dangled symlinks under /sys/devices/system/node/,
> since the mem_blk's directories will be removed no matter what
> unregister_mem_sect_under_nodes() returns.
> 
> One way to solve this is to check whether unlinked_nodes is NULL or not.
> In the case it is not, we can just use it as before, but if it is NULL,
> we can just skip the node_test_and_set check, and call sysfs_remove_link()
> unconditionally.
> This is harmless as sysfs_remove_link() backs off somewhere down the chain
> in case the link has already been removed.
> This method was presented in v3 of the path [1].
> 
> But since the maximum number of nodes we can have is 1024,
> when NODES_SHIFT = 10, that gives us a nodemask_t of 128 bytes.
> Although everything depends on how deep the stack is, I think we can afford
> to define the nodemask_t variable whithin the stack.
> 
> That simplifies the code, and we do not need to worry about untested error
> code paths.
> 
> If we see that this causes troubles with the stack, we can always return to 
> [1].
> 
> Signed-off-by: Oscar Salvador 

LGTM:

Reviewed-by: Pavel Tatashin 

Thank you,
Pavel

Re: [PATCH 2/2] mm: zero remaining unavailable struct pages

2018-08-27 Thread Pasha Tatashin
On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Naoya Horiguchi 
> 
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> 
>   BUG: unable to handle kernel paging request at fffe
>   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
>   Oops:  [#1] SMP PTI
>   CPU: 2 PID: 1728 Comm: page-types Not tainted 
> 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 
> 04/01/2014
>   RIP: 0010:stable_page_flags+0x27/0x3c0
>   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 
> 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 
> 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
>   RSP: 0018:bbd44111fde0 EFLAGS: 00010202
>   RAX: fffe RBX: 7fffeff9 RCX: 
>   RDX: 0001 RSI: 0202 RDI: ed1182fff5c0
>   RBP:  R08: 0001 R09: 0001
>   R10: bbd44111fed8 R11:  R12: ed1182fff5c0
>   R13: 000bffd7 R14: 02fff5c0 R15: bbd44111ff10
>   FS:  7efc4335a500() GS:93a5bfc0() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: fffe CR3: b2a58000 CR4: 001406e0
>   Call Trace:
>kpageflags_read+0xc7/0x120
>proc_reg_read+0x3c/0x60
>__vfs_read+0x36/0x170
>vfs_read+0x89/0x130
>ksys_pread64+0x71/0x90
>do_syscall_64+0x5b/0x160
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   RIP: 0033:0x7efc42e75e23
>   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 
> 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 
> 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> 
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
> 
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
> 
>   MEMBLOCK configuration:
>memory size = 0x0001fff75c00 reserved size = 0x0300c000
>memory.cnt  = 0x4
>memory[0x0] [0x1000-0x0009efff], 
> 0x0009e000 bytes on node 0 flags: 0x0
>memory[0x1] [0x0010-0xbffd6fff], 
> 0xbfed7000 bytes on node 0 flags: 0x0
>memory[0x2] [0x0001-0x00013fff], 
> 0x4000 bytes on node 0 flags: 0x0
>memory[0x3] [0x00014000-0x00023fff], 
> 0x0001 bytes on node 1 flags: 0x0
>...
> 
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x1-0x13fff] is gone:
> 
>   MEMBLOCK configuration:
>memory size = 0x0001bff75c00 reserved size = 0x0300c000
>memory.cnt  = 0x3
>memory[0x0] [0x1000-0x0009efff], 
> 0x0009e000 bytes on node 0 flags: 0x0
>memory[0x1] [0x0010-0xbffd6fff], 
> 0xbfed7000 bytes on node 0 flags: 0x0
>memory[0x2] [0x00014000-0x00023fff], 
> 0x0001 bytes on node 1 flags: 0x0
>...
> 
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
> 
> We have a function zero_resv_unavail() which does zeroing the struct
> pages outside memblock.memory, but currently it covers only the reserved
> unavailable range (i.e. memblock.memory && !memblock.reserved).
> This patch extends it to cover all unavailable range, which fixes
> the reported issue.
> 
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi 
> Tested-by: Oscar Salvador 
> Tested-by: Masayoshi Mizuma 

Reviewed-by: Pavel Tatashin 

Also, please review and add the following patch to this series:

From 6d23e66e979244734a06c1b636742c2568121b39 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin 
Date: Mon, 27 Aug 2018 19:10:35 -0400
Subject: [PATCH] mm: return zero_resv_unavail optimization

When checking for valid pfns in zero_resv_unavail(), it is not necessary to
verify that pfns within pageblock_nr_pages ranges are valid, only the first
one needs to be checked. This is because memory for pages are allocated in
contiguous chunks that contain pageblock_nr_pages struct pages.

Signed-off-by: Pavel Tatashin 
---
 mm/page_alloc.c | 46 ++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 650d8f16a67e..5dfc206db40e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6441,6 +6441,29 @@ void __init free_area_init_node(int 

Re: [PATCH 2/2] mm: zero remaining unavailable struct pages

2018-08-27 Thread Pasha Tatashin
On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Naoya Horiguchi 
> 
> There is a kernel panic that is triggered when reading /proc/kpageflags
> on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
> 
>   BUG: unable to handle kernel paging request at fffe
>   PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
>   Oops:  [#1] SMP PTI
>   CPU: 2 PID: 1728 Comm: page-types Not tainted 
> 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 
> 04/01/2014
>   RIP: 0010:stable_page_flags+0x27/0x3c0
>   Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 
> 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 
> 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
>   RSP: 0018:bbd44111fde0 EFLAGS: 00010202
>   RAX: fffe RBX: 7fffeff9 RCX: 
>   RDX: 0001 RSI: 0202 RDI: ed1182fff5c0
>   RBP:  R08: 0001 R09: 0001
>   R10: bbd44111fed8 R11:  R12: ed1182fff5c0
>   R13: 000bffd7 R14: 02fff5c0 R15: bbd44111ff10
>   FS:  7efc4335a500() GS:93a5bfc0() knlGS:
>   CS:  0010 DS:  ES:  CR0: 80050033
>   CR2: fffe CR3: b2a58000 CR4: 001406e0
>   Call Trace:
>kpageflags_read+0xc7/0x120
>proc_reg_read+0x3c/0x60
>__vfs_read+0x36/0x170
>vfs_read+0x89/0x130
>ksys_pread64+0x71/0x90
>do_syscall_64+0x5b/0x160
>entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   RIP: 0033:0x7efc42e75e23
>   Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 
> 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 
> 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
> 
> According to kernel bisection, this problem became visible due to commit
> f7f99100d8d9 which changes how struct pages are initialized.
> 
> Memblock layout affects the pfn ranges covered by node/zone. Consider
> that we have a VM with 2 NUMA nodes and each node has 4GB memory, and
> the default (no memmap= given) memblock layout is like below:
> 
>   MEMBLOCK configuration:
>memory size = 0x0001fff75c00 reserved size = 0x0300c000
>memory.cnt  = 0x4
>memory[0x0] [0x1000-0x0009efff], 
> 0x0009e000 bytes on node 0 flags: 0x0
>memory[0x1] [0x0010-0xbffd6fff], 
> 0xbfed7000 bytes on node 0 flags: 0x0
>memory[0x2] [0x0001-0x00013fff], 
> 0x4000 bytes on node 0 flags: 0x0
>memory[0x3] [0x00014000-0x00023fff], 
> 0x0001 bytes on node 1 flags: 0x0
>...
> 
> If you give memmap=1G!4G (so it just covers memory[0x2]),
> the range [0x1-0x13fff] is gone:
> 
>   MEMBLOCK configuration:
>memory size = 0x0001bff75c00 reserved size = 0x0300c000
>memory.cnt  = 0x3
>memory[0x0] [0x1000-0x0009efff], 
> 0x0009e000 bytes on node 0 flags: 0x0
>memory[0x1] [0x0010-0xbffd6fff], 
> 0xbfed7000 bytes on node 0 flags: 0x0
>memory[0x2] [0x00014000-0x00023fff], 
> 0x0001 bytes on node 1 flags: 0x0
>...
> 
> This causes shrinking node 0's pfn range because it is calculated by
> the address range of memblock.memory. So some of struct pages in the
> gap range are left uninitialized.
> 
> We have a function zero_resv_unavail() which does zeroing the struct
> pages outside memblock.memory, but currently it covers only the reserved
> unavailable range (i.e. memblock.memory && !memblock.reserved).
> This patch extends it to cover all unavailable range, which fixes
> the reported issue.
> 
> Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
> Signed-off-by: Naoya Horiguchi 
> Tested-by: Oscar Salvador 
> Tested-by: Masayoshi Mizuma 

Reviewed-by: Pavel Tatashin 

Also, please review and add the following patch to this series:

From 6d23e66e979244734a06c1b636742c2568121b39 Mon Sep 17 00:00:00 2001
From: Pavel Tatashin 
Date: Mon, 27 Aug 2018 19:10:35 -0400
Subject: [PATCH] mm: return zero_resv_unavail optimization

When checking for valid pfns in zero_resv_unavail(), it is not necessary to
verify that pfns within pageblock_nr_pages ranges are valid, only the first
one needs to be checked. This is because memory for pages are allocated in
contiguous chunks that contain pageblock_nr_pages struct pages.

Signed-off-by: Pavel Tatashin 
---
 mm/page_alloc.c | 46 ++
 1 file changed, 26 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 650d8f16a67e..5dfc206db40e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6441,6 +6441,29 @@ void __init free_area_init_node(int 

Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

2018-08-27 Thread Pasha Tatashin
On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Masayoshi Mizuma 
> 
> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved") breaks movable_node kernel option because it
> changed the memory gap range to reserved memblock. So, the node
> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> 
> =
> kernel: BIOS-e820: [mem 0x1800-0x180f] usable
> kernel: BIOS-e820: [mem 0x1c00-0x1c0f] usable
> ...
> kernel: reserved[0x12]#011[0x1810-0x1bff], 
> 0x03f0 bytes flags: 0x0
> ...
> kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x1800-0x1bff] 
> hotplug
> kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c00-0x1fff] 
> hotplug
> ...
> kernel: Movable zone start for each node
> kernel:  Node 3: 0x1c00
> kernel: Early memory node ranges
> ...
> =
> 
> Naoya's v1 patch [*] fixes the original issue and this movable_node
> issue doesn't occur.
> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> regions into memblock.reserved") and apply the v1 patch.
> 
> [*] https://lkml.org/lkml/2018/6/13/27
> 
> Signed-off-by: Masayoshi Mizuma 

Reviewed-by: Pavel Tatashin 

Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

2018-08-27 Thread Pasha Tatashin
On 8/23/18 2:25 PM, Masayoshi Mizuma wrote:
> From: Masayoshi Mizuma 
> 
> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved") breaks movable_node kernel option because it
> changed the memory gap range to reserved memblock. So, the node
> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> 
> =
> kernel: BIOS-e820: [mem 0x1800-0x180f] usable
> kernel: BIOS-e820: [mem 0x1c00-0x1c0f] usable
> ...
> kernel: reserved[0x12]#011[0x1810-0x1bff], 
> 0x03f0 bytes flags: 0x0
> ...
> kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x1800-0x1bff] 
> hotplug
> kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c00-0x1fff] 
> hotplug
> ...
> kernel: Movable zone start for each node
> kernel:  Node 3: 0x1c00
> kernel: Early memory node ranges
> ...
> =
> 
> Naoya's v1 patch [*] fixes the original issue and this movable_node
> issue doesn't occur.
> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> regions into memblock.reserved") and apply the v1 patch.
> 
> [*] https://lkml.org/lkml/2018/6/13/27
> 
> Signed-off-by: Masayoshi Mizuma 

Reviewed-by: Pavel Tatashin 

Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

2018-08-27 Thread Pasha Tatashin
On Mon, Aug 27, 2018 at 8:31 AM Masayoshi Mizuma  wrote:
>
> Hi Pavel,
>
> I would appreciate if you could send the feedback for the patch.

I will study it today.

Pavel

>
> Thanks!
> Masa
>
> On 08/24/2018 04:29 AM, Michal Hocko wrote:
> > On Fri 24-08-18 00:03:25, Naoya Horiguchi wrote:
> >> (CCed related people)
> >
> > Fixup Pavel email.
> >
> >>
> >> Hi Mizuma-san,
> >>
> >> Thank you for the report.
> >> The mentioned patch was created based on feedbacks from 
> >> reviewers/maintainers,
> >> so I'd like to hear from them about how we should handle the issue.
> >>
> >> And one note is that there is a follow-up patch for "x86/e820: put 
> >> !E820_TYPE_RAM
> >> regions into memblock.reserved" which might be affected by your changes.
> >>
> >>> commit e181ae0c5db9544de9c53239eb22bc012ce75033
> >>> Author: Pavel Tatashin 
> >>> Date:   Sat Jul 14 09:15:07 2018 -0400
> >>>
> >>> mm: zero unavailable pages before memmap init
> >>
> >> Thanks,
> >> Naoya Horiguchi
> >>
> >> On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
> >>> From: Masayoshi Mizuma 
> >>>
> >>> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> >>> memblock.reserved") breaks movable_node kernel option because it
> >>> changed the memory gap range to reserved memblock. So, the node
> >>> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> >>>
> >>> =
> >>> kernel: BIOS-e820: [mem 0x1800-0x180f] usable
> >>> kernel: BIOS-e820: [mem 0x1c00-0x1c0f] usable
> >>> ...
> >>> kernel: reserved[0x12]#011[0x1810-0x1bff], 
> >>> 0x03f0 bytes flags: 0x0
> >>> ...
> >>> kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x1800-0x1bff] 
> >>> hotplug
> >>> kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c00-0x1fff] 
> >>> hotplug
> >>> ...
> >>> kernel: Movable zone start for each node
> >>> kernel:  Node 3: 0x1c00
> >>> kernel: Early memory node ranges
> >>> ...
> >>> =
> >>>
> >>> Naoya's v1 patch [*] fixes the original issue and this movable_node
> >>> issue doesn't occur.
> >>> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> >>> regions into memblock.reserved") and apply the v1 patch.
> >>>
> >>> [*] https://lkml.org/lkml/2018/6/13/27
> >>>
> >>> Signed-off-by: Masayoshi Mizuma 
> >>> ---
> >>>  arch/x86/kernel/e820.c | 15 +++
> >>>  1 file changed, 3 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >>> index c88c23c658c1..d1f25c831447 100644
> >>> --- a/arch/x86/kernel/e820.c
> >>> +++ b/arch/x86/kernel/e820.c
> >>> @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
> >>>  {
> >>> int i;
> >>> u64 end;
> >>> -   u64 addr = 0;
> >>>
> >>> /*
> >>>  * The bootstrap memblock region count maximum is 128 entries
> >>> @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
> >>> struct e820_entry *entry = _table->entries[i];
> >>>
> >>> end = entry->addr + entry->size;
> >>> -   if (addr < entry->addr)
> >>> -   memblock_reserve(addr, entry->addr - addr);
> >>> -   addr = end;
> >>> if (end != (resource_size_t)end)
> >>> continue;
> >>>
> >>> -   /*
> >>> -* all !E820_TYPE_RAM ranges (including gap ranges) are put
> >>> -* into memblock.reserved to make sure that struct pages in
> >>> -* such regions are not left uninitialized after bootup.
> >>> -*/
> >>> if (entry->type != E820_TYPE_RAM && entry->type != 
> >>> E820_TYPE_RESERVED_KERN)
> >>> -   memblock_reserve(entry->addr, entry->size);
> >>> -   else
> >>> -   memblock_add(entry->addr, entry->size);
> >>> +   continue;
> >>> +
> >>> +   memblock_add(entry->addr, entry->size);
> >>> }
> >>>
> >>> /* Throw away partial pages: */
> >>> --
> >>> 2.18.0
> >>>
> >>>
> >
>

Re: [PATCH 1/2] Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"

2018-08-27 Thread Pasha Tatashin
On Mon, Aug 27, 2018 at 8:31 AM Masayoshi Mizuma  wrote:
>
> Hi Pavel,
>
> I would appreciate if you could send the feedback for the patch.

I will study it today.

Pavel

>
> Thanks!
> Masa
>
> On 08/24/2018 04:29 AM, Michal Hocko wrote:
> > On Fri 24-08-18 00:03:25, Naoya Horiguchi wrote:
> >> (CCed related people)
> >
> > Fixup Pavel email.
> >
> >>
> >> Hi Mizuma-san,
> >>
> >> Thank you for the report.
> >> The mentioned patch was created based on feedbacks from 
> >> reviewers/maintainers,
> >> so I'd like to hear from them about how we should handle the issue.
> >>
> >> And one note is that there is a follow-up patch for "x86/e820: put 
> >> !E820_TYPE_RAM
> >> regions into memblock.reserved" which might be affected by your changes.
> >>
> >>> commit e181ae0c5db9544de9c53239eb22bc012ce75033
> >>> Author: Pavel Tatashin 
> >>> Date:   Sat Jul 14 09:15:07 2018 -0400
> >>>
> >>> mm: zero unavailable pages before memmap init
> >>
> >> Thanks,
> >> Naoya Horiguchi
> >>
> >> On Thu, Aug 23, 2018 at 02:25:12PM -0400, Masayoshi Mizuma wrote:
> >>> From: Masayoshi Mizuma 
> >>>
> >>> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> >>> memblock.reserved") breaks movable_node kernel option because it
> >>> changed the memory gap range to reserved memblock. So, the node
> >>> is marked as Normal zone even if the SRAT has Hot plaggable affinity.
> >>>
> >>> =
> >>> kernel: BIOS-e820: [mem 0x1800-0x180f] usable
> >>> kernel: BIOS-e820: [mem 0x1c00-0x1c0f] usable
> >>> ...
> >>> kernel: reserved[0x12]#011[0x1810-0x1bff], 
> >>> 0x03f0 bytes flags: 0x0
> >>> ...
> >>> kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x1800-0x1bff] 
> >>> hotplug
> >>> kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c00-0x1fff] 
> >>> hotplug
> >>> ...
> >>> kernel: Movable zone start for each node
> >>> kernel:  Node 3: 0x1c00
> >>> kernel: Early memory node ranges
> >>> ...
> >>> =
> >>>
> >>> Naoya's v1 patch [*] fixes the original issue and this movable_node
> >>> issue doesn't occur.
> >>> Let's revert commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
> >>> regions into memblock.reserved") and apply the v1 patch.
> >>>
> >>> [*] https://lkml.org/lkml/2018/6/13/27
> >>>
> >>> Signed-off-by: Masayoshi Mizuma 
> >>> ---
> >>>  arch/x86/kernel/e820.c | 15 +++
> >>>  1 file changed, 3 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> >>> index c88c23c658c1..d1f25c831447 100644
> >>> --- a/arch/x86/kernel/e820.c
> >>> +++ b/arch/x86/kernel/e820.c
> >>> @@ -1248,7 +1248,6 @@ void __init e820__memblock_setup(void)
> >>>  {
> >>> int i;
> >>> u64 end;
> >>> -   u64 addr = 0;
> >>>
> >>> /*
> >>>  * The bootstrap memblock region count maximum is 128 entries
> >>> @@ -1265,21 +1264,13 @@ void __init e820__memblock_setup(void)
> >>> struct e820_entry *entry = _table->entries[i];
> >>>
> >>> end = entry->addr + entry->size;
> >>> -   if (addr < entry->addr)
> >>> -   memblock_reserve(addr, entry->addr - addr);
> >>> -   addr = end;
> >>> if (end != (resource_size_t)end)
> >>> continue;
> >>>
> >>> -   /*
> >>> -* all !E820_TYPE_RAM ranges (including gap ranges) are put
> >>> -* into memblock.reserved to make sure that struct pages in
> >>> -* such regions are not left uninitialized after bootup.
> >>> -*/
> >>> if (entry->type != E820_TYPE_RAM && entry->type != 
> >>> E820_TYPE_RESERVED_KERN)
> >>> -   memblock_reserve(entry->addr, entry->size);
> >>> -   else
> >>> -   memblock_add(entry->addr, entry->size);
> >>> +   continue;
> >>> +
> >>> +   memblock_add(entry->addr, entry->size);
> >>> }
> >>>
> >>> /* Throw away partial pages: */
> >>> --
> >>> 2.18.0
> >>>
> >>>
> >
>

Re: [RESEND PATCH v10 6/6] mm: page_alloc: reduce unnecessary binary search in early_pfn_valid()

2018-08-16 Thread Pasha Tatashin


On 7/6/18 5:01 AM, Jia He wrote:
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But there is
> still some room for improvement. E.g. in early_pfn_valid(), if pfn and
> pfn+1 are in the same memblock region, we can record the last returned
> memblock region index and check whether pfn++ is still in the same
> region.
> 
> Currently it only improve the performance on arm/arm64 and will have no
> impact on other arches.
> 
> For the performance improvement, after this set, I can see the time
> overhead of memmap_init() is reduced from 27956us to 13537us in my
> armv8a server(QDF2400 with 96G memory, pagesize 64k).

This series would be a lot simpler if patches 4, 5, and 6 were dropped.
The extra complexity does not make sense to save 0.0001s/T during not.

Patches 1-3, look OK, but without patches 4-5 __init_memblock should be
made local static as I suggested earlier.

So, I think Jia should re-spin this series with only 3 patches. Or,
Andrew could remove the from linux-next before merge.

Thank you,
Pavel

Re: [RESEND PATCH v10 6/6] mm: page_alloc: reduce unnecessary binary search in early_pfn_valid()

2018-08-16 Thread Pasha Tatashin


On 7/6/18 5:01 AM, Jia He wrote:
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But there is
> still some room for improvement. E.g. in early_pfn_valid(), if pfn and
> pfn+1 are in the same memblock region, we can record the last returned
> memblock region index and check whether pfn++ is still in the same
> region.
> 
> Currently it only improve the performance on arm/arm64 and will have no
> impact on other arches.
> 
> For the performance improvement, after this set, I can see the time
> overhead of memmap_init() is reduced from 27956us to 13537us in my
> armv8a server(QDF2400 with 96G memory, pagesize 64k).

This series would be a lot simpler if patches 4, 5, and 6 were dropped.
The extra complexity does not make sense to save 0.0001s/T during not.

Patches 1-3, look OK, but without patches 4-5 __init_memblock should be
made local static as I suggested earlier.

So, I think Jia should re-spin this series with only 3 patches. Or,
Andrew could remove the from linux-next before merge.

Thank you,
Pavel

Re: [RESEND PATCH v10 3/6] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()

2018-08-16 Thread Pasha Tatashin


On 8/16/18 9:08 PM, Pavel Tatashin wrote:
> 
>> Signed-off-by: Jia He 
>> ---
>>  mm/memblock.c | 37 +
>>  1 file changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index ccad225..84f7fa7 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1140,31 +1140,52 @@ int __init_memblock memblock_set_node(phys_addr_t 
>> base, phys_addr_t size,
>>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>  
>>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
>> +static int early_region_idx __init_memblock = -1;
> 
> One comment:
> 
> This should be __initdata, but even better bring it inside the function
> as local static variable.

Disregard this comment, this global is used in the next commits. So,
everything is OK. No need for __initdata either.

> 
>>  ulong __init_memblock memblock_next_valid_pfn(ulong pfn)
>>  {
> 
> Otherwise looks good:
> 
> Reviewed-by: Pavel Tatashin 
> 
> 

Re: [RESEND PATCH v10 3/6] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()

2018-08-16 Thread Pasha Tatashin


On 8/16/18 9:08 PM, Pavel Tatashin wrote:
> 
>> Signed-off-by: Jia He 
>> ---
>>  mm/memblock.c | 37 +
>>  1 file changed, 29 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index ccad225..84f7fa7 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1140,31 +1140,52 @@ int __init_memblock memblock_set_node(phys_addr_t 
>> base, phys_addr_t size,
>>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>>  
>>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
>> +static int early_region_idx __init_memblock = -1;
> 
> One comment:
> 
> This should be __initdata, but even better bring it inside the function
> as local static variable.

Disregard this comment, this global is used in the next commits. So,
everything is OK. No need for __initdata either.

> 
>>  ulong __init_memblock memblock_next_valid_pfn(ulong pfn)
>>  {
> 
> Otherwise looks good:
> 
> Reviewed-by: Pavel Tatashin 
> 
> 

Re: [RESEND PATCH v10 3/6] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()

2018-08-16 Thread Pasha Tatashin

> Signed-off-by: Jia He 
> ---
>  mm/memblock.c | 37 +
>  1 file changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index ccad225..84f7fa7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1140,31 +1140,52 @@ int __init_memblock memblock_set_node(phys_addr_t 
> base, phys_addr_t size,
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
> +static int early_region_idx __init_memblock = -1;

One comment:

This should be __initdata, but even better bring it inside the function
as local static variable.

>  ulong __init_memblock memblock_next_valid_pfn(ulong pfn)
>  {

Otherwise looks good:

Reviewed-by: Pavel Tatashin 



Re: [RESEND PATCH v10 3/6] mm: page_alloc: reduce unnecessary binary search in memblock_next_valid_pfn()

2018-08-16 Thread Pasha Tatashin

> Signed-off-by: Jia He 
> ---
>  mm/memblock.c | 37 +
>  1 file changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index ccad225..84f7fa7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1140,31 +1140,52 @@ int __init_memblock memblock_set_node(phys_addr_t 
> base, phys_addr_t size,
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>  
>  #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
> +static int early_region_idx __init_memblock = -1;

One comment:

This should be __initdata, but even better bring it inside the function
as local static variable.

>  ulong __init_memblock memblock_next_valid_pfn(ulong pfn)
>  {

Otherwise looks good:

Reviewed-by: Pavel Tatashin 



Re: [RESEND PATCH v10 2/6] mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64

2018-08-16 Thread Pasha Tatashin
On 18-07-06 17:01:11, Jia He wrote:
> From: Jia He 
> 
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
> 
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
> 
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU
> with a sparse memory map.  The kernel boot time drops from 109 to
> 62 seconds."
> 
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> 
> Suggested-by: Daniel Vacek 
> Signed-off-by: Jia He 

The version of this patch in linux-next has few fixes, I reviewed that one
looks good to me.

Reviewed-by: Pavel Tatashin 

Re: [RESEND PATCH v10 2/6] mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64

2018-08-16 Thread Pasha Tatashin
On 18-07-06 17:01:11, Jia He wrote:
> From: Jia He 
> 
> Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> where possible") optimized the loop in memmap_init_zone(). But it causes
> possible panic bug. So Daniel Vacek reverted it later.
> 
> But as suggested by Daniel Vacek, it is fine to using memblock to skip
> gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> Daniel said:
> "On arm and arm64, memblock is used by default. But generic version of
> pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> not always return the next valid one but skips more resulting in some
> valid frames to be skipped (as if they were invalid). And that's why
> kernel was eventually crashing on some !arm machines."
> 
> About the performance consideration:
> As said by James in b92df1de5,
> "I have tested this patch on a virtual model of a Samurai CPU
> with a sparse memory map.  The kernel boot time drops from 109 to
> 62 seconds."
> 
> Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> 
> Suggested-by: Daniel Vacek 
> Signed-off-by: Jia He 

The version of this patch in linux-next has few fixes, I reviewed that one
looks good to me.

Reviewed-by: Pavel Tatashin 

Re: [RESEND PATCH v10 0/6] optimize memblock_next_valid_pfn and early_pfn_valid on arm and arm64

2018-08-16 Thread Pasha Tatashin
On 18-08-15 15:34:56, Andrew Morton wrote:
> On Fri,  6 Jul 2018 17:01:09 +0800 Jia He  wrote:
> 
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> > 
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> > 
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> > 
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> > 
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> 
> This patchset is basically unreviewed at this stage.  Could people
> please find some time to check it carefully?

Working on it.

Pavel

Re: [RESEND PATCH v10 0/6] optimize memblock_next_valid_pfn and early_pfn_valid on arm and arm64

2018-08-16 Thread Pasha Tatashin
On 18-08-15 15:34:56, Andrew Morton wrote:
> On Fri,  6 Jul 2018 17:01:09 +0800 Jia He  wrote:
> 
> > Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
> > where possible") optimized the loop in memmap_init_zone(). But it causes
> > possible panic bug. So Daniel Vacek reverted it later.
> > 
> > But as suggested by Daniel Vacek, it is fine to using memblock to skip
> > gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
> > 
> > More from what Daniel said:
> > "On arm and arm64, memblock is used by default. But generic version of
> > pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
> > not always return the next valid one but skips more resulting in some
> > valid frames to be skipped (as if they were invalid). And that's why
> > kernel was eventually crashing on some !arm machines."
> > 
> > About the performance consideration:
> > As said by James in b92df1de5,
> > "I have tested this patch on a virtual model of a Samurai CPU with a
> > sparse memory map.  The kernel boot time drops from 109 to 62 seconds."
> > Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
> > 
> > Besides we can remain memblock_next_valid_pfn, there is still some room
> > for improvement. After this set, I can see the time overhead of memmap_init
> > is reduced from 27956us to 13537us in my armv8a server(QDF2400 with 96G
> > memory, pagesize 64k). I believe arm server will benefit more if memory is
> > larger than TBs
> 
> This patchset is basically unreviewed at this stage.  Could people
> please find some time to check it carefully?

Working on it.

Pavel

Re: [PATCH v3 4/4] mm/memory_hotplug: Drop node_online check in unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:19, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> We are getting the nid from the pages that are not yet removed,
> but a node can only be offline when its memory/cpu's have been removed.
> Therefore, we know that the node is still online.

Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Oscar Salvador 
> ---
>  drivers/base/node.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 81b27b5b1f15..b23769e4fcbb 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -465,8 +465,6 @@ void unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>  
>   if (nid < 0)
>   continue;
> - if (!node_online(nid))
> - continue;
>   /*
>* It is possible that NODEMASK_ALLOC fails due to memory
>* pressure.
> -- 
> 2.13.6
> 

Re: [PATCH v3 4/4] mm/memory_hotplug: Drop node_online check in unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:19, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> We are getting the nid from the pages that are not yet removed,
> but a node can only be offline when its memory/cpu's have been removed.
> Therefore, we know that the node is still online.

Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Oscar Salvador 
> ---
>  drivers/base/node.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 81b27b5b1f15..b23769e4fcbb 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -465,8 +465,6 @@ void unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>  
>   if (nid < 0)
>   continue;
> - if (!node_online(nid))
> - continue;
>   /*
>* It is possible that NODEMASK_ALLOC fails due to memory
>* pressure.
> -- 
> 2.13.6
> 

Re: [PATCH v3 3/4] mm/memory_hotplug: Refactor unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
> > d) What's the maximum number of nodes, ever?  Perhaps we can always
> >fit a nodemask_t onto the stack, dunno.
> 
> Right now, we define the maximum as NODES_SHIFT = 10, so:
> 
> 1 << 10 = 1024 Maximum nodes.
> 
> Since this makes only 128 bytes, I wonder if we can just go ahead and define 
> a nodemask_t
> whithin the stack.
> 128 bytes is not that much, is it?

Yeah, sue stack here, 128b is tiny. This also will solve Andrew's point of 
having an untested path when alloc fails, and simplify the patch overall.

Thank you,
Pavel

Re: [PATCH v3 3/4] mm/memory_hotplug: Refactor unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
> > d) What's the maximum number of nodes, ever?  Perhaps we can always
> >fit a nodemask_t onto the stack, dunno.
> 
> Right now, we define the maximum as NODES_SHIFT = 10, so:
> 
> 1 << 10 = 1024 Maximum nodes.
> 
> Since this makes only 128 bytes, I wonder if we can just go ahead and define 
> a nodemask_t
> whithin the stack.
> 128 bytes is not that much, is it?

Yeah, sue stack here, 128b is tiny. This also will solve Andrew's point of 
having an untested path when alloc fails, and simplify the patch overall.

Thank you,
Pavel

Re: [PATCH v3 2/4] mm/memory_hotplug: Drop mem_blk check from unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:17, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Before calling to unregister_mem_sect_under_nodes(),
> remove_memory_section() already checks if we got a valid memory_block.
> 
> No need to check that again in unregister_mem_sect_under_nodes().
> 
> If more functions start using unregister_mem_sect_under_nodes() in the
> future, we can always place a WARN_ON to catch null mem_blk's so we can
> safely back off.
> 
> For now, let us keep the check in remove_memory_section() since it is the
> only function that uses it.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: Andrew Morton 

Reviewed-by: Pavel Tatashin 

> ---
>  drivers/base/node.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1ac4c36e13bb..dd3bdab230b2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -455,10 +455,6 @@ int unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>   NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
>   unsigned long pfn, sect_start_pfn, sect_end_pfn;
>  
> - if (!mem_blk) {
> - NODEMASK_FREE(unlinked_nodes);
> - return -EFAULT;
> - }
>   if (!unlinked_nodes)
>   return -ENOMEM;
>   nodes_clear(*unlinked_nodes);
> -- 
> 2.13.6
> 

Re: [PATCH v3 2/4] mm/memory_hotplug: Drop mem_blk check from unregister_mem_sect_under_nodes

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:17, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> Before calling to unregister_mem_sect_under_nodes(),
> remove_memory_section() already checks if we got a valid memory_block.
> 
> No need to check that again in unregister_mem_sect_under_nodes().
> 
> If more functions start using unregister_mem_sect_under_nodes() in the
> future, we can always place a WARN_ON to catch null mem_blk's so we can
> safely back off.
> 
> For now, let us keep the check in remove_memory_section() since it is the
> only function that uses it.
> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: Andrew Morton 

Reviewed-by: Pavel Tatashin 

> ---
>  drivers/base/node.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1ac4c36e13bb..dd3bdab230b2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -455,10 +455,6 @@ int unregister_mem_sect_under_nodes(struct memory_block 
> *mem_blk,
>   NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
>   unsigned long pfn, sect_start_pfn, sect_end_pfn;
>  
> - if (!mem_blk) {
> - NODEMASK_FREE(unlinked_nodes);
> - return -EFAULT;
> - }
>   if (!unlinked_nodes)
>   return -ENOMEM;
>   nodes_clear(*unlinked_nodes);
> -- 
> 2.13.6
> 

Re: [PATCH v3 1/4] mm/memory-hotplug: Drop unused args from remove_memory_section

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:16, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> unregister_memory_section() calls remove_memory_section()
> with three arguments:
> 
> * node_id
> * section
> * phys_device
> 
> Neither node_id nor phys_device are used.
> Let us drop them from the function.

Looks good:
Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> Reviewed-by: Andrew Morton 
> ---
>  drivers/base/memory.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c8a1cb0b6136..2c622a9a7490 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -752,8 +752,7 @@ unregister_memory(struct memory_block *memory)
>   device_unregister(>dev);
>  }
>  
> -static int remove_memory_section(unsigned long node_id,
> -struct mem_section *section, int phys_device)
> +static int remove_memory_section(struct mem_section *section)
>  {
>   struct memory_block *mem;
>  
> @@ -785,7 +784,7 @@ int unregister_memory_section(struct mem_section *section)
>   if (!present_section(section))
>   return -EINVAL;
>  
> - return remove_memory_section(0, section, 0);
> + return remove_memory_section(section);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> -- 
> 2.13.6
> 

Re: [PATCH v3 1/4] mm/memory-hotplug: Drop unused args from remove_memory_section

2018-08-16 Thread Pasha Tatashin
On 18-08-15 16:42:16, Oscar Salvador wrote:
> From: Oscar Salvador 
> 
> unregister_memory_section() calls remove_memory_section()
> with three arguments:
> 
> * node_id
> * section
> * phys_device
> 
> Neither node_id nor phys_device are used.
> Let us drop them from the function.

Looks good:
Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 
> Reviewed-by: Andrew Morton 
> ---
>  drivers/base/memory.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c8a1cb0b6136..2c622a9a7490 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -752,8 +752,7 @@ unregister_memory(struct memory_block *memory)
>   device_unregister(>dev);
>  }
>  
> -static int remove_memory_section(unsigned long node_id,
> -struct mem_section *section, int phys_device)
> +static int remove_memory_section(struct mem_section *section)
>  {
>   struct memory_block *mem;
>  
> @@ -785,7 +784,7 @@ int unregister_memory_section(struct mem_section *section)
>   if (!present_section(section))
>   return -EINVAL;
>  
> - return remove_memory_section(0, section, 0);
> + return remove_memory_section(section);
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
>  
> -- 
> 2.13.6
> 

Re: [PATCH v2 3/4] mm/memory_hotplug: Make register_mem_sect_under_node a cb of walk_memory_range

2018-08-16 Thread Pasha Tatashin
On 18-06-22 13:18:38, osalva...@techadventures.net wrote:
> From: Oscar Salvador 
> 
> link_mem_sections() and walk_memory_range() share most of the code,
> so we can use convert link_mem_sections() into a dummy function that calls
> walk_memory_range() with a callback to register_mem_sect_under_node().
> 
> This patch converts register_mem_sect_under_node() in order to
> match a walk_memory_range's callback, getting rid of the
> check_nid argument and checking instead if the system is still
> boothing, since we only have to check for the nid if the system
> is in such state.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: Pavel Tatashin 

Reviewed-by: Pavel Tatashin 

Re: [PATCH v2 3/4] mm/memory_hotplug: Make register_mem_sect_under_node a cb of walk_memory_range

2018-08-16 Thread Pasha Tatashin
On 18-06-22 13:18:38, osalva...@techadventures.net wrote:
> From: Oscar Salvador 
> 
> link_mem_sections() and walk_memory_range() share most of the code,
> so we can use convert link_mem_sections() into a dummy function that calls
> walk_memory_range() with a callback to register_mem_sect_under_node().
> 
> This patch converts register_mem_sect_under_node() in order to
> match a walk_memory_range's callback, getting rid of the
> check_nid argument and checking instead if the system is still
> boothing, since we only have to check for the nid if the system
> is in such state.
> 
> Signed-off-by: Oscar Salvador 
> Suggested-by: Pavel Tatashin 

Reviewed-by: Pavel Tatashin 

Re: [PATCH v10 05/10] mm: zero reserved and unavailable struct pages

2017-10-06 Thread Pasha Tatashin

Hi Michal,



As I've said in other reply this should go in only if the scenario you
describe is real. I am somehow suspicious to be honest. I simply do not
see how those weird struct pages would be in a valid pfn range of any
zone.



There are examples of both when unavailable memory is not part of any 
zone, and where it is part of zones.


I run Linux in kvm with these arguments:

qemu-system-x86_64
-enable-kvm
-cpu kvm64
-kernel $kernel
-initrd $initrd
-m 512
-smp 2
-device e1000,netdev=net0
-netdev user,id=net0
-boot order=nc
-no-reboot
-watchdog i6300esb
-watchdog-action debug
-rtc base=localtime
-serial stdio
-display none
-monitor null

This patch reports that there are 98 unavailable pages.

They are: pfn 0 and pfns in range [159, 255].

Note, trim_low_memory_range() reserves only pfns in range [0, 15], it 
does not reserve [159, 255] ones.


e820__memblock_setup() reports linux that the following physical ranges 
are available:

[1 , 158]
[256, 130783]

Notice, that exactly unavailable pfns are missing!

Now, lets check what we have in zone 0: [1, 131039]

pfn 0, is not part of the zone, but pfns [1, 158], are.

However, the bigger problem we have if we do not initialize these struct 
pages is with memory hotplug. Because, that path operates at 2M 
boundaries (section_nr). And checks if 2M range of pages is hot 
removable. It starts with first pfn from zone, rounds it down to 2M 
boundary (sturct pages are allocated at 2M boundaries when vmemmap is 
created), and and checks if that section is hot removable. In this case 
start with pfn 1 and convert it down to pfn 0.


Later pfn is converted to struct page, and some fields are checked. Now, 
if we do not zero struct pages, we get unpredictable results. In fact 
when CONFIG_VM_DEBUG is enabled, and we explicitly set all vmemmap 
memory to ones, I am getting the following panic with kernel test 
without this patch applied:


[   23.277793] BUG: unable to handle kernel NULL pointer dereference at 
 (null)

[   23.278863] IP: is_pageblock_removable_nolock+0x35/0x90
[   23.279619] PGD 0 P4D 0
[   23.280031] Oops:  [#1] PREEMPT
[   23.280527] CPU: 0 PID: 249 Comm: udevd Not tainted 
4.14.0-rc3_pt_memset10-00335-g5e2c7478bed5-dirty #8

[   23.281735] task: 88001f4e2900 task.stack: c9314000
[   23.282532] RIP: 0010:is_pageblock_removable_nolock+0x35/0x90
[   23.283275] RSP: 0018:c9317d60 EFLAGS: 00010202
[   23.283948] RAX:  RBX: 88001d92b000 RCX: 

[   23.284862] RDX:  RSI: 0020 RDI: 
88001d92b000
[   23.285771] RBP: c9317d80 R08: 10c8 R09: 

[   23.286542] R10:  R11:  R12: 
88001db2b000
[   23.287264] R13: 81af6d00 R14: 88001f7d5000 R15: 
82a1b6c0
[   23.287971] FS:  7f4eb857f7c0() GS:81c27000() 
knlGS:

[   23.288775] CS:  0010 DS:  ES:  CR0: 80050033
[   23.289355] CR2:  CR3: 1f4e6000 CR4: 
06b0

[   23.290066] Call Trace:
[   23.290323]  ? is_mem_section_removable+0x5a/0xd0
[   23.290798]  show_mem_removable+0x6b/0xa0
[   23.291204]  dev_attr_show+0x1b/0x50
[   23.291565]  sysfs_kf_seq_show+0xa1/0x100
[   23.291967]  kernfs_seq_show+0x22/0x30
[   23.292349]  seq_read+0x1ac/0x3a0
[   23.292687]  kernfs_fop_read+0x36/0x190
[   23.293074]  ? security_file_permission+0x90/0xb0
[   23.293547]  __vfs_read+0x16/0x30
[   23.293884]  vfs_read+0x81/0x130
[   23.294214]  SyS_read+0x44/0xa0
[   23.294537]  entry_SYSCALL_64_fastpath+0x1f/0xbd
[   23.295003] RIP: 0033:0x7f4eb7c660a0
[   23.295364] RSP: 002b:7ffda6cffe28 EFLAGS: 0246 ORIG_RAX: 

[   23.296152] RAX: ffda RBX: 03de RCX: 
7f4eb7c660a0
[   23.296934] RDX: 1000 RSI: 7ffda6cffec8 RDI: 
0005
[   23.297963] RBP: 7ffda6cffde8 R08: 7379732f73656369 R09: 
6f6d656d2f6d6574
[   23.299198] R10: 726f6d656d2f7972 R11: 0246 R12: 
0022
[   23.300400] R13: 561d68ea7710 R14:  R15: 
7ffda6d05c78
[   23.301591] Code: c1 ea 35 49 c1 e8 2b 48 8b 14 d5 c0 b6 a1 82 41 83 
e0 03 48 85 d2 74 0c 48 c1 e8 29 25 f0 0f 00 00 48 01 c2 4d 69 c0 98 
05 00 00 <48> 8b 02 48 89 fa 48 83 e0 f8 49 8b 88 28 b5 d3 81 48 29 c2 49
[   23.304739] RIP: is_pageblock_removable_nolock+0x35/0x90 RSP: 
c9317d60

[   23.305940] CR2: 


  1   2   3   >