Fwd: trinity test fanotify cause hungtasks on kernel 4.13

2017-07-27 Thread Gu Zheng


hi,ALL:
when we used the trinity test the fanotify interfaces, it cause many hungtasks.
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
the shell is  simple:
  1 #!/bin/bash
  2
  3 while true
  4 do
  5 ./trinity -c fanotify_init -l off -C 2 -X > /dev/null 2>&1 &
  6 sleep 1
  7 ./trinity -c fanotify_mark -l off -C 2 -X > /dev/null 2>&1 &
  8 sleep 10
  9 done
we found the trinity enter the D state fastly.
we check the pids'stack
[root@localhost ~]# ps -aux | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   977  0.0  0.0 207992  7904 ?Ss   15:23   0:00 
/usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible 
recursive locking detected ernel BUG at list_del corruption list_add corruption 
do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable 
to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) 
went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody 
cared IRQ handler type mismatch Machine Check Exception: Machine check events 
logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment 
not present: invalid opcode: alignment check: stack segment: fpu exception: 
simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops 
-xtD
root   997  0.0  0.0 203360  3188 ?Ssl  15:23   0:00 
/usr/sbin/gssproxy -D
root  1549  0.0  0.0  82552  6012 ?Ss   15:23   0:00 /usr/sbin/sshd 
-D
root  2807  3.5  0.2  59740 35416 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_init -l off -C 2 -X
root  2809  3.1  0.2  53712 35332 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_mark -l off -C 2 -X
root  2915  0.0  0.0 136948  1776 pts/0D15:24   0:00 ps ax
root  2919  0.0  0.0 112656  2100 pts/1S+   15:24   0:00 grep 
--color=auto D
[root@localhost ~]# cat /proc/2807/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

[root@localhost ~]# cat /proc/2915/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x
[root@localhost ~]# cat /proc/2809/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

all progresses are waiting for the response in 
fanotify_handle_event->fanotify_get_response,
becauseof non-response or killed monitor,so the waitqueue is  in blocked state,
then the others will be stucked which use the  fanotify_get_response.



if we use wait_event_timeout , the responed time can not be guaranteed.

do you have any ideas?
thanks.






Fwd: trinity test fanotify cause hungtasks on kernel 4.13

2017-07-27 Thread Gu Zheng


hi,ALL:
when we used the trinity test the fanotify interfaces, it cause many hungtasks.
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
the shell is  simple:
  1 #!/bin/bash
  2
  3 while true
  4 do
  5 ./trinity -c fanotify_init -l off -C 2 -X > /dev/null 2>&1 &
  6 sleep 1
  7 ./trinity -c fanotify_mark -l off -C 2 -X > /dev/null 2>&1 &
  8 sleep 10
  9 done
we found the trinity enter the D state fastly.
we check the pids'stack
[root@localhost ~]# ps -aux | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   977  0.0  0.0 207992  7904 ?Ss   15:23   0:00 
/usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible 
recursive locking detected ernel BUG at list_del corruption list_add corruption 
do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable 
to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) 
went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody 
cared IRQ handler type mismatch Machine Check Exception: Machine check events 
logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment 
not present: invalid opcode: alignment check: stack segment: fpu exception: 
simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops 
-xtD
root   997  0.0  0.0 203360  3188 ?Ssl  15:23   0:00 
/usr/sbin/gssproxy -D
root  1549  0.0  0.0  82552  6012 ?Ss   15:23   0:00 /usr/sbin/sshd 
-D
root  2807  3.5  0.2  59740 35416 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_init -l off -C 2 -X
root  2809  3.1  0.2  53712 35332 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_mark -l off -C 2 -X
root  2915  0.0  0.0 136948  1776 pts/0D15:24   0:00 ps ax
root  2919  0.0  0.0 112656  2100 pts/1S+   15:24   0:00 grep 
--color=auto D
[root@localhost ~]# cat /proc/2807/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

[root@localhost ~]# cat /proc/2915/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x
[root@localhost ~]# cat /proc/2809/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

all progresses are waiting for the response in 
fanotify_handle_event->fanotify_get_response,
becauseof non-response or killed monitor,so the waitqueue is  in blocked state,
then the others will be stucked which use the  fanotify_get_response.



if we use wait_event_timeout , the responed time can not be guaranteed.

do you have any ideas?
thanks.






Re: trinity test fanotify cause hungtasks on kernel 4.13

2017-07-27 Thread Gu Zheng

hi:all
sorry , close the  CONFIG_FANOTIFY_ACCESS_PERMISSIONS  is ok.
it effected by adding the permissive judgment in fanotify_mark.

在 2017/7/27 17:55, Gu Zheng 写道:

if we disable the CONFIG_FANOTIFY_ACCESS_PERMISSIONS,
the mem will be consumed  quickly, because the fsnotify_mark_srcu read lock 
always be hold.




Re: trinity test fanotify cause hungtasks on kernel 4.13

2017-07-27 Thread Gu Zheng

hi:all
sorry , close the  CONFIG_FANOTIFY_ACCESS_PERMISSIONS  is ok.
it effected by adding the permissive judgment in fanotify_mark.

在 2017/7/27 17:55, Gu Zheng 写道:

if we disable the CONFIG_FANOTIFY_ACCESS_PERMISSIONS,
the mem will be consumed  quickly, because the fsnotify_mark_srcu read lock 
always be hold.




trinity test fanotify cause hungtasks on kernel 4.13

2017-07-27 Thread Gu Zheng

hi,Eric Paris:
when we used the trinity test the fanotify interfaces, it cause many hungtasks.
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
the shell is  simple:
  1 #!/bin/bash
  2
  3 while true
  4 do
  5 ./trinity -c fanotify_init -l off -C 2 -X > /dev/null 2>&1 &
  6 sleep 1
  7 ./trinity -c fanotify_mark -l off -C 2 -X > /dev/null 2>&1 &
  8 sleep 10
  9 done
we found the trinity enter the D state fastly.
we check the pids'stack
[root@localhost ~]# ps -aux | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   977  0.0  0.0 207992  7904 ?Ss   15:23   0:00 
/usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible 
recursive locking detected ernel BUG at list_del corruption list_add corruption 
do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable 
to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) 
went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody 
cared IRQ handler type mismatch Machine Check Exception: Machine check events 
logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment 
not present: invalid opcode: alignment check: stack segment: fpu exception: 
simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops 
-xtD
root   997  0.0  0.0 203360  3188 ?Ssl  15:23   0:00 
/usr/sbin/gssproxy -D
root  1549  0.0  0.0  82552  6012 ?Ss   15:23   0:00 /usr/sbin/sshd 
-D
root  2807  3.5  0.2  59740 35416 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_init -l off -C 2 -X
root  2809  3.1  0.2  53712 35332 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_mark -l off -C 2 -X
root  2915  0.0  0.0 136948  1776 pts/0D15:24   0:00 ps ax
root  2919  0.0  0.0 112656  2100 pts/1S+   15:24   0:00 grep 
--color=auto D
[root@localhost ~]# cat /proc/2807/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

[root@localhost ~]# cat /proc/2915/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x
[root@localhost ~]# cat /proc/2809/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

all pids wait for the response in fanotify_handle_event->fanotify_get_response,
but the monitor can not replay anything ,becauseof the permission or killed 
monitor
then the others will be stucked who use the fanotify or synchronize_srcu

if we disable the CONFIG_FANOTIFY_ACCESS_PERMISSIONS,
the mem will be consumed  quickly, because the fsnotify_mark_srcu read lock 
always be hold.

if add a timeout , the safety can not be guaranteed.

do you have any ideas?
thanks.




trinity test fanotify cause hungtasks on kernel 4.13

2017-07-27 Thread Gu Zheng

hi,Eric Paris:
when we used the trinity test the fanotify interfaces, it cause many hungtasks.
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
the shell is  simple:
  1 #!/bin/bash
  2
  3 while true
  4 do
  5 ./trinity -c fanotify_init -l off -C 2 -X > /dev/null 2>&1 &
  6 sleep 1
  7 ./trinity -c fanotify_mark -l off -C 2 -X > /dev/null 2>&1 &
  8 sleep 10
  9 done
we found the trinity enter the D state fastly.
we check the pids'stack
[root@localhost ~]# ps -aux | grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root   977  0.0  0.0 207992  7904 ?Ss   15:23   0:00 
/usr/bin/abrt-watch-log -F BUG: WARNING: at WARNING: CPU: INFO: possible 
recursive locking detected ernel BUG at list_del corruption list_add corruption 
do_IRQ: stack overflow: ear stack overflow (cur: eneral protection fault nable 
to handle kernel ouble fault: RTNL: assertion failed eek! page_mapcount(page) 
went negative! adness at NETDEV WATCHDOG ysctl table check failed : nobody 
cared IRQ handler type mismatch Machine Check Exception: Machine check events 
logged divide error: bounds: coprocessor segment overrun: invalid TSS: segment 
not present: invalid opcode: alignment check: stack segment: fpu exception: 
simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops 
-xtD
root   997  0.0  0.0 203360  3188 ?Ssl  15:23   0:00 
/usr/sbin/gssproxy -D
root  1549  0.0  0.0  82552  6012 ?Ss   15:23   0:00 /usr/sbin/sshd 
-D
root  2807  3.5  0.2  59740 35416 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_init -l off -C 2 -X
root  2809  3.1  0.2  53712 35332 pts/0DL   15:24   0:00 ./trinity -c 
fanotify_mark -l off -C 2 -X
root  2915  0.0  0.0 136948  1776 pts/0D15:24   0:00 ps ax
root  2919  0.0  0.0 112656  2100 pts/1S+   15:24   0:00 grep 
--color=auto D
[root@localhost ~]# cat /proc/2807/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

[root@localhost ~]# cat /proc/2915/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x
[root@localhost ~]# cat /proc/2809/stack
[] fanotify_handle_event+0x2a1/0x2f0
[] fsnotify+0x2d3/0x4f0
[] security_file_open+0x89/0x90
[] do_dentry_open+0x139/0x330
[] vfs_open+0x4f/0x70
[] path_openat+0x548/0x1350
[] do_filp_open+0x91/0x100
[] do_sys_open+0x124/0x210
[] SyS_open+0x1e/0x20
[] do_syscall_64+0x67/0x150
[] entry_SYSCALL64_slow_path+0x25/0x25
[] 0x

all pids wait for the response in fanotify_handle_event->fanotify_get_response,
but the monitor can not replay anything ,becauseof the permission or killed 
monitor
then the others will be stucked who use the fanotify or synchronize_srcu

if we disable the CONFIG_FANOTIFY_ACCESS_PERMISSIONS,
the mem will be consumed  quickly, because the fsnotify_mark_srcu read lock 
always be hold.

if add a timeout , the safety can not be guaranteed.

do you have any ideas?
thanks.




[PATCH] mtd:avoid blktrans_open/release race and avoid insmod ftl.ko deadlock

2017-03-16 Thread Gu Zheng
this modification can fix the is issue in commit
857814ee65dbc942b18b2dc713124043035e "mtd: fix: avoid race condition
when accessing mtd->usecount ",also can solve the issue about it happen
dealock when ismod ftl.ko. the original process is as follows:
init_ftl
register_mtd_blktrans
mutex_lock(_table_mutex) //mtd_table_mutex locked
ftl_add_mtd
add_mtd_blktrans_dev
device_add_disk
register_disk
blkdev_get
__blkdev_get
blktrans_open
mutex_lock(_table_mutex) //dead lock

this patch can prevent some mtd_table_mutex lock race undiscovered.

Signed-off-by: Gu Zheng <guzhe...@huawei.com>
---
 drivers/mtd/mtd_blkdevs.c | 31 
 drivers/mtd/mtdcore.c | 74 +++
 drivers/mtd/mtdcore.h |  4 ++-
 3 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index 6b8d5cd..c194208 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -191,7 +191,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t 
mode)
if (!dev)
return -ERESTARTSYS; /* FIXME: busy loop! -arnd*/
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
mutex_lock(>lock);
 
if (dev->open)
@@ -217,7 +217,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t 
mode)
 unlock:
dev->open++;
mutex_unlock(>lock);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
blktrans_dev_put(dev);
return ret;
 
@@ -228,7 +228,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t 
mode)
module_put(dev->tr->owner);
kref_put(>ref, blktrans_dev_release);
mutex_unlock(>lock);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
blktrans_dev_put(dev);
return ret;
 }
@@ -240,7 +240,7 @@ static void blktrans_release(struct gendisk *disk, fmode_t 
mode)
if (!dev)
return;
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
mutex_lock(>lock);
 
if (--dev->open)
@@ -256,7 +256,7 @@ static void blktrans_release(struct gendisk *disk, fmode_t 
mode)
}
 unlock:
mutex_unlock(>lock);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
blktrans_dev_put(dev);
 }
 
@@ -323,10 +323,7 @@ int add_mtd_blktrans_dev(struct mtd_blktrans_dev *new)
struct gendisk *gd;
int ret;
 
-   if (mutex_trylock(_table_mutex)) {
-   mutex_unlock(_table_mutex);
-   BUG();
-   }
+   mtd_table_assert_mutex_locked();
 
mutex_lock(_ref_mutex);
list_for_each_entry(d, >devs, list) {
@@ -455,11 +452,7 @@ int del_mtd_blktrans_dev(struct mtd_blktrans_dev *old)
 {
unsigned long flags;
 
-   if (mutex_trylock(_table_mutex)) {
-   mutex_unlock(_table_mutex);
-   BUG();
-   }
-
+   mtd_table_assert_mutex_locked();
if (old->disk_attributes)
sysfs_remove_group(_to_dev(old->disk)->kobj,
old->disk_attributes);
@@ -531,13 +524,13 @@ int register_mtd_blktrans(struct mtd_blktrans_ops *tr)
register_mtd_user(_notifier);
 
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
 
ret = register_blkdev(tr->major, tr->name);
if (ret < 0) {
printk(KERN_WARNING "Unable to register %s block device on 
major %d: %d\n",
   tr->name, tr->major, ret);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
return ret;
}
 
@@ -553,7 +546,7 @@ int register_mtd_blktrans(struct mtd_blktrans_ops *tr)
if (mtd->type != MTD_ABSENT)
tr->add_mtd(tr, mtd);
 
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
return 0;
 }
 
@@ -561,7 +554,7 @@ int deregister_mtd_blktrans(struct mtd_blktrans_ops *tr)
 {
struct mtd_blktrans_dev *dev, *next;
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
 
/* Remove it from the list of active majors */
list_del(>list);
@@ -570,7 +563,7 @@ int deregister_mtd_blktrans(struct mtd_blktrans_ops *tr)
tr->remove_dev(dev);
 
unregister_blkdev(tr->major, tr->name);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
 
BUG_ON(!list_empty(>devs));
return 0;
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 66a9ded..f3d5470 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -84,6 +84,8 @@ static DEFINE_IDR(mtd_idr);
should not use them for _anything_ else */
 DEFINE_MUTEX(mtd_table_mutex);
 EXPORT_SYMBOL_GPL(mtd_table_mutex);
+int mtd_table_mutex_depth;
+struct task_struct *mtd_tab

[PATCH] mtd:avoid blktrans_open/release race and avoid insmod ftl.ko deadlock

2017-03-16 Thread Gu Zheng
this modification can fix the is issue in commit
857814ee65dbc942b18b2dc713124043035e "mtd: fix: avoid race condition
when accessing mtd->usecount ",also can solve the issue about it happen
dealock when ismod ftl.ko. the original process is as follows:
init_ftl
register_mtd_blktrans
mutex_lock(_table_mutex) //mtd_table_mutex locked
ftl_add_mtd
add_mtd_blktrans_dev
device_add_disk
register_disk
blkdev_get
__blkdev_get
blktrans_open
mutex_lock(_table_mutex) //dead lock

this patch can prevent some mtd_table_mutex lock race undiscovered.

Signed-off-by: Gu Zheng 
---
 drivers/mtd/mtd_blkdevs.c | 31 
 drivers/mtd/mtdcore.c | 74 +++
 drivers/mtd/mtdcore.h |  4 ++-
 3 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/drivers/mtd/mtd_blkdevs.c b/drivers/mtd/mtd_blkdevs.c
index 6b8d5cd..c194208 100644
--- a/drivers/mtd/mtd_blkdevs.c
+++ b/drivers/mtd/mtd_blkdevs.c
@@ -191,7 +191,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t 
mode)
if (!dev)
return -ERESTARTSYS; /* FIXME: busy loop! -arnd*/
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
mutex_lock(>lock);
 
if (dev->open)
@@ -217,7 +217,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t 
mode)
 unlock:
dev->open++;
mutex_unlock(>lock);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
blktrans_dev_put(dev);
return ret;
 
@@ -228,7 +228,7 @@ static int blktrans_open(struct block_device *bdev, fmode_t 
mode)
module_put(dev->tr->owner);
kref_put(>ref, blktrans_dev_release);
mutex_unlock(>lock);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
blktrans_dev_put(dev);
return ret;
 }
@@ -240,7 +240,7 @@ static void blktrans_release(struct gendisk *disk, fmode_t 
mode)
if (!dev)
return;
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
mutex_lock(>lock);
 
if (--dev->open)
@@ -256,7 +256,7 @@ static void blktrans_release(struct gendisk *disk, fmode_t 
mode)
}
 unlock:
mutex_unlock(>lock);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
blktrans_dev_put(dev);
 }
 
@@ -323,10 +323,7 @@ int add_mtd_blktrans_dev(struct mtd_blktrans_dev *new)
struct gendisk *gd;
int ret;
 
-   if (mutex_trylock(_table_mutex)) {
-   mutex_unlock(_table_mutex);
-   BUG();
-   }
+   mtd_table_assert_mutex_locked();
 
mutex_lock(_ref_mutex);
list_for_each_entry(d, >devs, list) {
@@ -455,11 +452,7 @@ int del_mtd_blktrans_dev(struct mtd_blktrans_dev *old)
 {
unsigned long flags;
 
-   if (mutex_trylock(_table_mutex)) {
-   mutex_unlock(_table_mutex);
-   BUG();
-   }
-
+   mtd_table_assert_mutex_locked();
if (old->disk_attributes)
sysfs_remove_group(_to_dev(old->disk)->kobj,
old->disk_attributes);
@@ -531,13 +524,13 @@ int register_mtd_blktrans(struct mtd_blktrans_ops *tr)
register_mtd_user(_notifier);
 
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
 
ret = register_blkdev(tr->major, tr->name);
if (ret < 0) {
printk(KERN_WARNING "Unable to register %s block device on 
major %d: %d\n",
   tr->name, tr->major, ret);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
return ret;
}
 
@@ -553,7 +546,7 @@ int register_mtd_blktrans(struct mtd_blktrans_ops *tr)
if (mtd->type != MTD_ABSENT)
tr->add_mtd(tr, mtd);
 
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
return 0;
 }
 
@@ -561,7 +554,7 @@ int deregister_mtd_blktrans(struct mtd_blktrans_ops *tr)
 {
struct mtd_blktrans_dev *dev, *next;
 
-   mutex_lock(_table_mutex);
+   mtd_table_mutex_lock();
 
/* Remove it from the list of active majors */
list_del(>list);
@@ -570,7 +563,7 @@ int deregister_mtd_blktrans(struct mtd_blktrans_ops *tr)
tr->remove_dev(dev);
 
unregister_blkdev(tr->major, tr->name);
-   mutex_unlock(_table_mutex);
+   mtd_table_mutex_unlock();
 
BUG_ON(!list_empty(>devs));
return 0;
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 66a9ded..f3d5470 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -84,6 +84,8 @@ static DEFINE_IDR(mtd_idr);
should not use them for _anything_ else */
 DEFINE_MUTEX(mtd_table_mutex);
 EXPORT_SYMBOL_GPL(mtd_table_mutex);
+int mtd_table_mutex_depth;
+struct task_struct *mtd_table_mutex_owner;
 
 struct mt

[PATCH] tmpfs: clear S_ISGID when setting posix ACLs

2017-01-08 Thread Gu Zheng
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49 ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

  touch $testfile
  chown 100:100 $testfile
  chmod 2755 $testfile
  _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng <guzhe...@huawei.com>
---
 fs/posix_acl.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 5955220..d014dff 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -922,11 +922,10 @@ int simple_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
int error;
 
if (type == ACL_TYPE_ACCESS) {
-   error = posix_acl_equiv_mode(acl, >i_mode);
-   if (error < 0)
-   return 0;
-   if (error == 0)
-   acl = NULL;
+   error = posix_acl_update_mode(inode,
+   >i_mode, );
+   if (error)
+   return error;
}
 
inode->i_ctime = current_time(inode);
-- 
2.5.0



[PATCH] tmpfs: clear S_ISGID when setting posix ACLs

2017-01-08 Thread Gu Zheng
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49 ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

  touch $testfile
  chown 100:100 $testfile
  chmod 2755 $testfile
  _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng 
---
 fs/posix_acl.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 5955220..d014dff 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -922,11 +922,10 @@ int simple_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
int error;
 
if (type == ACL_TYPE_ACCESS) {
-   error = posix_acl_equiv_mode(acl, >i_mode);
-   if (error < 0)
-   return 0;
-   if (error == 0)
-   acl = NULL;
+   error = posix_acl_update_mode(inode,
+   >i_mode, );
+   if (error)
+   return error;
}
 
inode->i_ctime = current_time(inode);
-- 
2.5.0



Re: [PATCH] tmpfs: clear S_ISGID when setting posix ACLs

2017-01-08 Thread Gu Zheng

thanks, I will update it.

在 2017/1/6 18:10, Jan Kara 写道:

On Fri 06-01-17 16:12:55, Gu Zheng wrote:

This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49d9458aa351605b43a7e34598caef
posix_acl: Clear SGID bit when setting file permissions.
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

   touch $testfile
   chown 100:100 $testfile
   chmod 2755 $testfile
   _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng <guzhe...@huawei.com>


Ah, good catch. One comment below:


diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 5955220..d014dff 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -922,11 +922,10 @@ int simple_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
int error;

if (type == ACL_TYPE_ACCESS) {
-   error = posix_acl_equiv_mode(acl, >i_mode);
-   if (error < 0)
-   return 0;
-   if (error == 0)
-   acl = NULL;
+   error = posix_acl_update_mode(inode,
+   >i_mode, );
+   if (error > 0)
+   return error;


Uh, why this error > 0 check? AFAIU it should be:

if (error < 0)
return 0;

As it used to be before...

Honza





Re: [PATCH] tmpfs: clear S_ISGID when setting posix ACLs

2017-01-08 Thread Gu Zheng

thanks, I will update it.

在 2017/1/6 18:10, Jan Kara 写道:

On Fri 06-01-17 16:12:55, Gu Zheng wrote:

This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49d9458aa351605b43a7e34598caef
posix_acl: Clear SGID bit when setting file permissions.
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

   touch $testfile
   chown 100:100 $testfile
   chmod 2755 $testfile
   _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng 


Ah, good catch. One comment below:


diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 5955220..d014dff 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -922,11 +922,10 @@ int simple_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
int error;

if (type == ACL_TYPE_ACCESS) {
-   error = posix_acl_equiv_mode(acl, >i_mode);
-   if (error < 0)
-   return 0;
-   if (error == 0)
-   acl = NULL;
+   error = posix_acl_update_mode(inode,
+   >i_mode, );
+   if (error > 0)
+   return error;


Uh, why this error > 0 check? AFAIU it should be:

if (error < 0)
return 0;

As it used to be before...

Honza





[PATCH] tmpfs: clear S_ISGID when setting posix ACLs

2017-01-06 Thread Gu Zheng
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49d9458aa351605b43a7e34598caef
posix_acl: Clear SGID bit when setting file permissions.
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

  touch $testfile
  chown 100:100 $testfile
  chmod 2755 $testfile
  _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng <guzhe...@huawei.com>
---
 fs/posix_acl.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 5955220..d014dff 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -922,11 +922,10 @@ int simple_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
int error;
 
if (type == ACL_TYPE_ACCESS) {
-   error = posix_acl_equiv_mode(acl, >i_mode);
-   if (error < 0)
-   return 0;
-   if (error == 0)
-   acl = NULL;
+   error = posix_acl_update_mode(inode,
+   >i_mode, );
+   if (error > 0)
+   return error;
}
 
inode->i_ctime = current_time(inode);
-- 
2.5.0



[PATCH] tmpfs: clear S_ISGID when setting posix ACLs

2017-01-06 Thread Gu Zheng
This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49d9458aa351605b43a7e34598caef
posix_acl: Clear SGID bit when setting file permissions.
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

  touch $testfile
  chown 100:100 $testfile
  chmod 2755 $testfile
  _runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng 
---
 fs/posix_acl.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 5955220..d014dff 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -922,11 +922,10 @@ int simple_set_acl(struct inode *inode, struct posix_acl 
*acl, int type)
int error;
 
if (type == ACL_TYPE_ACCESS) {
-   error = posix_acl_equiv_mode(acl, >i_mode);
-   if (error < 0)
-   return 0;
-   if (error == 0)
-   acl = NULL;
+   error = posix_acl_update_mode(inode,
+   >i_mode, );
+   if (error > 0)
+   return error;
}
 
inode->i_ctime = current_time(inode);
-- 
2.5.0



[tip:x86/apic] x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  8f54969dc8d6704632b42cbb5e47730cd75cc713
Gitweb: http://git.kernel.org/tip/8f54969dc8d6704632b42cbb5e47730cd75cc713
Author: Gu Zheng <guz.f...@cn.fujitsu.com>
AuthorDate: Thu, 25 Aug 2016 16:35:16 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 21 Sep 2016 21:18:38 +0200

x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.f...@cn.fujitsu.com>
Acked-by: Ingo Molnar <mi...@kernel.org>
Cc: mika.j.pentt...@gmail.com
Cc: len.br...@intel.com
Cc: raf...@kernel.org
Cc: r...@rjwysocki.net
Cc: yasu.isim...@gmail.com
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: isimatu.yasu...@jp.fujitsu.com
Cc: gongzhaog...@inspur.com
Cc: t...@kernel.org
Cc: izumi.t...@jp.fujitsu.com
Cc: c...@linux.com
Cc: chen.t...@easystack.cn
Cc: a...@linux-foundation.org
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: l...@kernel.org
Link: 
http://lkml.kernel.org/r/1472114120-3281-4-git-send-email-douly.f...@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  7 +
 arch/x86/kernel/apic/apic.c   | 60 ---
 3 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index c2f94dc..3200704 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0447e31..7d668d1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -176,15 +176,10 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 
enabled)
return -EINVAL;
}
 
-   if (!enabled) {
-   ++disabled_cpus;
-   return -EINVAL;
-   }
-
if (boot_cpu_physical_apicid != -1U)
ver = boot_cpu_apic_version;
 
-   cpu = generic_processor_info(id, ver);
+   cpu = __generic_processor_info(id, ver, enabled);
if (cpu >= 0)
early_per_cpu(x86_cpu_to_acpiid, cpu) = acpiid;
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a8c94bb..2dc01c3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2021,7 +2021,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+s

[tip:x86/apic] x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  8f54969dc8d6704632b42cbb5e47730cd75cc713
Gitweb: http://git.kernel.org/tip/8f54969dc8d6704632b42cbb5e47730cd75cc713
Author: Gu Zheng 
AuthorDate: Thu, 25 Aug 2016 16:35:16 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 21 Sep 2016 21:18:38 +0200

x86/acpi: Introduce persistent storage for cpuid <-> apicid mapping

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 2.

In this patch, we introduce a new static array named cpuid_to_apicid[],
which is large enough to store info for all possible cpus.

And then, we modify the cpuid calculation. In generic_processor_info(),
it simply finds the next unused cpuid. And it is also why the cpuid <-> nodeid
mapping changes with node hotplug.

After this patch, we find the next unused cpuid, map it to an apicid,
and store the mapping in cpuid_to_apicid[], so that cpuid <-> apicid
mapping will be persistent.

And finally we will use this array to make cpuid <-> nodeid persistent.

cpuid <-> apicid mapping is established at local apic registeration time.
But non-present or disabled cpus are ignored.

In this patch, we establish all possible cpuid <-> apicid mapping when
registering local apic.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
Signed-off-by: Dou Liyang 
Acked-by: Ingo Molnar 
Cc: mika.j.pentt...@gmail.com
Cc: len.br...@intel.com
Cc: raf...@kernel.org
Cc: r...@rjwysocki.net
Cc: yasu.isim...@gmail.com
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: isimatu.yasu...@jp.fujitsu.com
Cc: gongzhaog...@inspur.com
Cc: t...@kernel.org
Cc: izumi.t...@jp.fujitsu.com
Cc: c...@linux.com
Cc: chen.t...@easystack.cn
Cc: a...@linux-foundation.org
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: l...@kernel.org
Link: 
http://lkml.kernel.org/r/1472114120-3281-4-git-send-email-douly.f...@cn.fujitsu.com
Signed-off-by: Thomas Gleixner 

---
 arch/x86/include/asm/mpspec.h |  1 +
 arch/x86/kernel/acpi/boot.c   |  7 +
 arch/x86/kernel/apic/apic.c   | 60 ---
 3 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index c2f94dc..3200704 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 0447e31..7d668d1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -176,15 +176,10 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 
enabled)
return -EINVAL;
}
 
-   if (!enabled) {
-   ++disabled_cpus;
-   return -EINVAL;
-   }
-
if (boot_cpu_physical_apicid != -1U)
ver = boot_cpu_apic_version;
 
-   cpu = generic_processor_info(id, ver);
+   cpu = __generic_processor_info(id, ver, enabled);
if (cpu >= 0)
early_per_cpu(x86_cpu_to_acpiid, cpu) = acpiid;
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a8c94bb..2dc01c3 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2021,7 +2021,53 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-static int __generic_processor_info(int apicid, int version, bool enabled)
+/*
+ * The number of allocated logical CPU IDs. Since logical CPU IDs are allocated
+ * contiguously, it equals to current allocated max logical CPU ID plus 1.
+ * All allocated CPU ID should be in [0, nr_logical_cpuidi), so the maximum of
+ * nr_logical_cpuids is nr_cpu_ids.
+ *
+ * NOTE: Reserve 0 for BSP.
+ */
+static int nr_logical_cpuids = 1;
+
+/*
+ * Used to store mapping between logical CPU IDs and APIC IDs.
+ */
+static int cpuid_to_apicid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
+ * and cpuid_to_apicid[] synchronized.
+ */
+static int allocate_logical_cpuid(int apicid)
+{
+   int i;
+
+   /*
+* cpuid <-> apicid mapping is persistent, so when a cpu is up,
+* check if the kernel has allocated a cpuid for it.
+*/
+   for (i = 0

[tip:x86/apic] x86/acpi: Enable acpi to register all possible cpus at boot time

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  f7c28833c252031bc68a29e26a18a661797cf3a3
Gitweb: http://git.kernel.org/tip/f7c28833c252031bc68a29e26a18a661797cf3a3
Author: Gu Zheng <guz.f...@cn.fujitsu.com>
AuthorDate: Thu, 25 Aug 2016 16:35:15 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 21 Sep 2016 21:18:38 +0200

x86/acpi: Enable acpi to register all possible cpus at boot time

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU


node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Est

[tip:x86/apic] x86/acpi: Enable acpi to register all possible cpus at boot time

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  f7c28833c252031bc68a29e26a18a661797cf3a3
Gitweb: http://git.kernel.org/tip/f7c28833c252031bc68a29e26a18a661797cf3a3
Author: Gu Zheng 
AuthorDate: Thu, 25 Aug 2016 16:35:15 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 21 Sep 2016 21:18:38 +0200

x86/acpi: Enable acpi to register all possible cpus at boot time

cpuid <-> nodeid mapping is firstly established at boot time. And workqueue 
caches
the mapping in wq_numa_possible_cpumask in wq_numa_init() at boot time.

When doing node online/offline, cpuid <-> nodeid mapping is 
established/destroyed,
which means, cpuid <-> nodeid mapping will change if node hotplug happens. But
workqueue does not update wq_numa_possible_cpumask.

So here is the problem:

Assume we have the following cpuid <-> nodeid in the beginning:

  Node | CPU


node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 2 | 30-44, 90-104
node 3 | 45-59, 105-119

and we hot-remove node2 and node3, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89

and we hot-add node4 and node5, it becomes:

  Node | CPU

node 0 |  0-14, 60-74
node 1 | 15-29, 75-89
node 4 | 30-59
node 5 | 90-119

But in wq_numa_possible_cpumask, cpu30 is still mapped to node2, and the like.

When a pool workqueue is initialized, if its cpumask belongs to a node, its
pool->node will be mapped to that node. And memory used by this workqueue will
also be allocated on that node.

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs 
*attrs){
...
/* if cpumask is contained inside a NUMA node, we belong to that node */
if (wq_numa_enabled) {
for_each_node(node) {
if (cpumask_subset(pool->attrs->cpumask,
   wq_numa_possible_cpumask[node])) {
pool->node = node;
break;
}
}
}

Since wq_numa_possible_cpumask is not updated, it could be mapped to an offline 
node,
which will lead to memory allocation failure:

 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, min 
order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656

It happens here:

create_worker(struct worker_pool *pool)
 |--> worker = alloc_worker(pool->node);

static struct worker *alloc_worker(int node)
{
struct worker *worker;

worker = kzalloc_node(sizeof(*worker), GFP_KERNEL, node); --> Here, 
useing the wrong node.

..

return worker;
}

[Solution]

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm
2. apicid (physical cpu id)   <->   nodeid
3. cpuid (logical cpu id) <->   apicid
4. cpuid (logical cpu id) <->   nodeid

1. pxm (proximity domain) is provided by ACPI firmware in SRAT, and nodeid <-> 
pxm
   mapping is setup at boot time. This mapping is persistent, won't change.

2. apicid <-> nodeid mapping is setup using info in 1. The mapping is setup at 
boot
   time and CPU hotadd time, and cleared at CPU hotremove time. This mapping is 
also
   persistent.

3. cpuid <-> apicid mapping is setup at boot time and CPU hotadd time. cpuid is
   allocated, lower ids first, and released at CPU hotremove time, reused for 
other
   hotadded CPUs. So this mapping is not persistent.

4. cpuid <-> nodeid mapping is also setup at boot time and CPU hotadd time, and
   cleared at CPU hotremove time. As a result of 3, this mapping is not 
persistent.

To fix this problem, we establish cpuid <-> nodeid mapping for all the possible
cpus at boot time, and make it persistent. And according to init_cpu_to_node(),
cpuid <-> nodeid mapping is based on apicid <-> nodeid mapping and cpuid <-> 
apicid
mapping. So the key point is obtaining all cpus' apicid.

apicid can be obtained by _MAT (Multiple APIC Table Entry) method or found in
MADT (Multiple APIC Description Table). So we finish the job in the following 
steps:

1. Enable apic registeration flow to handle both enabled and disabled cpus.
   This is done by introducing an extra parameter to generic_processor_info to 
let the
   caller control if disabled cpus are ignored.

2. Introduce a new array storing all possible cpuid <-> apicid mapping. And 
also modify
   the way cpuid is calculated. Establish all possible cpuid <-> apicid mapping 
when
   registering local apic. Store the mapping in this array.

3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
   This is also done by introducing an extra parameter to these apis to let the 
caller
   control if disabled cpus are ignored.

4. Establish all possible cpuid <-> nodeid mapping.

[tip:x86/apic] x86/acpi: Enable MADT APIs to return disabled apicids

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  8ad893faf2eaedb710a3073afbb5d569df2c3e41
Gitweb: http://git.kernel.org/tip/8ad893faf2eaedb710a3073afbb5d569df2c3e41
Author: Gu Zheng <guz.f...@cn.fujitsu.com>
AuthorDate: Thu, 25 Aug 2016 16:35:17 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 21 Sep 2016 21:18:39 +0200

x86/acpi: Enable MADT APIs to return disabled apicids

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.f...@cn.fujitsu.com>
Acked-by: Ingo Molnar <mi...@kernel.org>
Cc: mika.j.pentt...@gmail.com
Cc: len.br...@intel.com
Cc: raf...@kernel.org
Cc: r...@rjwysocki.net
Cc: yasu.isim...@gmail.com
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: isimatu.yasu...@jp.fujitsu.com
Cc: gongzhaog...@inspur.com
Cc: t...@kernel.org
Cc: izumi.t...@jp.fujitsu.com
Cc: c...@linux.com
Cc: chen.t...@easystack.cn
Cc: a...@linux-foundation.org
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: l...@kernel.org
Link: 
http://lkml.kernel.org/r/1472114120-3281-5-git-send-email-douly.f...@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 60 +++
 2 files changed, 42 insertions(+), 23 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index c7ba948..02b84aa 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not be present
+*  because cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 9125d7d..fd59ae8 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
   

[tip:x86/apic] x86/acpi: Enable MADT APIs to return disabled apicids

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  8ad893faf2eaedb710a3073afbb5d569df2c3e41
Gitweb: http://git.kernel.org/tip/8ad893faf2eaedb710a3073afbb5d569df2c3e41
Author: Gu Zheng 
AuthorDate: Thu, 25 Aug 2016 16:35:17 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 21 Sep 2016 21:18:39 +0200

x86/acpi: Enable MADT APIs to return disabled apicids

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 3.

There are four mappings in the kernel:
1. nodeid (logical node id)   <->   pxm(persistent)
2. apicid (physical cpu id)   <->   nodeid (persistent)
3. cpuid (logical cpu id) <->   apicid (not persistent, now persistent 
by step 2)
4. cpuid (logical cpu id) <->   nodeid (not persistent)

So, in order to setup persistent cpuid <-> nodeid mapping for all possible CPUs,
we should:
1. Setup cpuid <-> apicid mapping for all possible CPUs, which has been done in 
step 1, 2.
2. Setup cpuid <-> nodeid mapping for all possible CPUs. But before that, we 
should
   obtain all apicids from MADT.

All processors' apicids can be obtained by _MAT method or from MADT in ACPI.
The current code ignores disabled processors and returns -ENODEV.

After this patch, a new parameter will be added to MADT APIs so that caller
is able to control if disabled processors are ignored.

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
Signed-off-by: Dou Liyang 
Acked-by: Ingo Molnar 
Cc: mika.j.pentt...@gmail.com
Cc: len.br...@intel.com
Cc: raf...@kernel.org
Cc: r...@rjwysocki.net
Cc: yasu.isim...@gmail.com
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: isimatu.yasu...@jp.fujitsu.com
Cc: gongzhaog...@inspur.com
Cc: t...@kernel.org
Cc: izumi.t...@jp.fujitsu.com
Cc: c...@linux.com
Cc: chen.t...@easystack.cn
Cc: a...@linux-foundation.org
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: l...@kernel.org
Link: 
http://lkml.kernel.org/r/1472114120-3281-5-git-send-email-douly.f...@cn.fujitsu.com
Signed-off-by: Thomas Gleixner 

---
 drivers/acpi/acpi_processor.c |  5 +++-
 drivers/acpi/processor_core.c | 60 +++
 2 files changed, 42 insertions(+), 23 deletions(-)

diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index c7ba948..02b84aa 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -300,8 +300,11 @@ static int acpi_processor_get_info(struct acpi_device 
*device)
 *  Extra Processor objects may be enumerated on MP systems with
 *  less than the max # of CPUs. They should be ignored _iff
 *  they are physically not present.
+*
+*  NOTE: Even if the processor has a cpuid, it may not be present
+*  because cpuid <-> apicid mapping is persistent now.
 */
-   if (invalid_logical_cpuid(pr->id)) {
+   if (invalid_logical_cpuid(pr->id) || !cpu_present(pr->id)) {
int ret = acpi_processor_hotadd_init(pr);
if (ret)
return ret;
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index 9125d7d..fd59ae8 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -32,12 +32,12 @@ static struct acpi_table_madt *get_madt_table(void)
 }
 
 static int map_lapic_id(struct acpi_subtable_header *entry,
-u32 acpi_id, phys_cpuid_t *apic_id)
+u32 acpi_id, phys_cpuid_t *apic_id, bool ignore_disabled)
 {
struct acpi_madt_local_apic *lapic =
container_of(entry, struct acpi_madt_local_apic, header);
 
-   if (!(lapic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(lapic->lapic_flags & ACPI_MADT_ENABLED))
return -ENODEV;
 
if (lapic->processor_id != acpi_id)
@@ -48,12 +48,13 @@ static int map_lapic_id(struct acpi_subtable_header *entry,
 }
 
 static int map_x2apic_id(struct acpi_subtable_header *entry,
-   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id)
+   int device_declaration, u32 acpi_id, phys_cpuid_t *apic_id,
+   bool ignore_disabled)
 {
struct acpi_madt_local_x2apic *apic =
container_of(entry, struct acpi_madt_local_x2apic, header);
 
-   if (!(apic->lapic_flags & ACPI_MADT_ENABLED))
+   if (ignore_disabled && !(apic->lapic_flags & ACPI_MADT_ENABLED))
r

[tip:x86/apic] x86/acpi: Set persistent cpuid <-> nodeid mapping when booting

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  dc6db24d2476cd09c0ecf2b8d80313539f737a89
Gitweb: http://git.kernel.org/tip/dc6db24d2476cd09c0ecf2b8d80313539f737a89
Author: Gu Zheng <guz.f...@cn.fujitsu.com>
AuthorDate: Thu, 25 Aug 2016 16:35:18 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 21 Sep 2016 21:18:39 +0200

x86/acpi: Set persistent cpuid <-> nodeid mapping when booting

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

[ tglx: Remove the unneeded exports ]

Signed-off-by: Gu Zheng <guz.f...@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangc...@cn.fujitsu.com>
Signed-off-by: Zhu Guihua <zhugh.f...@cn.fujitsu.com>
Signed-off-by: Dou Liyang <douly.f...@cn.fujitsu.com>
Acked-by: Ingo Molnar <mi...@kernel.org>
Cc: mika.j.pentt...@gmail.com
Cc: len.br...@intel.com
Cc: raf...@kernel.org
Cc: r...@rjwysocki.net
Cc: yasu.isim...@gmail.com
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: isimatu.yasu...@jp.fujitsu.com
Cc: gongzhaog...@inspur.com
Cc: t...@kernel.org
Cc: izumi.t...@jp.fujitsu.com
Cc: c...@linux.com
Cc: chen.t...@easystack.cn
Cc: a...@linux-foundation.org
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: l...@kernel.org
Link: 
http://lkml.kernel.org/r/1472114120-3281-6-git-send-email-douly.f...@cn.fujitsu.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>

---
 arch/ia64/kernel/acpi.c   |  2 +-
 arch/x86/kernel/acpi/boot.c   |  3 +-
 drivers/acpi/acpi_processor.c |  5 
 drivers/acpi/bus.c|  1 +
 drivers/acpi/processor_core.c | 68 +++
 include/linux/acpi.h  |  3 ++
 6 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 92b7bc9..9273e03 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 7d668d1..fc88410 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -702,7 +702,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
@@ -713,6 +713,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, 
int physid)
numa_set_node(cpu, nid);
}
 #endif
+   return 0;
 }
 
 int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, int *pcpu)
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 02b84aa..f9f23fd 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -182,6 +182,11 @@ int __weak arch_register_cpu(int cpu)
 
 void __weak arch_unregister_cpu(int cpu) {}
 
+int __weak acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+{
+   return -ENODEV;
+}
+
 static int acpi_processor_hotadd_init(struct acpi_processor *pr)
 {
unsigned long long sta;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 85b7d07..a760dac 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1193,6 +1193,7 @@ static int __init acpi_init(void)
acpi_wakeup_device_init();
acpi_debugger_init();
acpi_setup_sb_notify_handler();
+   acpi_set_processor_mapping();
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index fd59ae8..8801976 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -280,6 +280,74 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool __init
+map_processor(acpi_handle handle, phys_cpuid_t *phys_id, int *cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_o

[tip:x86/apic] x86/acpi: Set persistent cpuid <-> nodeid mapping when booting

2016-09-22 Thread tip-bot for Gu Zheng
Commit-ID:  dc6db24d2476cd09c0ecf2b8d80313539f737a89
Gitweb: http://git.kernel.org/tip/dc6db24d2476cd09c0ecf2b8d80313539f737a89
Author: Gu Zheng 
AuthorDate: Thu, 25 Aug 2016 16:35:18 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 21 Sep 2016 21:18:39 +0200

x86/acpi: Set persistent cpuid <-> nodeid mapping when booting

The whole patch-set aims at making cpuid <-> nodeid mapping persistent. So that,
when node online/offline happens, cache based on cpuid <-> nodeid mapping such 
as
wq_numa_possible_cpumask will not cause any problem.
It contains 4 steps:
1. Enable apic registeration flow to handle both enabled and disabled cpus.
2. Introduce a new array storing all possible cpuid <-> apicid mapping.
3. Enable _MAT and MADT relative apis to return non-present or disabled cpus' 
apicid.
4. Establish all possible cpuid <-> nodeid mapping.

This patch finishes step 4.

This patch set the persistent cpuid <-> nodeid mapping for all enabled/disabled
processors at boot time via an additional acpi namespace walk for processors.

[ tglx: Remove the unneeded exports ]

Signed-off-by: Gu Zheng 
Signed-off-by: Tang Chen 
Signed-off-by: Zhu Guihua 
Signed-off-by: Dou Liyang 
Acked-by: Ingo Molnar 
Cc: mika.j.pentt...@gmail.com
Cc: len.br...@intel.com
Cc: raf...@kernel.org
Cc: r...@rjwysocki.net
Cc: yasu.isim...@gmail.com
Cc: linux...@kvack.org
Cc: linux-a...@vger.kernel.org
Cc: isimatu.yasu...@jp.fujitsu.com
Cc: gongzhaog...@inspur.com
Cc: t...@kernel.org
Cc: izumi.t...@jp.fujitsu.com
Cc: c...@linux.com
Cc: chen.t...@easystack.cn
Cc: a...@linux-foundation.org
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: l...@kernel.org
Link: 
http://lkml.kernel.org/r/1472114120-3281-6-git-send-email-douly.f...@cn.fujitsu.com
Signed-off-by: Thomas Gleixner 

---
 arch/ia64/kernel/acpi.c   |  2 +-
 arch/x86/kernel/acpi/boot.c   |  3 +-
 drivers/acpi/acpi_processor.c |  5 
 drivers/acpi/bus.c|  1 +
 drivers/acpi/processor_core.c | 68 +++
 include/linux/acpi.h  |  3 ++
 6 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 92b7bc9..9273e03 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 7d668d1..fc88410 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -702,7 +702,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
@@ -713,6 +713,7 @@ static void acpi_map_cpu2node(acpi_handle handle, int cpu, 
int physid)
numa_set_node(cpu, nid);
}
 #endif
+   return 0;
 }
 
 int acpi_map_cpu(acpi_handle handle, phys_cpuid_t physid, int *pcpu)
diff --git a/drivers/acpi/acpi_processor.c b/drivers/acpi/acpi_processor.c
index 02b84aa..f9f23fd 100644
--- a/drivers/acpi/acpi_processor.c
+++ b/drivers/acpi/acpi_processor.c
@@ -182,6 +182,11 @@ int __weak arch_register_cpu(int cpu)
 
 void __weak arch_unregister_cpu(int cpu) {}
 
+int __weak acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+{
+   return -ENODEV;
+}
+
 static int acpi_processor_hotadd_init(struct acpi_processor *pr)
 {
unsigned long long sta;
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index 85b7d07..a760dac 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -1193,6 +1193,7 @@ static int __init acpi_init(void)
acpi_wakeup_device_init();
acpi_debugger_init();
acpi_setup_sb_notify_handler();
+   acpi_set_processor_mapping();
return 0;
 }
 
diff --git a/drivers/acpi/processor_core.c b/drivers/acpi/processor_core.c
index fd59ae8..8801976 100644
--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -280,6 +280,74 @@ int acpi_get_cpuid(acpi_handle handle, int type, u32 
acpi_id)
 }
 EXPORT_SYMBOL_GPL(acpi_get_cpuid);
 
+#ifdef CONFIG_ACPI_HOTPLUG_CPU
+static bool __init
+map_processor(acpi_handle handle, phys_cpuid_t *phys_id, int *cpuid)
+{
+   int type;
+   u32 acpi_id;
+   acpi_status status;
+   acpi_object_type acpi_type;
+   unsigned long long tmp;
+   union acpi_object object = { 0 };
+   struct acpi_buffer buffer = { sizeof(union acpi_object),  };
+
+   status = acpi_get_type(handle, _type);
+   if (ACPI_FAILURE(status))
+   return false;
+
+   switch (acpi_type) {
+   case ACPI_TY

[PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-04 Thread Gu Zheng
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack, and let the boot
up routine init the espfix stack for the target cpu after it booted to
avoid the noise.

Signed-off-by: Gu Zheng 
---
v1:
  Alloc the page on the node the target CPU is on.
RFC:
  Let the boot up routine init the espfix stack for the target cpu after it
  booted.
---
---
 arch/x86/include/asm/espfix.h |2 +-
 arch/x86/kernel/espfix_64.c   |   28 
 arch/x86/kernel/smpboot.c |   14 +++---
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..e397583 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
@@ -165,12 +164,15 @@ void init_espfix_ap(void)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = _pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pmd(_mm, __pa(pmd_p) >> PAGE_SHIFT);
for (n = 0; n < ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pte_p = (pte_t *)page_address(page);
pmd = __pmd(__pa(pte_p) | (PGTABLE_PROT & ptemask));
paravirt_alloc_pte(_mm, __pa(pte_p) >> PAGE_SHIFT);
for (n = 0

[PATCH V1] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-04 Thread Gu Zheng
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

As we alloc pages with GFP_KERNEL in init_espfix_ap() which is called
before enabled local irq, and the lockdep sub-system considers this
behaviour as allocating memory with GFP_FS with local irq disabled,
then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack, and let the boot
up routine init the espfix stack for the target cpu after it booted to
avoid the noise.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
v1:
  Alloc the page on the node the target CPU is on.
RFC:
  Let the boot up routine init the espfix stack for the target cpu after it
  booted.
---
---
 arch/x86/include/asm/espfix.h |2 +-
 arch/x86/kernel/espfix_64.c   |   28 
 arch/x86/kernel/smpboot.c |   14 +++---
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..ca3ce9a 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..e397583 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,25 +131,24 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
pte_t pte, *pte_p;
-   int n;
+   int n, node;
void *stack_page;
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
@@ -165,12 +164,15 @@ void init_espfix_ap(void)
if (stack_page)
goto unlock_done;
 
+   node = cpu_to_node(cpu);
ptemask = __supported_pte_mask;
 
pud_p = espfix_pud_page[pud_index(addr)];
pud = *pud_p;
if (!pud_present(pud)) {
-   pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0);
+
+   pmd_p = (pmd_t *)page_address(page);
pud = __pud(__pa(pmd_p) | (PGTABLE_PROT  ptemask));
paravirt_alloc_pmd(init_mm, __pa(pmd_p)  PAGE_SHIFT);
for (n = 0; n  ESPFIX_PUD_CLONES; n++)
@@ -180,7 +182,9 @@ void init_espfix_ap(void)
pmd_p = pmd_offset(pud, addr);
pmd = *pmd_p;
if (!pmd_present(pmd)) {
-   pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
+   struct page *page = alloc_pages_node(node, PGALLOC_GFP, 0

Re: [RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-03 Thread Gu Zheng
Hi Ingo,

On 06/02/2015 07:59 PM, Ingo Molnar wrote:

> 
> * Gu Zheng  wrote:
> 
>> The following lockdep warning occurrs when running with latest kernel:
>> [3.178000] [ cut here ]
>> [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
>> lockdep_trace_alloc+0xdd/0xe0()
>> [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [3.199000] Modules linked in:
>>
>> [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
>> [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
>> 81773f0a
>> [3.23]   88086fd5fc90 88086fd5fc78 
>> 8108c85a
>> [3.238000]  88086fd6 0092 88086fd6 
>> 00d0
>> [3.246000] Call Trace:
>> [3.249000]  [] dump_stack+0x4c/0x65
>> [3.255000]  [] warn_slowpath_common+0x8a/0xc0
>> [3.261000]  [] warn_slowpath_fmt+0x55/0x70
>> [3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
>> [3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
>> [3.281000]  [] ? __lock_acquire+0xf6d/0x1560
>> [3.288000]  [] alloc_page_interleave+0x3a/0x90
>> [3.295000]  [] alloc_pages_current+0x17d/0x1a0
>> [3.301000]  [] ? __get_free_pages+0xe/0x50
>> [3.308000]  [] __get_free_pages+0xe/0x50
>> [3.314000]  [] init_espfix_ap+0x17b/0x320
>> [3.32]  [] start_secondary+0xf1/0x1f0
>> [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>>
>> This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in 
>> init_espfix_ap() which is called before enabled local irq, and the lockdep 
>> sub-system considers this behaviour as allocating memory with GFP_FS with 
>> local 
>> irq disabled, then trigger the warning as mentioned about.
> 
> Why should this be a 'mis-warning'? If the GFP_KERNEL allocation sleeps then 
> we'll 
> sleep with irqs disabled => bad.
> 
> This looks like a real (albeit hard to trigger) bug.


You are right.
Thanks for correct me, I misread the log.

> 
>> Though we could allocate them on the boot CPU side and hand them over to the 
>> secondary CPU, but it seemes a bit waste if some of cpus are offline. As 
>> thers 
>> is no need to these pages(espfix stack) until we try to run user code, so we 
>> postpone the initialization of espfix stack after cpu booted to avoid the 
>> noise.
> 
>> -void init_espfix_ap(void)
>> +void init_espfix_ap(int cpu)
>>  {
> 
> So how about the concern I raised in a former thread, that the allocation 
> should 
> be done for the node the target CPU is on? The 'cpu' parameter should be 
> propagated to the allocation as well, and turned into a node allocation or so.
> 
> Even though some CPUs will share the espfix stack, some won't.


Hmm, sounds reasonable.

Regards,
Gu

> 
> Thanks,
> 
>   Ingo
> .
> 




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-03 Thread Gu Zheng
Hi Ingo,

On 06/02/2015 07:59 PM, Ingo Molnar wrote:

 
 * Gu Zheng guz.f...@cn.fujitsu.com wrote:
 
 The following lockdep warning occurrs when running with latest kernel:
 [3.178000] [ cut here ]
 [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
 lockdep_trace_alloc+0xdd/0xe0()
 [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
 [3.199000] Modules linked in:

 [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
 [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
 81773f0a
 [3.23]   88086fd5fc90 88086fd5fc78 
 8108c85a
 [3.238000]  88086fd6 0092 88086fd6 
 00d0
 [3.246000] Call Trace:
 [3.249000]  [81773f0a] dump_stack+0x4c/0x65
 [3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
 [3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
 [3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
 [3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
 [3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
 [3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
 [3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
 [3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
 [3.308000]  [811c869e] __get_free_pages+0xe/0x50
 [3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
 [3.32]  [8105c691] start_secondary+0xf1/0x1f0
 [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

 This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in 
 init_espfix_ap() which is called before enabled local irq, and the lockdep 
 sub-system considers this behaviour as allocating memory with GFP_FS with 
 local 
 irq disabled, then trigger the warning as mentioned about.
 
 Why should this be a 'mis-warning'? If the GFP_KERNEL allocation sleeps then 
 we'll 
 sleep with irqs disabled = bad.
 
 This looks like a real (albeit hard to trigger) bug.


You are right.
Thanks for correct me, I misread the log.

 
 Though we could allocate them on the boot CPU side and hand them over to the 
 secondary CPU, but it seemes a bit waste if some of cpus are offline. As 
 thers 
 is no need to these pages(espfix stack) until we try to run user code, so we 
 postpone the initialization of espfix stack after cpu booted to avoid the 
 noise.
 
 -void init_espfix_ap(void)
 +void init_espfix_ap(int cpu)
  {
 
 So how about the concern I raised in a former thread, that the allocation 
 should 
 be done for the node the target CPU is on? The 'cpu' parameter should be 
 propagated to the allocation as well, and turned into a node allocation or so.
 
 Even though some CPUs will share the espfix stack, some won't.


Hmm, sounds reasonable.

Regards,
Gu

 
 Thanks,
 
   Ingo
 .
 




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-02 Thread Gu Zheng
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack after cpu booted
to avoid the noise.


Signed-off-by: Gu Zheng 
---
v2:
  Let the boot up routine init the espfix stack for the target cpu after it
  booted.
---
 arch/x86/include/asm/espfix.h |2 +-
 arch/x86/kernel/espfix_64.c   |   15 +++
 arch/x86/kernel/smpboot.c |   14 +++---
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..b074c4f 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..37a4404 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -146,10 +146,9 @@ void init_espfix_ap(void)
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
@@ -199,7 +198,7 @@ void init_espfix_ap(void)
 unlock_done:
mutex_unlock(_init_mutex);
 done:
-   this_cpu_write(espfix_stack, addr);
-   this_cpu_write(espfix_waddr, (unsigned long)stack_page
-  + (addr & ~PAGE_MASK));
+   per_cpu(espfix_stack, cpu) = addr;
+   per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+  + (addr & ~PAGE_MASK);
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..e9fdd0e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
}
}
 
+   /*
+* Enable the espfix hack for this CPU
+*/
+#ifdef CONFIG_X86_ESPFIX64
+   init_espfix_ap(cpu);
+#endif
+
/* mar

Re: [RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-02 Thread Gu Zheng
Hi Andy,

Sorry for late reply. 
On 05/29/2015 09:07 AM, Andy Lutomirski wrote:

> On Wed, May 27, 2015 at 6:20 PM, Gu Zheng  wrote:
>> ping...
>>
>> On 05/22/2015 06:13 PM, Gu Zheng wrote:
>>
>>> The following lockdep warning occurs when running with 4.1.0-rc3:
>>> [3.178000] [ cut here ]
>>> [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
>>> lockdep_trace_alloc+0xdd/0xe0()
>>> [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>>> [3.199000] Modules linked in:
>>>
>>> [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
>>> [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
>>> 81773f0a
>>> [3.23]   88086fd5fc90 88086fd5fc78 
>>> 8108c85a
>>> [3.238000]  88086fd6 0092 88086fd6 
>>> 00d0
>>> [3.246000] Call Trace:
>>> [3.249000]  [] dump_stack+0x4c/0x65
>>> [3.255000]  [] warn_slowpath_common+0x8a/0xc0
>>> [3.261000]  [] warn_slowpath_fmt+0x55/0x70
>>> [3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
>>> [3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
>>> [3.281000]  [] ? __lock_acquire+0xf6d/0x1560
>>> [3.288000]  [] alloc_page_interleave+0x3a/0x90
>>> [3.295000]  [] alloc_pages_current+0x17d/0x1a0
>>> [3.301000]  [] ? __get_free_pages+0xe/0x50
>>> [3.308000]  [] __get_free_pages+0xe/0x50
>>> [3.314000]  [] init_espfix_ap+0x17b/0x320
>>> [3.32]  [] start_secondary+0xf1/0x1f0
>>> [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>>>
>>> This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
>>> init_espfix_ap() which is called before enabled local irq, and the lockdep
>>> sub-system considers this behaviour as allocating memory with GFP_FS with
>>> local irq disabled, then trigger the warning as mentioned about.
>>>
>>> Though we could allocate them on the boot CPU side and hand them over to
>>> the secondary CPU, but it seems a waste if some of cpus are still offline.
>>> As there is no need to these pages(espfix stack) until we try to run user
>>> code, so we can postpone the initialization of espfix stack after cpu
>>> booted to avoid the noise.
> 
> Does this pass the sigreturn_32 test on both 32-bit and 64-bit kernels
> and sigreturn_64 test on 64-bit kernels?  (The test is in
> tools/testing/selftests/x86.)  If so, looks good to me.

It failed the test.
There seems a bug in this patch, it alloc espfix stack in the do_boot_cpu
routine, not in the context of target cpu that we want to boot up, so the simple
change is wrong here.
I will send the v2 version soon, and it passed the tests you mentioned above.

Thanks again for your comments and suggestion.

Regards,
Gu
 

> 
> --Andy
> 
>>>
>>> Signed-off-by: Gu Zheng 
>>> ---
>>>  arch/x86/kernel/smpboot.c | 14 +++---
>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>>> index 50e547e..3ce05de 100644
>>> --- a/arch/x86/kernel/smpboot.c
>>> +++ b/arch/x86/kernel/smpboot.c
>>> @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
>>>   check_tsc_sync_target();
>>>
>>>   /*
>>> -  * Enable the espfix hack for this CPU
>>> -  */
>>> -#ifdef CONFIG_X86_ESPFIX64
>>> - init_espfix_ap();
>>> -#endif
>>> -
>>> - /*
>>>* We need to hold vector_lock so there the set of online cpus
>>>* does not change while we are assigning vectors to cpus.  Holding
>>>* this lock ensures we don't half assign or remove an irq from a cpu.
>>> @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
>>> task_struct *idle)
>>>   }
>>>   }
>>>
>>> + /*
>>> +  * Enable the espfix hack for this CPU
>>> +  */
>>> +#ifdef CONFIG_X86_ESPFIX64
>>> + init_espfix_ap();
>>> +#endif
>>> +
>>>   /* mark "stuck" area as not stuck */
>>>   *trampoline_status = 0;
>>>
>>
>>
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-02 Thread Gu Zheng
Hi Andy,

Sorry for late reply. 
On 05/29/2015 09:07 AM, Andy Lutomirski wrote:

 On Wed, May 27, 2015 at 6:20 PM, Gu Zheng guz.f...@cn.fujitsu.com wrote:
 ping...

 On 05/22/2015 06:13 PM, Gu Zheng wrote:

 The following lockdep warning occurs when running with 4.1.0-rc3:
 [3.178000] [ cut here ]
 [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
 lockdep_trace_alloc+0xdd/0xe0()
 [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
 [3.199000] Modules linked in:

 [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
 [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
 81773f0a
 [3.23]   88086fd5fc90 88086fd5fc78 
 8108c85a
 [3.238000]  88086fd6 0092 88086fd6 
 00d0
 [3.246000] Call Trace:
 [3.249000]  [81773f0a] dump_stack+0x4c/0x65
 [3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
 [3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
 [3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
 [3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
 [3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
 [3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
 [3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
 [3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
 [3.308000]  [811c869e] __get_free_pages+0xe/0x50
 [3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
 [3.32]  [8105c691] start_secondary+0xf1/0x1f0
 [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

 This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
 init_espfix_ap() which is called before enabled local irq, and the lockdep
 sub-system considers this behaviour as allocating memory with GFP_FS with
 local irq disabled, then trigger the warning as mentioned about.

 Though we could allocate them on the boot CPU side and hand them over to
 the secondary CPU, but it seems a waste if some of cpus are still offline.
 As there is no need to these pages(espfix stack) until we try to run user
 code, so we can postpone the initialization of espfix stack after cpu
 booted to avoid the noise.
 
 Does this pass the sigreturn_32 test on both 32-bit and 64-bit kernels
 and sigreturn_64 test on 64-bit kernels?  (The test is in
 tools/testing/selftests/x86.)  If so, looks good to me.

It failed the test.
There seems a bug in this patch, it alloc espfix stack in the do_boot_cpu
routine, not in the context of target cpu that we want to boot up, so the simple
change is wrong here.
I will send the v2 version soon, and it passed the tests you mentioned above.

Thanks again for your comments and suggestion.

Regards,
Gu
 

 
 --Andy
 

 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  arch/x86/kernel/smpboot.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
 index 50e547e..3ce05de 100644
 --- a/arch/x86/kernel/smpboot.c
 +++ b/arch/x86/kernel/smpboot.c
 @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
   check_tsc_sync_target();

   /*
 -  * Enable the espfix hack for this CPU
 -  */
 -#ifdef CONFIG_X86_ESPFIX64
 - init_espfix_ap();
 -#endif
 -
 - /*
* We need to hold vector_lock so there the set of online cpus
* does not change while we are assigning vectors to cpus.  Holding
* this lock ensures we don't half assign or remove an irq from a cpu.
 @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
 task_struct *idle)
   }
   }

 + /*
 +  * Enable the espfix hack for this CPU
 +  */
 +#ifdef CONFIG_X86_ESPFIX64
 + init_espfix_ap();
 +#endif
 +
   /* mark stuck area as not stuck */
   *trampoline_status = 0;



 
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2] x86, espfix: postpone the initialization of espfix stack for AP

2015-06-02 Thread Gu Zheng
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seemes a bit waste if some of cpus are offline.
As thers is no need to these pages(espfix stack) until we try to run user
code, so we postpone the initialization of espfix stack after cpu booted
to avoid the noise.


Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
v2:
  Let the boot up routine init the espfix stack for the target cpu after it
  booted.
---
 arch/x86/include/asm/espfix.h |2 +-
 arch/x86/kernel/espfix_64.c   |   15 +++
 arch/x86/kernel/smpboot.c |   14 +++---
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/espfix.h b/arch/x86/include/asm/espfix.h
index 99efebb..b074c4f 100644
--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -9,7 +9,7 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DECLARE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 extern void init_espfix_bsp(void);
-extern void init_espfix_ap(void);
+extern void init_espfix_ap(int cpu);
 
 #endif /* CONFIG_X86_64 */
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..37a4404 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -131,12 +131,12 @@ void __init init_espfix_bsp(void)
init_espfix_random();
 
/* The rest is the same as for any other processor */
-   init_espfix_ap();
+   init_espfix_ap(0);
 }
 
-void init_espfix_ap(void)
+void init_espfix_ap(int cpu)
 {
-   unsigned int cpu, page;
+   unsigned int page;
unsigned long addr;
pud_t pud, *pud_p;
pmd_t pmd, *pmd_p;
@@ -146,10 +146,9 @@ void init_espfix_ap(void)
pteval_t ptemask;
 
/* We only have to do this once... */
-   if (likely(this_cpu_read(espfix_stack)))
+   if (likely(per_cpu(espfix_stack, cpu)))
return; /* Already initialized */
 
-   cpu = smp_processor_id();
addr = espfix_base_addr(cpu);
page = cpu/ESPFIX_STACKS_PER_PAGE;
 
@@ -199,7 +198,7 @@ void init_espfix_ap(void)
 unlock_done:
mutex_unlock(espfix_init_mutex);
 done:
-   this_cpu_write(espfix_stack, addr);
-   this_cpu_write(espfix_waddr, (unsigned long)stack_page
-  + (addr  ~PAGE_MASK));
+   per_cpu(espfix_stack, cpu) = addr;
+   per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
+  + (addr  ~PAGE_MASK);
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..e9fdd0e 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu

Re: [RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-05-28 Thread Gu Zheng
Hi Andy,

On 05/29/2015 09:07 AM, Andy Lutomirski wrote:

> On Wed, May 27, 2015 at 6:20 PM, Gu Zheng  wrote:
>> ping...
>>
>> On 05/22/2015 06:13 PM, Gu Zheng wrote:
>>
>>> The following lockdep warning occurs when running with 4.1.0-rc3:
>>> [3.178000] [ cut here ]
>>> [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
>>> lockdep_trace_alloc+0xdd/0xe0()
>>> [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>>> [3.199000] Modules linked in:
>>>
>>> [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
>>> [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
>>> 81773f0a
>>> [3.23]   88086fd5fc90 88086fd5fc78 
>>> 8108c85a
>>> [3.238000]  88086fd6 0092 88086fd6 
>>> 00d0
>>> [3.246000] Call Trace:
>>> [3.249000]  [] dump_stack+0x4c/0x65
>>> [3.255000]  [] warn_slowpath_common+0x8a/0xc0
>>> [3.261000]  [] warn_slowpath_fmt+0x55/0x70
>>> [3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
>>> [3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
>>> [3.281000]  [] ? __lock_acquire+0xf6d/0x1560
>>> [3.288000]  [] alloc_page_interleave+0x3a/0x90
>>> [3.295000]  [] alloc_pages_current+0x17d/0x1a0
>>> [3.301000]  [] ? __get_free_pages+0xe/0x50
>>> [3.308000]  [] __get_free_pages+0xe/0x50
>>> [3.314000]  [] init_espfix_ap+0x17b/0x320
>>> [3.32]  [] start_secondary+0xf1/0x1f0
>>> [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
>>>
>>> This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
>>> init_espfix_ap() which is called before enabled local irq, and the lockdep
>>> sub-system considers this behaviour as allocating memory with GFP_FS with
>>> local irq disabled, then trigger the warning as mentioned about.
>>>
>>> Though we could allocate them on the boot CPU side and hand them over to
>>> the secondary CPU, but it seems a waste if some of cpus are still offline.
>>> As there is no need to these pages(espfix stack) until we try to run user
>>> code, so we can postpone the initialization of espfix stack after cpu
>>> booted to avoid the noise.
> 
> Does this pass the sigreturn_32 test on both 32-bit and 64-bit kernels
> and sigreturn_64 test on 64-bit kernels?  (The test is in
> tools/testing/selftests/x86.)  If so, looks good to me.

To be honest, I forgot this part, will do it soon.
Thanks for your reminder.

Regards,
Gu

> 
> --Andy
> 
>>>
>>> Signed-off-by: Gu Zheng 
>>> ---
>>>  arch/x86/kernel/smpboot.c | 14 +++---
>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>>> index 50e547e..3ce05de 100644
>>> --- a/arch/x86/kernel/smpboot.c
>>> +++ b/arch/x86/kernel/smpboot.c
>>> @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
>>>   check_tsc_sync_target();
>>>
>>>   /*
>>> -  * Enable the espfix hack for this CPU
>>> -  */
>>> -#ifdef CONFIG_X86_ESPFIX64
>>> - init_espfix_ap();
>>> -#endif
>>> -
>>> - /*
>>>* We need to hold vector_lock so there the set of online cpus
>>>* does not change while we are assigning vectors to cpus.  Holding
>>>* this lock ensures we don't half assign or remove an irq from a cpu.
>>> @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
>>> task_struct *idle)
>>>   }
>>>   }
>>>
>>> + /*
>>> +  * Enable the espfix hack for this CPU
>>> +  */
>>> +#ifdef CONFIG_X86_ESPFIX64
>>> + init_espfix_ap();
>>> +#endif
>>> +
>>>   /* mark "stuck" area as not stuck */
>>>   *trampoline_status = 0;
>>>
>>
>>
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-05-28 Thread Gu Zheng
Hi Andy,

On 05/29/2015 09:07 AM, Andy Lutomirski wrote:

 On Wed, May 27, 2015 at 6:20 PM, Gu Zheng guz.f...@cn.fujitsu.com wrote:
 ping...

 On 05/22/2015 06:13 PM, Gu Zheng wrote:

 The following lockdep warning occurs when running with 4.1.0-rc3:
 [3.178000] [ cut here ]
 [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
 lockdep_trace_alloc+0xdd/0xe0()
 [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
 [3.199000] Modules linked in:

 [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
 [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
 81773f0a
 [3.23]   88086fd5fc90 88086fd5fc78 
 8108c85a
 [3.238000]  88086fd6 0092 88086fd6 
 00d0
 [3.246000] Call Trace:
 [3.249000]  [81773f0a] dump_stack+0x4c/0x65
 [3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
 [3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
 [3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
 [3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
 [3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
 [3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
 [3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
 [3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
 [3.308000]  [811c869e] __get_free_pages+0xe/0x50
 [3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
 [3.32]  [8105c691] start_secondary+0xf1/0x1f0
 [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

 This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
 init_espfix_ap() which is called before enabled local irq, and the lockdep
 sub-system considers this behaviour as allocating memory with GFP_FS with
 local irq disabled, then trigger the warning as mentioned about.

 Though we could allocate them on the boot CPU side and hand them over to
 the secondary CPU, but it seems a waste if some of cpus are still offline.
 As there is no need to these pages(espfix stack) until we try to run user
 code, so we can postpone the initialization of espfix stack after cpu
 booted to avoid the noise.
 
 Does this pass the sigreturn_32 test on both 32-bit and 64-bit kernels
 and sigreturn_64 test on 64-bit kernels?  (The test is in
 tools/testing/selftests/x86.)  If so, looks good to me.

To be honest, I forgot this part, will do it soon.
Thanks for your reminder.

Regards,
Gu

 
 --Andy
 

 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  arch/x86/kernel/smpboot.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
 index 50e547e..3ce05de 100644
 --- a/arch/x86/kernel/smpboot.c
 +++ b/arch/x86/kernel/smpboot.c
 @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
   check_tsc_sync_target();

   /*
 -  * Enable the espfix hack for this CPU
 -  */
 -#ifdef CONFIG_X86_ESPFIX64
 - init_espfix_ap();
 -#endif
 -
 - /*
* We need to hold vector_lock so there the set of online cpus
* does not change while we are assigning vectors to cpus.  Holding
* this lock ensures we don't half assign or remove an irq from a cpu.
 @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
 task_struct *idle)
   }
   }

 + /*
 +  * Enable the espfix hack for this CPU
 +  */
 +#ifdef CONFIG_X86_ESPFIX64
 + init_espfix_ap();
 +#endif
 +
   /* mark stuck area as not stuck */
   *trampoline_status = 0;



 
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-05-27 Thread Gu Zheng
ping...

On 05/22/2015 06:13 PM, Gu Zheng wrote:

> The following lockdep warning occurs when running with 4.1.0-rc3:
> [3.178000] [ cut here ]
> [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
> lockdep_trace_alloc+0xdd/0xe0()
> [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [3.199000] Modules linked in:
> 
> [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
> [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
> 81773f0a
> [3.23]   88086fd5fc90 88086fd5fc78 
> 8108c85a
> [3.238000]  88086fd6 0092 88086fd6 
> 00d0
> [3.246000] Call Trace:
> [3.249000]  [] dump_stack+0x4c/0x65
> [3.255000]  [] warn_slowpath_common+0x8a/0xc0
> [3.261000]  [] warn_slowpath_fmt+0x55/0x70
> [3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
> [3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
> [3.281000]  [] ? __lock_acquire+0xf6d/0x1560
> [3.288000]  [] alloc_page_interleave+0x3a/0x90
> [3.295000]  [] alloc_pages_current+0x17d/0x1a0
> [3.301000]  [] ? __get_free_pages+0xe/0x50
> [3.308000]  [] __get_free_pages+0xe/0x50
> [3.314000]  [] init_espfix_ap+0x17b/0x320
> [3.32]  [] start_secondary+0xf1/0x1f0
> [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
> 
> This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
> init_espfix_ap() which is called before enabled local irq, and the lockdep
> sub-system considers this behaviour as allocating memory with GFP_FS with
> local irq disabled, then trigger the warning as mentioned about.
> 
> Though we could allocate them on the boot CPU side and hand them over to
> the secondary CPU, but it seems a waste if some of cpus are still offline.
> As there is no need to these pages(espfix stack) until we try to run user
> code, so we can postpone the initialization of espfix stack after cpu
> booted to avoid the noise.
> 
> Signed-off-by: Gu Zheng 
> ---
>  arch/x86/kernel/smpboot.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 50e547e..3ce05de 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
>   check_tsc_sync_target();
>  
>   /*
> -  * Enable the espfix hack for this CPU
> -  */
> -#ifdef CONFIG_X86_ESPFIX64
> - init_espfix_ap();
> -#endif
> -
> - /*
>* We need to hold vector_lock so there the set of online cpus
>* does not change while we are assigning vectors to cpus.  Holding
>* this lock ensures we don't half assign or remove an irq from a cpu.
> @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
> task_struct *idle)
>   }
>   }
>  
> + /*
> +  * Enable the espfix hack for this CPU
> +  */
> +#ifdef CONFIG_X86_ESPFIX64
> + init_espfix_ap();
> +#endif
> +
>   /* mark "stuck" area as not stuck */
>   *trampoline_status = 0;
>  


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/memory_hotplug: set zone->wait_table to null after free it

2015-05-27 Thread Gu Zheng
Izumi found the following oops when hot re-add a node:
[ 1481.759192] BUG: unable to handle kernel paging request at c90008963690
[ 1481.760192] IP: [] __wake_up_bit+0x20/0x70
[ 1481.770098] PGD 86e919067 PUD 207cf003067 PMD 20796d3b067 PTE 0
[ 1481.770098] Oops:  [#1] SMP
[ 1481.770098] CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
[ 1481.770098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 
Series BIOS Version 1.87 04/28/2015
[ 1481.770098] task: 880838df8000 ti: 880017b94000 task.ti: 
880017b94000
[ 1481.770098] RIP: 0010:[]  [] 
__wake_up_bit+0x20/0x70
[ 1481.770098] RSP: 0018:880017b97be8  EFLAGS: 00010246
[ 1481.770098] RAX: c90008963690 RBX: 003c RCX: a4c9
[ 1481.770098] RDX:  RSI: ea101bffd500 RDI: c90008963648
[ 1481.770098] RBP: 880017b97c08 R08: 0220 R09: 
[ 1481.770098] R10:  R11:  R12: 8a0797c73800
[ 1481.770098] R13: ea101bffd500 R14: 0001 R15: 003c
[ 1481.770098] FS:  7fcc7700() GS:88087480() 
knlGS:
[ 1481.770098] CS:  0010 DS:  ES:  CR0: 80050033
[ 1481.770098] CR2: c90008963690 CR3: 000836761000 CR4: 001407e0
[ 1481.770098] Stack:
[ 1481.770098]  8a0797c73800 ea10 1000 
69c53212
[ 1481.770098]  880017b97c18 811c2a5d 880017b97c68 
8128a0e3
[ 1481.770098]  0001 00281bffd500 003c 
0028
[ 1481.770098] Call Trace:
[ 1481.770098]  [] unlock_page+0x6d/0x70
[ 1481.770098]  [] generic_write_end+0x53/0xb0
[ 1481.770098]  [] xfs_vm_write_end+0x29/0x80 [xfs]
[ 1481.770098]  [] generic_perform_write+0x10a/0x1e0
[ 1481.770098]  [] xfs_file_buffered_aio_write+0x14d/0x3e0 
[xfs]
[ 1481.770098]  [] xfs_file_write_iter+0x79/0x120 [xfs]
[ 1481.770098]  [] __vfs_write+0xd4/0x110
[ 1481.770098]  [] vfs_write+0xac/0x1c0
[ 1481.770098]  [] SyS_write+0x58/0xd0
[ 1481.770098]  [] system_call_fastpath+0x12/0x76
[ 1481.770098] Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47
 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
[ 1481.770098] RIP  [] __wake_up_bit+0x20/0x70
[ 1481.770098]  RSP 
[ 1481.770098] CR2: c90008963690
[ 1481.770098] ---[ end trace 25c9882ad3f72923 ]---
[ 1481.770098] Kernel panic - not syncing: Fatal exception
[ 1481.770098] Kernel Offset: disabled
[ 1481.770098] drm_kms_helper: panic occurred, switching back to text console
[ 1481.770098] ---[ end Kernel panic - not syncing: Fatal exception

Reproduce method (re-add a node):
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

This seems an use-after-free problem, and the root cause is zone->wait_table
was not set to *NULL* after free it in try_offline_node.

When hot re-add a node, we will reuse the pgdat of it, so does
the zone struct, and when add pages to the target zone, it will init the
zone first (including the wait_table) if the zone is not initialized.
The judgement of zone initialized is based on zone->wait_table:
static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
},
so if we do not set the zone->wait_table to *NULL* after free it, the memory
hotplug routine will skip the init of new zone when hot re-add the node, and
the wait_table still points to the freed memory, then we will access the invalid
address when trying to wake up the waiting people after the i/o operation with
the page is done, such as mentioned above.

Reported-by: Taku Izumi 
Cc: Stable 
Signed-off-by: Gu Zheng 
---
 mm/memory_hotplug.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 457bde5..9e88f74 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1969,8 +1969,10 @@ void try_offline_node(int nid)
 * wait_table may be allocated from boot memory,
 * here only free if it's allocated by vmalloc.
 */
-   if (is_vmalloc_addr(zone->wait_table))
+   if (is_vmalloc_addr(zone->wait_table)) {
vfree(zone->wait_table);
+   zone->wait_table = NULL;
+   }
}
 }
 EXPORT_SYMBOL(try_offline_node);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-05-27 Thread Gu Zheng
ping...

On 05/22/2015 06:13 PM, Gu Zheng wrote:

 The following lockdep warning occurs when running with 4.1.0-rc3:
 [3.178000] [ cut here ]
 [3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
 lockdep_trace_alloc+0xdd/0xe0()
 [3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
 [3.199000] Modules linked in:
 
 [3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
 [3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
 81773f0a
 [3.23]   88086fd5fc90 88086fd5fc78 
 8108c85a
 [3.238000]  88086fd6 0092 88086fd6 
 00d0
 [3.246000] Call Trace:
 [3.249000]  [81773f0a] dump_stack+0x4c/0x65
 [3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
 [3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
 [3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
 [3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
 [3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
 [3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
 [3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
 [3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
 [3.308000]  [811c869e] __get_free_pages+0xe/0x50
 [3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
 [3.32]  [8105c691] start_secondary+0xf1/0x1f0
 [3.327000] ---[ end trace 1b3327d9d6a1d62c ]---
 
 This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
 init_espfix_ap() which is called before enabled local irq, and the lockdep
 sub-system considers this behaviour as allocating memory with GFP_FS with
 local irq disabled, then trigger the warning as mentioned about.
 
 Though we could allocate them on the boot CPU side and hand them over to
 the secondary CPU, but it seems a waste if some of cpus are still offline.
 As there is no need to these pages(espfix stack) until we try to run user
 code, so we can postpone the initialization of espfix stack after cpu
 booted to avoid the noise.
 
 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  arch/x86/kernel/smpboot.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)
 
 diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
 index 50e547e..3ce05de 100644
 --- a/arch/x86/kernel/smpboot.c
 +++ b/arch/x86/kernel/smpboot.c
 @@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
   check_tsc_sync_target();
  
   /*
 -  * Enable the espfix hack for this CPU
 -  */
 -#ifdef CONFIG_X86_ESPFIX64
 - init_espfix_ap();
 -#endif
 -
 - /*
* We need to hold vector_lock so there the set of online cpus
* does not change while we are assigning vectors to cpus.  Holding
* this lock ensures we don't half assign or remove an irq from a cpu.
 @@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
 task_struct *idle)
   }
   }
  
 + /*
 +  * Enable the espfix hack for this CPU
 +  */
 +#ifdef CONFIG_X86_ESPFIX64
 + init_espfix_ap();
 +#endif
 +
   /* mark stuck area as not stuck */
   *trampoline_status = 0;
  


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/memory_hotplug: set zone-wait_table to null after free it

2015-05-27 Thread Gu Zheng
Izumi found the following oops when hot re-add a node:
[ 1481.759192] BUG: unable to handle kernel paging request at c90008963690
[ 1481.760192] IP: [810dff80] __wake_up_bit+0x20/0x70
[ 1481.770098] PGD 86e919067 PUD 207cf003067 PMD 20796d3b067 PTE 0
[ 1481.770098] Oops:  [#1] SMP
[ 1481.770098] CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
[ 1481.770098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 
Series BIOS Version 1.87 04/28/2015
[ 1481.770098] task: 880838df8000 ti: 880017b94000 task.ti: 
880017b94000
[ 1481.770098] RIP: 0010:[810dff80]  [810dff80] 
__wake_up_bit+0x20/0x70
[ 1481.770098] RSP: 0018:880017b97be8  EFLAGS: 00010246
[ 1481.770098] RAX: c90008963690 RBX: 003c RCX: a4c9
[ 1481.770098] RDX:  RSI: ea101bffd500 RDI: c90008963648
[ 1481.770098] RBP: 880017b97c08 R08: 0220 R09: 
[ 1481.770098] R10:  R11:  R12: 8a0797c73800
[ 1481.770098] R13: ea101bffd500 R14: 0001 R15: 003c
[ 1481.770098] FS:  7fcc7700() GS:88087480() 
knlGS:
[ 1481.770098] CS:  0010 DS:  ES:  CR0: 80050033
[ 1481.770098] CR2: c90008963690 CR3: 000836761000 CR4: 001407e0
[ 1481.770098] Stack:
[ 1481.770098]  8a0797c73800 ea10 1000 
69c53212
[ 1481.770098]  880017b97c18 811c2a5d 880017b97c68 
8128a0e3
[ 1481.770098]  0001 00281bffd500 003c 
0028
[ 1481.770098] Call Trace:
[ 1481.770098]  [811c2a5d] unlock_page+0x6d/0x70
[ 1481.770098]  [8128a0e3] generic_write_end+0x53/0xb0
[ 1481.770098]  [a0496559] xfs_vm_write_end+0x29/0x80 [xfs]
[ 1481.770098]  [811c25da] generic_perform_write+0x10a/0x1e0
[ 1481.770098]  [a04acb4d] xfs_file_buffered_aio_write+0x14d/0x3e0 
[xfs]
[ 1481.770098]  [a04ace59] xfs_file_write_iter+0x79/0x120 [xfs]
[ 1481.770098]  [8124aac4] __vfs_write+0xd4/0x110
[ 1481.770098]  [8124b1ac] vfs_write+0xac/0x1c0
[ 1481.770098]  [8124c0a8] SyS_write+0x58/0xd0
[ 1481.770098]  [8177eb6e] system_call_fastpath+0x12/0x76
[ 1481.770098] Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5
 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 48 39 47
 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
[ 1481.770098] RIP  [810dff80] __wake_up_bit+0x20/0x70
[ 1481.770098]  RSP 880017b97be8
[ 1481.770098] CR2: c90008963690
[ 1481.770098] ---[ end trace 25c9882ad3f72923 ]---
[ 1481.770098] Kernel panic - not syncing: Fatal exception
[ 1481.770098] Kernel Offset: disabled
[ 1481.770098] drm_kms_helper: panic occurred, switching back to text console
[ 1481.770098] ---[ end Kernel panic - not syncing: Fatal exception

Reproduce method (re-add a node):
Hot-add nodeA -- remove nodeA -- hot-add nodeA (panic)

This seems an use-after-free problem, and the root cause is zone-wait_table
was not set to *NULL* after free it in try_offline_node.

When hot re-add a node, we will reuse the pgdat of it, so does
the zone struct, and when add pages to the target zone, it will init the
zone first (including the wait_table) if the zone is not initialized.
The judgement of zone initialized is based on zone-wait_table:
static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone-wait_table;
},
so if we do not set the zone-wait_table to *NULL* after free it, the memory
hotplug routine will skip the init of new zone when hot re-add the node, and
the wait_table still points to the freed memory, then we will access the invalid
address when trying to wake up the waiting people after the i/o operation with
the page is done, such as mentioned above.

Reported-by: Taku Izumi izumi.t...@jp.fujitsu.com
Cc: Stable sta...@vger.kernel.org
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 mm/memory_hotplug.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 457bde5..9e88f74 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1969,8 +1969,10 @@ void try_offline_node(int nid)
 * wait_table may be allocated from boot memory,
 * here only free if it's allocated by vmalloc.
 */
-   if (is_vmalloc_addr(zone-wait_table))
+   if (is_vmalloc_addr(zone-wait_table)) {
vfree(zone-wait_table);
+   zone-wait_table = NULL;
+   }
}
 }
 EXPORT_SYMBOL(try_offline_node);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

Re: [RFC PATCH V2 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-05-26 Thread Gu Zheng
ping...

Any comments or suggestions are welcomed.

Regards,
Gu

On 05/14/2015 07:33 PM, Gu Zheng wrote:

> Yasuaki Ishimatsu found that with node online/offline, cpu<->node
> relationship is established. Because workqueue uses a info which
> was established at boot time, but it may be changed by node hotpluging.
> 
> Once pool->node points to a stale node, following allocation failure
> happens.
>   ==
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
> 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
>   ==
> 
> As the apicid <---> pxm and pxm <--> node relationship are persistent, then
> the apicid <--> node mapping is persistent, so the root cause is the
> cpu-id <-> lapicid mapping is not persistent (because the currently
> implementation always choose the first free cpu id for the new added cpu).
> If we can build persistent cpu-id <-> lapicid relationship, this problem
> will be fixed.
> 
> This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm 
> <-> node
> for all possible processor at the boot, the detail implementation are 2 steps:
> 
> Step1: generate a logic cpu id for all the local apic (both enabled and 
> dsiabled)
>when register local apic
> Step2: map the cpu to the phyical node via an additional acpi ns walk for 
> processor.
> 
> Please refer to:
> https://lkml.org/lkml/2015/2/27/145
> https://lkml.org/lkml/2015/3/25/989
> for the previous discussion.
> ---
>  V2: rebase on latest upstream.
> ---
> 
> Signed-off-by: Gu Zheng 
> ---
>  arch/ia64/kernel/acpi.c   |   2 +-
>  arch/x86/include/asm/mpspec.h |   1 +
>  arch/x86/kernel/acpi/boot.c   |   8 ++-
>  arch/x86/kernel/apic/apic.c   |  73 -
>  arch/x86/mm/numa.c|  20 ---
>  drivers/acpi/acpi_processor.c |   2 +-
>  drivers/acpi/bus.c|   3 ++
>  drivers/acpi/processor_core.c | 121 
> ++
>  include/linux/acpi.h  |   2 +
>  9 files changed, 172 insertions(+), 60 deletions(-)
> 
> diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
> index b1698bc..7db5563 100644
> --- a/arch/ia64/kernel/acpi.c
> +++ b/arch/ia64/kernel/acpi.c
> @@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
>   *  ACPI based hotplug CPU support
>   */
>  #ifdef CONFIG_ACPI_HOTPLUG_CPU
> -static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> +int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>   /*
> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
> index b07233b..db902d8 100644
> --- a/arch/x86/include/asm/mpspec.h
> +++ b/arch/x86/include/asm/mpspec.h
> @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
>  #endif
>  
>  int generic_processor_info(int apicid, int version);
> +int __generic_processor_info(int apicid, int version, bool enabled);
>  
>  #define PHYSID_ARRAY_SIZEBITS_TO_LONGS(MAX_LOCAL_APIC)
>  
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index dbe76a1..c79115b 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
>   return -EINVAL;
>   }
>  
> - if (!enabled) {
> + if (!enabled)
>   ++disabled_cpus;
> - return -EINVAL;
> - }
>  
>   if (boot_cpu_physical_apicid != -1U)
>   ver = apic_version[boot_cpu_physical_apicid];
>  
> - return generic_processor_info(id, ver);
> + return __generic_processor_info(id, ver, enabled);
>  }
>  
>  static int __init
> @@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
>  #ifdef CONFIG_ACPI_HOTPLUG_CPU
>  #include 
>  
> -static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> +void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>   int nid;
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index dcb5285..7fbf2cb 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>   apic_write(APIC_LVT1, value);
>  }
>  
> -int generic_processor_info(int apicid, int version)
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
> +

Re: [RFC PATCH V2 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-05-26 Thread Gu Zheng
ping...

Any comments or suggestions are welcomed.

Regards,
Gu

On 05/14/2015 07:33 PM, Gu Zheng wrote:

 Yasuaki Ishimatsu found that with node online/offline, cpu-node
 relationship is established. Because workqueue uses a info which
 was established at boot time, but it may be changed by node hotpluging.
 
 Once pool-node points to a stale node, following allocation failure
 happens.
   ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
   ==
 
 As the apicid --- pxm and pxm -- node relationship are persistent, then
 the apicid -- node mapping is persistent, so the root cause is the
 cpu-id - lapicid mapping is not persistent (because the currently
 implementation always choose the first free cpu id for the new added cpu).
 If we can build persistent cpu-id - lapicid relationship, this problem
 will be fixed.
 
 This patch tries to build the whole world mapping cpuid - apicid - pxm 
 - node
 for all possible processor at the boot, the detail implementation are 2 steps:
 
 Step1: generate a logic cpu id for all the local apic (both enabled and 
 dsiabled)
when register local apic
 Step2: map the cpu to the phyical node via an additional acpi ns walk for 
 processor.
 
 Please refer to:
 https://lkml.org/lkml/2015/2/27/145
 https://lkml.org/lkml/2015/3/25/989
 for the previous discussion.
 ---
  V2: rebase on latest upstream.
 ---
 
 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  arch/ia64/kernel/acpi.c   |   2 +-
  arch/x86/include/asm/mpspec.h |   1 +
  arch/x86/kernel/acpi/boot.c   |   8 ++-
  arch/x86/kernel/apic/apic.c   |  73 -
  arch/x86/mm/numa.c|  20 ---
  drivers/acpi/acpi_processor.c |   2 +-
  drivers/acpi/bus.c|   3 ++
  drivers/acpi/processor_core.c | 121 
 ++
  include/linux/acpi.h  |   2 +
  9 files changed, 172 insertions(+), 60 deletions(-)
 
 diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
 index b1698bc..7db5563 100644
 --- a/arch/ia64/kernel/acpi.c
 +++ b/arch/ia64/kernel/acpi.c
 @@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
   *  ACPI based hotplug CPU support
   */
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
 -static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 +int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
   /*
 diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
 index b07233b..db902d8 100644
 --- a/arch/x86/include/asm/mpspec.h
 +++ b/arch/x86/include/asm/mpspec.h
 @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
  #endif
  
  int generic_processor_info(int apicid, int version);
 +int __generic_processor_info(int apicid, int version, bool enabled);
  
  #define PHYSID_ARRAY_SIZEBITS_TO_LONGS(MAX_LOCAL_APIC)
  
 diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
 index dbe76a1..c79115b 100644
 --- a/arch/x86/kernel/acpi/boot.c
 +++ b/arch/x86/kernel/acpi/boot.c
 @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
   return -EINVAL;
   }
  
 - if (!enabled) {
 + if (!enabled)
   ++disabled_cpus;
 - return -EINVAL;
 - }
  
   if (boot_cpu_physical_apicid != -1U)
   ver = apic_version[boot_cpu_physical_apicid];
  
 - return generic_processor_info(id, ver);
 + return __generic_processor_info(id, ver, enabled);
  }
  
  static int __init
 @@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
  #include acpi/processor.h
  
 -static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 +void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
   int nid;
 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
 index dcb5285..7fbf2cb 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
   apic_write(APIC_LVT1, value);
  }
  
 -int generic_processor_info(int apicid, int version)
 +/*
 + * Logic cpu number(cpuid) to local APIC id persistent mappings.
 + * Do not clear the mapping even if cpu hot removed.
 + * */
 +static int apicid_to_cpuid[] = {
 + [0 ... NR_CPUS - 1] = -1,
 +};
 +
 +/*
 + * Internal cpu id bits, set the bit once cpu present, and never clear it.
 + * */
 +static cpumask_t cpuid_mask = CPU_MASK_NONE;
 +
 +static int get_cpuid(int apicid)
 +{
 + int free_id, i;
 +
 + free_id = cpumask_next_zero(-1, cpuid_mask);
 + if (free_id = nr_cpu_ids)
 + return -1;
 +
 + for (i = 0; i  free_id; i++)
 + if (apicid_to_cpuid[i

[RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-05-22 Thread Gu Zheng
The following lockdep warning occurs when running with 4.1.0-rc3:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seems a waste if some of cpus are still offline.
As there is no need to these pages(espfix stack) until we try to run user
code, so we can postpone the initialization of espfix stack after cpu
booted to avoid the noise.

Signed-off-by: Gu Zheng 
---
 arch/x86/kernel/smpboot.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..3ce05de 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
}
}
 
+   /*
+* Enable the espfix hack for this CPU
+*/
+#ifdef CONFIG_X86_ESPFIX64
+   init_espfix_ap();
+#endif
+
/* mark "stuck" area as not stuck */
*trampoline_status = 0;
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] x86, espfix: postpone the initialization of espfix stack for AP

2015-05-22 Thread Gu Zheng
The following lockdep warning occurs when running with 4.1.0-rc3:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.

Though we could allocate them on the boot CPU side and hand them over to
the secondary CPU, but it seems a waste if some of cpus are still offline.
As there is no need to these pages(espfix stack) until we try to run user
code, so we can postpone the initialization of espfix stack after cpu
booted to avoid the noise.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 arch/x86/kernel/smpboot.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..3ce05de 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -240,13 +240,6 @@ static void notrace start_secondary(void *unused)
check_tsc_sync_target();
 
/*
-* Enable the espfix hack for this CPU
-*/
-#ifdef CONFIG_X86_ESPFIX64
-   init_espfix_ap();
-#endif
-
-   /*
 * We need to hold vector_lock so there the set of online cpus
 * does not change while we are assigning vectors to cpus.  Holding
 * this lock ensures we don't half assign or remove an irq from a cpu.
@@ -901,6 +894,13 @@ static int do_boot_cpu(int apicid, int cpu, struct 
task_struct *idle)
}
}
 
+   /*
+* Enable the espfix hack for this CPU
+*/
+#ifdef CONFIG_X86_ESPFIX64
+   init_espfix_ap();
+#endif
+
/* mark stuck area as not stuck */
*trampoline_status = 0;
 
-- 
1.8.3.1


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-05-14 Thread Gu Zheng
Hi Ishimatsu,

On 05/15/2015 12:44 AM, Yasuaki Ishimatsu wrote:

> Hi Gu,
> 
> Before 8 months, I posted the following patch to relate
> cpuid to apicid.
> 
> https://lkml.org/lkml/2014/9/3/1120
> 
> Could you try this patch?


Thanks for your reminder.
It seems similar to the https://lkml.org/lkml/2015/3/25/989
"[PATCH 0/2] workqueue: fix a bug when numa mapping is changed",
though it also can fix the issue, but it seems not the perfect
solution, because self-maintain cpumask mapping (or something
like this) is very common in kernel.
As TJ and Kame suggested, it is available to build the mapping
for all the possible cpus at boot, so that we can ignore the
effect of cpu/node hotplug, especially for per cpu cases.

Regards,
Gu

> 
> Thanks,
> Yasuaki Ishimatsu
> 
> On Thu, 14 May 2015 19:33:33 +0800
> Gu Zheng  wrote:
> 
>> Yasuaki Ishimatsu found that with node online/offline, cpu<->node
>> relationship is established. Because workqueue uses a info which
>> was established at boot time, but it may be changed by node hotpluging.
>>
>> Once pool->node points to a stale node, following allocation failure
>> happens.
>>   ==
>>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>   cache: kmalloc-192, object size: 192, buffer size: 192, default
>> order:
>> 1, min order: 0
>>   node 0: slabs: 6172, objs: 259224, free: 245741
>>   node 1: slabs: 3261, objs: 136962, free: 127656
>>   ==
>>
>> As the apicid <---> pxm and pxm <--> node relationship are persistent, then
>> the apicid <--> node mapping is persistent, so the root cause is the
>> cpu-id <-> lapicid mapping is not persistent (because the currently
>> implementation always choose the first free cpu id for the new added cpu).
>> If we can build persistent cpu-id <-> lapicid relationship, this problem
>> will be fixed.
>>
>> This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm 
>> <-> node
>> for all possible processor at the boot, the detail implementation are 2 
>> steps:
>>
>> Step1: generate a logic cpu id for all the local apic (both enabled and 
>> dsiabled)
>>when register local apic
>> Step2: map the cpu to the phyical node via an additional acpi ns walk for 
>> processor.
>>
>> Please refer to:
>> https://lkml.org/lkml/2015/2/27/145
>> https://lkml.org/lkml/2015/3/25/989
>> for the previous discussion.
>> ---
>>  V2: rebase on latest upstream.
>> ---
>>
>> Signed-off-by: Gu Zheng 
>> ---
>>  arch/ia64/kernel/acpi.c   |   2 +-
>>  arch/x86/include/asm/mpspec.h |   1 +
>>  arch/x86/kernel/acpi/boot.c   |   8 ++-
>>  arch/x86/kernel/apic/apic.c   |  73 -
>>  arch/x86/mm/numa.c|  20 ---
>>  drivers/acpi/acpi_processor.c |   2 +-
>>  drivers/acpi/bus.c|   3 ++
>>  drivers/acpi/processor_core.c | 121 
>> ++
>>  include/linux/acpi.h  |   2 +
>>  9 files changed, 172 insertions(+), 60 deletions(-)
>>
>> diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
>> index b1698bc..7db5563 100644
>> --- a/arch/ia64/kernel/acpi.c
>> +++ b/arch/ia64/kernel/acpi.c
>> @@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
>>   *  ACPI based hotplug CPU support
>>   */
>>  #ifdef CONFIG_ACPI_HOTPLUG_CPU
>> -static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>> +int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>>  {
>>  #ifdef CONFIG_ACPI_NUMA
>>  /*
>> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
>> index b07233b..db902d8 100644
>> --- a/arch/x86/include/asm/mpspec.h
>> +++ b/arch/x86/include/asm/mpspec.h
>> @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
>>  #endif
>>  
>>  int generic_processor_info(int apicid, int version);
>> +int __generic_processor_info(int apicid, int version, bool enabled);
>>  
>>  #define PHYSID_ARRAY_SIZE   BITS_TO_LONGS(MAX_LOCAL_APIC)
>>  
>> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
>> index dbe76a1..c79115b 100644
>> --- a/arch/x86/kernel/acpi/boot.c
>> +++ b/arch/x86/kernel/acpi/boot.c
>> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
>>  return -EINVAL;
>>  }
>>  
>> -if (!enabled) {
>> +if (!enabled)
>>  ++disabled_cpus;

[PATCH] mm/memory hotplug: init the zones' size when calculate node totalpages

2015-05-14 Thread Gu Zheng
Init the zones' size when calculate node totalpages to avoid duplicated
operations in free_area_init_core.

Signed-off-by: Gu Zheng 
---
 mm/page_alloc.c |   44 +---
 1 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..0b34aec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4769,22 +4769,28 @@ static void __meminit calculate_node_totalpages(struct 
pglist_data *pgdat,
unsigned long *zones_size,
unsigned long *zholes_size)
 {
-   unsigned long realtotalpages, totalpages = 0;
+   unsigned long realtotalpages = 0, totalpages = 0;
enum zone_type i;
 
-   for (i = 0; i < MAX_NR_ZONES; i++)
-   totalpages += zone_spanned_pages_in_node(pgdat->node_id, i,
-node_start_pfn,
-node_end_pfn,
-zones_size);
-   pgdat->node_spanned_pages = totalpages;
+   for (i = 0; i < MAX_NR_ZONES; i++) {
+   struct zone *zone = pgdat->node_zones + i;
+   unsigned long size, real_size;
 
-   realtotalpages = totalpages;
-   for (i = 0; i < MAX_NR_ZONES; i++)
-   realtotalpages -=
-   zone_absent_pages_in_node(pgdat->node_id, i,
+   size = zone_spanned_pages_in_node(pgdat->node_id, i,
+ node_start_pfn,
+ node_end_pfn,
+ zones_size);
+   real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
  node_start_pfn, node_end_pfn,
  zholes_size);
+   zone->spanned_pages = size;
+   zone->present_pages = real_size;
+
+   totalpages += size;
+   realtotalpages += real_size;
+   }
+
+   pgdat->node_spanned_pages = totalpages;
pgdat->node_present_pages = realtotalpages;
printk(KERN_DEBUG "On node %d totalpages: %lu\n", pgdat->node_id,
realtotalpages);
@@ -4894,8 +4900,7 @@ static unsigned long __paginginit 
calc_memmap_size(unsigned long spanned_pages,
  * NOTE: pgdat should get zeroed by caller.
  */
 static void __paginginit free_area_init_core(struct pglist_data *pgdat,
-   unsigned long node_start_pfn, unsigned long node_end_pfn,
-   unsigned long *zones_size, unsigned long *zholes_size)
+   unsigned long node_start_pfn, unsigned long node_end_pfn)
 {
enum zone_type j;
int nid = pgdat->node_id;
@@ -4916,12 +4921,8 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
 
-   size = zone_spanned_pages_in_node(nid, j, node_start_pfn,
- node_end_pfn, zones_size);
-   realsize = freesize = size - zone_absent_pages_in_node(nid, j,
-   node_start_pfn,
-   node_end_pfn,
-   zholes_size);
+   size = zone->spanned_pages;
+   realsize = freesize = zone->present_pages;
 
/*
 * Adjust freesize so that it accounts for how much memory
@@ -4956,8 +4957,6 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
nr_kernel_pages -= memmap_pages;
nr_all_pages += freesize;
 
-   zone->spanned_pages = size;
-   zone->present_pages = realsize;
/*
 * Set an approximate value for lowmem here, it will be adjusted
 * when the bootmem allocator frees pages into the buddy system.
@@ -5063,8 +5062,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
(unsigned long)pgdat->node_mem_map);
 #endif
 
-   free_area_init_core(pgdat, start_pfn, end_pfn,
-   zones_size, zholes_size);
+   free_area_init_core(pgdat, start_pfn, end_pfn);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] x86, espfix: use spin_lock rather than mutex

2015-05-14 Thread Gu Zheng
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [] dump_stack+0x4c/0x65
[3.255000]  [] warn_slowpath_common+0x8a/0xc0
[3.261000]  [] warn_slowpath_fmt+0x55/0x70
[3.268000]  [] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [] alloc_page_interleave+0x3a/0x90
[3.295000]  [] alloc_pages_current+0x17d/0x1a0
[3.301000]  [] ? __get_free_pages+0xe/0x50
[3.308000]  [] __get_free_pages+0xe/0x50
[3.314000]  [] init_espfix_ap+0x17b/0x320
[3.32]  [] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.
Though here we use GFP_NOFS rather GFP_KERNEL to avoid the warning, but
you know, init_espfix_ap is called with preempt and local irq disabled,
it is not a good idea to use mutex (might sleep) here.
So we convert the initialization lock to spin_lock here to avoid the noise.

Signed-off-by: Gu Zheng 
Cc: Stable 
---
 arch/x86/kernel/espfix_64.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..ceb35a3 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -57,14 +57,14 @@
 # error "Need more than one PGD for the ESPFIX hack"
 #endif
 
-#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP (GFP_ATOMIC | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
-/* Initialization mutex - should this be a spinlock? */
-static DEFINE_MUTEX(espfix_init_mutex);
+/* Initialization lock */
+static DEFINE_SPINLOCK(espfix_init_lock);
 
 /* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
 #define ESPFIX_MAX_PAGES  DIV_ROUND_UP(CONFIG_NR_CPUS, ESPFIX_STACKS_PER_PAGE)
@@ -144,6 +144,7 @@ void init_espfix_ap(void)
int n;
void *stack_page;
pteval_t ptemask;
+   unsigned long flags;
 
/* We only have to do this once... */
if (likely(this_cpu_read(espfix_stack)))
@@ -158,7 +159,7 @@ void init_espfix_ap(void)
if (likely(stack_page))
goto done;
 
-   mutex_lock(_init_mutex);
+   spin_lock_irqsave(_init_lock, flags);
 
/* Did we race on the lock? */
stack_page = ACCESS_ONCE(espfix_pages[page]);
@@ -188,7 +189,7 @@ void init_espfix_ap(void)
}
 
pte_p = pte_offset_kernel(, addr);
-   stack_page = (void *)__get_free_page(GFP_KERNEL);
+   stack_page = (void *)__get_free_page(PGALLOC_GFP);
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO & ptemask));
for (n = 0; n < ESPFIX_PTE_CLONES; n++)
set_pte(_p[n*PTE_STRIDE], pte);
@@ -197,7 +198,7 @@ void init_espfix_ap(void)
ACCESS_ONCE(espfix_pages[page]) = stack_page;
 
 unlock_done:
-   mutex_unlock(_init_mutex);
+   spin_unlock_irqrestore(_init_lock, flags);
 done:
this_cpu_write(espfix_stack, addr);
this_cpu_write(espfix_waddr, (unsigned long)stack_page
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2 2/2] gfp: use the best near online node if the target node is offline

2015-05-14 Thread Gu Zheng
Since the change to the cpu <--> mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

---
V2: Maintaining a per-cpu cache about the alternative-node
only for x86 arch to avoid additional overhead.
---

Signed-off-by: Gu Zheng 
---
 arch/x86/include/asm/topology.h |  2 ++
 arch/x86/mm/numa.c  | 33 +
 include/linux/gfp.h | 12 +++-
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 0e8f04f..37bb6b6 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -82,6 +82,8 @@ static inline const struct cpumask *cpumask_of_node(int node)
 }
 #endif
 
+extern int get_near_online_node(int node);
+
 extern void setup_node_to_cpumask_map(void);
 
 /*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a733cf9..4126464 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -72,12 +72,34 @@ int numa_cpu_node(int cpu)
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 EXPORT_SYMBOL(node_to_cpumask_map);
 
+cpumask_t node_to_cpuid_mask_map[MAX_NUMNODES];
 /*
  * Map cpu index to node index
  */
 DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);
 
+DEFINE_PER_CPU(int, x86_cpu_to_near_online_node);
+EXPORT_PER_CPU_SYMBOL(x86_cpu_to_near_online_node);
+
+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val < min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}
+
 void numa_set_node(int cpu, int node)
 {
int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
@@ -95,7 +117,11 @@ void numa_set_node(int cpu, int node)
return;
}
 #endif
+
+   per_cpu(x86_cpu_to_near_online_node, cpu) =
+   find_near_online_node(numa_cpu_node(cpu));
per_cpu(x86_cpu_to_node_map, cpu) = node;
+   cpumask_set_cpu(cpu, _to_cpuid_mask_map[numa_cpu_node(cpu)]);
 
set_cpu_numa_node(cpu, node);
 }
@@ -105,6 +131,13 @@ void numa_clear_node(int cpu)
numa_set_node(cpu, NUMA_NO_NODE);
 }
 
+int get_near_online_node(int node)
+{
+   return per_cpu(x86_cpu_to_near_online_node,
+  cpumask_first(_to_cpuid_mask_map[node]));
+}
+EXPORT_SYMBOL(get_near_online_node);
+
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
  * Requires node_possible_map to be valid.
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373..b233ea4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -305,13 +305,23 @@ static inline struct page *alloc_pages_node(int nid, 
gfp_t gfp_mask,
if (nid < 0)
nid = numa_node_id();
 
+#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
+   if (!node_online(nid))
+   nid = get_near_online_node(nid);
+#endif
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA)
+   if (!node_online(nid))
+   nid = get_near_online_node(nid);
+#endif
+   VM_BUG_ON(!node_online(nid));
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-05-14 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu<->node
relationship is established. Because workqueue uses a info which
was established at boot time, but it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid <---> pxm and pxm <--> node relationship are persistent, then
the apicid <--> node mapping is persistent, so the root cause is the
cpu-id <-> lapicid mapping is not persistent (because the currently
implementation always choose the first free cpu id for the new added cpu).
If we can build persistent cpu-id <-> lapicid relationship, this problem
will be fixed.

This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm <-> 
node
for all possible processor at the boot, the detail implementation are 2 steps:

Step1: generate a logic cpu id for all the local apic (both enabled and 
dsiabled)
   when register local apic
Step2: map the cpu to the phyical node via an additional acpi ns walk for 
processor.

Please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
for the previous discussion.
---
 V2: rebase on latest upstream.
---

Signed-off-by: Gu Zheng 
---
 arch/ia64/kernel/acpi.c   |   2 +-
 arch/x86/include/asm/mpspec.h |   1 +
 arch/x86/kernel/acpi/boot.c   |   8 ++-
 arch/x86/kernel/apic/apic.c   |  73 -
 arch/x86/mm/numa.c|  20 ---
 drivers/acpi/acpi_processor.c |   2 +-
 drivers/acpi/bus.c|   3 ++
 drivers/acpi/processor_core.c | 121 ++
 include/linux/acpi.h  |   2 +
 9 files changed, 172 insertions(+), 60 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index dbe76a1..c79115b 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
@@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..7fbf2cb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_cpuid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int free_id, i;
+
+   free_id = cpumask_next_zero(-1, _mask);
+   if (free_id >= nr_cpu_ids)
+   return -1;
+
+   for (i = 0; i < free_id; i++)
+   if (apicid_to_cpuid[i] == apicid)
+   return i;
+
+   apicid_to_cpuid[free_id] = apicid;
+   cpumask_set_cpu(free_id, _mask);
+
+   return free_id;
+}
+
+int

[RFC PATCH V2 2/2] gfp: use the best near online node if the target node is offline

2015-05-14 Thread Gu Zheng
Since the change to the cpu -- mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

---
V2: Maintaining a per-cpu cache about the alternative-node
only for x86 arch to avoid additional overhead.
---

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 arch/x86/include/asm/topology.h |  2 ++
 arch/x86/mm/numa.c  | 33 +
 include/linux/gfp.h | 12 +++-
 3 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 0e8f04f..37bb6b6 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -82,6 +82,8 @@ static inline const struct cpumask *cpumask_of_node(int node)
 }
 #endif
 
+extern int get_near_online_node(int node);
+
 extern void setup_node_to_cpumask_map(void);
 
 /*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a733cf9..4126464 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -72,12 +72,34 @@ int numa_cpu_node(int cpu)
 cpumask_var_t node_to_cpumask_map[MAX_NUMNODES];
 EXPORT_SYMBOL(node_to_cpumask_map);
 
+cpumask_t node_to_cpuid_mask_map[MAX_NUMNODES];
 /*
  * Map cpu index to node index
  */
 DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map);
 
+DEFINE_PER_CPU(int, x86_cpu_to_near_online_node);
+EXPORT_PER_CPU_SYMBOL(x86_cpu_to_near_online_node);
+
+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val  min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}
+
 void numa_set_node(int cpu, int node)
 {
int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map);
@@ -95,7 +117,11 @@ void numa_set_node(int cpu, int node)
return;
}
 #endif
+
+   per_cpu(x86_cpu_to_near_online_node, cpu) =
+   find_near_online_node(numa_cpu_node(cpu));
per_cpu(x86_cpu_to_node_map, cpu) = node;
+   cpumask_set_cpu(cpu, node_to_cpuid_mask_map[numa_cpu_node(cpu)]);
 
set_cpu_numa_node(cpu, node);
 }
@@ -105,6 +131,13 @@ void numa_clear_node(int cpu)
numa_set_node(cpu, NUMA_NO_NODE);
 }
 
+int get_near_online_node(int node)
+{
+   return per_cpu(x86_cpu_to_near_online_node,
+  cpumask_first(node_to_cpuid_mask_map[node]));
+}
+EXPORT_SYMBOL(get_near_online_node);
+
 /*
  * Allocate node_to_cpumask_map based on number of available nodes
  * Requires node_possible_map to be valid.
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373..b233ea4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -305,13 +305,23 @@ static inline struct page *alloc_pages_node(int nid, 
gfp_t gfp_mask,
if (nid  0)
nid = numa_node_id();
 
+#if IS_ENABLED(CONFIG_X86)  IS_ENABLED(CONFIG_NUMA)
+   if (!node_online(nid))
+   nid = get_near_online_node(nid);
+#endif
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
+   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
+
+#if IS_ENABLED(CONFIG_X86)  IS_ENABLED(CONFIG_NUMA)
+   if (!node_online(nid))
+   nid = get_near_online_node(nid);
+#endif
+   VM_BUG_ON(!node_online(nid));
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH V2 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-05-14 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu-node
relationship is established. Because workqueue uses a info which
was established at boot time, but it may be changed by node hotpluging.

Once pool-node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid --- pxm and pxm -- node relationship are persistent, then
the apicid -- node mapping is persistent, so the root cause is the
cpu-id - lapicid mapping is not persistent (because the currently
implementation always choose the first free cpu id for the new added cpu).
If we can build persistent cpu-id - lapicid relationship, this problem
will be fixed.

This patch tries to build the whole world mapping cpuid - apicid - pxm - 
node
for all possible processor at the boot, the detail implementation are 2 steps:

Step1: generate a logic cpu id for all the local apic (both enabled and 
dsiabled)
   when register local apic
Step2: map the cpu to the phyical node via an additional acpi ns walk for 
processor.

Please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
for the previous discussion.
---
 V2: rebase on latest upstream.
---

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 arch/ia64/kernel/acpi.c   |   2 +-
 arch/x86/include/asm/mpspec.h |   1 +
 arch/x86/kernel/acpi/boot.c   |   8 ++-
 arch/x86/kernel/apic/apic.c   |  73 -
 arch/x86/mm/numa.c|  20 ---
 drivers/acpi/acpi_processor.c |   2 +-
 drivers/acpi/bus.c|   3 ++
 drivers/acpi/processor_core.c | 121 ++
 include/linux/acpi.h  |   2 +
 9 files changed, 172 insertions(+), 60 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index b1698bc..7db5563 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index dbe76a1..c79115b 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
@@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include acpi/processor.h
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..7fbf2cb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_cpuid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int free_id, i;
+
+   free_id = cpumask_next_zero(-1, cpuid_mask);
+   if (free_id = nr_cpu_ids)
+   return -1;
+
+   for (i = 0; i  free_id; i++)
+   if (apicid_to_cpuid[i] == apicid)
+   return i;
+
+   apicid_to_cpuid[free_id] = apicid;
+   cpumask_set_cpu(free_id, cpuid_mask);
+
+   return free_id;
+}
+
+int __generic_processor_info(int apicid

[PATCH] mm/memory hotplug: init the zones' size when calculate node totalpages

2015-05-14 Thread Gu Zheng
Init the zones' size when calculate node totalpages to avoid duplicated
operations in free_area_init_core.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 mm/page_alloc.c |   44 +---
 1 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..0b34aec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4769,22 +4769,28 @@ static void __meminit calculate_node_totalpages(struct 
pglist_data *pgdat,
unsigned long *zones_size,
unsigned long *zholes_size)
 {
-   unsigned long realtotalpages, totalpages = 0;
+   unsigned long realtotalpages = 0, totalpages = 0;
enum zone_type i;
 
-   for (i = 0; i  MAX_NR_ZONES; i++)
-   totalpages += zone_spanned_pages_in_node(pgdat-node_id, i,
-node_start_pfn,
-node_end_pfn,
-zones_size);
-   pgdat-node_spanned_pages = totalpages;
+   for (i = 0; i  MAX_NR_ZONES; i++) {
+   struct zone *zone = pgdat-node_zones + i;
+   unsigned long size, real_size;
 
-   realtotalpages = totalpages;
-   for (i = 0; i  MAX_NR_ZONES; i++)
-   realtotalpages -=
-   zone_absent_pages_in_node(pgdat-node_id, i,
+   size = zone_spanned_pages_in_node(pgdat-node_id, i,
+ node_start_pfn,
+ node_end_pfn,
+ zones_size);
+   real_size = size - zone_absent_pages_in_node(pgdat-node_id, i,
  node_start_pfn, node_end_pfn,
  zholes_size);
+   zone-spanned_pages = size;
+   zone-present_pages = real_size;
+
+   totalpages += size;
+   realtotalpages += real_size;
+   }
+
+   pgdat-node_spanned_pages = totalpages;
pgdat-node_present_pages = realtotalpages;
printk(KERN_DEBUG On node %d totalpages: %lu\n, pgdat-node_id,
realtotalpages);
@@ -4894,8 +4900,7 @@ static unsigned long __paginginit 
calc_memmap_size(unsigned long spanned_pages,
  * NOTE: pgdat should get zeroed by caller.
  */
 static void __paginginit free_area_init_core(struct pglist_data *pgdat,
-   unsigned long node_start_pfn, unsigned long node_end_pfn,
-   unsigned long *zones_size, unsigned long *zholes_size)
+   unsigned long node_start_pfn, unsigned long node_end_pfn)
 {
enum zone_type j;
int nid = pgdat-node_id;
@@ -4916,12 +4921,8 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
struct zone *zone = pgdat-node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
 
-   size = zone_spanned_pages_in_node(nid, j, node_start_pfn,
- node_end_pfn, zones_size);
-   realsize = freesize = size - zone_absent_pages_in_node(nid, j,
-   node_start_pfn,
-   node_end_pfn,
-   zholes_size);
+   size = zone-spanned_pages;
+   realsize = freesize = zone-present_pages;
 
/*
 * Adjust freesize so that it accounts for how much memory
@@ -4956,8 +4957,6 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
nr_kernel_pages -= memmap_pages;
nr_all_pages += freesize;
 
-   zone-spanned_pages = size;
-   zone-present_pages = realsize;
/*
 * Set an approximate value for lowmem here, it will be adjusted
 * when the bootmem allocator frees pages into the buddy system.
@@ -5063,8 +5062,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
(unsigned long)pgdat-node_mem_map);
 #endif
 
-   free_area_init_core(pgdat, start_pfn, end_pfn,
-   zones_size, zholes_size);
+   free_area_init_core(pgdat, start_pfn, end_pfn);
 }
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] x86, espfix: use spin_lock rather than mutex

2015-05-14 Thread Gu Zheng
The following lockdep warning occurrs when running with latest kernel:
[3.178000] [ cut here ]
[3.183000] WARNING: CPU: 128 PID: 0 at kernel/locking/lockdep.c:2755 
lockdep_trace_alloc+0xdd/0xe0()
[3.193000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[3.199000] Modules linked in:

[3.203000] CPU: 128 PID: 0 Comm: swapper/128 Not tainted 4.1.0-rc3 #70
[3.221000]   2d6601fb3e6d4e4c 88086fd5fc38 
81773f0a
[3.23]   88086fd5fc90 88086fd5fc78 
8108c85a
[3.238000]  88086fd6 0092 88086fd6 
00d0
[3.246000] Call Trace:
[3.249000]  [81773f0a] dump_stack+0x4c/0x65
[3.255000]  [8108c85a] warn_slowpath_common+0x8a/0xc0
[3.261000]  [8108c8e5] warn_slowpath_fmt+0x55/0x70
[3.268000]  [810ee24d] lockdep_trace_alloc+0xdd/0xe0
[3.274000]  [811cda0d] __alloc_pages_nodemask+0xad/0xca0
[3.281000]  [810ec7ad] ? __lock_acquire+0xf6d/0x1560
[3.288000]  [81219c8a] alloc_page_interleave+0x3a/0x90
[3.295000]  [8121b32d] alloc_pages_current+0x17d/0x1a0
[3.301000]  [811c869e] ? __get_free_pages+0xe/0x50
[3.308000]  [811c869e] __get_free_pages+0xe/0x50
[3.314000]  [8102640b] init_espfix_ap+0x17b/0x320
[3.32]  [8105c691] start_secondary+0xf1/0x1f0
[3.327000] ---[ end trace 1b3327d9d6a1d62c ]---

This seems a mis-warning by lockdep, as we alloc pages with GFP_KERNEL in
init_espfix_ap() which is called before enabled local irq, and the lockdep
sub-system considers this behaviour as allocating memory with GFP_FS with
local irq disabled, then trigger the warning as mentioned about.
Though here we use GFP_NOFS rather GFP_KERNEL to avoid the warning, but
you know, init_espfix_ap is called with preempt and local irq disabled,
it is not a good idea to use mutex (might sleep) here.
So we convert the initialization lock to spin_lock here to avoid the noise.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
Cc: Stable sta...@vger.kernel.org
---
 arch/x86/kernel/espfix_64.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index f5d0730..ceb35a3 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -57,14 +57,14 @@
 # error Need more than one PGD for the ESPFIX hack
 #endif
 
-#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
+#define PGALLOC_GFP (GFP_ATOMIC | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
-/* Initialization mutex - should this be a spinlock? */
-static DEFINE_MUTEX(espfix_init_mutex);
+/* Initialization lock */
+static DEFINE_SPINLOCK(espfix_init_lock);
 
 /* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
 #define ESPFIX_MAX_PAGES  DIV_ROUND_UP(CONFIG_NR_CPUS, ESPFIX_STACKS_PER_PAGE)
@@ -144,6 +144,7 @@ void init_espfix_ap(void)
int n;
void *stack_page;
pteval_t ptemask;
+   unsigned long flags;
 
/* We only have to do this once... */
if (likely(this_cpu_read(espfix_stack)))
@@ -158,7 +159,7 @@ void init_espfix_ap(void)
if (likely(stack_page))
goto done;
 
-   mutex_lock(espfix_init_mutex);
+   spin_lock_irqsave(espfix_init_lock, flags);
 
/* Did we race on the lock? */
stack_page = ACCESS_ONCE(espfix_pages[page]);
@@ -188,7 +189,7 @@ void init_espfix_ap(void)
}
 
pte_p = pte_offset_kernel(pmd, addr);
-   stack_page = (void *)__get_free_page(GFP_KERNEL);
+   stack_page = (void *)__get_free_page(PGALLOC_GFP);
pte = __pte(__pa(stack_page) | (__PAGE_KERNEL_RO  ptemask));
for (n = 0; n  ESPFIX_PTE_CLONES; n++)
set_pte(pte_p[n*PTE_STRIDE], pte);
@@ -197,7 +198,7 @@ void init_espfix_ap(void)
ACCESS_ONCE(espfix_pages[page]) = stack_page;
 
 unlock_done:
-   mutex_unlock(espfix_init_mutex);
+   spin_unlock_irqrestore(espfix_init_lock, flags);
 done:
this_cpu_write(espfix_stack, addr);
this_cpu_write(espfix_waddr, (unsigned long)stack_page
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH V2 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-05-14 Thread Gu Zheng
Hi Ishimatsu,

On 05/15/2015 12:44 AM, Yasuaki Ishimatsu wrote:

 Hi Gu,
 
 Before 8 months, I posted the following patch to relate
 cpuid to apicid.
 
 https://lkml.org/lkml/2014/9/3/1120
 
 Could you try this patch?


Thanks for your reminder.
It seems similar to the https://lkml.org/lkml/2015/3/25/989
[PATCH 0/2] workqueue: fix a bug when numa mapping is changed,
though it also can fix the issue, but it seems not the perfect
solution, because self-maintain cpumask mapping (or something
like this) is very common in kernel.
As TJ and Kame suggested, it is available to build the mapping
for all the possible cpus at boot, so that we can ignore the
effect of cpu/node hotplug, especially for per cpu cases.

Regards,
Gu

 
 Thanks,
 Yasuaki Ishimatsu
 
 On Thu, 14 May 2015 19:33:33 +0800
 Gu Zheng guz.f...@cn.fujitsu.com wrote:
 
 Yasuaki Ishimatsu found that with node online/offline, cpu-node
 relationship is established. Because workqueue uses a info which
 was established at boot time, but it may be changed by node hotpluging.

 Once pool-node points to a stale node, following allocation failure
 happens.
   ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
   ==

 As the apicid --- pxm and pxm -- node relationship are persistent, then
 the apicid -- node mapping is persistent, so the root cause is the
 cpu-id - lapicid mapping is not persistent (because the currently
 implementation always choose the first free cpu id for the new added cpu).
 If we can build persistent cpu-id - lapicid relationship, this problem
 will be fixed.

 This patch tries to build the whole world mapping cpuid - apicid - pxm 
 - node
 for all possible processor at the boot, the detail implementation are 2 
 steps:

 Step1: generate a logic cpu id for all the local apic (both enabled and 
 dsiabled)
when register local apic
 Step2: map the cpu to the phyical node via an additional acpi ns walk for 
 processor.

 Please refer to:
 https://lkml.org/lkml/2015/2/27/145
 https://lkml.org/lkml/2015/3/25/989
 for the previous discussion.
 ---
  V2: rebase on latest upstream.
 ---

 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  arch/ia64/kernel/acpi.c   |   2 +-
  arch/x86/include/asm/mpspec.h |   1 +
  arch/x86/kernel/acpi/boot.c   |   8 ++-
  arch/x86/kernel/apic/apic.c   |  73 -
  arch/x86/mm/numa.c|  20 ---
  drivers/acpi/acpi_processor.c |   2 +-
  drivers/acpi/bus.c|   3 ++
  drivers/acpi/processor_core.c | 121 
 ++
  include/linux/acpi.h  |   2 +
  9 files changed, 172 insertions(+), 60 deletions(-)

 diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
 index b1698bc..7db5563 100644
 --- a/arch/ia64/kernel/acpi.c
 +++ b/arch/ia64/kernel/acpi.c
 @@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
   *  ACPI based hotplug CPU support
   */
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
 -static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 +int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
  /*
 diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
 index b07233b..db902d8 100644
 --- a/arch/x86/include/asm/mpspec.h
 +++ b/arch/x86/include/asm/mpspec.h
 @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
  #endif
  
  int generic_processor_info(int apicid, int version);
 +int __generic_processor_info(int apicid, int version, bool enabled);
  
  #define PHYSID_ARRAY_SIZE   BITS_TO_LONGS(MAX_LOCAL_APIC)
  
 diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
 index dbe76a1..c79115b 100644
 --- a/arch/x86/kernel/acpi/boot.c
 +++ b/arch/x86/kernel/acpi/boot.c
 @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
  return -EINVAL;
  }
  
 -if (!enabled) {
 +if (!enabled)
  ++disabled_cpus;
 -return -EINVAL;
 -}
  
  if (boot_cpu_physical_apicid != -1U)
  ver = apic_version[boot_cpu_physical_apicid];
  
 -return generic_processor_info(id, ver);
 +return __generic_processor_info(id, ver, enabled);
  }
  
  static int __init
 @@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
  #include acpi/processor.h
  
 -static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 +void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
  int nid;
 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
 index dcb5285..7fbf2cb 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int

Re: [PATCH 1/2 V3] memory-hotplug: fix BUG_ON in move_freepages()

2015-05-10 Thread Gu Zheng
Hi Xishi,

What is the condition about this series?

Thanks,
Gu
On 04/22/2015 02:26 PM, Xishi Qiu wrote:

> add CC: Tejun Heo 
> 
> On 2015/4/21 18:15, Xishi Qiu wrote:
> 
>> Hot remove nodeXX, then hot add nodeXX. If BIOS report cpu first, it will 
>> call
>> hotadd_new_pgdat(nid, 0), this will set pgdat->node_start_pfn to 0. As nodeXX
>> exists at boot time, so pgdat->node_spanned_pages is the same as original. 
>> Then
>> free_area_init_core()->memmap_init() will pass a wrong start and a nonzero 
>> size.
>>
>> free_area_init_core()
>>  memmap_init()
>>  memmap_init_zone()
>>  early_pfn_in_nid()
>>  set_page_links()
>>
>> "if (!early_pfn_in_nid(pfn, nid))" will skip the pfn(memory in section), but 
>> it
>> will not skip the pfn(hole in section), this will cover and relink the page 
>> to
>> zone/nid, so page_zone() from memory and hole in the same section are 
>> different.
>>
>> The following call trace shows the bug. This patch add/remove memblk when hot
>> adding/removing memory, so it will set the node size to 0 when hotadd a new 
>> node
>> (original or new). init_currently_empty_zone() and memmap_init() will be 
>> called
>> in add_zone(), so need not to change them.
>>
>> [90476.077469] kernel BUG at mm/page_alloc.c:1042!  // move_freepages() -> 
>> BUG_ON(page_zone(start_page) != page_zone(end_page));
>> [90476.077469] invalid opcode:  [#1] SMP 
>> [90476.077469] Modules linked in: iptable_nat nf_conntrack_ipv4 
>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fuse btrfs zlib_deflate 
>> raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc bridge stp llc 
>> ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables 
>> cfg80211 rfkill sg iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp 
>> intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel 
>> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd 
>> pcspkr igb vfat i2c_algo_bit dca fat sb_edac edac_core i2c_i801 lpc_ich 
>> i2c_core mfd_core shpchp acpi_pad ipmi_si ipmi_msghandler uinput nfsd 
>> auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod crc_t10dif 
>> crct10dif_common ahci libahci megaraid_sas tg3 ptp libata pps_core dm_mirror 
>> dm_region_hash dm_log dm_mod [last unloaded: rasf]
>> [90476.157382] CPU: 2 PID: 322803 Comm: updatedb Tainted: GF   W  
>> O--   3.10.0-229.1.2.5.hulk.rc14.x86_64 #1
>> [90476.157382] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei 
>> N1, BIOS V100R001 04/13/2015
>> [90476.157382] task: 88006a6d5b00 ti: 880068eb8000 task.ti: 
>> 880068eb8000
>> [90476.157382] RIP: 0010:[]  [] 
>> move_freepages+0x12f/0x140
>> [90476.157382] RSP: 0018:880068ebb640  EFLAGS: 00010002
>> [90476.157382] RAX: 880002316cc0 RBX: ea0001bd RCX: 
>> 0001
>> [90476.157382] RDX: 880002476e40 RSI:  RDI: 
>> 880002316cc0
>> [90476.157382] RBP: 880068ebb690 R08: 0010 R09: 
>> ea0001bd7fc0
>> [90476.157382] R10: 0006f5ff R11:  R12: 
>> 0001
>> [90476.157382] R13: 0003 R14: 880002316eb8 R15: 
>> ea0001bd7fc0
>> [90476.157382] FS:  7f4d3ab95740() GS:880033a0() 
>> knlGS:
>> [90476.157382] CS:  0010 DS:  ES:  CR0: 80050033
>> [90476.157382] CR2: 7f4d3ae1a808 CR3: 00018907a000 CR4: 
>> 001407e0
>> [90476.157382] DR0:  DR1:  DR2: 
>> 
>> [90476.157382] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [90476.157382] Stack:
>> [90476.157382]  880068ebb698 880002316cc0 a800b5378098 
>> 880068ebb698
>> [90476.157382]  810b11dc 880002316cc0 0001 
>> 0003
>> [90476.157382]  880002316eb8 ea0001bd6420 880068ebb6a0 
>> 8115a003
>> [90476.157382] Call Trace:
>> [90476.157382]  [] ? update_curr+0xcc/0x150
>> [90476.157382]  [] move_freepages_block+0x73/0x80
>> [90476.157382]  [] __rmqueue+0x26a/0x460
>> [90476.157382]  [] ? native_sched_clock+0x13/0x80
>> [90476.157382]  [] get_page_from_freelist+0x7f2/0xd30
>> [90476.157382]  [] ? __switch_to+0x179/0x4a0
>> [90476.157382]  [] ? xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
>> [90476.157382]  [] __alloc_pages_nodemask+0x1c1/0xc90
>> [90476.157382]  [] ? _xfs_buf_ioapply+0x31c/0x420 [xfs]
>> [90476.157382]  [] ? down_trylock+0x2d/0x40
>> [90476.157382]  [] ? xfs_buf_trylock+0x1f/0x80 [xfs]
>> [90476.157382]  [] alloc_pages_current+0xa9/0x170
>> [90476.157382]  [] new_slab+0x275/0x300
>> [90476.157382]  [] __slab_alloc+0x315/0x48f
>> [90476.157382]  [] ? kmem_zone_alloc+0x77/0x100 [xfs]
>> [90476.157382]  [] ? xfs_bmap_search_extents+0x5c/0xc0 
>> [xfs]
>> [90476.157382]  [] kmem_cache_alloc+0x193/0x1d0
>> [90476.157382]  [] ? kmem_zone_alloc+0x77/0x100 [xfs]
>> [90476.157382]  [] kmem_zone_alloc+0x77/0x100 [xfs]
>> 

Re: [PATCH] Hotplug: fix the bug that the system is down,when memory is not in node0 and cpu is logically hotadded.

2015-05-10 Thread Gu Zheng
Hi TJ, Song,

Sorry for late reply.

On 05/08/2015 11:23 PM, Tejun Heo wrote:

> Cc'ing Lai, Gu and Kamezawa as they've been working in the area for a
> while now.  Gu, is this related to what you've been working on?


Yes, they are the same. And we are still working on it, please refer to the
following for detail:
https://lkml.org/lkml/2015/4/24/143
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989

Regards,
Gu

> 
> Thanks.
> 
> On Fri, May 08, 2015 at 07:16:40PM +0800, Song Xiumiao wrote:
>> From: songxiumiao 
>>
>> By analysing the bug function call trace,we find that create_worker
>> function will alloc the memory from node0.Because node0 is offline,
>> the allocation is failed. Then we add a condition to ensure the node
>> is online and system can alloc memory from a node that is online.
>>
>> Follow is the bug information:
>> [root@localhost ~]# echo 1 > /sys/devices/system/cpu/cpu90/online
>> [  225.611209] smpboot: Booting Node 2 Processor 90 APIC 0x40
>> [18446744029.482996] kvm: enabling virtualization on CPU90
>> [  225.725503] TSC synchronization [CPU#43 -> CPU#90]:
>> [  225.730952] Measured 672516581900 cycles TSC warp between CPUs, turning 
>> off TSC clock.
>> [  225.739800] tsc: Marking TSC unstable due to check_tsc_sync_source failed
>> [  225.755126] BUG: unable to handle kernel paging request at 
>> 1b08
>> [  225.762931] IP: [] __alloc_pages_nodemask+0xb7/0x940
>> [  225.770247] PGD 449bb0067 PUD 46110e067 PMD 0
>> [  225.775248] Oops:  [#1] SMP
>> [  225.778875] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT 
>> nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 
>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntracd
>> [  225.868198] CPU: 43 PID: 5400 Comm: bash Not tainted 
>> 4.0.0-rc4-bug-fixed-remove #16
>> [  225.876754] Hardware name: Insyde Brickland/Type2 - Board Product Name1, 
>> BIOS Brickland.05.04.15.0024 02/28/2015
>> [  225.888122] task: 88045a3d8da0 ti: 88044612 task.ti: 
>> 88044612
>> [  225.896484] RIP: 0010:[]  [] 
>> __alloc_pages_nodemask+0xb7/0x940
>> [  225.906509] RSP: 0018:880446123918  EFLAGS: 00010246
>> [  225.912443] RAX: 1b00 RBX: 0010 RCX: 
>> 
>> [  225.920416] RDX:  RSI:  RDI: 
>> 002052d0
>> [  225.928388] RBP: 880446123a08 R08: 880460eca0c0 R09: 
>> 60eca101
>> [  225.936361] R10: 88046d007300 R11: 8108dd31 R12: 
>> 0001002a
>> [  225.944334] R13: 002052d0 R14: 0001 R15: 
>> 40d0
>> [  225.952306] FS:  7f9386450740() GS:88046db6() 
>> knlGS:
>> [  225.961346] CS:  0010 DS:  ES:  CR0: 80050033
>> [  225.967765] CR2: 1b08 CR3: 0004612a3000 CR4: 
>> 001407e0
>> [  225.975735] Stack:
>> [  225.977981]  002052d0  0003 
>> 88045a3d8da0
>> [  225.986291]  880446123988 811c7f81 88045a3d8da0 
>> 
>> [  225.994597]  80d2 88046d005500 0003000f 
>> 002052d0002052d0
>> [  226.002904] Call Trace:
>> [  226.005645]  [] ? alloc_pages_current+0x91/0x100
>> [  226.012557]  [] ? deactivate_slab+0x383/0x400
>> [  226.019173]  [] new_slab+0xa7/0x460
>> [  226.024826]  [] __slab_alloc+0x310/0x470
>> [  226.030960]  [] ? get_from_free_list+0x46/0x60
>> [  226.037679]  [] ? alloc_worker+0x21/0x50
>> [  226.043812]  [] kmem_cache_alloc_node_trace+0x91/0x250
>> [  226.051299]  [] alloc_worker+0x21/0x50
>> [  226.057236]  [] create_worker+0x53/0x1e0
>> [  226.063357]  [] alloc_unbound_pwq+0x2a2/0x510
>> [  226.069974]  [] wq_update_unbound_numa+0x1b4/0x220
>> [  226.077076]  [] workqueue_cpu_up_callback+0x308/0x3d0
>> [  226.084468]  [] notifier_call_chain+0x4e/0x80
>> [  226.091084]  [] __raw_notifier_call_chain+0xe/0x10
>> [  226.098189]  [] cpu_notify+0x23/0x50
>> [  226.103929]  [] _cpu_up+0x188/0x1a0
>> [  226.109574]  [] cpu_up+0x89/0xb0
>> [  226.114923]  [] cpu_subsys_online+0x40/0x90
>> [  226.121350]  [] device_online+0x6d/0xa0
>> [  226.127382]  [] online_store+0x95/0xa0
>> [  226.133322]  [] dev_attr_store+0x18/0x30
>> [  226.139457]  [] sysfs_kf_write+0x3d/0x50
>> [  226.145586]  [] kernfs_fop_write+0x12a/0x180
>> [  226.152109]  [] vfs_write+0xb7/0x1f0
>> [  226.157853]  [] ? do_audit_syscall_entry+0x6c/0x70
>> [  226.164954]  [] SyS_write+0x55/0xd0
>> [  226.170595]  [] system_call_fastpath+0x12/0x17
>> [  226.177306] Code: 30 97 00 89 45 bc 83 e1 0f b8 22 01 32 01 01 c9 d3 f8 
>> 83 e0 03 89 9d 6c ff ff ff 83 e3 10 89 45 c0 0f 85 6d 01 00 00 48 8b 45 88 
>> <48> 83 78 08 00 0f 84 51 01 00 00 b8 01
>> [  226.199175] RIP  [] __alloc_pages_nodemask+0xb7/0x940
>> [  226.206576]  RSP 
>> [  226.210471] CR2: 1b08
>> [  226.227939] ---[ end trace 30d753e1e1124696 ]---
>> [  226.412591] Kernel panic - not syncing: Fatal exception
>> [  226.430948] 

Re: [PATCH] Hotplug: fix the bug that the system is down,when memory is not in node0 and cpu is logically hotadded.

2015-05-10 Thread Gu Zheng
Hi TJ, Song,

Sorry for late reply.

On 05/08/2015 11:23 PM, Tejun Heo wrote:

 Cc'ing Lai, Gu and Kamezawa as they've been working in the area for a
 while now.  Gu, is this related to what you've been working on?


Yes, they are the same. And we are still working on it, please refer to the
following for detail:
https://lkml.org/lkml/2015/4/24/143
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989

Regards,
Gu

 
 Thanks.
 
 On Fri, May 08, 2015 at 07:16:40PM +0800, Song Xiumiao wrote:
 From: songxiumiao songxium...@inspur.com

 By analysing the bug function call trace,we find that create_worker
 function will alloc the memory from node0.Because node0 is offline,
 the allocation is failed. Then we add a condition to ensure the node
 is online and system can alloc memory from a node that is online.

 Follow is the bug information:
 [root@localhost ~]# echo 1  /sys/devices/system/cpu/cpu90/online
 [  225.611209] smpboot: Booting Node 2 Processor 90 APIC 0x40
 [18446744029.482996] kvm: enabling virtualization on CPU90
 [  225.725503] TSC synchronization [CPU#43 - CPU#90]:
 [  225.730952] Measured 672516581900 cycles TSC warp between CPUs, turning 
 off TSC clock.
 [  225.739800] tsc: Marking TSC unstable due to check_tsc_sync_source failed
 [  225.755126] BUG: unable to handle kernel paging request at 
 1b08
 [  225.762931] IP: [81182597] __alloc_pages_nodemask+0xb7/0x940
 [  225.770247] PGD 449bb0067 PUD 46110e067 PMD 0
 [  225.775248] Oops:  [#1] SMP
 [  225.778875] Modules linked in: xt_CHECKSUM ip6t_rpfilter ip6t_REJECT 
 nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 
 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntracd
 [  225.868198] CPU: 43 PID: 5400 Comm: bash Not tainted 
 4.0.0-rc4-bug-fixed-remove #16
 [  225.876754] Hardware name: Insyde Brickland/Type2 - Board Product Name1, 
 BIOS Brickland.05.04.15.0024 02/28/2015
 [  225.888122] task: 88045a3d8da0 ti: 88044612 task.ti: 
 88044612
 [  225.896484] RIP: 0010:[81182597]  [81182597] 
 __alloc_pages_nodemask+0xb7/0x940
 [  225.906509] RSP: 0018:880446123918  EFLAGS: 00010246
 [  225.912443] RAX: 1b00 RBX: 0010 RCX: 
 
 [  225.920416] RDX:  RSI:  RDI: 
 002052d0
 [  225.928388] RBP: 880446123a08 R08: 880460eca0c0 R09: 
 60eca101
 [  225.936361] R10: 88046d007300 R11: 8108dd31 R12: 
 0001002a
 [  225.944334] R13: 002052d0 R14: 0001 R15: 
 40d0
 [  225.952306] FS:  7f9386450740() GS:88046db6() 
 knlGS:
 [  225.961346] CS:  0010 DS:  ES:  CR0: 80050033
 [  225.967765] CR2: 1b08 CR3: 0004612a3000 CR4: 
 001407e0
 [  225.975735] Stack:
 [  225.977981]  002052d0  0003 
 88045a3d8da0
 [  225.986291]  880446123988 811c7f81 88045a3d8da0 
 
 [  225.994597]  80d2 88046d005500 0003000f 
 002052d0002052d0
 [  226.002904] Call Trace:
 [  226.005645]  [811c7f81] ? alloc_pages_current+0x91/0x100
 [  226.012557]  [811d27c3] ? deactivate_slab+0x383/0x400
 [  226.019173]  [811d3957] new_slab+0xa7/0x460
 [  226.024826]  [81678c75] __slab_alloc+0x310/0x470
 [  226.030960]  [8130caf6] ? get_from_free_list+0x46/0x60
 [  226.037679]  [8108dd31] ? alloc_worker+0x21/0x50
 [  226.043812]  [811d46c1] kmem_cache_alloc_node_trace+0x91/0x250
 [  226.051299]  [8108dd31] alloc_worker+0x21/0x50
 [  226.057236]  [8108ff23] create_worker+0x53/0x1e0
 [  226.063357]  [81092092] alloc_unbound_pwq+0x2a2/0x510
 [  226.069974]  [810924b4] wq_update_unbound_numa+0x1b4/0x220
 [  226.077076]  [81092828] workqueue_cpu_up_callback+0x308/0x3d0
 [  226.084468]  [8109784e] notifier_call_chain+0x4e/0x80
 [  226.091084]  [8109796e] __raw_notifier_call_chain+0xe/0x10
 [  226.098189]  [810774f3] cpu_notify+0x23/0x50
 [  226.103929]  [81077878] _cpu_up+0x188/0x1a0
 [  226.109574]  [81077919] cpu_up+0x89/0xb0
 [  226.114923]  [8166fba0] cpu_subsys_online+0x40/0x90
 [  226.121350]  [814386dd] device_online+0x6d/0xa0
 [  226.127382]  [814387a5] online_store+0x95/0xa0
 [  226.133322]  [814358a8] dev_attr_store+0x18/0x30
 [  226.139457]  [8126d76d] sysfs_kf_write+0x3d/0x50
 [  226.145586]  [8126cc1a] kernfs_fop_write+0x12a/0x180
 [  226.152109]  [811f1bb7] vfs_write+0xb7/0x1f0
 [  226.157853]  [810232bc] ? do_audit_syscall_entry+0x6c/0x70
 [  226.164954]  [811f2835] SyS_write+0x55/0xd0
 [  226.170595]  [81681f09] system_call_fastpath+0x12/0x17
 [  226.177306] Code: 30 97 00 89 45 bc 83 e1 0f b8 22 01 32 01 01 c9 d3 f8 
 83 e0 03 89 9d 6c ff ff ff 83 e3 10 89 45 c0 0f 85 6d 01 

Re: [PATCH 1/2 V3] memory-hotplug: fix BUG_ON in move_freepages()

2015-05-10 Thread Gu Zheng
Hi Xishi,

What is the condition about this series?

Thanks,
Gu
On 04/22/2015 02:26 PM, Xishi Qiu wrote:

 add CC: Tejun Heo t...@kernel.org
 
 On 2015/4/21 18:15, Xishi Qiu wrote:
 
 Hot remove nodeXX, then hot add nodeXX. If BIOS report cpu first, it will 
 call
 hotadd_new_pgdat(nid, 0), this will set pgdat-node_start_pfn to 0. As nodeXX
 exists at boot time, so pgdat-node_spanned_pages is the same as original. 
 Then
 free_area_init_core()-memmap_init() will pass a wrong start and a nonzero 
 size.

 free_area_init_core()
  memmap_init()
  memmap_init_zone()
  early_pfn_in_nid()
  set_page_links()

 if (!early_pfn_in_nid(pfn, nid)) will skip the pfn(memory in section), but 
 it
 will not skip the pfn(hole in section), this will cover and relink the page 
 to
 zone/nid, so page_zone() from memory and hole in the same section are 
 different.

 The following call trace shows the bug. This patch add/remove memblk when hot
 adding/removing memory, so it will set the node size to 0 when hotadd a new 
 node
 (original or new). init_currently_empty_zone() and memmap_init() will be 
 called
 in add_zone(), so need not to change them.

 [90476.077469] kernel BUG at mm/page_alloc.c:1042!  // move_freepages() - 
 BUG_ON(page_zone(start_page) != page_zone(end_page));
 [90476.077469] invalid opcode:  [#1] SMP 
 [90476.077469] Modules linked in: iptable_nat nf_conntrack_ipv4 
 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fuse btrfs zlib_deflate 
 raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc bridge stp llc 
 ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables 
 cfg80211 rfkill sg iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp 
 intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel 
 ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd 
 pcspkr igb vfat i2c_algo_bit dca fat sb_edac edac_core i2c_i801 lpc_ich 
 i2c_core mfd_core shpchp acpi_pad ipmi_si ipmi_msghandler uinput nfsd 
 auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod crc_t10dif 
 crct10dif_common ahci libahci megaraid_sas tg3 ptp libata pps_core dm_mirror 
 dm_region_hash dm_log dm_mod [last unloaded: rasf]
 [90476.157382] CPU: 2 PID: 322803 Comm: updatedb Tainted: GF   W  
 O--   3.10.0-229.1.2.5.hulk.rc14.x86_64 #1
 [90476.157382] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei 
 N1, BIOS V100R001 04/13/2015
 [90476.157382] task: 88006a6d5b00 ti: 880068eb8000 task.ti: 
 880068eb8000
 [90476.157382] RIP: 0010:[81159f7f]  [81159f7f] 
 move_freepages+0x12f/0x140
 [90476.157382] RSP: 0018:880068ebb640  EFLAGS: 00010002
 [90476.157382] RAX: 880002316cc0 RBX: ea0001bd RCX: 
 0001
 [90476.157382] RDX: 880002476e40 RSI:  RDI: 
 880002316cc0
 [90476.157382] RBP: 880068ebb690 R08: 0010 R09: 
 ea0001bd7fc0
 [90476.157382] R10: 0006f5ff R11:  R12: 
 0001
 [90476.157382] R13: 0003 R14: 880002316eb8 R15: 
 ea0001bd7fc0
 [90476.157382] FS:  7f4d3ab95740() GS:880033a0() 
 knlGS:
 [90476.157382] CS:  0010 DS:  ES:  CR0: 80050033
 [90476.157382] CR2: 7f4d3ae1a808 CR3: 00018907a000 CR4: 
 001407e0
 [90476.157382] DR0:  DR1:  DR2: 
 
 [90476.157382] DR3:  DR6: fffe0ff0 DR7: 
 0400
 [90476.157382] Stack:
 [90476.157382]  880068ebb698 880002316cc0 a800b5378098 
 880068ebb698
 [90476.157382]  810b11dc 880002316cc0 0001 
 0003
 [90476.157382]  880002316eb8 ea0001bd6420 880068ebb6a0 
 8115a003
 [90476.157382] Call Trace:
 [90476.157382]  [810b11dc] ? update_curr+0xcc/0x150
 [90476.157382]  [8115a003] move_freepages_block+0x73/0x80
 [90476.157382]  [8115b9ba] __rmqueue+0x26a/0x460
 [90476.157382]  [8101ba53] ? native_sched_clock+0x13/0x80
 [90476.157382]  [8115e172] get_page_from_freelist+0x7f2/0xd30
 [90476.157382]  [81012639] ? __switch_to+0x179/0x4a0
 [90476.157382]  [a01fc0d7] ? xfs_iext_bno_to_ext+0xa7/0x1a0 [xfs]
 [90476.157382]  [8115e871] __alloc_pages_nodemask+0x1c1/0xc90
 [90476.157382]  [a01ab24c] ? _xfs_buf_ioapply+0x31c/0x420 [xfs]
 [90476.157382]  [8109cb0d] ? down_trylock+0x2d/0x40
 [90476.157382]  [a01abfff] ? xfs_buf_trylock+0x1f/0x80 [xfs]
 [90476.157382]  [8119d229] alloc_pages_current+0xa9/0x170
 [90476.157382]  [811a7225] new_slab+0x275/0x300
 [90476.157382]  [815faaa2] __slab_alloc+0x315/0x48f
 [90476.157382]  [a01c59d7] ? kmem_zone_alloc+0x77/0x100 [xfs]
 [90476.157382]  [a01d21fc] ? xfs_bmap_search_extents+0x5c/0xc0 
 [xfs]
 [90476.157382]  [811a9863] kmem_cache_alloc+0x193/0x1d0
 

Re: [RESEND RFC PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-04-27 Thread Gu Zheng
Hi Hanjun, Rafael,

On 04/25/2015 06:14 PM, Hanjun Guo wrote:

> On 2015/4/24 22:45, Rafael J. Wysocki wrote:
>> On Friday, April 24, 2015 05:58:32 PM Gu Zheng wrote:
>>> Yasuaki Ishimatsu found that with node online/offline, cpu<->node 
>>> relationship
>>> is  established. Because workqueue uses a info which  was established at 
>>> boot
>>> time, but it may be changed by node hotpluging.
>>>
>>> Once pool->node points to a stale node, following allocation failure
>>> happens.
>>>   ==
>>>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>   cache: kmalloc-192, object size: 192, buffer size: 192, default
>>> order:
>>> 1, min order: 0
>>>   node 0: slabs: 6172, objs: 259224, free: 245741
>>>   node 1: slabs: 3261, objs: 136962, free: 127656
>>>   ==
>>>
>>> As the apicid <---> pxm and pxm <--> node relationship are persistent, then
>>> the apicid <--> node mapping is persistent, so the root cause is the
>>> cpu-id <-> lapicid mapping is not persistent (because the currently 
>>> implementation
>>> always choose the first free cpu id for the new added cpu). If we can build
>>> persistent cpu-id <-> lapicid relationship, this problem will be fixed.
>>>
>>> This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm 
>>> <-> node
>>> for all possible processor at the boot, the detail implementation are 2 
>>> steps:
>>> Step1: generate a logic cpu id for all the local apic (both enabled and 
>>> dsiabled)
>>>when register local apic
>>> Step2: map the cpu to the phyical node via an additional acpi ns walk for 
>>> processor.
>>>
>>> Please refer to:
>>> https://lkml.org/lkml/2015/2/27/145
>>> https://lkml.org/lkml/2015/3/25/989
>>> for the previous discussion.
>>>
>>> Reported-by: Yasuaki Ishimatsu 
>>> Signed-off-by: Gu Zheng 
>> This one will conflict with the ARM64 ACPI material when that goes in, so 
>> it'll
>> need to be rebased on top of that.
> 
> Yes, please. Then I will take a look too.

Thanks for your reminder, will rebase it soon.

Regards,
Gu

> 
> Thanks
> Hanjun
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-27 Thread Gu Zheng
Hi Kame-san,

On 04/27/2015 05:44 PM, Kamezawa Hiroyuki wrote:

> On 2015/04/25 5:01, Andrew Morton wrote:
>> On Fri, 24 Apr 2015 17:58:33 +0800 Gu Zheng  wrote:
>>
>>> Since the change to the cpu <--> mapping (map the cpu to the physical
>>> node for all possible at the boot), the node of cpu may be not present,
>>> so we use the best near online node if the node is not present in the low
>>> level allocation APIs.
>>>
>>> ...
>>>
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>>>   return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>>>   }
>>>
>>> +static int find_near_online_node(int node)
>>> +{
>>> +int n, val;
>>> +int min_val = INT_MAX;
>>> +int best_node = -1;
>>> +
>>> +for_each_online_node(n) {
>>> +val = node_distance(node, n);
>>> +
>>> +if (val < min_val) {
>>> +min_val = val;
>>> +best_node = n;
>>> +}
>>> +}
>>> +
>>> +return best_node;
>>> +}
>>
>> This should be `inline' if it's in a header file.
>>
>> But it is far too large to be inlined anyway - please move it to a .c file.
>>
>> And please document it.  A critical thing to describe is how we
>> determine whether a node is "near".  There are presumably multiple ways
>> in which we could decide that a node is "near" (number of hops, minimum
>> latency, ...).  Which one did you choose, and why?
>>
>>>   static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>>   unsigned int order)
>>>   {
>>> +/* Offline node, use the best near online node */
>>> +if (!node_online(nid))
>>> +nid = find_near_online_node(nid);
>>> +
>>>   /* Unknown node is current node */
>>>   if (nid < 0)
>>>   nid = numa_node_id();
>>> @@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, 
>>> gfp_t gfp_mask,
>>>   static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
>>>   unsigned int order)
>>>   {
>>> -VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>>> +/* Offline node, use the best near online node */
>>> +if (!node_online(nid))
>>> +nid = find_near_online_node(nid);
> 
> In above VM_BUG_ON(), !node_online(nid) is the bug.


But it will be possible here with the change in PATCH 1/2.

> 
>>> +
>>> +VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>>>
>>>   return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>>>   }
>>
>> Ouch.  These functions are called very frequently, and adding overhead
>> to them is a big deal.  And the patch even adds overhead to non-x86
>> architectures which don't benefit from it!
>>
>> Is there no way this problem can be fixed somewhere else?  Preferably
>> by fixing things up at hotplug time.
> 
> I agree. the results should be cached. If necessary, in per-cpu line.

Sounds great, will try this way.

Regards,
Gu

> 
> 
> Thanks,
> -Kame
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-27 Thread Gu Zheng
Hi Andrew,

On 04/25/2015 04:01 AM, Andrew Morton wrote:

> On Fri, 24 Apr 2015 17:58:33 +0800 Gu Zheng  wrote:
> 
>> Since the change to the cpu <--> mapping (map the cpu to the physical
>> node for all possible at the boot), the node of cpu may be not present,
>> so we use the best near online node if the node is not present in the low
>> level allocation APIs.
>>
>> ...
>>
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>>  return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>>  }
>>  
>> +static int find_near_online_node(int node)
>> +{
>> +int n, val;
>> +int min_val = INT_MAX;
>> +int best_node = -1;
>> +
>> +for_each_online_node(n) {
>> +val = node_distance(node, n);
>> +
>> +if (val < min_val) {
>> +min_val = val;
>> +best_node = n;
>> +}
>> +}
>> +
>> +return best_node;
>> +}
> 
> This should be `inline' if it's in a header file.
> 
> But it is far too large to be inlined anyway - please move it to a .c file.

Agree.

> 
> And please document it.  A critical thing to describe is how we
> determine whether a node is "near".  There are presumably multiple ways
> in which we could decide that a node is "near" (number of hops, minimum
> latency, ...).  Which one did you choose, and why?

It just reuse the dropped code in PATCH 1/2, based on the node_distance table,
which is a arch special defined one, and the data mostly comes from the
firmware info, e.g. SLIT table.

> 
>>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>  unsigned int order)
>>  {
>> +/* Offline node, use the best near online node */
>> +if (!node_online(nid))
>> +nid = find_near_online_node(nid);
>> +
>>  /* Unknown node is current node */
>>  if (nid < 0)
>>  nid = numa_node_id();
>> @@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, 
>> gfp_t gfp_mask,
>>  static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
>>  unsigned int order)
>>  {
>> -VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
>> +/* Offline node, use the best near online node */
>> +if (!node_online(nid))
>> +nid = find_near_online_node(nid);
>> +
>> +VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>>  
>>  return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>>  }
> 
> Ouch.  These functions are called very frequently, and adding overhead
> to them is a big deal.  And the patch even adds overhead to non-x86
> architectures which don't benefit from it!
> 
> Is there no way this problem can be fixed somewhere else?  Preferably
> by fixing things up at hotplug time.

As Kame suggested, maintaining a per-cpu cache about the alternative-node
only for x86 arch seems a good choice.

Regards,
Gu

> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-27 Thread Gu Zheng
Hi Andrew,

On 04/25/2015 04:01 AM, Andrew Morton wrote:

 On Fri, 24 Apr 2015 17:58:33 +0800 Gu Zheng guz.f...@cn.fujitsu.com wrote:
 
 Since the change to the cpu -- mapping (map the cpu to the physical
 node for all possible at the boot), the node of cpu may be not present,
 so we use the best near online node if the node is not present in the low
 level allocation APIs.

 ...

 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
  return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
  }
  
 +static int find_near_online_node(int node)
 +{
 +int n, val;
 +int min_val = INT_MAX;
 +int best_node = -1;
 +
 +for_each_online_node(n) {
 +val = node_distance(node, n);
 +
 +if (val  min_val) {
 +min_val = val;
 +best_node = n;
 +}
 +}
 +
 +return best_node;
 +}
 
 This should be `inline' if it's in a header file.
 
 But it is far too large to be inlined anyway - please move it to a .c file.

Agree.

 
 And please document it.  A critical thing to describe is how we
 determine whether a node is near.  There are presumably multiple ways
 in which we could decide that a node is near (number of hops, minimum
 latency, ...).  Which one did you choose, and why?

It just reuse the dropped code in PATCH 1/2, based on the node_distance table,
which is a arch special defined one, and the data mostly comes from the
firmware info, e.g. SLIT table.

 
  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
  unsigned int order)
  {
 +/* Offline node, use the best near online node */
 +if (!node_online(nid))
 +nid = find_near_online_node(nid);
 +
  /* Unknown node is current node */
  if (nid  0)
  nid = numa_node_id();
 @@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, 
 gfp_t gfp_mask,
  static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
  unsigned int order)
  {
 -VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
 +/* Offline node, use the best near online node */
 +if (!node_online(nid))
 +nid = find_near_online_node(nid);
 +
 +VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
  
  return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
  }
 
 Ouch.  These functions are called very frequently, and adding overhead
 to them is a big deal.  And the patch even adds overhead to non-x86
 architectures which don't benefit from it!
 
 Is there no way this problem can be fixed somewhere else?  Preferably
 by fixing things up at hotplug time.

As Kame suggested, maintaining a per-cpu cache about the alternative-node
only for x86 arch seems a good choice.

Regards,
Gu

 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-27 Thread Gu Zheng
Hi Kame-san,

On 04/27/2015 05:44 PM, Kamezawa Hiroyuki wrote:

 On 2015/04/25 5:01, Andrew Morton wrote:
 On Fri, 24 Apr 2015 17:58:33 +0800 Gu Zheng guz.f...@cn.fujitsu.com wrote:

 Since the change to the cpu -- mapping (map the cpu to the physical
 node for all possible at the boot), the node of cpu may be not present,
 so we use the best near online node if the node is not present in the low
 level allocation APIs.

 ...

 --- a/include/linux/gfp.h
 +++ b/include/linux/gfp.h
 @@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
   return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
   }

 +static int find_near_online_node(int node)
 +{
 +int n, val;
 +int min_val = INT_MAX;
 +int best_node = -1;
 +
 +for_each_online_node(n) {
 +val = node_distance(node, n);
 +
 +if (val  min_val) {
 +min_val = val;
 +best_node = n;
 +}
 +}
 +
 +return best_node;
 +}

 This should be `inline' if it's in a header file.

 But it is far too large to be inlined anyway - please move it to a .c file.

 And please document it.  A critical thing to describe is how we
 determine whether a node is near.  There are presumably multiple ways
 in which we could decide that a node is near (number of hops, minimum
 latency, ...).  Which one did you choose, and why?

   static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
   unsigned int order)
   {
 +/* Offline node, use the best near online node */
 +if (!node_online(nid))
 +nid = find_near_online_node(nid);
 +
   /* Unknown node is current node */
   if (nid  0)
   nid = numa_node_id();
 @@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, 
 gfp_t gfp_mask,
   static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
   unsigned int order)
   {
 -VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
 +/* Offline node, use the best near online node */
 +if (!node_online(nid))
 +nid = find_near_online_node(nid);
 
 In above VM_BUG_ON(), !node_online(nid) is the bug.


But it will be possible here with the change in PATCH 1/2.

 
 +
 +VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);

   return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
   }

 Ouch.  These functions are called very frequently, and adding overhead
 to them is a big deal.  And the patch even adds overhead to non-x86
 architectures which don't benefit from it!

 Is there no way this problem can be fixed somewhere else?  Preferably
 by fixing things up at hotplug time.
 
 I agree. the results should be cached. If necessary, in per-cpu line.

Sounds great, will try this way.

Regards,
Gu

 
 
 Thanks,
 -Kame
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RESEND RFC PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-04-27 Thread Gu Zheng
Hi Hanjun, Rafael,

On 04/25/2015 06:14 PM, Hanjun Guo wrote:

 On 2015/4/24 22:45, Rafael J. Wysocki wrote:
 On Friday, April 24, 2015 05:58:32 PM Gu Zheng wrote:
 Yasuaki Ishimatsu found that with node online/offline, cpu-node 
 relationship
 is  established. Because workqueue uses a info which  was established at 
 boot
 time, but it may be changed by node hotpluging.

 Once pool-node points to a stale node, following allocation failure
 happens.
   ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
   ==

 As the apicid --- pxm and pxm -- node relationship are persistent, then
 the apicid -- node mapping is persistent, so the root cause is the
 cpu-id - lapicid mapping is not persistent (because the currently 
 implementation
 always choose the first free cpu id for the new added cpu). If we can build
 persistent cpu-id - lapicid relationship, this problem will be fixed.

 This patch tries to build the whole world mapping cpuid - apicid - pxm 
 - node
 for all possible processor at the boot, the detail implementation are 2 
 steps:
 Step1: generate a logic cpu id for all the local apic (both enabled and 
 dsiabled)
when register local apic
 Step2: map the cpu to the phyical node via an additional acpi ns walk for 
 processor.

 Please refer to:
 https://lkml.org/lkml/2015/2/27/145
 https://lkml.org/lkml/2015/3/25/989
 for the previous discussion.

 Reported-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 This one will conflict with the ARM64 ACPI material when that goes in, so 
 it'll
 need to be rebased on top of that.
 
 Yes, please. Then I will take a look too.

Thanks for your reminder, will rebase it soon.

Regards,
Gu

 
 Thanks
 Hanjun
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND RFC PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-04-24 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu<->node relationship
is  established. Because workqueue uses a info which  was established at boot
time, but it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid <---> pxm and pxm <--> node relationship are persistent, then
the apicid <--> node mapping is persistent, so the root cause is the
cpu-id <-> lapicid mapping is not persistent (because the currently 
implementation
always choose the first free cpu id for the new added cpu). If we can build
persistent cpu-id <-> lapicid relationship, this problem will be fixed.

This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm <-> 
node
for all possible processor at the boot, the detail implementation are 2 steps:
Step1: generate a logic cpu id for all the local apic (both enabled and 
dsiabled)
   when register local apic
Step2: map the cpu to the phyical node via an additional acpi ns walk for 
processor.

Please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
for the previous discussion.

Reported-by: Yasuaki Ishimatsu 
Signed-off-by: Gu Zheng 
---
 arch/ia64/kernel/acpi.c   |2 +-
 arch/x86/include/asm/mpspec.h |1 +
 arch/x86/kernel/acpi/boot.c   |8 +--
 arch/x86/kernel/apic/apic.c   |   71 +++
 arch/x86/mm/numa.c|   20 
 drivers/acpi/acpi_processor.c |2 +-
 drivers/acpi/bus.c|3 +
 drivers/acpi/processor_core.c |  108 ++--
 include/linux/acpi.h  |2 +
 9 files changed, 162 insertions(+), 55 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 2c44989..e7958f8 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 803b684..b084cc0 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
@@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..7fbf2cb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_cpuid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int free_id, i;
+
+   free_id = cpumask_next_zero(-1, _mask);
+   if (free_id >= nr_cpu_ids)
+   return -1;
+
+   for (i = 0; i < free_id; i++)
+   if (apicid_to_cpuid[i] == apicid)
+   return i;
+
+   apicid_to_cpuid[free_id] = apicid;
+   cpumask_set_cpu(free_id, _mask);
+
+   return free_id;
+

[RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-24 Thread Gu Zheng
Since the change to the cpu <--> mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

Signed-off-by: Gu Zheng 
---
 include/linux/gfp.h |   28 +++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373..19684a8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val < min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
@@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, gfp_t 
gfp_mask,
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
+   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RESEND RFC PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-04-24 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu-node relationship
is  established. Because workqueue uses a info which  was established at boot
time, but it may be changed by node hotpluging.

Once pool-node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid --- pxm and pxm -- node relationship are persistent, then
the apicid -- node mapping is persistent, so the root cause is the
cpu-id - lapicid mapping is not persistent (because the currently 
implementation
always choose the first free cpu id for the new added cpu). If we can build
persistent cpu-id - lapicid relationship, this problem will be fixed.

This patch tries to build the whole world mapping cpuid - apicid - pxm - 
node
for all possible processor at the boot, the detail implementation are 2 steps:
Step1: generate a logic cpu id for all the local apic (both enabled and 
dsiabled)
   when register local apic
Step2: map the cpu to the phyical node via an additional acpi ns walk for 
processor.

Please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
for the previous discussion.

Reported-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 arch/ia64/kernel/acpi.c   |2 +-
 arch/x86/include/asm/mpspec.h |1 +
 arch/x86/kernel/acpi/boot.c   |8 +--
 arch/x86/kernel/apic/apic.c   |   71 +++
 arch/x86/mm/numa.c|   20 
 drivers/acpi/acpi_processor.c |2 +-
 drivers/acpi/bus.c|3 +
 drivers/acpi/processor_core.c |  108 ++--
 include/linux/acpi.h  |2 +
 9 files changed, 162 insertions(+), 55 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 2c44989..e7958f8 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 803b684..b084cc0 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
@@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include acpi/processor.h
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..7fbf2cb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_cpuid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int free_id, i;
+
+   free_id = cpumask_next_zero(-1, cpuid_mask);
+   if (free_id = nr_cpu_ids)
+   return -1;
+
+   for (i = 0; i  free_id; i++)
+   if (apicid_to_cpuid[i] == apicid)
+   return i;
+
+   apicid_to_cpuid[free_id] = apicid;
+   cpumask_set_cpu(free_id, cpuid_mask);
+
+   return free_id;
+}
+
+int

[RESEND RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-24 Thread Gu Zheng
Since the change to the cpu -- mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 include/linux/gfp.h |   28 +++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373..19684a8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val  min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
/* Unknown node is current node */
if (nid  0)
nid = numa_node_id();
@@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, gfp_t 
gfp_mask,
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
+   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-04-22 Thread Gu Zheng
ping...

On 04/17/2015 08:48 PM, Gu Zheng wrote:

> Yasuaki Ishimatsu found that with node online/offline, cpu<->node relationship
> is  established. Because workqueue uses a info which  was established at boot
> time, but it may be changed by node hotpluging.
> 
> Once pool->node points to a stale node, following allocation failure
> happens.
>   ==
>  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>   cache: kmalloc-192, object size: 192, buffer size: 192, default
> order:
> 1, min order: 0
>   node 0: slabs: 6172, objs: 259224, free: 245741
>   node 1: slabs: 3261, objs: 136962, free: 127656
>   ==
> 
> As the apicid <---> pxm and pxm <--> node relationship are persistent, then
> the apicid <--> node mapping is persistent, so the root cause is the
> cpu-id <-> lapicid mapping is not persistent (because the currently 
> implementation
> always choose the first free cpu id for the new added cpu). If we can build
> persistent cpu-id <-> lapicid relationship, this problem will be fixed.
> 
> This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm 
> <-> node
> for all possible processor at the boot, the detail implementation are 2 steps:
> Step1: generate a logic cpu id for all the local apic (both enabled and 
> dsiabled)
>when register local apic
> Step2: map the cpu to the phyical node via an additional acpi ns walk for 
> processor.
> 
> Please refer to:
> https://lkml.org/lkml/2015/2/27/145
> https://lkml.org/lkml/2015/3/25/989
> for the previous discussion.
> 
> Reported-by: Yasuaki Ishimatsu 
> Signed-off-by: Gu Zheng 
> ---
>  arch/ia64/kernel/acpi.c   |2 +-
>  arch/x86/include/asm/mpspec.h |1 +
>  arch/x86/kernel/acpi/boot.c   |8 +--
>  arch/x86/kernel/apic/apic.c   |   71 +++
>  arch/x86/mm/numa.c|   20 
>  drivers/acpi/acpi_processor.c |2 +-
>  drivers/acpi/bus.c|3 +
>  drivers/acpi/processor_core.c |  108 ++--
>  include/linux/acpi.h  |2 +
>  9 files changed, 162 insertions(+), 55 deletions(-)
> 
> diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
> index 2c44989..e7958f8 100644
> --- a/arch/ia64/kernel/acpi.c
> +++ b/arch/ia64/kernel/acpi.c
> @@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
>   *  ACPI based hotplug CPU support
>   */
>  #ifdef CONFIG_ACPI_HOTPLUG_CPU
> -static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> +int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>   /*
> diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
> index b07233b..db902d8 100644
> --- a/arch/x86/include/asm/mpspec.h
> +++ b/arch/x86/include/asm/mpspec.h
> @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
>  #endif
>  
>  int generic_processor_info(int apicid, int version);
> +int __generic_processor_info(int apicid, int version, bool enabled);
>  
>  #define PHYSID_ARRAY_SIZEBITS_TO_LONGS(MAX_LOCAL_APIC)
>  
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index 803b684..b084cc0 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
>   return -EINVAL;
>   }
>  
> - if (!enabled) {
> + if (!enabled)
>   ++disabled_cpus;
> - return -EINVAL;
> - }
>  
>   if (boot_cpu_physical_apicid != -1U)
>   ver = apic_version[boot_cpu_physical_apicid];
>  
> - return generic_processor_info(id, ver);
> + return __generic_processor_info(id, ver, enabled);
>  }
>  
>  static int __init
> @@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
>  #ifdef CONFIG_ACPI_HOTPLUG_CPU
>  #include 
>  
> -static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
> +void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
>  {
>  #ifdef CONFIG_ACPI_NUMA
>   int nid;
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index dcb5285..7fbf2cb 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>   apic_write(APIC_LVT1, value);
>  }
>  
> -int generic_processor_info(int apicid, int version)
> +/*
> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
> + * Do not clear the mapping even if cpu hot removed.
> + * */
> +static in

Re: [RFC PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-04-22 Thread Gu Zheng
ping...

On 04/17/2015 08:48 PM, Gu Zheng wrote:

 Yasuaki Ishimatsu found that with node online/offline, cpu-node relationship
 is  established. Because workqueue uses a info which  was established at boot
 time, but it may be changed by node hotpluging.
 
 Once pool-node points to a stale node, following allocation failure
 happens.
   ==
  SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
   cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
 1, min order: 0
   node 0: slabs: 6172, objs: 259224, free: 245741
   node 1: slabs: 3261, objs: 136962, free: 127656
   ==
 
 As the apicid --- pxm and pxm -- node relationship are persistent, then
 the apicid -- node mapping is persistent, so the root cause is the
 cpu-id - lapicid mapping is not persistent (because the currently 
 implementation
 always choose the first free cpu id for the new added cpu). If we can build
 persistent cpu-id - lapicid relationship, this problem will be fixed.
 
 This patch tries to build the whole world mapping cpuid - apicid - pxm 
 - node
 for all possible processor at the boot, the detail implementation are 2 steps:
 Step1: generate a logic cpu id for all the local apic (both enabled and 
 dsiabled)
when register local apic
 Step2: map the cpu to the phyical node via an additional acpi ns walk for 
 processor.
 
 Please refer to:
 https://lkml.org/lkml/2015/2/27/145
 https://lkml.org/lkml/2015/3/25/989
 for the previous discussion.
 
 Reported-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
  arch/ia64/kernel/acpi.c   |2 +-
  arch/x86/include/asm/mpspec.h |1 +
  arch/x86/kernel/acpi/boot.c   |8 +--
  arch/x86/kernel/apic/apic.c   |   71 +++
  arch/x86/mm/numa.c|   20 
  drivers/acpi/acpi_processor.c |2 +-
  drivers/acpi/bus.c|3 +
  drivers/acpi/processor_core.c |  108 ++--
  include/linux/acpi.h  |2 +
  9 files changed, 162 insertions(+), 55 deletions(-)
 
 diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
 index 2c44989..e7958f8 100644
 --- a/arch/ia64/kernel/acpi.c
 +++ b/arch/ia64/kernel/acpi.c
 @@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
   *  ACPI based hotplug CPU support
   */
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
 -static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 +int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
   /*
 diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
 index b07233b..db902d8 100644
 --- a/arch/x86/include/asm/mpspec.h
 +++ b/arch/x86/include/asm/mpspec.h
 @@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
  #endif
  
  int generic_processor_info(int apicid, int version);
 +int __generic_processor_info(int apicid, int version, bool enabled);
  
  #define PHYSID_ARRAY_SIZEBITS_TO_LONGS(MAX_LOCAL_APIC)
  
 diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
 index 803b684..b084cc0 100644
 --- a/arch/x86/kernel/acpi/boot.c
 +++ b/arch/x86/kernel/acpi/boot.c
 @@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
   return -EINVAL;
   }
  
 - if (!enabled) {
 + if (!enabled)
   ++disabled_cpus;
 - return -EINVAL;
 - }
  
   if (boot_cpu_physical_apicid != -1U)
   ver = apic_version[boot_cpu_physical_apicid];
  
 - return generic_processor_info(id, ver);
 + return __generic_processor_info(id, ver, enabled);
  }
  
  static int __init
 @@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
  #ifdef CONFIG_ACPI_HOTPLUG_CPU
  #include acpi/processor.h
  
 -static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 +void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
  {
  #ifdef CONFIG_ACPI_NUMA
   int nid;
 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
 index dcb5285..7fbf2cb 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
   apic_write(APIC_LVT1, value);
  }
  
 -int generic_processor_info(int apicid, int version)
 +/*
 + * Logic cpu number(cpuid) to local APIC id persistent mappings.
 + * Do not clear the mapping even if cpu hot removed.
 + * */
 +static int apicid_to_cpuid[] = {
 + [0 ... NR_CPUS - 1] = -1,
 +};
 +
 +/*
 + * Internal cpu id bits, set the bit once cpu present, and never clear it.
 + * */
 +static cpumask_t cpuid_mask = CPU_MASK_NONE;
 +
 +static int get_cpuid(int apicid)
 +{
 + int free_id, i;
 +
 + free_id = cpumask_next_zero(-1, cpuid_mask);
 + if (free_id = nr_cpu_ids)
 + return -1;
 +
 + for (i = 0; i  free_id; i++)
 + if (apicid_to_cpuid[i] == apicid

Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages()

2015-04-19 Thread Gu Zheng
Hi Ishimatsu, Xishi,

On 04/20/2015 10:11 AM, Yasuaki Ishimatsu wrote:

> 
>> When hot adding memory and creating new node, the node is offline.
>> And after calling node_set_online(), the node becomes online.
>>
>> Oh, sorry. I misread your ptaches.
>>
> 
> Please ignore it...

Seems also a misread to me.
I clear it (my worry) here:
If we set the node size to 0 here, it may hidden more things than we experted,
and all the init chunks around with the size (spanned/present/managed...) will
be non-sense, and the user/caller will not get a summary of the hot added node
because of the changes here.
I am not sure the worry is necessary, please correct me if I missing something.

Regards,
Gu

> 
> Thanks,
> Yasuaki Ishimatsu
> 
> On 
> Yasuaki Ishimatsu  wrote:
> 
>>
>> When hot adding memory and creating new node, the node is offline.
>> And after calling node_set_online(), the node becomes online.
>>
>> Oh, sorry. I misread your ptaches.
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>> On Mon, 20 Apr 2015 09:33:10 +0800
>> Xishi Qiu  wrote:
>>
>>> On 2015/4/18 4:05, Yasuaki Ishimatsu wrote:
>>>

 Your patches will fix your issue.
 But, if BIOS reports memory first at node hot add, pgdat can
 not be initialized.

 Memory hot add flows are as follows:

 add_memory
   ...
   -> hotadd_new_pgdat()
   ...
   -> node_set_online(nid)

 When calling hotadd_new_pgdat() for a hot added node, the node is
 offline because node_set_online() is not called yet. So if applying
 your patches, the pgdat is not initialized in this case.

 Thanks,
 Yasuaki Ishimatsu

>>>
>>> Hi Yasuaki,
>>>
>>> I'm not quite understand, when BIOS reports memory first, why pgdat
>>> can not be initialized?
>>> When hotadd a new node, hotadd_new_pgdat() will be called too, and
>>> when hotadd memory to a existent node, it's no need to call 
>>> hotadd_new_pgdat(),
>>> right?
>>>
>>> Thanks,
>>> Xishi Qiu
>>>
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages()

2015-04-19 Thread Gu Zheng
Hi Ishimatsu, Xishi,

On 04/20/2015 10:11 AM, Yasuaki Ishimatsu wrote:

> 
>> When hot adding memory and creating new node, the node is offline.
>> And after calling node_set_online(), the node becomes online.
>>
>> Oh, sorry. I misread your ptaches.
>>
> 
> Please ignore it...

Seems also a misread to me.
I clear it (my worry) here:
If we set the node size to 0 here, it may hidden more things than we experted.
All the init chunks around with the size (spanned/present/managed...) will
be non-sense, and the user/caller will not get a summary of the hot added node
because of the changes here.
I am not sure the worry is necessary, please correct me if I missing something.

Regards,
Gu

> 
> Thanks,
> Yasuaki Ishimatsu
> 
> On 
> Yasuaki Ishimatsu  wrote:
> 
>>
>> When hot adding memory and creating new node, the node is offline.
>> And after calling node_set_online(), the node becomes online.
>>
>> Oh, sorry. I misread your ptaches.
>>
>> Thanks,
>> Yasuaki Ishimatsu
>>
>> On Mon, 20 Apr 2015 09:33:10 +0800
>> Xishi Qiu  wrote:
>>
>>> On 2015/4/18 4:05, Yasuaki Ishimatsu wrote:
>>>

 Your patches will fix your issue.
 But, if BIOS reports memory first at node hot add, pgdat can
 not be initialized.

 Memory hot add flows are as follows:

 add_memory
   ...
   -> hotadd_new_pgdat()
   ...
   -> node_set_online(nid)

 When calling hotadd_new_pgdat() for a hot added node, the node is
 offline because node_set_online() is not called yet. So if applying
 your patches, the pgdat is not initialized in this case.

 Thanks,
 Yasuaki Ishimatsu

>>>
>>> Hi Yasuaki,
>>>
>>> I'm not quite understand, when BIOS reports memory first, why pgdat
>>> can not be initialized?
>>> When hotadd a new node, hotadd_new_pgdat() will be called too, and
>>> when hotadd memory to a existent node, it's no need to call 
>>> hotadd_new_pgdat(),
>>> right?
>>>
>>> Thanks,
>>> Xishi Qiu
>>>
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages()

2015-04-19 Thread Gu Zheng
Hi Xishi,
On 04/18/2015 04:05 AM, Yasuaki Ishimatsu wrote:

> 
> Your patches will fix your issue.
> But, if BIOS reports memory first at node hot add, pgdat can
> not be initialized.
> 
> Memory hot add flows are as follows:
> 
> add_memory
>   ...
>   -> hotadd_new_pgdat()
>   ...
>   -> node_set_online(nid)
> 
> When calling hotadd_new_pgdat() for a hot added node, the node is
> offline because node_set_online() is not called yet. So if applying
> your patches, the pgdat is not initialized in this case.

Ishimtasu's worry is reasonable. And I am afraid the fix here is a bit
over-kill. 

> 
> Thanks,
> Yasuaki Ishimatsu
> 
> On Fri, 17 Apr 2015 18:50:32 +0800
> Xishi Qiu  wrote:
> 
>> Hot remove nodeXX, then hot add nodeXX. If BIOS report cpu first, it will 
>> call
>> hotadd_new_pgdat(nid, 0), this will set pgdat->node_start_pfn to 0. As nodeXX
>> exists at boot time, so pgdat->node_spanned_pages is the same as original. 
>> Then
>> free_area_init_core()->memmap_init() will pass a wrong start and a nonzero 
>> size.

As your analysis said the root cause here is passing a *0* as the 
node_start_pfn,
then the chaos occurred when init the zones. And this only happens to the 
re-hotadd
node, so how about using the saved *node_start_pfn* (via 
get_pfn_range_for_nid(nid, _pfn, _pfn))
instead if we find "pgdat->node_start_pfn == 0 && !node_online(XXX)"?

Thanks,
Gu

>>
>> free_area_init_core()
>>  memmap_init()
>>  memmap_init_zone()
>>  early_pfn_in_nid()
>>  set_page_links()
>>
>> "if (!early_pfn_in_nid(pfn, nid))" will skip the pfn(memory in section), but 
>> it
>> will not skip the pfn(hole in section), this will cover and relink the page 
>> to
>> zone/nid, so page_zone() from memory and hole in the same section are 
>> different.
>> The following call trace shows the bug.
>>
>> This patch will set the node size to 0 when hotadd a new node(original or 
>> new).
>> init_currently_empty_zone() and memmap_init() will be called in add_zone(), 
>> so
>> need not to change it.
>>
>> [90476.077469] kernel BUG at mm/page_alloc.c:1042!  // move_freepages() -> 
>> BUG_ON(page_zone(start_page) != page_zone(end_page));
>> [90476.077469] invalid opcode:  [#1] SMP 
>> [90476.077469] Modules linked in: iptable_nat nf_conntrack_ipv4 
>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fuse btrfs zlib_deflate 
>> raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc bridge stp llc 
>> ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables 
>> cfg80211 rfkill sg iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp 
>> intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel 
>> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd 
>> pcspkr igb vfat i2c_algo_bit dca fat sb_edac edac_core i2c_i801 lpc_ich 
>> i2c_core mfd_core shpchp acpi_pad ipmi_si ipmi_msghandler uinput nfsd 
>> auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod crc_t10dif 
>> crct10dif_common ahci libahci megaraid_sas tg3 ptp libata pps_core dm_mirror 
>> dm_region_hash dm_log dm_mod [last unloaded: rasf]
>> [90476.157382] CPU: 2 PID: 322803 Comm: updatedb Tainted: GF   W  
>> O--   3.10.0-229.1.2.5.hulk.rc14.x86_64 #1
>> [90476.157382] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei 
>> N1, BIOS V100R001 04/13/2015
>> [90476.157382] task: 88006a6d5b00 ti: 880068eb8000 task.ti: 
>> 880068eb8000
>> [90476.157382] RIP: 0010:[]  [] 
>> move_freepages+0x12f/0x140
>> [90476.157382] RSP: 0018:880068ebb640  EFLAGS: 00010002
>> [90476.157382] RAX: 880002316cc0 RBX: ea0001bd RCX: 
>> 0001
>> [90476.157382] RDX: 880002476e40 RSI:  RDI: 
>> 880002316cc0
>> [90476.157382] RBP: 880068ebb690 R08: 0010 R09: 
>> ea0001bd7fc0
>> [90476.157382] R10: 0006f5ff R11:  R12: 
>> 0001
>> [90476.157382] R13: 0003 R14: 880002316eb8 R15: 
>> ea0001bd7fc0
>> [90476.157382] FS:  7f4d3ab95740() GS:880033a0() 
>> knlGS:
>> [90476.157382] CS:  0010 DS:  ES:  CR0: 80050033
>> [90476.157382] CR2: 7f4d3ae1a808 CR3: 00018907a000 CR4: 
>> 001407e0
>> [90476.157382] DR0:  DR1:  DR2: 
>> 
>> [90476.157382] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [90476.157382] Stack:
>> [90476.157382]  880068ebb698 880002316cc0 a800b5378098 
>> 880068ebb698
>> [90476.157382]  810b11dc 880002316cc0 0001 
>> 0003
>> [90476.157382]  880002316eb8 ea0001bd6420 880068ebb6a0 
>> 8115a003
>> [90476.157382] Call Trace:
>> [90476.157382]  [] ? update_curr+0xcc/0x150
>> [90476.157382]  [] move_freepages_block+0x73/0x80
>> [90476.157382]  [] __rmqueue+0x26a/0x460
>> [90476.157382]  [] ? native_sched_clock+0x13/0x80
>> 

Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages()

2015-04-19 Thread Gu Zheng
Hi Ishimatsu, Xishi,

On 04/20/2015 10:11 AM, Yasuaki Ishimatsu wrote:

 
 When hot adding memory and creating new node, the node is offline.
 And after calling node_set_online(), the node becomes online.

 Oh, sorry. I misread your ptaches.

 
 Please ignore it...

Seems also a misread to me.
I clear it (my worry) here:
If we set the node size to 0 here, it may hidden more things than we experted,
and all the init chunks around with the size (spanned/present/managed...) will
be non-sense, and the user/caller will not get a summary of the hot added node
because of the changes here.
I am not sure the worry is necessary, please correct me if I missing something.

Regards,
Gu

 
 Thanks,
 Yasuaki Ishimatsu
 
 On 
 Yasuaki Ishimatsu yasu.isim...@gmail.com wrote:
 

 When hot adding memory and creating new node, the node is offline.
 And after calling node_set_online(), the node becomes online.

 Oh, sorry. I misread your ptaches.

 Thanks,
 Yasuaki Ishimatsu

 On Mon, 20 Apr 2015 09:33:10 +0800
 Xishi Qiu qiuxi...@huawei.com wrote:

 On 2015/4/18 4:05, Yasuaki Ishimatsu wrote:


 Your patches will fix your issue.
 But, if BIOS reports memory first at node hot add, pgdat can
 not be initialized.

 Memory hot add flows are as follows:

 add_memory
   ...
   - hotadd_new_pgdat()
   ...
   - node_set_online(nid)

 When calling hotadd_new_pgdat() for a hot added node, the node is
 offline because node_set_online() is not called yet. So if applying
 your patches, the pgdat is not initialized in this case.

 Thanks,
 Yasuaki Ishimatsu


 Hi Yasuaki,

 I'm not quite understand, when BIOS reports memory first, why pgdat
 can not be initialized?
 When hotadd a new node, hotadd_new_pgdat() will be called too, and
 when hotadd memory to a existent node, it's no need to call 
 hotadd_new_pgdat(),
 right?

 Thanks,
 Xishi Qiu

 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages()

2015-04-19 Thread Gu Zheng
Hi Ishimatsu, Xishi,

On 04/20/2015 10:11 AM, Yasuaki Ishimatsu wrote:

 
 When hot adding memory and creating new node, the node is offline.
 And after calling node_set_online(), the node becomes online.

 Oh, sorry. I misread your ptaches.

 
 Please ignore it...

Seems also a misread to me.
I clear it (my worry) here:
If we set the node size to 0 here, it may hidden more things than we experted.
All the init chunks around with the size (spanned/present/managed...) will
be non-sense, and the user/caller will not get a summary of the hot added node
because of the changes here.
I am not sure the worry is necessary, please correct me if I missing something.

Regards,
Gu

 
 Thanks,
 Yasuaki Ishimatsu
 
 On 
 Yasuaki Ishimatsu yasu.isim...@gmail.com wrote:
 

 When hot adding memory and creating new node, the node is offline.
 And after calling node_set_online(), the node becomes online.

 Oh, sorry. I misread your ptaches.

 Thanks,
 Yasuaki Ishimatsu

 On Mon, 20 Apr 2015 09:33:10 +0800
 Xishi Qiu qiuxi...@huawei.com wrote:

 On 2015/4/18 4:05, Yasuaki Ishimatsu wrote:


 Your patches will fix your issue.
 But, if BIOS reports memory first at node hot add, pgdat can
 not be initialized.

 Memory hot add flows are as follows:

 add_memory
   ...
   - hotadd_new_pgdat()
   ...
   - node_set_online(nid)

 When calling hotadd_new_pgdat() for a hot added node, the node is
 offline because node_set_online() is not called yet. So if applying
 your patches, the pgdat is not initialized in this case.

 Thanks,
 Yasuaki Ishimatsu


 Hi Yasuaki,

 I'm not quite understand, when BIOS reports memory first, why pgdat
 can not be initialized?
 When hotadd a new node, hotadd_new_pgdat() will be called too, and
 when hotadd memory to a existent node, it's no need to call 
 hotadd_new_pgdat(),
 right?

 Thanks,
 Xishi Qiu

 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2 V2] memory-hotplug: fix BUG_ON in move_freepages()

2015-04-19 Thread Gu Zheng
Hi Xishi,
On 04/18/2015 04:05 AM, Yasuaki Ishimatsu wrote:

 
 Your patches will fix your issue.
 But, if BIOS reports memory first at node hot add, pgdat can
 not be initialized.
 
 Memory hot add flows are as follows:
 
 add_memory
   ...
   - hotadd_new_pgdat()
   ...
   - node_set_online(nid)
 
 When calling hotadd_new_pgdat() for a hot added node, the node is
 offline because node_set_online() is not called yet. So if applying
 your patches, the pgdat is not initialized in this case.

Ishimtasu's worry is reasonable. And I am afraid the fix here is a bit
over-kill. 

 
 Thanks,
 Yasuaki Ishimatsu
 
 On Fri, 17 Apr 2015 18:50:32 +0800
 Xishi Qiu qiuxi...@huawei.com wrote:
 
 Hot remove nodeXX, then hot add nodeXX. If BIOS report cpu first, it will 
 call
 hotadd_new_pgdat(nid, 0), this will set pgdat-node_start_pfn to 0. As nodeXX
 exists at boot time, so pgdat-node_spanned_pages is the same as original. 
 Then
 free_area_init_core()-memmap_init() will pass a wrong start and a nonzero 
 size.

As your analysis said the root cause here is passing a *0* as the 
node_start_pfn,
then the chaos occurred when init the zones. And this only happens to the 
re-hotadd
node, so how about using the saved *node_start_pfn* (via 
get_pfn_range_for_nid(nid, start_pfn, end_pfn))
instead if we find pgdat-node_start_pfn == 0  !node_online(XXX)?

Thanks,
Gu


 free_area_init_core()
  memmap_init()
  memmap_init_zone()
  early_pfn_in_nid()
  set_page_links()

 if (!early_pfn_in_nid(pfn, nid)) will skip the pfn(memory in section), but 
 it
 will not skip the pfn(hole in section), this will cover and relink the page 
 to
 zone/nid, so page_zone() from memory and hole in the same section are 
 different.
 The following call trace shows the bug.

 This patch will set the node size to 0 when hotadd a new node(original or 
 new).
 init_currently_empty_zone() and memmap_init() will be called in add_zone(), 
 so
 need not to change it.

 [90476.077469] kernel BUG at mm/page_alloc.c:1042!  // move_freepages() - 
 BUG_ON(page_zone(start_page) != page_zone(end_page));
 [90476.077469] invalid opcode:  [#1] SMP 
 [90476.077469] Modules linked in: iptable_nat nf_conntrack_ipv4 
 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fuse btrfs zlib_deflate 
 raid6_pq xor msdos ext4 mbcache jbd2 binfmt_misc bridge stp llc 
 ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables 
 cfg80211 rfkill sg iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp 
 intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel 
 ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd 
 pcspkr igb vfat i2c_algo_bit dca fat sb_edac edac_core i2c_i801 lpc_ich 
 i2c_core mfd_core shpchp acpi_pad ipmi_si ipmi_msghandler uinput nfsd 
 auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod crc_t10dif 
 crct10dif_common ahci libahci megaraid_sas tg3 ptp libata pps_core dm_mirror 
 dm_region_hash dm_log dm_mod [last unloaded: rasf]
 [90476.157382] CPU: 2 PID: 322803 Comm: updatedb Tainted: GF   W  
 O--   3.10.0-229.1.2.5.hulk.rc14.x86_64 #1
 [90476.157382] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei 
 N1, BIOS V100R001 04/13/2015
 [90476.157382] task: 88006a6d5b00 ti: 880068eb8000 task.ti: 
 880068eb8000
 [90476.157382] RIP: 0010:[81159f7f]  [81159f7f] 
 move_freepages+0x12f/0x140
 [90476.157382] RSP: 0018:880068ebb640  EFLAGS: 00010002
 [90476.157382] RAX: 880002316cc0 RBX: ea0001bd RCX: 
 0001
 [90476.157382] RDX: 880002476e40 RSI:  RDI: 
 880002316cc0
 [90476.157382] RBP: 880068ebb690 R08: 0010 R09: 
 ea0001bd7fc0
 [90476.157382] R10: 0006f5ff R11:  R12: 
 0001
 [90476.157382] R13: 0003 R14: 880002316eb8 R15: 
 ea0001bd7fc0
 [90476.157382] FS:  7f4d3ab95740() GS:880033a0() 
 knlGS:
 [90476.157382] CS:  0010 DS:  ES:  CR0: 80050033
 [90476.157382] CR2: 7f4d3ae1a808 CR3: 00018907a000 CR4: 
 001407e0
 [90476.157382] DR0:  DR1:  DR2: 
 
 [90476.157382] DR3:  DR6: fffe0ff0 DR7: 
 0400
 [90476.157382] Stack:
 [90476.157382]  880068ebb698 880002316cc0 a800b5378098 
 880068ebb698
 [90476.157382]  810b11dc 880002316cc0 0001 
 0003
 [90476.157382]  880002316eb8 ea0001bd6420 880068ebb6a0 
 8115a003
 [90476.157382] Call Trace:
 [90476.157382]  [810b11dc] ? update_curr+0xcc/0x150
 [90476.157382]  [8115a003] move_freepages_block+0x73/0x80
 [90476.157382]  [8115b9ba] __rmqueue+0x26a/0x460
 [90476.157382]  [8101ba53] ? native_sched_clock+0x13/0x80
 [90476.157382]  [8115e172] get_page_from_freelist+0x7f2/0xd30
 [90476.157382]  

[RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-17 Thread Gu Zheng
Since the change to the cpu <--> mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

Signed-off-by: Gu Zheng 
---
 include/linux/gfp.h |   28 +++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373..19684a8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val < min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
@@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, gfp_t 
gfp_mask,
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
+   VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-04-17 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu<->node relationship
is  established. Because workqueue uses a info which  was established at boot
time, but it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid <---> pxm and pxm <--> node relationship are persistent, then
the apicid <--> node mapping is persistent, so the root cause is the
cpu-id <-> lapicid mapping is not persistent (because the currently 
implementation
always choose the first free cpu id for the new added cpu). If we can build
persistent cpu-id <-> lapicid relationship, this problem will be fixed.

This patch tries to build the whole world mapping cpuid <-> apicid <-> pxm <-> 
node
for all possible processor at the boot, the detail implementation are 2 steps:
Step1: generate a logic cpu id for all the local apic (both enabled and 
dsiabled)
   when register local apic
Step2: map the cpu to the phyical node via an additional acpi ns walk for 
processor.

Please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
for the previous discussion.

Reported-by: Yasuaki Ishimatsu 
Signed-off-by: Gu Zheng 
---
 arch/ia64/kernel/acpi.c   |2 +-
 arch/x86/include/asm/mpspec.h |1 +
 arch/x86/kernel/acpi/boot.c   |8 +--
 arch/x86/kernel/apic/apic.c   |   71 +++
 arch/x86/mm/numa.c|   20 
 drivers/acpi/acpi_processor.c |2 +-
 drivers/acpi/bus.c|3 +
 drivers/acpi/processor_core.c |  108 ++--
 include/linux/acpi.h  |2 +
 9 files changed, 162 insertions(+), 55 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 2c44989..e7958f8 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 803b684..b084cc0 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
@@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include 
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..7fbf2cb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_cpuid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int free_id, i;
+
+   free_id = cpumask_next_zero(-1, _mask);
+   if (free_id >= nr_cpu_ids)
+   return -1;
+
+   for (i = 0; i < free_id; i++)
+   if (apicid_to_cpuid[i] == apicid)
+   return i;
+
+   apicid_to_cpuid[free_id] = apicid;
+   cpumask_set_cpu(free_id, _mask);
+
+   return free_id;
+

[RFC PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-04-17 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu-node relationship
is  established. Because workqueue uses a info which  was established at boot
time, but it may be changed by node hotpluging.

Once pool-node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid --- pxm and pxm -- node relationship are persistent, then
the apicid -- node mapping is persistent, so the root cause is the
cpu-id - lapicid mapping is not persistent (because the currently 
implementation
always choose the first free cpu id for the new added cpu). If we can build
persistent cpu-id - lapicid relationship, this problem will be fixed.

This patch tries to build the whole world mapping cpuid - apicid - pxm - 
node
for all possible processor at the boot, the detail implementation are 2 steps:
Step1: generate a logic cpu id for all the local apic (both enabled and 
dsiabled)
   when register local apic
Step2: map the cpu to the phyical node via an additional acpi ns walk for 
processor.

Please refer to:
https://lkml.org/lkml/2015/2/27/145
https://lkml.org/lkml/2015/3/25/989
for the previous discussion.

Reported-by: Yasuaki Ishimatsu isimatu.yasu...@jp.fujitsu.com
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 arch/ia64/kernel/acpi.c   |2 +-
 arch/x86/include/asm/mpspec.h |1 +
 arch/x86/kernel/acpi/boot.c   |8 +--
 arch/x86/kernel/apic/apic.c   |   71 +++
 arch/x86/mm/numa.c|   20 
 drivers/acpi/acpi_processor.c |2 +-
 drivers/acpi/bus.c|3 +
 drivers/acpi/processor_core.c |  108 ++--
 include/linux/acpi.h  |2 +
 9 files changed, 162 insertions(+), 55 deletions(-)

diff --git a/arch/ia64/kernel/acpi.c b/arch/ia64/kernel/acpi.c
index 2c44989..e7958f8 100644
--- a/arch/ia64/kernel/acpi.c
+++ b/arch/ia64/kernel/acpi.c
@@ -796,7 +796,7 @@ int acpi_isa_irq_to_gsi(unsigned isa_irq, u32 *gsi)
  *  ACPI based hotplug CPU support
  */
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
-static int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+int acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
/*
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index b07233b..db902d8 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -86,6 +86,7 @@ static inline void early_reserve_e820_mpc_new(void) { }
 #endif
 
 int generic_processor_info(int apicid, int version);
+int __generic_processor_info(int apicid, int version, bool enabled);
 
 #define PHYSID_ARRAY_SIZE  BITS_TO_LONGS(MAX_LOCAL_APIC)
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 803b684..b084cc0 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -174,15 +174,13 @@ static int acpi_register_lapic(int id, u8 enabled)
return -EINVAL;
}
 
-   if (!enabled) {
+   if (!enabled)
++disabled_cpus;
-   return -EINVAL;
-   }
 
if (boot_cpu_physical_apicid != -1U)
ver = apic_version[boot_cpu_physical_apicid];
 
-   return generic_processor_info(id, ver);
+   return __generic_processor_info(id, ver, enabled);
 }
 
 static int __init
@@ -726,7 +724,7 @@ static void __init acpi_set_irq_model_ioapic(void)
 #ifdef CONFIG_ACPI_HOTPLUG_CPU
 #include acpi/processor.h
 
-static void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
+void acpi_map_cpu2node(acpi_handle handle, int cpu, int physid)
 {
 #ifdef CONFIG_ACPI_NUMA
int nid;
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index dcb5285..7fbf2cb 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1977,7 +1977,38 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
-int generic_processor_info(int apicid, int version)
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_cpuid[] = {
+   [0 ... NR_CPUS - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int free_id, i;
+
+   free_id = cpumask_next_zero(-1, cpuid_mask);
+   if (free_id = nr_cpu_ids)
+   return -1;
+
+   for (i = 0; i  free_id; i++)
+   if (apicid_to_cpuid[i] == apicid)
+   return i;
+
+   apicid_to_cpuid[free_id] = apicid;
+   cpumask_set_cpu(free_id, cpuid_mask);
+
+   return free_id;
+}
+
+int

[RFC PATCH 2/2] gfp: use the best near online node if the target node is offline

2015-04-17 Thread Gu Zheng
Since the change to the cpu -- mapping (map the cpu to the physical
node for all possible at the boot), the node of cpu may be not present,
so we use the best near online node if the node is not present in the low
level allocation APIs.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 include/linux/gfp.h |   28 +++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 97a9373..19684a8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,9 +298,31 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+static int find_near_online_node(int node)
+{
+   int n, val;
+   int min_val = INT_MAX;
+   int best_node = -1;
+
+   for_each_online_node(n) {
+   val = node_distance(node, n);
+
+   if (val  min_val) {
+   min_val = val;
+   best_node = n;
+   }
+   }
+
+   return best_node;
+}
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
/* Unknown node is current node */
if (nid  0)
nid = numa_node_id();
@@ -311,7 +333,11 @@ static inline struct page *alloc_pages_node(int nid, gfp_t 
gfp_mask,
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
unsigned int order)
 {
-   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES || !node_online(nid));
+   /* Offline node, use the best near online node */
+   if (!node_online(nid))
+   nid = find_near_online_node(nid);
+
+   VM_BUG_ON(nid  0 || nid = MAX_NUMNODES);
 
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] md: fix md io stats accounting broken

2015-04-02 Thread Gu Zheng
Simon reported the md io stats accounting issue:
"
I'm seeing "iostat -x -k 1" print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00   
345.000.000.000.00   0.00 100.00
md2   0.00 0.000.000.00 0.00 0.00 0.00 
58779.000.000.000.00   0.00 100.00
md1   0.00 0.000.000.00 0.00 0.00 0.00
12.000.000.000.00   0.00 100.00
"
The cause is commit "18c0b223cf9901727ef3b02da6711ac930b4e5d4" uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.

Reported-by: Simon Kirby 
Cc:  3.19
Signed-off-by: Gu Zheng 
---
 drivers/md/md.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..e617878 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -249,6 +249,7 @@ static void md_make_request(struct request_queue *q, struct 
bio *bio)
const int rw = bio_data_dir(bio);
struct mddev *mddev = q->queuedata;
unsigned int sectors;
+   int cpu;
 
if (mddev == NULL || mddev->pers == NULL
|| !mddev->ready) {
@@ -284,7 +285,10 @@ static void md_make_request(struct request_queue *q, 
struct bio *bio)
sectors = bio_sectors(bio);
mddev->pers->make_request(mddev, bio);
 
-   generic_start_io_acct(rw, sectors, >gendisk->part0);
+   cpu = part_stat_lock();
+   part_stat_inc(cpu, >gendisk->part0, ios[rw]);
+   part_stat_add(cpu, >gendisk->part0, sectors[rw], sectors);
+   part_stat_unlock();
 
if (atomic_dec_and_test(>active_io) && mddev->suspended)
wake_up(>sb_wait);
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] md: fix md io stats accounting broken

2015-04-02 Thread Gu Zheng
Simon reported the md io stats accounting issue:

I'm seeing iostat -x -k 1 print this after a RAID1 rebuild on 4.0-rc5.
It's not abnormal other than it's 3-disk, with one being SSD (sdc) and
the other two being write-mostly:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00   
345.000.000.000.00   0.00 100.00
md2   0.00 0.000.000.00 0.00 0.00 0.00 
58779.000.000.000.00   0.00 100.00
md1   0.00 0.000.000.00 0.00 0.00 0.00
12.000.000.000.00   0.00 100.00

The cause is commit 18c0b223cf9901727ef3b02da6711ac930b4e5d4 uses the
generic_start_io_acct to account the disk stats rather than the open code,
but it also introduced the increase to .in_flight[rw] which is needless to
md. So we re-use the open code here to fix it.

Reported-by: Simon Kirby s...@hostway.ca
Cc: sta...@vger.kernel.org 3.19
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 drivers/md/md.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 717daad..e617878 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -249,6 +249,7 @@ static void md_make_request(struct request_queue *q, struct 
bio *bio)
const int rw = bio_data_dir(bio);
struct mddev *mddev = q-queuedata;
unsigned int sectors;
+   int cpu;
 
if (mddev == NULL || mddev-pers == NULL
|| !mddev-ready) {
@@ -284,7 +285,10 @@ static void md_make_request(struct request_queue *q, 
struct bio *bio)
sectors = bio_sectors(bio);
mddev-pers-make_request(mddev, bio);
 
-   generic_start_io_acct(rw, sectors, mddev-gendisk-part0);
+   cpu = part_stat_lock();
+   part_stat_inc(cpu, mddev-gendisk-part0, ios[rw]);
+   part_stat_add(cpu, mddev-gendisk-part0, sectors[rw], sectors);
+   part_stat_unlock();
 
if (atomic_dec_and_test(mddev-active_io)  mddev-suspended)
wake_up(mddev-sb_wait);
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-04-01 Thread Gu Zheng
Hi Kame, TJ,

On 04/01/2015 04:30 PM, Kamezawa Hiroyuki wrote:

> On 2015/04/01 12:02, Tejun Heo wrote:
>> On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:
>>> Now, hot-added cpus will have the lowest free cpu id.
>>>
>>> Because of this, in most of systems which has only cpu-hot-add, cpu-ids are 
>>> always
>>> contiguous even after cpu hot add.
>>> In enterprise, this would be considered as imcompatibility.
>>>
>>> determining cpuid <-> lapicid at boot will make cpuids sparse. That may 
>>> corrupt
>>> exisiting script or configuration/resource management software.
>>
>> Ugh... so, cpu number allocation on hot-add is part of userland
>> interface that we're locked into?
> 
> We checked most of RHEL7 packages and didn't find a problem yet.
> But, for examle, we know some performance test team's test program assumed 
> contiguous
> cpuids and it failed. It was an easy case because we can ask them to fix the 
> application
> but I guess there will be some amount of customers that cpuids are contiguous.
> 
>> Tying hotplug and id allocation
>> order together usually isn't a good idea.  What if the cpu up fails
>> while running the notifiers?  The ID is already allocated and the next
>> cpu being brought up will be after a hole anyway.  Is this even
>> actually gonna affect userland?
>>
> 
> Maybe. It's not fail-safe but
> 
> In general, all kernel engineers (and skilled userland engineers) knows that
> cpuids cannot be always contiguous and cpuids/nodeids should be checked before
> running programs. I think most of engineers should be aware of that but many
> users have their own assumption :(
> 
> Basically, I don't have strong objections, you're right technically.
> 
> In summary...
>  - users should not assume cpuids are contiguous.
>  - all possible ids should be fixed at boot time.
>  - For uses, some clarification document should be somewhere in 
> Documenatation.

Fine to me.

> 
> So, Gu-san
>  1) determine all possible ids at boot.
>  2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
>  3) It would be good if other guys give us ack.

Also fine.
But before this going, could you please reconsider determining the ids when 
firstly
present (the implementation on this patchset)? 
Though it is not the perfect one in some words, but we can ignore the doubts 
that
mentioned above as the cpu/node hotplug is not frequent behaviours, and there 
seems
not anything harmful to us if we go this way.

Regards,
Gu

> 
> In future,
> I myself thinks naming system like udev for cpuid/numaid is necessary, at 
> last.
> Can that renaming feature can be cgroup/namespace feature ? If possible,
> all container can have cpuids starting from 0.
> 
> Thanks,
> -Kame
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-04-01 Thread Gu Zheng
Hi Kame, TJ,

On 04/01/2015 04:30 PM, Kamezawa Hiroyuki wrote:

 On 2015/04/01 12:02, Tejun Heo wrote:
 On Wed, Apr 01, 2015 at 11:55:11AM +0900, Kamezawa Hiroyuki wrote:
 Now, hot-added cpus will have the lowest free cpu id.

 Because of this, in most of systems which has only cpu-hot-add, cpu-ids are 
 always
 contiguous even after cpu hot add.
 In enterprise, this would be considered as imcompatibility.

 determining cpuid - lapicid at boot will make cpuids sparse. That may 
 corrupt
 exisiting script or configuration/resource management software.

 Ugh... so, cpu number allocation on hot-add is part of userland
 interface that we're locked into?
 
 We checked most of RHEL7 packages and didn't find a problem yet.
 But, for examle, we know some performance test team's test program assumed 
 contiguous
 cpuids and it failed. It was an easy case because we can ask them to fix the 
 application
 but I guess there will be some amount of customers that cpuids are contiguous.
 
 Tying hotplug and id allocation
 order together usually isn't a good idea.  What if the cpu up fails
 while running the notifiers?  The ID is already allocated and the next
 cpu being brought up will be after a hole anyway.  Is this even
 actually gonna affect userland?

 
 Maybe. It's not fail-safe but
 
 In general, all kernel engineers (and skilled userland engineers) knows that
 cpuids cannot be always contiguous and cpuids/nodeids should be checked before
 running programs. I think most of engineers should be aware of that but many
 users have their own assumption :(
 
 Basically, I don't have strong objections, you're right technically.
 
 In summary...
  - users should not assume cpuids are contiguous.
  - all possible ids should be fixed at boot time.
  - For uses, some clarification document should be somewhere in 
 Documenatation.

Fine to me.

 
 So, Gu-san
  1) determine all possible ids at boot.
  2) clarify cpuid/nodeid can have hole because of 1) in Documenation.
  3) It would be good if other guys give us ack.

Also fine.
But before this going, could you please reconsider determining the ids when 
firstly
present (the implementation on this patchset)? 
Though it is not the perfect one in some words, but we can ignore the doubts 
that
mentioned above as the cpu/node hotplug is not frequent behaviours, and there 
seems
not anything harmful to us if we go this way.

Regards,
Gu

 
 In future,
 I myself thinks naming system like udev for cpuid/numaid is necessary, at 
 last.
 Can that renaming feature can be cgroup/namespace feature ? If possible,
 all container can have cpuids starting from 0.
 
 Thanks,
 -Kame
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-03-30 Thread Gu Zheng
Hi Kame-san,

On 03/27/2015 12:31 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/26 13:55, Gu Zheng wrote:
>> Hi Kame-san,
>> On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:
>>
>>> On 2015/03/26 11:17, Gu Zheng wrote:
>>>> Previously, we build the apicid <--> cpuid mapping when the cpu is 
>>>> present, but
>>>> the relationship will be changed if the cpu/node hotplug happenned, 
>>>> because we
>>>> always choose the first free cpuid for the hot added cpu (whether it is 
>>>> new-add
>>>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>>>> occurred, and it causes the wq sub-system allocation failture:
>>>> ==
>>>>SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>>> cache: kmalloc-192, object size: 192, buffer size: 192, default
>>>> order:
>>>>   1, min order: 0
>>>> node 0: slabs: 6172, objs: 259224, free: 245741
>>>> node 1: slabs: 3261, objs: 136962, free: 127656
>>>> ==
>>>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu 
>>>> first
>>>> present, and never change it.
>>>>
>>>> Suggested-by: KAMEZAWA Hiroyuki 
>>>> Signed-off-by: Gu Zheng 
>>>> ---
>>>>arch/x86/kernel/apic/apic.c |   31 ++-
>>>>1 files changed, 30 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>>>> index ad3639a..d539ebc 100644
>>>> --- a/arch/x86/kernel/apic/apic.c
>>>> +++ b/arch/x86/kernel/apic/apic.c
>>>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>>>apic_write(APIC_LVT1, value);
>>>>}
>>>>
>>>> +/*
>>>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>>>> + * Do not clear the mapping even if cpu hot removed.
>>>> + * */
>>>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>>>> +  [0 ... MAX_LOCAL_APIC - 1] = -1,
>>>> +};
>>>
>>>
>>> This patch cannot handle x2apic, which is 32bit.
>>
>> IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
>> generating a logic cpu number for it, so it seems no problem here.
>>
> you mean MAX_LOCAL_APIC=32768 ? isn't it too wasting ?

I use the big array here to keep the same format with the existed ones:
int apic_version[MAX_LOCAL_APIC];

s16 __apicid_to_node[MAX_LOCAL_APIC] = {
[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
};
Or we should also say "NO" to them?

Regards,
Gu

> 
> Anyway, APIC IDs are sparse values. Please use proper structure.
> 
> Thanks,
> -Kame
> 
>>>
>>> As far as I understand, it depends on CPU's spec and the newest cpu has 
>>> 9bit apicid, at least.
>>>
>>> But you can't create inifinit array.
>>>
>>> If you can't allocate the array dynamically, How about adding
>>>
>>>   static int cpuid_to_apicid[MAX_CPU] = {}
>>>
>>> or using idr library ? (please see lib/idr.c)
>>>
>>> I guess you can update this map after boot(after mm initialization)
>>> and make use of idr library.
>>>
>>> About this patch, Nack.
>>>
>>> -Kame
>>>
>>>
>>>
>>>> +
>>>> +/*
>>>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>>>> + * */
>>>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>>>> +
>>>> +static int get_cpuid(int apicid)
>>>> +{
>>>> +  int cpuid;
>>>> +
>>>> +  cpuid = apicid_to_x86_cpu[apicid];
>>>> +  if (cpuid == -1)
>>>> +  cpuid = cpumask_next_zero(-1, _mask);
>>>> +
>>>> +  return cpuid;
>>>> +}
>>>> +
>>>>int generic_processor_info(int apicid, int version)
>>>>{
>>>>int cpu, max = nr_cpu_ids;
>>>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>>>> */
>>>>cpu = 0;
>>>>} else
>>>> -  cpu = cpumask_next_zero(-1, cpu_present_mask);
>>>> +  cpu = get_cpuid(apicid);
>>>> +
>>>> +  /* Store the mapping */
>>>> +  apicid_to_x86_cpu[apicid] = cpu;
>>>>
>>>>/*
>>>> * Validate version
>>>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>>>>early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>>>>apic->x86_32_early_logical_apicid(cpu);
>>>>#endif
>>>> +  /* Mark this cpu id as uesed (already mapping a local apic id) */
>>>> +  cpumask_set_cpu(cpu, _mask);
>>>>set_cpu_possible(cpu, true);
>>>>set_cpu_present(cpu, true);
>>>>
>>>>
>>>
>>>
>>> .
>>>
>>
>>
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-30 Thread Gu Zheng
Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/27 0:18, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
>>> wq generates the numa affinity (pool->node) for all the possible cpu's
>>> per cpu workqueue at init stage, that means the affinity of currently 
>>> un-present
>>> ones' may be incorrect, so we need to update the pool->node for the new 
>>> added cpu
>>> to the correct node when preparing online, otherwise it will try to create 
>>> worker
>>> on invalid node if node hotplug occurred.
>>
>> If the mapping is gonna be static once the cpus show up, any chance we
>> can initialize that for all possible cpus during boot?
>>
> 
> I think the kernel can define all possible
> 
>  cpuid <-> lapicid <-> pxm <-> nodeid
> 
> mapping at boot with using firmware table information.

Could you explain more?

> 
> One concern is current x86 logic for memory-less node v.s. memory hotplug.
> (as I explained before)
> 
> My idea is
>   step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node 
> id at boot.
> 
> But this may be overwritten by x86's memory less node logic. So,
>   step2. check node is online or not before calling kmalloc. If offline, use 
> -1.
>  rather than updating workqueue's attribute.
> 
> Thanks,
> -Kame
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-30 Thread Gu Zheng
Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/27 0:18, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
>>> wq generates the numa affinity (pool->node) for all the possible cpu's
>>> per cpu workqueue at init stage, that means the affinity of currently 
>>> un-present
>>> ones' may be incorrect, so we need to update the pool->node for the new 
>>> added cpu
>>> to the correct node when preparing online, otherwise it will try to create 
>>> worker
>>> on invalid node if node hotplug occurred.
>>
>> If the mapping is gonna be static once the cpus show up, any chance we
>> can initialize that for all possible cpus during boot?
>>
> 
> I think the kernel can define all possible
> 
>  cpuid <-> lapicid <-> pxm <-> nodeid
> 
> mapping at boot with using firmware table information.

Could you explain more?

Regards,
Gu

> 
> One concern is current x86 logic for memory-less node v.s. memory hotplug.
> (as I explained before)
> 
> My idea is
>   step1. build all possible mapping at boot cpuid <-> apicid <-> pxm <-> node 
> id at boot.
> 
> But this may be overwritten by x86's memory less node logic. So,
>   step2. check node is online or not before calling kmalloc. If offline, use 
> -1.
>  rather than updating workqueue's attribute.
> 
> Thanks,
> -Kame
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-03-30 Thread Gu Zheng
Hi Kame-san,

On 03/27/2015 12:31 AM, Kamezawa Hiroyuki wrote:

 On 2015/03/26 13:55, Gu Zheng wrote:
 Hi Kame-san,
 On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:

 On 2015/03/26 11:17, Gu Zheng wrote:
 Previously, we build the apicid -- cpuid mapping when the cpu is 
 present, but
 the relationship will be changed if the cpu/node hotplug happenned, 
 because we
 always choose the first free cpuid for the hot added cpu (whether it is 
 new-add
 or re-add), so this the cpuid -- node mapping changed if node hot plug
 occurred, and it causes the wq sub-system allocation failture:
 ==
SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
 cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
   1, min order: 0
 node 0: slabs: 6172, objs: 259224, free: 245741
 node 1: slabs: 3261, objs: 136962, free: 127656
 ==
 So here we build the persistent [lapic id] -- cpuid mapping when the cpu 
 first
 present, and never change it.

 Suggested-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
arch/x86/kernel/apic/apic.c |   31 ++-
1 files changed, 30 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
 index ad3639a..d539ebc 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
}

 +/*
 + * Logic cpu number(cpuid) to local APIC id persistent mappings.
 + * Do not clear the mapping even if cpu hot removed.
 + * */
 +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
 +  [0 ... MAX_LOCAL_APIC - 1] = -1,
 +};


 This patch cannot handle x2apic, which is 32bit.

 IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
 generating a logic cpu number for it, so it seems no problem here.

 you mean MAX_LOCAL_APIC=32768 ? isn't it too wasting ?

I use the big array here to keep the same format with the existed ones:
int apic_version[MAX_LOCAL_APIC];

s16 __apicid_to_node[MAX_LOCAL_APIC] = {
[0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
};
Or we should also say NO to them?

Regards,
Gu

 
 Anyway, APIC IDs are sparse values. Please use proper structure.
 
 Thanks,
 -Kame
 

 As far as I understand, it depends on CPU's spec and the newest cpu has 
 9bit apicid, at least.

 But you can't create inifinit array.

 If you can't allocate the array dynamically, How about adding

   static int cpuid_to_apicid[MAX_CPU] = {}

 or using idr library ? (please see lib/idr.c)

 I guess you can update this map after boot(after mm initialization)
 and make use of idr library.

 About this patch, Nack.

 -Kame



 +
 +/*
 + * Internal cpu id bits, set the bit once cpu present, and never clear it.
 + * */
 +static cpumask_t cpuid_mask = CPU_MASK_NONE;
 +
 +static int get_cpuid(int apicid)
 +{
 +  int cpuid;
 +
 +  cpuid = apicid_to_x86_cpu[apicid];
 +  if (cpuid == -1)
 +  cpuid = cpumask_next_zero(-1, cpuid_mask);
 +
 +  return cpuid;
 +}
 +
int generic_processor_info(int apicid, int version)
{
int cpu, max = nr_cpu_ids;
 @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
 */
cpu = 0;
} else
 -  cpu = cpumask_next_zero(-1, cpu_present_mask);
 +  cpu = get_cpuid(apicid);
 +
 +  /* Store the mapping */
 +  apicid_to_x86_cpu[apicid] = cpu;

/*
 * Validate version
 @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic-x86_32_early_logical_apicid(cpu);
#endif
 +  /* Mark this cpu id as uesed (already mapping a local apic id) */
 +  cpumask_set_cpu(cpu, cpuid_mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);




 .



 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-30 Thread Gu Zheng
Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:

 On 2015/03/27 0:18, Tejun Heo wrote:
 Hello,

 On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
 wq generates the numa affinity (pool-node) for all the possible cpu's
 per cpu workqueue at init stage, that means the affinity of currently 
 un-present
 ones' may be incorrect, so we need to update the pool-node for the new 
 added cpu
 to the correct node when preparing online, otherwise it will try to create 
 worker
 on invalid node if node hotplug occurred.

 If the mapping is gonna be static once the cpus show up, any chance we
 can initialize that for all possible cpus during boot?

 
 I think the kernel can define all possible
 
  cpuid - lapicid - pxm - nodeid
 
 mapping at boot with using firmware table information.

Could you explain more?

Regards,
Gu

 
 One concern is current x86 logic for memory-less node v.s. memory hotplug.
 (as I explained before)
 
 My idea is
   step1. build all possible mapping at boot cpuid - apicid - pxm - node 
 id at boot.
 
 But this may be overwritten by x86's memory less node logic. So,
   step2. check node is online or not before calling kmalloc. If offline, use 
 -1.
  rather than updating workqueue's attribute.
 
 Thanks,
 -Kame
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-30 Thread Gu Zheng
Hi Kame-san,

On 03/27/2015 12:42 AM, Kamezawa Hiroyuki wrote:

 On 2015/03/27 0:18, Tejun Heo wrote:
 Hello,

 On Thu, Mar 26, 2015 at 01:04:00PM +0800, Gu Zheng wrote:
 wq generates the numa affinity (pool-node) for all the possible cpu's
 per cpu workqueue at init stage, that means the affinity of currently 
 un-present
 ones' may be incorrect, so we need to update the pool-node for the new 
 added cpu
 to the correct node when preparing online, otherwise it will try to create 
 worker
 on invalid node if node hotplug occurred.

 If the mapping is gonna be static once the cpus show up, any chance we
 can initialize that for all possible cpus during boot?

 
 I think the kernel can define all possible
 
  cpuid - lapicid - pxm - nodeid
 
 mapping at boot with using firmware table information.

Could you explain more?

 
 One concern is current x86 logic for memory-less node v.s. memory hotplug.
 (as I explained before)
 
 My idea is
   step1. build all possible mapping at boot cpuid - apicid - pxm - node 
 id at boot.
 
 But this may be overwritten by x86's memory less node logic. So,
   step2. check node is online or not before calling kmalloc. If offline, use 
 -1.
  rather than updating workqueue's attribute.
 
 Thanks,
 -Kame
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-26 Thread Gu Zheng
Hi Kame-san,

On 03/26/2015 11:12 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/26 11:17, Gu Zheng wrote:
>> Yasuaki Ishimatsu found that with node online/offline, cpu<->node
>> relationship is established. Because workqueue uses a info which was
>> established at boot time, but it may be changed by node hotpluging.
>>
>> Once pool->node points to a stale node, following allocation failure
>> happens.
>>==
>>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>cache: kmalloc-192, object size: 192, buffer size: 192, default
>> order:
>>  1, min order: 0
>>node 0: slabs: 6172, objs: 259224, free: 245741
>>node 1: slabs: 3261, objs: 136962, free: 127656
>>==
>>
>> As the apicid <--> node relationship is persistent, so the root cause is the
>  ^^^
>pxm.
> 
>> cpu-id <-> lapicid mapping is not persistent (because the currently 
>> implementation
>> always choose the first free cpu id for the new added cpu), so if we can 
>> build
>> persistent cpu-id <-> lapicid relationship, this problem will be fixed.
>>
>> Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous 
>> discussion.
>>
>> Gu Zheng (2):
>>x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
>>workqueue: update per cpu workqueue's numa affinity when cpu
>>  preparing online
> 
> why patch(2/2) required ?

wq generates the numa affinity (pool->node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool->node for the new added 
cpu
to the correct node when preparing online, otherwise it will try to create 
worker
on invalid node if node hotplug occurred.

Regards,
Gu

> 
> Thanks,
> -Kame
> 
>>
>>   arch/x86/kernel/apic/apic.c |   31 ++-
>>   kernel/workqueue.c  |1 +
>>   2 files changed, 31 insertions(+), 1 deletions(-)
>>
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-26 Thread Gu Zheng
Hi Kame-san,

On 03/26/2015 11:12 AM, Kamezawa Hiroyuki wrote:

 On 2015/03/26 11:17, Gu Zheng wrote:
 Yasuaki Ishimatsu found that with node online/offline, cpu-node
 relationship is established. Because workqueue uses a info which was
 established at boot time, but it may be changed by node hotpluging.

 Once pool-node points to a stale node, following allocation failure
 happens.
==
   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
  1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==

 As the apicid -- node relationship is persistent, so the root cause is the
  ^^^
pxm.
 
 cpu-id - lapicid mapping is not persistent (because the currently 
 implementation
 always choose the first free cpu id for the new added cpu), so if we can 
 build
 persistent cpu-id - lapicid relationship, this problem will be fixed.

 Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous 
 discussion.

 Gu Zheng (2):
x86/cpu hotplug: make lapicid - cpuid mapping persistent
workqueue: update per cpu workqueue's numa affinity when cpu
  preparing online
 
 why patch(2/2) required ?

wq generates the numa affinity (pool-node) for all the possible cpu's
per cpu workqueue at init stage, that means the affinity of currently un-present
ones' may be incorrect, so we need to update the pool-node for the new added 
cpu
to the correct node when preparing online, otherwise it will try to create 
worker
on invalid node if node hotplug occurred.

Regards,
Gu

 
 Thanks,
 -Kame
 

   arch/x86/kernel/apic/apic.c |   31 ++-
   kernel/workqueue.c  |1 +
   2 files changed, 31 insertions(+), 1 deletions(-)

 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-03-25 Thread Gu Zheng
Hi Kame-san,
On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:

> On 2015/03/26 11:17, Gu Zheng wrote:
>> Previously, we build the apicid <--> cpuid mapping when the cpu is present, 
>> but
>> the relationship will be changed if the cpu/node hotplug happenned, because 
>> we
>> always choose the first free cpuid for the hot added cpu (whether it is 
>> new-add
>> or re-add), so this the cpuid <--> node mapping changed if node hot plug
>> occurred, and it causes the wq sub-system allocation failture:
>>==
>>   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
>>cache: kmalloc-192, object size: 192, buffer size: 192, default
>> order:
>>  1, min order: 0
>>node 0: slabs: 6172, objs: 259224, free: 245741
>>node 1: slabs: 3261, objs: 136962, free: 127656
>>==
>> So here we build the persistent [lapic id] <--> cpuid mapping when the cpu 
>> first
>> present, and never change it.
>>
>> Suggested-by: KAMEZAWA Hiroyuki 
>> Signed-off-by: Gu Zheng 
>> ---
>>   arch/x86/kernel/apic/apic.c |   31 ++-
>>   1 files changed, 30 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
>> index ad3639a..d539ebc 100644
>> --- a/arch/x86/kernel/apic/apic.c
>> +++ b/arch/x86/kernel/apic/apic.c
>> @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
>>  apic_write(APIC_LVT1, value);
>>   }
>>   
>> +/*
>> + * Logic cpu number(cpuid) to local APIC id persistent mappings.
>> + * Do not clear the mapping even if cpu hot removed.
>> + * */
>> +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
>> +[0 ... MAX_LOCAL_APIC - 1] = -1,
>> +};
> 
> 
> This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

> 
> As far as I understand, it depends on CPU's spec and the newest cpu has 9bit 
> apicid, at least.
> 
> But you can't create inifinit array.
> 
> If you can't allocate the array dynamically, How about adding
> 
>  static int cpuid_to_apicid[MAX_CPU] = {}
> 
> or using idr library ? (please see lib/idr.c)
> 
> I guess you can update this map after boot(after mm initialization)
> and make use of idr library.
> 
> About this patch, Nack.
> 
> -Kame
> 
> 
> 
>> +
>> +/*
>> + * Internal cpu id bits, set the bit once cpu present, and never clear it.
>> + * */
>> +static cpumask_t cpuid_mask = CPU_MASK_NONE;
>> +
>> +static int get_cpuid(int apicid)
>> +{
>> +int cpuid;
>> +
>> +cpuid = apicid_to_x86_cpu[apicid];
>> +if (cpuid == -1)
>> +cpuid = cpumask_next_zero(-1, _mask);
>> +
>> +return cpuid;
>> +}
>> +
>>   int generic_processor_info(int apicid, int version)
>>   {
>>  int cpu, max = nr_cpu_ids;
>> @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
>>   */
>>  cpu = 0;
>>  } else
>> -cpu = cpumask_next_zero(-1, cpu_present_mask);
>> +cpu = get_cpuid(apicid);
>> +
>> +/* Store the mapping */
>> +apicid_to_x86_cpu[apicid] = cpu;
>>   
>>  /*
>>   * Validate version
>> @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
>>  early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
>>  apic->x86_32_early_logical_apicid(cpu);
>>   #endif
>> +/* Mark this cpu id as uesed (already mapping a local apic id) */
>> +cpumask_set_cpu(cpu, _mask);
>>  set_cpu_possible(cpu, true);
>>  set_cpu_present(cpu, true);
>>   
>>
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online

2015-03-25 Thread Gu Zheng
Update the per cpu workqueue's numa affinity when cpu preparing online to
create the worker on the correct node.

Signed-off-by: Gu Zheng 
---
 kernel/workqueue.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 41ff75b..4c65953 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4618,6 +4618,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
for_each_cpu_worker_pool(pool, cpu) {
+   pool->node = cpu_to_node(cpu);
if (pool->nr_workers)
continue;
if (!create_worker(pool))
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-25 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu<->node
relationship is established. Because workqueue uses a info which was
established at boot time, but it may be changed by node hotpluging.

Once pool->node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid <--> node relationship is persistent, so the root cause is the
cpu-id <-> lapicid mapping is not persistent (because the currently 
implementation
always choose the first free cpu id for the new added cpu), so if we can build
persistent cpu-id <-> lapicid relationship, this problem will be fixed.

Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.

Gu Zheng (2):
  x86/cpu hotplug: make lapicid <-> cpuid mapping persistent
  workqueue: update per cpu workqueue's numa affinity when cpu
preparing online

 arch/x86/kernel/apic/apic.c |   31 ++-
 kernel/workqueue.c  |1 +
 2 files changed, 31 insertions(+), 1 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] x86/cpu hotplug: make apicid <--> cpuid mapping persistent

2015-03-25 Thread Gu Zheng
Previously, we build the apicid <--> cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid <--> node mapping changed if node hot plug
occurred, and it causes the wq sub-system allocation failture:
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==
So here we build the persistent [lapic id] <--> cpuid mapping when the cpu first
present, and never change it.

Suggested-by: KAMEZAWA Hiroyuki 
Signed-off-by: Gu Zheng 
---
 arch/x86/kernel/apic/apic.c |   31 ++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+   [0 ... MAX_LOCAL_APIC - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int cpuid;
+
+   cpuid = apicid_to_x86_cpu[apicid];
+   if (cpuid == -1)
+   cpuid = cpumask_next_zero(-1, _mask);
+
+   return cpuid;
+}
+
 int generic_processor_info(int apicid, int version)
 {
int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
 */
cpu = 0;
} else
-   cpu = cpumask_next_zero(-1, cpu_present_mask);
+   cpu = get_cpuid(apicid);
+
+   /* Store the mapping */
+   apicid_to_x86_cpu[apicid] = cpu;
 
/*
 * Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic->x86_32_early_logical_apicid(cpu);
 #endif
+   /* Mark this cpu id as uesed (already mapping a local apic id) */
+   cpumask_set_cpu(cpu, _mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);
 
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] workqueue: update per cpu workqueue's numa affinity when cpu preparing online

2015-03-25 Thread Gu Zheng
Update the per cpu workqueue's numa affinity when cpu preparing online to
create the worker on the correct node.

Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 kernel/workqueue.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 41ff75b..4c65953 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4618,6 +4618,7 @@ static int workqueue_cpu_up_callback(struct 
notifier_block *nfb,
switch (action  ~CPU_TASKS_FROZEN) {
case CPU_UP_PREPARE:
for_each_cpu_worker_pool(pool, cpu) {
+   pool-node = cpu_to_node(cpu);
if (pool-nr_workers)
continue;
if (!create_worker(pool))
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-03-25 Thread Gu Zheng
Previously, we build the apicid -- cpuid mapping when the cpu is present, but
the relationship will be changed if the cpu/node hotplug happenned, because we
always choose the first free cpuid for the hot added cpu (whether it is new-add
or re-add), so this the cpuid -- node mapping changed if node hot plug
occurred, and it causes the wq sub-system allocation failture:
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==
So here we build the persistent [lapic id] -- cpuid mapping when the cpu first
present, and never change it.

Suggested-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 arch/x86/kernel/apic/apic.c |   31 ++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index ad3639a..d539ebc 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
apic_write(APIC_LVT1, value);
 }
 
+/*
+ * Logic cpu number(cpuid) to local APIC id persistent mappings.
+ * Do not clear the mapping even if cpu hot removed.
+ * */
+static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
+   [0 ... MAX_LOCAL_APIC - 1] = -1,
+};
+
+/*
+ * Internal cpu id bits, set the bit once cpu present, and never clear it.
+ * */
+static cpumask_t cpuid_mask = CPU_MASK_NONE;
+
+static int get_cpuid(int apicid)
+{
+   int cpuid;
+
+   cpuid = apicid_to_x86_cpu[apicid];
+   if (cpuid == -1)
+   cpuid = cpumask_next_zero(-1, cpuid_mask);
+
+   return cpuid;
+}
+
 int generic_processor_info(int apicid, int version)
 {
int cpu, max = nr_cpu_ids;
@@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
 */
cpu = 0;
} else
-   cpu = cpumask_next_zero(-1, cpu_present_mask);
+   cpu = get_cpuid(apicid);
+
+   /* Store the mapping */
+   apicid_to_x86_cpu[apicid] = cpu;
 
/*
 * Validate version
@@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
apic-x86_32_early_logical_apicid(cpu);
 #endif
+   /* Mark this cpu id as uesed (already mapping a local apic id) */
+   cpumask_set_cpu(cpu, cpuid_mask);
set_cpu_possible(cpu, true);
set_cpu_present(cpu, true);
 
-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] workqueue: fix a bug when numa mapping is changed

2015-03-25 Thread Gu Zheng
Yasuaki Ishimatsu found that with node online/offline, cpu-node
relationship is established. Because workqueue uses a info which was
established at boot time, but it may be changed by node hotpluging.

Once pool-node points to a stale node, following allocation failure
happens.
  ==
 SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
  cache: kmalloc-192, object size: 192, buffer size: 192, default
order:
1, min order: 0
  node 0: slabs: 6172, objs: 259224, free: 245741
  node 1: slabs: 3261, objs: 136962, free: 127656
  ==

As the apicid -- node relationship is persistent, so the root cause is the
cpu-id - lapicid mapping is not persistent (because the currently 
implementation
always choose the first free cpu id for the new added cpu), so if we can build
persistent cpu-id - lapicid relationship, this problem will be fixed.

Please refer to https://lkml.org/lkml/2015/2/27/145 for the previous discussion.

Gu Zheng (2):
  x86/cpu hotplug: make lapicid - cpuid mapping persistent
  workqueue: update per cpu workqueue's numa affinity when cpu
preparing online

 arch/x86/kernel/apic/apic.c |   31 ++-
 kernel/workqueue.c  |1 +
 2 files changed, 31 insertions(+), 1 deletions(-)

-- 
1.7.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] x86/cpu hotplug: make apicid -- cpuid mapping persistent

2015-03-25 Thread Gu Zheng
Hi Kame-san,
On 03/26/2015 11:19 AM, Kamezawa Hiroyuki wrote:

 On 2015/03/26 11:17, Gu Zheng wrote:
 Previously, we build the apicid -- cpuid mapping when the cpu is present, 
 but
 the relationship will be changed if the cpu/node hotplug happenned, because 
 we
 always choose the first free cpuid for the hot added cpu (whether it is 
 new-add
 or re-add), so this the cpuid -- node mapping changed if node hot plug
 occurred, and it causes the wq sub-system allocation failture:
==
   SLUB: Unable to allocate memory on node 2 (gfp=0x80d0)
cache: kmalloc-192, object size: 192, buffer size: 192, default
 order:
  1, min order: 0
node 0: slabs: 6172, objs: 259224, free: 245741
node 1: slabs: 3261, objs: 136962, free: 127656
==
 So here we build the persistent [lapic id] -- cpuid mapping when the cpu 
 first
 present, and never change it.

 Suggested-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
 Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
 ---
   arch/x86/kernel/apic/apic.c |   31 ++-
   1 files changed, 30 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
 index ad3639a..d539ebc 100644
 --- a/arch/x86/kernel/apic/apic.c
 +++ b/arch/x86/kernel/apic/apic.c
 @@ -2038,6 +2038,30 @@ void disconnect_bsp_APIC(int virt_wire_setup)
  apic_write(APIC_LVT1, value);
   }
   
 +/*
 + * Logic cpu number(cpuid) to local APIC id persistent mappings.
 + * Do not clear the mapping even if cpu hot removed.
 + * */
 +static int apicid_to_x86_cpu[MAX_LOCAL_APIC] = {
 +[0 ... MAX_LOCAL_APIC - 1] = -1,
 +};
 
 
 This patch cannot handle x2apic, which is 32bit.

IMO, if the apicid is too big (larger than MAX_LOCAL_APIC), we will skip
generating a logic cpu number for it, so it seems no problem here.

 
 As far as I understand, it depends on CPU's spec and the newest cpu has 9bit 
 apicid, at least.
 
 But you can't create inifinit array.
 
 If you can't allocate the array dynamically, How about adding
 
  static int cpuid_to_apicid[MAX_CPU] = {}
 
 or using idr library ? (please see lib/idr.c)
 
 I guess you can update this map after boot(after mm initialization)
 and make use of idr library.
 
 About this patch, Nack.
 
 -Kame
 
 
 
 +
 +/*
 + * Internal cpu id bits, set the bit once cpu present, and never clear it.
 + * */
 +static cpumask_t cpuid_mask = CPU_MASK_NONE;
 +
 +static int get_cpuid(int apicid)
 +{
 +int cpuid;
 +
 +cpuid = apicid_to_x86_cpu[apicid];
 +if (cpuid == -1)
 +cpuid = cpumask_next_zero(-1, cpuid_mask);
 +
 +return cpuid;
 +}
 +
   int generic_processor_info(int apicid, int version)
   {
  int cpu, max = nr_cpu_ids;
 @@ -2115,7 +2139,10 @@ int generic_processor_info(int apicid, int version)
   */
  cpu = 0;
  } else
 -cpu = cpumask_next_zero(-1, cpu_present_mask);
 +cpu = get_cpuid(apicid);
 +
 +/* Store the mapping */
 +apicid_to_x86_cpu[apicid] = cpu;
   
  /*
   * Validate version
 @@ -2144,6 +2171,8 @@ int generic_processor_info(int apicid, int version)
  early_per_cpu(x86_cpu_to_logical_apicid, cpu) =
  apic-x86_32_early_logical_apicid(cpu);
   #endif
 +/* Mark this cpu id as uesed (already mapping a local apic id) */
 +cpumask_set_cpu(cpu, cpuid_mask);
  set_cpu_possible(cpu, true);
  set_cpu_present(cpu, true);
   

 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat

2015-03-11 Thread Gu Zheng
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition.
[ 1422.011064] BUG: unable to handle kernel paging request at 00025f60
[ 1422.011086] IP: [] next_online_pgdat+0x1/0x50
[ 1422.011178] PGD 0
[ 1422.011180] Oops:  [#1] SMP
[ 1422.011409] ACPI: Device does not support D3cold
[ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop 
dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper 
cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca 
i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support 
tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd 
mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc 
scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
[ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G   O 
3.10.15-5885-euler0302 #1
[ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, 
BIOS V100R001 03/02/2015
[ 1422.012065] Workqueue: events vmstat_update
[ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: 
a800d32ae000
[ 1422.012165] RIP: 0010:[]  [] 
next_online_pgdat+0x1/0x50
[ 1422.012205] RSP: 0018:a800d32afce8  EFLAGS: 00010286
[ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: 0082
[ 1422.012226] RDX:  RSI: 0082 RDI: 
[ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: 81cbdc96
[ 1422.012272] R10: 40ec R11: 00a0 R12: a800fffb3440
[ 1422.012290] R13: a800d32afd38 R14: 0017 R15: a800e6616800
[ 1422.012292] FS:  () GS:a800e660() 
knlGS:
[ 1422.012314] CS:  0010 DS:  ES:  CR0: 80050033
[ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: 001407e0
[ 1422.012328] DR0:  DR1:  DR2: 
[ 1422.012328] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1422.012328] Stack:
[ 1422.012328]  a800d32afd28 81126ca5 a800 
814b4314
[ 1422.012328]  a800d32ae010  a800e6616180 
a800fffb3440
[ 1422.012328]  a800d32afde8 81128220 0013 
0038
[ 1422.012328] Call Trace:
[ 1422.012328]  [] ? next_zone+0xc5/0x150
[ 1422.012328]  [] ? __schedule+0x544/0x780
[ 1422.012328]  [] refresh_cpu_vm_stats+0xd0/0x140
[ 1422.012328]  [] vmstat_update+0x11/0x50
[ 1422.012328]  [] process_one_work+0x194/0x3d0
[ 1422.012328]  [] worker_thread+0x12b/0x410
[ 1422.012328]  [] ? manage_workers+0x1a0/0x1a0
[ 1422.012328]  [] kthread+0xc6/0xd0
[ 1422.012328]  [] ? kthread_freezable_should_stop+0x70/0x70
[ 1422.012328]  [] ret_from_fork+0x7c/0xb0
[ 1422.012328]  [] ? kthread_freezable_should_stop+0x70/0x70

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of 
try_offline_node,
which will reset the all content of pgdat to 0, as the pgdat is accessed 
lock-lee,
so that the users still using the pgdat will panic, such as the vmstat_update 
routine.

So the solution here is postponing the reset of obsolete pgdat from 
try_offline_node()
to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and 
pgdat->classzone_idx to
be 0 rather than the memset 0 to avoid breaking pointer information in pgdat.

Reported-by: Xishi Qiu 
Suggested-by: KAMEZAWA Hiroyuki 
Cc: 
Signed-off-by: Gu Zheng 
---
 mm/memory_hotplug.c |   13 -
 1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fab107..65842d6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
return NULL;
 
arch_refresh_nodedata(nid, pgdat);
+   } else {
+   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   pgdat->nr_zones = 0;
+   pgdat->classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
@@ -1977,15 +1981,6 @@ void try_offline_node(int nid)
if (is_vmalloc_addr(zone->wait_table))
vfree(zone->wait_table);
}
-
-   /*
-* Since there is no way to guarentee the address of pgdat/zone is not
-* on stack of any kernel threads or used by other kernel objects
-* without reference counting or other symchronizing method, do not
-* reset node_data and free pgdat here. Just reset it to 0 and reuse
-* the memory when the node is online again.
-*/
-   memset(pgdat, 0, sizeof(*pgdat));
 }
 EXPORT_SYMBOL(try_offline_node);
 
-- 
1.7.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.k

[PATCH] mm/memory hotplog: postpone the reset of obsolete pgdat

2015-03-11 Thread Gu Zheng
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition.
[ 1422.011064] BUG: unable to handle kernel paging request at 00025f60
[ 1422.011086] IP: [81126b91] next_online_pgdat+0x1/0x50
[ 1422.011178] PGD 0
[ 1422.011180] Oops:  [#1] SMP
[ 1422.011409] ACPI: Device does not support D3cold
[ 1422.011961] Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop 
dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper 
cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca 
i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support 
tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd 
mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc 
scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
[ 1422.012006] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G   O 
3.10.15-5885-euler0302 #1
[ 1422.012024] Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, 
BIOS V100R001 03/02/2015
[ 1422.012065] Workqueue: events vmstat_update
[ 1422.012084] task: a800d32c ti: a800d32ae000 task.ti: 
a800d32ae000
[ 1422.012165] RIP: 0010:[81126b91]  [81126b91] 
next_online_pgdat+0x1/0x50
[ 1422.012205] RSP: 0018:a800d32afce8  EFLAGS: 00010286
[ 1422.012225] RAX: 1440 RBX: 81da53b8 RCX: 0082
[ 1422.012226] RDX:  RSI: 0082 RDI: 
[ 1422.012254] RBP: a800d32afd28 R08: 81c93bfc R09: 81cbdc96
[ 1422.012272] R10: 40ec R11: 00a0 R12: a800fffb3440
[ 1422.012290] R13: a800d32afd38 R14: 0017 R15: a800e6616800
[ 1422.012292] FS:  () GS:a800e660() 
knlGS:
[ 1422.012314] CS:  0010 DS:  ES:  CR0: 80050033
[ 1422.012328] CR2: 00025f60 CR3: 01a0b000 CR4: 001407e0
[ 1422.012328] DR0:  DR1:  DR2: 
[ 1422.012328] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1422.012328] Stack:
[ 1422.012328]  a800d32afd28 81126ca5 a800 
814b4314
[ 1422.012328]  a800d32ae010  a800e6616180 
a800fffb3440
[ 1422.012328]  a800d32afde8 81128220 0013 
0038
[ 1422.012328] Call Trace:
[ 1422.012328]  [81126ca5] ? next_zone+0xc5/0x150
[ 1422.012328]  [814b4314] ? __schedule+0x544/0x780
[ 1422.012328]  [81128220] refresh_cpu_vm_stats+0xd0/0x140
[ 1422.012328]  [811282a1] vmstat_update+0x11/0x50
[ 1422.012328]  [81064c24] process_one_work+0x194/0x3d0
[ 1422.012328]  [810660bb] worker_thread+0x12b/0x410
[ 1422.012328]  [81065f90] ? manage_workers+0x1a0/0x1a0
[ 1422.012328]  [8106ba66] kthread+0xc6/0xd0
[ 1422.012328]  [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70
[ 1422.012328]  [814be0ac] ret_from_fork+0x7c/0xb0
[ 1422.012328]  [8106b9a0] ? kthread_freezable_should_stop+0x70/0x70

The cause is the memset(pgdat, 0, sizeof(*pgdat)) at the end of 
try_offline_node,
which will reset the all content of pgdat to 0, as the pgdat is accessed 
lock-lee,
so that the users still using the pgdat will panic, such as the vmstat_update 
routine.

So the solution here is postponing the reset of obsolete pgdat from 
try_offline_node()
to hotadd_new_pgdat(), and just resetting pgdat-nr_zones and 
pgdat-classzone_idx to
be 0 rather than the memset 0 to avoid breaking pointer information in pgdat.

Reported-by: Xishi Qiu qiuxi...@huawei.com
Suggested-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com
Cc: sta...@vger.kernel.org
Signed-off-by: Gu Zheng guz.f...@cn.fujitsu.com
---
 mm/memory_hotplug.c |   13 -
 1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9fab107..65842d6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
return NULL;
 
arch_refresh_nodedata(nid, pgdat);
+   } else {
+   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   pgdat-nr_zones = 0;
+   pgdat-classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
@@ -1977,15 +1981,6 @@ void try_offline_node(int nid)
if (is_vmalloc_addr(zone-wait_table))
vfree(zone-wait_table);
}
-
-   /*
-* Since there is no way to guarentee the address of pgdat/zone is not
-* on stack of any kernel threads or used by other kernel objects
-* without reference counting or other symchronizing method, do not
-* reset node_data and free pgdat here. Just reset it to 0 and reuse

Re: node-hotplug: is memset 0 safe in try_offline_node()?

2015-03-10 Thread Gu Zheng
Hi Xishi,

What is the condition of this problem now?

Regards,
Gu
On 03/05/2015 05:39 PM, Xishi Qiu wrote:

> On 2015/3/5 16:26, Gu Zheng wrote:
> 
>> Hi Xishi,
>> Could you please try the following one?
>> It postpones the reset of obsolete pgdat from try_offline_node() to
>> hotadd_new_pgdat(), and just resetting pgdat->nr_zones and
>> pgdat->classzone_idx to be 0 rather than the whole reset by memset()
>> as Kame suggested.
>>
>> Regards,
>> Gu
>>
>> ---
>>  mm/memory_hotplug.c |   13 -
>>  1 files changed, 4 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 1778628..c17eebf 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
>> start)
>>  return NULL;
>>  
>>  arch_refresh_nodedata(nid, pgdat);
>> +} else {
>> +/* Reset the nr_zones and classzone_idx to 0 before reuse */
>> +pgdat->nr_zones = 0;
>> +pgdat->classzone_idx = 0;
> 
> Hi Gu,
> 
> This is just to avoid the warning, I think it's no meaning.
> Here is the changlog from the original patch:
> 
> commit 88fdf75d1bb51d85ba00c466391770056d44bc03
> ...
> Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
> when it is allocated.  Arch code and memory hotplug already initiailize
> pg_data_t.  So this warning should never happen.  I select fields 
> *randomly*
> near the beginning, middle and end of pg_data_t for checking.
> ...
> 
> Thanks,
> Xishi Qiu
> 
>>  }
>>  
>>  /* we can use NODE_DATA(nid) from here */
>> @@ -2021,15 +2025,6 @@ void try_offline_node(int nid)
>>  
>>  /* notify that the node is down */
>>  call_node_notify(NODE_DOWN, (void *)(long)nid);
>> -
>> -/*
>> - * Since there is no way to guarentee the address of pgdat/zone is not
>> - * on stack of any kernel threads or used by other kernel objects
>> - * without reference counting or other symchronizing method, do not
>> - * reset node_data and free pgdat here. Just reset it to 0 and reuse
>> - * the memory when the node is online again.
>> - */
>> -memset(pgdat, 0, sizeof(*pgdat));
>>  }
>>  EXPORT_SYMBOL(try_offline_node);
>>  
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org;> em...@kvack.org 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: node-hotplug: is memset 0 safe in try_offline_node()?

2015-03-10 Thread Gu Zheng
Hi Xishi,

What is the condition of this problem now?

Regards,
Gu
On 03/05/2015 05:39 PM, Xishi Qiu wrote:

 On 2015/3/5 16:26, Gu Zheng wrote:
 
 Hi Xishi,
 Could you please try the following one?
 It postpones the reset of obsolete pgdat from try_offline_node() to
 hotadd_new_pgdat(), and just resetting pgdat-nr_zones and
 pgdat-classzone_idx to be 0 rather than the whole reset by memset()
 as Kame suggested.

 Regards,
 Gu

 ---
  mm/memory_hotplug.c |   13 -
  1 files changed, 4 insertions(+), 9 deletions(-)

 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
 index 1778628..c17eebf 100644
 --- a/mm/memory_hotplug.c
 +++ b/mm/memory_hotplug.c
 @@ -1092,6 +1092,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
 start)
  return NULL;
  
  arch_refresh_nodedata(nid, pgdat);
 +} else {
 +/* Reset the nr_zones and classzone_idx to 0 before reuse */
 +pgdat-nr_zones = 0;
 +pgdat-classzone_idx = 0;
 
 Hi Gu,
 
 This is just to avoid the warning, I think it's no meaning.
 Here is the changlog from the original patch:
 
 commit 88fdf75d1bb51d85ba00c466391770056d44bc03
 ...
 Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero
 when it is allocated.  Arch code and memory hotplug already initiailize
 pg_data_t.  So this warning should never happen.  I select fields 
 *randomly*
 near the beginning, middle and end of pg_data_t for checking.
 ...
 
 Thanks,
 Xishi Qiu
 
  }
  
  /* we can use NODE_DATA(nid) from here */
 @@ -2021,15 +2025,6 @@ void try_offline_node(int nid)
  
  /* notify that the node is down */
  call_node_notify(NODE_DOWN, (void *)(long)nid);
 -
 -/*
 - * Since there is no way to guarentee the address of pgdat/zone is not
 - * on stack of any kernel threads or used by other kernel objects
 - * without reference counting or other symchronizing method, do not
 - * reset node_data and free pgdat here. Just reset it to 0 and reuse
 - * the memory when the node is online again.
 - */
 -memset(pgdat, 0, sizeof(*pgdat));
  }
  EXPORT_SYMBOL(try_offline_node);
  
 
 
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   >