date:20180423

Re: [PATCH 9/9] Protect SELinux initialized state with pmalloc

2018-04-23 Thread kbuild test robot

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.17-rc2]
[cannot apply to next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Igor-Stoppa/struct-page-add-field-for-vm_struct/20180424-065001
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: mips-jz4740 (attached as .config)
compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   security/selinux/ss/services.o: In function `selinux_ss_init':
>> services.c:(.text+0x21f4): undefined reference to 
>> `pmalloc_create_custom_pool'
>> services.c:(.text+0x2218): undefined reference to `pmalloc'
>> services.c:(.text+0x223c): undefined reference to `pmalloc_protect_pool'
   security/selinux/ss/services.o: In function `security_load_policy':
>> services.c:(.text+0x4ab8): undefined reference to `pmalloc_rare_write'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 9/9] Protect SELinux initialized state with pmalloc

2018-04-23 Thread kbuild test robot

Hi Igor,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on v4.17-rc2]
[cannot apply to next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Igor-Stoppa/struct-page-add-field-for-vm_struct/20180424-065001
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: mips-jz4740 (attached as .config)
compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=mips 

All errors (new ones prefixed by >>):

   security/selinux/ss/services.o: In function `selinux_ss_init':
>> services.c:(.text+0x21f4): undefined reference to 
>> `pmalloc_create_custom_pool'
>> services.c:(.text+0x2218): undefined reference to `pmalloc'
>> services.c:(.text+0x223c): undefined reference to `pmalloc_protect_pool'
   security/selinux/ss/services.o: In function `security_load_policy':
>> services.c:(.text+0x4ab8): undefined reference to `pmalloc_rare_write'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH] f2fs: give message and set need_fsck given broken node id

2018-04-23 Thread Jaegeuk Kim

syzbot hit the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +)
Merge branch 'fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: 
https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264
syzkaller reproducer: 
https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5436660795834368
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d154ec99402c6f628...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
[ cut here ]
kernel BUG at fs/f2fs/node.c:1185!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185
RSP: 0018:8801d960e820 EFLAGS: 00010293
RAX: 8801d88205c0 RBX: 0003 RCX: 82f6cc06
RDX:  RSI: 82f6d5e8 RDI: 0004
RBP: 8801d960ec30 R08: 8801d88205c0 R09: ed003b5e46c2
R10: 0003 R11: 0003 R12: 8801a86e00c0
R13: 0001 R14: 8801a86e0530 R15: 8801d9745240
FS:  0072c880() GS:8801daf0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f3d403209b8 CR3: 0001d8f3f000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 get_node_page fs/f2fs/node.c:1237 [inline]
 truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014
 remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039
 f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547
 evict+0x4a6/0x960 fs/inode.c:557
 iput_final fs/inode.c:1519 [inline]
 iput+0x62d/0xa80 fs/inode.c:1545
 f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849
 mount_bdev+0x30c/0x3e0 fs/super.c:1164
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1267
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2848
 ksys_mount+0x12d/0x140 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443dea
RSP: 002b:7ffcc7882368 EFLAGS: 0297 ORIG_RAX: 00a5
RAX: ffda RBX: 2c00 RCX: 00443dea
RDX: 2000 RSI: 2100 RDI: 7ffcc7882370
RBP: 0003 R08: 20016a00 R09: 000a
R10:  R11: 0297 R12: 0004
R13: 00402ce0 R14:  R15: 
RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: 8801d960e820
---[ end trace 4edbeb71f002bb76 ]---

Reported-by: syzbot+d154ec99402c6f628...@syzkaller.appspotmail.com
Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/f2fs.h  | 13 +
 fs/f2fs/inode.c |  6 +-
 fs/f2fs/node.c  | 23 +--
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 8f3ad9662d13..d26aae5bf00d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1583,18 +1583,6 @@ static inline bool __exist_node_summaries(struct 
f2fs_sb_info *sbi)
is_set_ckpt_flags(sbi, CP_FASTBOOT_FLAG));
 }
 
-/*
- * Check whether the given nid is within node id range.
- */
-static inline int check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
-{
-   if (unlikely(nid < F2FS_ROOT_INO(sbi)))
-   return -EINVAL;
-   if (unlikely(nid >= NM_I(sbi)->max_nid))
-   return -EINVAL;
-   return 0;
-}
-
 /*
  * Check whether the inode has blocks or not
  */
@@ -2768,6 +2756,7 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
 struct dnode_of_data;
 struct node_info;
 
+int check_nid_range(struct f2fs_sb_info *sbi, nid_t nid);
 bool available_free_memory(struct f2fs_sb_info *sbi, int type);
 int need_dentry_mark(struct f2fs_sb_info *sbi, nid_t nid);
 bool

[PATCH] f2fs: give message and set need_fsck given broken node id

2018-04-23 Thread Jaegeuk Kim

syzbot hit the following crash on upstream commit
83beed7b2b26f232d782127792dd0cd4362fdc41 (Fri Apr 20 17:56:32 2018 +)
Merge branch 'fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
syzbot dashboard link: 
https://syzkaller.appspot.com/bug?extid=d154ec99402c6f628887

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5414336294027264
syzkaller reproducer: 
https://syzkaller.appspot.com/x/repro.syz?id=5471683234234368
Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5436660795834368
Kernel config: https://syzkaller.appspot.com/x/.config?id=1808800213120130118
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d154ec99402c6f628...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for details.
If you forward the report, please keep this part and the footer.

F2FS-fs (loop0): Magic Mismatch, valid(0xf2f52010) - read(0x0)
F2FS-fs (loop0): Can't find valid F2FS filesystem in 1th superblock
F2FS-fs (loop0): invalid crc value
[ cut here ]
kernel BUG at fs/f2fs/node.c:1185!
invalid opcode:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4549 Comm: syzkaller704305 Not tainted 4.17.0-rc1+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:__get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185
RSP: 0018:8801d960e820 EFLAGS: 00010293
RAX: 8801d88205c0 RBX: 0003 RCX: 82f6cc06
RDX:  RSI: 82f6d5e8 RDI: 0004
RBP: 8801d960ec30 R08: 8801d88205c0 R09: ed003b5e46c2
R10: 0003 R11: 0003 R12: 8801a86e00c0
R13: 0001 R14: 8801a86e0530 R15: 8801d9745240
FS:  0072c880() GS:8801daf0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f3d403209b8 CR3: 0001d8f3f000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 get_node_page fs/f2fs/node.c:1237 [inline]
 truncate_xattr_node+0x152/0x2e0 fs/f2fs/node.c:1014
 remove_inode_page+0x200/0xaf0 fs/f2fs/node.c:1039
 f2fs_evict_inode+0xe86/0x1710 fs/f2fs/inode.c:547
 evict+0x4a6/0x960 fs/inode.c:557
 iput_final fs/inode.c:1519 [inline]
 iput+0x62d/0xa80 fs/inode.c:1545
 f2fs_fill_super+0x5f4e/0x7bf0 fs/f2fs/super.c:2849
 mount_bdev+0x30c/0x3e0 fs/super.c:1164
 f2fs_mount+0x34/0x40 fs/f2fs/super.c:3020
 mount_fs+0xae/0x328 fs/super.c:1267
 vfs_kern_mount.part.34+0xd4/0x4d0 fs/namespace.c:1037
 vfs_kern_mount fs/namespace.c:1027 [inline]
 do_new_mount fs/namespace.c:2518 [inline]
 do_mount+0x564/0x3070 fs/namespace.c:2848
 ksys_mount+0x12d/0x140 fs/namespace.c:3064
 __do_sys_mount fs/namespace.c:3078 [inline]
 __se_sys_mount fs/namespace.c:3075 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3075
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x443dea
RSP: 002b:7ffcc7882368 EFLAGS: 0297 ORIG_RAX: 00a5
RAX: ffda RBX: 2c00 RCX: 00443dea
RDX: 2000 RSI: 2100 RDI: 7ffcc7882370
RBP: 0003 R08: 20016a00 R09: 000a
R10:  R11: 0297 R12: 0004
R13: 00402ce0 R14:  R15: 
RIP: __get_node_page+0xb68/0x16e0 fs/f2fs/node.c:1185 RSP: 8801d960e820
---[ end trace 4edbeb71f002bb76 ]---

Reported-by: syzbot+d154ec99402c6f628...@syzkaller.appspotmail.com
Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/f2fs.h  | 13 +
 fs/f2fs/inode.c |  6 +-
 fs/f2fs/node.c  | 23 +--
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 8f3ad9662d13..d26aae5bf00d 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1583,18 +1583,6 @@ static inline bool __exist_node_summaries(struct 
f2fs_sb_info *sbi)
is_set_ckpt_flags(sbi, CP_FASTBOOT_FLAG));
 }
 
-/*
- * Check whether the given nid is within node id range.
- */
-static inline int check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
-{
-   if (unlikely(nid < F2FS_ROOT_INO(sbi)))
-   return -EINVAL;
-   if (unlikely(nid >= NM_I(sbi)->max_nid))
-   return -EINVAL;
-   return 0;
-}
-
 /*
  * Check whether the inode has blocks or not
  */
@@ -2768,6 +2756,7 @@ f2fs_hash_t f2fs_dentry_hash(const struct qstr *name_info,
 struct dnode_of_data;
 struct node_info;
 
+int check_nid_range(struct f2fs_sb_info *sbi, nid_t nid);
 bool available_free_memory(struct f2fs_sb_info *sbi, int type);
 int need_dentry_mark(struct f2fs_sb_info *sbi, nid_t nid);
 bool is_checkpointed_node(struct f2fs_sb_info

Re: [PATCH] Input: xen-kbdfront - allow better run-time configuration

2018-04-23 Thread Oleksandr Andrushchenko


On 04/23/2018 09:53 PM, Dmitry Torokhov wrote:

On Thu, Apr 19, 2018 at 02:44:19PM +0300, Oleksandr Andrushchenko wrote:

On 04/19/2018 02:25 PM, Juergen Gross wrote:

On 18/04/18 17:04, Oleksandr Andrushchenko wrote:

From: Oleksandr Andrushchenko 

It is now only possible to control if multi-touch virtual device
is created or not (via the corresponding XenStore entries),
but keyboard and pointer devices are always created.

Why don't you want to go that route for keyboard and mouse, too?
Or does this really make no sense?

Well, I would prefer not to touch anything outside Linux and
this driver. And these settings seem to be implementation specific.
So, this is why introduce Linux module parameters and don't extend
the kbdif protocol.

Why do you consider this implementation specific? How other guests
decide to forego creation of relative pointer device or keyboard-like
device?

You already have "features" for absolute pointing device and multitouch,
so please extend the protocol properly so you indeed do not code
something implementation-specific (i.e. module parameters).

Ok, but in order to preserve the default behavior, e.g.
pointer and keyboard devices are always created now, I'll have
to have reverse features in the protocol:
 - feature-no-pointer
 - feature-no-keyboard
The above may be set as a part of frontend's configuration and
if missed are considered to be set to false.



Thanks.

Re: [PATCH] Input: xen-kbdfront - allow better run-time configuration

2018-04-23 Thread Oleksandr Andrushchenko


On 04/23/2018 09:53 PM, Dmitry Torokhov wrote:

On Thu, Apr 19, 2018 at 02:44:19PM +0300, Oleksandr Andrushchenko wrote:

On 04/19/2018 02:25 PM, Juergen Gross wrote:

On 18/04/18 17:04, Oleksandr Andrushchenko wrote:

From: Oleksandr Andrushchenko 

It is now only possible to control if multi-touch virtual device
is created or not (via the corresponding XenStore entries),
but keyboard and pointer devices are always created.

Why don't you want to go that route for keyboard and mouse, too?
Or does this really make no sense?

Well, I would prefer not to touch anything outside Linux and
this driver. And these settings seem to be implementation specific.
So, this is why introduce Linux module parameters and don't extend
the kbdif protocol.

Why do you consider this implementation specific? How other guests
decide to forego creation of relative pointer device or keyboard-like
device?

You already have "features" for absolute pointing device and multitouch,
so please extend the protocol properly so you indeed do not code
something implementation-specific (i.e. module parameters).

Ok, but in order to preserve the default behavior, e.g.
pointer and keyboard devices are always created now, I'll have
to have reverse features in the protocol:
 - feature-no-pointer
 - feature-no-keyboard
The above may be set as a part of frontend's configuration and
if missed are considered to be set to false.



Thanks.

Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs

2018-04-23 Thread David Rientjes

On Sat, 21 Apr 2018, Alexey Dobriyan wrote:

> > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > from this discussion it isn't scalable), and just have applications
> > > that need per-IRQ stats use /proc/interrupts ?
> > 
> > If you can prove noone is using them in /proc/stat...
> 
> And you can't even stick WARN into /proc/stat to find out.
> 

FWIW, removing per irq counts from /proc/stat would break some of our 
scripts.  We could adapt to that, but everybody else would have to as 
well, so I'm afraid it's not going to be possible.

It would probably be better to extract out the stats that you're actually 
interested in to a new file.

Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs

2018-04-23 Thread David Rientjes

On Sat, 21 Apr 2018, Alexey Dobriyan wrote:

> > On Thu, Apr 19, 2018 at 04:23:02PM -0700, Joel Fernandes (Google) wrote:
> > > Can we not just remove per-IRQ stats from /proc/stat (since I gather
> > > from this discussion it isn't scalable), and just have applications
> > > that need per-IRQ stats use /proc/interrupts ?
> > 
> > If you can prove noone is using them in /proc/stat...
> 
> And you can't even stick WARN into /proc/stat to find out.
> 

FWIW, removing per irq counts from /proc/stat would break some of our 
scripts.  We could adapt to that, but everybody else would have to as 
well, so I'm afraid it's not going to be possible.

It would probably be better to extract out the stats that you're actually 
interested in to a new file.

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-23 Thread Dongli Zhang

Hi Juergen,

On 04/24/2018 01:22 PM, Juergen Gross wrote:
> On 24/04/18 01:55, Dongli Zhang wrote:
>> Hi Wei,
>>
>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
 About per-domU xenwatch thread create/destroy, a new type of xenstore node 
 is
 introduced: '/local/domain/0/mtwatch/'.

 Suppose the new domid id 7. During the domU (domid=7) creation, the xen
 toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
 insertion
 of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
 operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.

 The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
 Kernel
 thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
 inserted,
 while this kernel thread is destroyed when the corresponding xenstore node 
 is
 removed.
>>>
>>> Instead of inventing yet another node, can you not watch /local/domain
>>> directly?
>>
>> Would you like to watch at /local/domain directly? Or is your question "is 
>> there
>> any other way to not watch at /local/domain, while no extra xenstore node 
>> will
>> be introduced"?
>>
>> Actually, the first prototype of this idea was to watch at /local/domain
>> directly to get aware of the domU create/destroy, so that xen toolstack will 
>> not
>> get involved. Joao Martins (CCed) had a concern on the performance as 
>> watching
>> at /local/domain would lead to large amount of xenwatch events.
> 
> That's what the special watches "@introduceDomain" and "@releaseDomain"
> are meant for.

I used to consider to watch at "@introduceDomain". However, there is no domain
information appended with "@introduceDomain" and it is still required for dom0
kernel to proactively confirm who is created.

Dongli Zhang

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-23 Thread Dongli Zhang

Hi Juergen,

On 04/24/2018 01:22 PM, Juergen Gross wrote:
> On 24/04/18 01:55, Dongli Zhang wrote:
>> Hi Wei,
>>
>> On 04/23/2018 10:09 PM, Wei Liu wrote:
>>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
 About per-domU xenwatch thread create/destroy, a new type of xenstore node 
 is
 introduced: '/local/domain/0/mtwatch/'.

 Suppose the new domid id 7. During the domU (domid=7) creation, the xen
 toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
 insertion
 of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
 operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.

 The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
 Kernel
 thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
 inserted,
 while this kernel thread is destroyed when the corresponding xenstore node 
 is
 removed.
>>>
>>> Instead of inventing yet another node, can you not watch /local/domain
>>> directly?
>>
>> Would you like to watch at /local/domain directly? Or is your question "is 
>> there
>> any other way to not watch at /local/domain, while no extra xenstore node 
>> will
>> be introduced"?
>>
>> Actually, the first prototype of this idea was to watch at /local/domain
>> directly to get aware of the domU create/destroy, so that xen toolstack will 
>> not
>> get involved. Joao Martins (CCed) had a concern on the performance as 
>> watching
>> at /local/domain would lead to large amount of xenwatch events.
> 
> That's what the special watches "@introduceDomain" and "@releaseDomain"
> are meant for.

I used to consider to watch at "@introduceDomain". However, there is no domain
information appended with "@introduceDomain" and it is still required for dom0
kernel to proactively confirm who is created.

Dongli Zhang

Re: [v2 00/10] arm/arm64: mediatek: Fix mmsys device probing

2018-04-23 Thread Lee Jones

On Mon, 23 Apr 2018, matthias@kernel.org wrote:

> From: Matthias Brugger 
> 
> Changes since v1:
> - add binding documentation
> - ddp: use regmap_update_bits
> - ddp: ignore EPROBE_DEFER on clock probing
> - mfd: delete mmsys_private
> - add Reviewed-by and Acked-by tags

I'm confused by the double submission.

Please can you send it again completely detached from the first and
the second submissions please?

> ---
> 
> MMSYS in Mediatek SoCs has some registers to control clock gates (which is 
> used in the clk driver) and some registers to set the routing and enable
> the differnet blocks of the display subsystem.
> 
> Up to now both drivers, clock and drm are probed with the same device tree
> compatible. But only the first driver get probed, which in effect breaks
> graphics on mt8173 and mt2701.
> 
> This patch set introduces a new mfd device, which binds against the mmsys
> compatible and takes care of probing the needed devices. It was tested on the
> bananapi-r2 and the Acer R13 Chromebook.
> 
> 
> Matthias Brugger (10):
>   dt-bindings: mediatek: mmsys: Add support for mfd
>   drm/mediatek: Use regmap for register access
>   mfd: mtk-mmsys: Add mmsys driver
>   drm/mediatek: mt2701: switch to mfd probing.
>   clk: mediatek: mt2701-mm: switch to mfd device
>   mfd: mtk-mmsys: Add mt8173 nodes
>   drm/mediatek: Add mfd support for mt8173
>   clk: mediatek: mt8173-mm: switch to mfd device
>   drm: mediatek: Omit warning on probe defers
>   MAINTAINERS: update Mediatek Soc entry
> 
>  .../bindings/arm/mediatek/mediatek,mmsys.txt   |  2 -
>  .../bindings/display/mediatek/mediatek,disp.txt|  2 +-
>  .../devicetree/bindings/mfd/mediatek,mmsys.txt | 27 +++
>  MAINTAINERS|  2 +
>  drivers/clk/mediatek/clk-mt2701-mm.c   | 10 +--
>  drivers/clk/mediatek/clk-mt8173.c  | 17 +++-
>  drivers/gpu/drm/mediatek/mtk_drm_crtc.c|  4 +-
>  drivers/gpu/drm/mediatek/mtk_drm_ddp.c | 41 --
>  drivers/gpu/drm/mediatek/mtk_drm_ddp.h |  4 +-
>  drivers/gpu/drm/mediatek/mtk_drm_drv.c | 33 
>  drivers/gpu/drm/mediatek/mtk_drm_drv.h |  2 +-
>  drivers/mfd/Kconfig|  9 +++
>  drivers/mfd/Makefile   |  2 +
>  drivers/mfd/mtk-mmsys.c| 93 
> ++
>  14 files changed, 189 insertions(+), 59 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/mfd/mediatek,mmsys.txt
>  create mode 100644 drivers/mfd/mtk-mmsys.c
> 

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [v2 00/10] arm/arm64: mediatek: Fix mmsys device probing

2018-04-23 Thread Lee Jones

On Mon, 23 Apr 2018, matthias@kernel.org wrote:

> From: Matthias Brugger 
> 
> Changes since v1:
> - add binding documentation
> - ddp: use regmap_update_bits
> - ddp: ignore EPROBE_DEFER on clock probing
> - mfd: delete mmsys_private
> - add Reviewed-by and Acked-by tags

I'm confused by the double submission.

Please can you send it again completely detached from the first and
the second submissions please?

> ---
> 
> MMSYS in Mediatek SoCs has some registers to control clock gates (which is 
> used in the clk driver) and some registers to set the routing and enable
> the differnet blocks of the display subsystem.
> 
> Up to now both drivers, clock and drm are probed with the same device tree
> compatible. But only the first driver get probed, which in effect breaks
> graphics on mt8173 and mt2701.
> 
> This patch set introduces a new mfd device, which binds against the mmsys
> compatible and takes care of probing the needed devices. It was tested on the
> bananapi-r2 and the Acer R13 Chromebook.
> 
> 
> Matthias Brugger (10):
>   dt-bindings: mediatek: mmsys: Add support for mfd
>   drm/mediatek: Use regmap for register access
>   mfd: mtk-mmsys: Add mmsys driver
>   drm/mediatek: mt2701: switch to mfd probing.
>   clk: mediatek: mt2701-mm: switch to mfd device
>   mfd: mtk-mmsys: Add mt8173 nodes
>   drm/mediatek: Add mfd support for mt8173
>   clk: mediatek: mt8173-mm: switch to mfd device
>   drm: mediatek: Omit warning on probe defers
>   MAINTAINERS: update Mediatek Soc entry
> 
>  .../bindings/arm/mediatek/mediatek,mmsys.txt   |  2 -
>  .../bindings/display/mediatek/mediatek,disp.txt|  2 +-
>  .../devicetree/bindings/mfd/mediatek,mmsys.txt | 27 +++
>  MAINTAINERS|  2 +
>  drivers/clk/mediatek/clk-mt2701-mm.c   | 10 +--
>  drivers/clk/mediatek/clk-mt8173.c  | 17 +++-
>  drivers/gpu/drm/mediatek/mtk_drm_crtc.c|  4 +-
>  drivers/gpu/drm/mediatek/mtk_drm_ddp.c | 41 --
>  drivers/gpu/drm/mediatek/mtk_drm_ddp.h |  4 +-
>  drivers/gpu/drm/mediatek/mtk_drm_drv.c | 33 
>  drivers/gpu/drm/mediatek/mtk_drm_drv.h |  2 +-
>  drivers/mfd/Kconfig|  9 +++
>  drivers/mfd/Makefile   |  2 +
>  drivers/mfd/mtk-mmsys.c| 93 
> ++
>  14 files changed, 189 insertions(+), 59 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/mfd/mediatek,mmsys.txt
>  create mode 100644 drivers/mfd/mtk-mmsys.c
> 

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

[PATCH v2 2/3] lightnvm: pblk: garbage collect lines with failed writes

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

Write failures should not happen under normal circumstances,
so in order to bring the chunk back into a known state as soon
as possible, evacuate all the valid data out of the line and let the
fw judge if the block can be written to in the next reset cycle.

Do this by introducing a new gc list for lines with failed writes,
and ensure that the rate limiter allocates a small portion of
the write bandwidth to get the job done.

The lba list is saved in memory for use during gc as we
cannot gurantee that the emeta data is readable if a write
error occurred.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c  |  45 ++-
 drivers/lightnvm/pblk-gc.c| 102 +++---
 drivers/lightnvm/pblk-init.c  |  45 ---
 drivers/lightnvm/pblk-rl.c|  29 ++--
 drivers/lightnvm/pblk-sysfs.c |  15 ++-
 drivers/lightnvm/pblk-write.c |   2 +
 drivers/lightnvm/pblk.h   |  25 +--
 7 files changed, 199 insertions(+), 64 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 7762e89..413cf3b 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -373,7 +373,13 @@ struct list_head *pblk_line_gc_list(struct pblk *pblk, 
struct pblk_line *line)
 
lockdep_assert_held(>lock);
 
-   if (!vsc) {
+   if (line->w_err_gc->has_write_err) {
+   if (line->gc_group != PBLK_LINEGC_WERR) {
+   line->gc_group = PBLK_LINEGC_WERR;
+   move_list = _mg->gc_werr_list;
+   pblk_rl_werr_line_in(>rl);
+   }
+   } else if (!vsc) {
if (line->gc_group != PBLK_LINEGC_FULL) {
line->gc_group = PBLK_LINEGC_FULL;
move_list = _mg->gc_full_list;
@@ -1603,8 +1609,13 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
line->state = PBLK_LINESTATE_FREE;
line->gc_group = PBLK_LINEGC_NONE;
pblk_line_free(line);
-   spin_unlock(>lock);
 
+   if (line->w_err_gc->has_write_err) {
+   pblk_rl_werr_line_out(>rl);
+   line->w_err_gc->has_write_err = 0;
+   }
+
+   spin_unlock(>lock);
atomic_dec(>pipeline_gc);
 
spin_lock(_mg->free_lock);
@@ -1767,11 +1778,34 @@ void pblk_line_close_meta(struct pblk *pblk, struct 
pblk_line *line)
 
spin_lock(_mg->close_lock);
spin_lock(>lock);
+
+   /* Update the in-memory start address for emeta, in case it has
+* shifted due to write errors
+*/
+   if (line->emeta_ssec != line->cur_sec)
+   line->emeta_ssec = line->cur_sec;
+
list_add_tail(>list, _mg->emeta_list);
spin_unlock(>lock);
spin_unlock(_mg->close_lock);
 
pblk_line_should_sync_meta(pblk);
+
+
+}
+
+static void pblk_save_lba_list(struct pblk *pblk, struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = >lm;
+   struct pblk_line_mgmt *l_mg = >l_mg;
+   unsigned int lba_list_size = lm->emeta_len[2];
+   struct pblk_w_err_gc *w_err_gc = line->w_err_gc;
+   struct pblk_emeta *emeta = line->emeta;
+
+   w_err_gc->lba_list = pblk_malloc(lba_list_size,
+l_mg->emeta_alloc_type, GFP_KERNEL);
+   memcpy(w_err_gc->lba_list, emeta_to_lbas(pblk, emeta->buf),
+   lba_list_size);
 }
 
 void pblk_line_close_ws(struct work_struct *work)
@@ -1780,6 +1814,13 @@ void pblk_line_close_ws(struct work_struct *work)
ws);
struct pblk *pblk = line_ws->pblk;
struct pblk_line *line = line_ws->line;
+   struct pblk_w_err_gc *w_err_gc = line->w_err_gc;
+
+   /* Write errors makes the emeta start address stored in smeta invalid,
+* so keep a copy of the lba list until we've gc'd the line
+*/
+   if (w_err_gc->has_write_err)
+   pblk_save_lba_list(pblk, line);
 
pblk_line_close(pblk, line);
mempool_free(line_ws, pblk->gen_ws_pool);
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index b0cc277..df88f1b 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -129,6 +129,53 @@ static void pblk_gc_line_ws(struct work_struct *work)
kfree(gc_rq_ws);
 }
 
+static __le64 *get_lba_list_from_emeta(struct pblk *pblk,
+  struct pblk_line *line)
+{
+   struct line_emeta *emeta_buf;
+   struct pblk_line_mgmt *l_mg = >l_mg;
+   struct pblk_line_meta *lm = >lm;
+   unsigned int lba_list_size = lm->emeta_len[2];
+   __le64 *lba_list;
+   int ret;
+
+   emeta_buf = pblk_malloc(lm->emeta_len[0],
+   l_mg->emeta_alloc_type, GFP_KERNEL);
+   if (!emeta_buf)
+

[PATCH v2 2/3] lightnvm: pblk: garbage collect lines with failed writes

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

Write failures should not happen under normal circumstances,
so in order to bring the chunk back into a known state as soon
as possible, evacuate all the valid data out of the line and let the
fw judge if the block can be written to in the next reset cycle.

Do this by introducing a new gc list for lines with failed writes,
and ensure that the rate limiter allocates a small portion of
the write bandwidth to get the job done.

The lba list is saved in memory for use during gc as we
cannot gurantee that the emeta data is readable if a write
error occurred.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c  |  45 ++-
 drivers/lightnvm/pblk-gc.c| 102 +++---
 drivers/lightnvm/pblk-init.c  |  45 ---
 drivers/lightnvm/pblk-rl.c|  29 ++--
 drivers/lightnvm/pblk-sysfs.c |  15 ++-
 drivers/lightnvm/pblk-write.c |   2 +
 drivers/lightnvm/pblk.h   |  25 +--
 7 files changed, 199 insertions(+), 64 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 7762e89..413cf3b 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -373,7 +373,13 @@ struct list_head *pblk_line_gc_list(struct pblk *pblk, 
struct pblk_line *line)
 
lockdep_assert_held(>lock);
 
-   if (!vsc) {
+   if (line->w_err_gc->has_write_err) {
+   if (line->gc_group != PBLK_LINEGC_WERR) {
+   line->gc_group = PBLK_LINEGC_WERR;
+   move_list = _mg->gc_werr_list;
+   pblk_rl_werr_line_in(>rl);
+   }
+   } else if (!vsc) {
if (line->gc_group != PBLK_LINEGC_FULL) {
line->gc_group = PBLK_LINEGC_FULL;
move_list = _mg->gc_full_list;
@@ -1603,8 +1609,13 @@ static void __pblk_line_put(struct pblk *pblk, struct 
pblk_line *line)
line->state = PBLK_LINESTATE_FREE;
line->gc_group = PBLK_LINEGC_NONE;
pblk_line_free(line);
-   spin_unlock(>lock);
 
+   if (line->w_err_gc->has_write_err) {
+   pblk_rl_werr_line_out(>rl);
+   line->w_err_gc->has_write_err = 0;
+   }
+
+   spin_unlock(>lock);
atomic_dec(>pipeline_gc);
 
spin_lock(_mg->free_lock);
@@ -1767,11 +1778,34 @@ void pblk_line_close_meta(struct pblk *pblk, struct 
pblk_line *line)
 
spin_lock(_mg->close_lock);
spin_lock(>lock);
+
+   /* Update the in-memory start address for emeta, in case it has
+* shifted due to write errors
+*/
+   if (line->emeta_ssec != line->cur_sec)
+   line->emeta_ssec = line->cur_sec;
+
list_add_tail(>list, _mg->emeta_list);
spin_unlock(>lock);
spin_unlock(_mg->close_lock);
 
pblk_line_should_sync_meta(pblk);
+
+
+}
+
+static void pblk_save_lba_list(struct pblk *pblk, struct pblk_line *line)
+{
+   struct pblk_line_meta *lm = >lm;
+   struct pblk_line_mgmt *l_mg = >l_mg;
+   unsigned int lba_list_size = lm->emeta_len[2];
+   struct pblk_w_err_gc *w_err_gc = line->w_err_gc;
+   struct pblk_emeta *emeta = line->emeta;
+
+   w_err_gc->lba_list = pblk_malloc(lba_list_size,
+l_mg->emeta_alloc_type, GFP_KERNEL);
+   memcpy(w_err_gc->lba_list, emeta_to_lbas(pblk, emeta->buf),
+   lba_list_size);
 }
 
 void pblk_line_close_ws(struct work_struct *work)
@@ -1780,6 +1814,13 @@ void pblk_line_close_ws(struct work_struct *work)
ws);
struct pblk *pblk = line_ws->pblk;
struct pblk_line *line = line_ws->line;
+   struct pblk_w_err_gc *w_err_gc = line->w_err_gc;
+
+   /* Write errors makes the emeta start address stored in smeta invalid,
+* so keep a copy of the lba list until we've gc'd the line
+*/
+   if (w_err_gc->has_write_err)
+   pblk_save_lba_list(pblk, line);
 
pblk_line_close(pblk, line);
mempool_free(line_ws, pblk->gen_ws_pool);
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index b0cc277..df88f1b 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -129,6 +129,53 @@ static void pblk_gc_line_ws(struct work_struct *work)
kfree(gc_rq_ws);
 }
 
+static __le64 *get_lba_list_from_emeta(struct pblk *pblk,
+  struct pblk_line *line)
+{
+   struct line_emeta *emeta_buf;
+   struct pblk_line_mgmt *l_mg = >l_mg;
+   struct pblk_line_meta *lm = >lm;
+   unsigned int lba_list_size = lm->emeta_len[2];
+   __le64 *lba_list;
+   int ret;
+
+   emeta_buf = pblk_malloc(lm->emeta_len[0],
+   l_mg->emeta_alloc_type, GFP_KERNEL);
+   if (!emeta_buf)
+   return NULL;
+
+   ret =

[PATCH v2 1/3] lightnvm: pblk: rework write error recovery path

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

The write error recovery path is incomplete, so rework
the write error recovery handling to do resubmits directly
from the write buffer.

When a write error occurs, the remaining sectors in the chunk are
mapped out and invalidated and the request inserted in a resubmit list.

The writer thread checks if there are any requests to resubmit,
scans and invalidates any lbas that have been overwritten by later
writes and resubmits the failed entries.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-init.c |   2 +
 drivers/lightnvm/pblk-rb.c   |  39 --
 drivers/lightnvm/pblk-recovery.c |  91 -
 drivers/lightnvm/pblk-write.c| 267 ++-
 drivers/lightnvm/pblk.h  |  11 +-
 5 files changed, 181 insertions(+), 229 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index bfc488d..6f06727 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -426,6 +426,7 @@ static int pblk_core_init(struct pblk *pblk)
goto free_r_end_wq;
 
INIT_LIST_HEAD(>compl_list);
+   INIT_LIST_HEAD(>resubmit_list);
 
return 0;
 
@@ -1185,6 +1186,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk->state = PBLK_STATE_RUNNING;
pblk->gc.gc_enabled = 0;
 
+   spin_lock_init(>resubmit_lock);
spin_lock_init(>trans_lock);
spin_lock_init(>lock);
 
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 024a366..00cd1f2 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -503,45 +503,6 @@ int pblk_rb_may_write_gc(struct pblk_rb *rb, unsigned int 
nr_entries,
 }
 
 /*
- * The caller of this function must ensure that the backpointer will not
- * overwrite the entries passed on the list.
- */
-unsigned int pblk_rb_read_to_bio_list(struct pblk_rb *rb, struct bio *bio,
- struct list_head *list,
- unsigned int max)
-{
-   struct pblk_rb_entry *entry, *tentry;
-   struct page *page;
-   unsigned int read = 0;
-   int ret;
-
-   list_for_each_entry_safe(entry, tentry, list, index) {
-   if (read > max) {
-   pr_err("pblk: too many entries on list\n");
-   goto out;
-   }
-
-   page = virt_to_page(entry->data);
-   if (!page) {
-   pr_err("pblk: could not allocate write bio page\n");
-   goto out;
-   }
-
-   ret = bio_add_page(bio, page, rb->seg_size, 0);
-   if (ret != rb->seg_size) {
-   pr_err("pblk: could not add page to write bio\n");
-   goto out;
-   }
-
-   list_del(>index);
-   read++;
-   }
-
-out:
-   return read;
-}
-
-/*
  * Read available entries on rb and add them to the given bio. To avoid a 
memory
  * copy, a page reference to the write buffer is used to be added to the bio.
  *
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 9cb6d5d..5983428 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -16,97 +16,6 @@
 
 #include "pblk.h"
 
-void pblk_submit_rec(struct work_struct *work)
-{
-   struct pblk_rec_ctx *recovery =
-   container_of(work, struct pblk_rec_ctx, ws_rec);
-   struct pblk *pblk = recovery->pblk;
-   struct nvm_rq *rqd = recovery->rqd;
-   struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
-   struct bio *bio;
-   unsigned int nr_rec_secs;
-   unsigned int pgs_read;
-   int ret;
-
-   nr_rec_secs = bitmap_weight((unsigned long int *)>ppa_status,
-   NVM_MAX_VLBA);
-
-   bio = bio_alloc(GFP_KERNEL, nr_rec_secs);
-
-   bio->bi_iter.bi_sector = 0;
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
-   rqd->bio = bio;
-   rqd->nr_ppas = nr_rec_secs;
-
-   pgs_read = pblk_rb_read_to_bio_list(>rwb, bio, >failed,
-   nr_rec_secs);
-   if (pgs_read != nr_rec_secs) {
-   pr_err("pblk: could not read recovery entries\n");
-   goto err;
-   }
-
-   if (pblk_setup_w_rec_rq(pblk, rqd, c_ctx)) {
-   pr_err("pblk: could not setup recovery request\n");
-   goto err;
-   }
-
-#ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(nr_rec_secs, >recov_writes);
-#endif
-
-   ret = pblk_submit_io(pblk, rqd);
-   if (ret) {
-   pr_err("pblk: I/O submission failed: %d\n", ret);
-   goto err;
-   }
-
-   mempool_free(recovery, pblk->rec_pool);
-   return;
-
-err:
-   bio_put(bio);
-

[PATCH v2 1/3] lightnvm: pblk: rework write error recovery path

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

The write error recovery path is incomplete, so rework
the write error recovery handling to do resubmits directly
from the write buffer.

When a write error occurs, the remaining sectors in the chunk are
mapped out and invalidated and the request inserted in a resubmit list.

The writer thread checks if there are any requests to resubmit,
scans and invalidates any lbas that have been overwritten by later
writes and resubmits the failed entries.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-init.c |   2 +
 drivers/lightnvm/pblk-rb.c   |  39 --
 drivers/lightnvm/pblk-recovery.c |  91 -
 drivers/lightnvm/pblk-write.c| 267 ++-
 drivers/lightnvm/pblk.h  |  11 +-
 5 files changed, 181 insertions(+), 229 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index bfc488d..6f06727 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -426,6 +426,7 @@ static int pblk_core_init(struct pblk *pblk)
goto free_r_end_wq;
 
INIT_LIST_HEAD(>compl_list);
+   INIT_LIST_HEAD(>resubmit_list);
 
return 0;
 
@@ -1185,6 +1186,7 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk->state = PBLK_STATE_RUNNING;
pblk->gc.gc_enabled = 0;
 
+   spin_lock_init(>resubmit_lock);
spin_lock_init(>trans_lock);
spin_lock_init(>lock);
 
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 024a366..00cd1f2 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -503,45 +503,6 @@ int pblk_rb_may_write_gc(struct pblk_rb *rb, unsigned int 
nr_entries,
 }
 
 /*
- * The caller of this function must ensure that the backpointer will not
- * overwrite the entries passed on the list.
- */
-unsigned int pblk_rb_read_to_bio_list(struct pblk_rb *rb, struct bio *bio,
- struct list_head *list,
- unsigned int max)
-{
-   struct pblk_rb_entry *entry, *tentry;
-   struct page *page;
-   unsigned int read = 0;
-   int ret;
-
-   list_for_each_entry_safe(entry, tentry, list, index) {
-   if (read > max) {
-   pr_err("pblk: too many entries on list\n");
-   goto out;
-   }
-
-   page = virt_to_page(entry->data);
-   if (!page) {
-   pr_err("pblk: could not allocate write bio page\n");
-   goto out;
-   }
-
-   ret = bio_add_page(bio, page, rb->seg_size, 0);
-   if (ret != rb->seg_size) {
-   pr_err("pblk: could not add page to write bio\n");
-   goto out;
-   }
-
-   list_del(>index);
-   read++;
-   }
-
-out:
-   return read;
-}
-
-/*
  * Read available entries on rb and add them to the given bio. To avoid a 
memory
  * copy, a page reference to the write buffer is used to be added to the bio.
  *
diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 9cb6d5d..5983428 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -16,97 +16,6 @@
 
 #include "pblk.h"
 
-void pblk_submit_rec(struct work_struct *work)
-{
-   struct pblk_rec_ctx *recovery =
-   container_of(work, struct pblk_rec_ctx, ws_rec);
-   struct pblk *pblk = recovery->pblk;
-   struct nvm_rq *rqd = recovery->rqd;
-   struct pblk_c_ctx *c_ctx = nvm_rq_to_pdu(rqd);
-   struct bio *bio;
-   unsigned int nr_rec_secs;
-   unsigned int pgs_read;
-   int ret;
-
-   nr_rec_secs = bitmap_weight((unsigned long int *)>ppa_status,
-   NVM_MAX_VLBA);
-
-   bio = bio_alloc(GFP_KERNEL, nr_rec_secs);
-
-   bio->bi_iter.bi_sector = 0;
-   bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
-   rqd->bio = bio;
-   rqd->nr_ppas = nr_rec_secs;
-
-   pgs_read = pblk_rb_read_to_bio_list(>rwb, bio, >failed,
-   nr_rec_secs);
-   if (pgs_read != nr_rec_secs) {
-   pr_err("pblk: could not read recovery entries\n");
-   goto err;
-   }
-
-   if (pblk_setup_w_rec_rq(pblk, rqd, c_ctx)) {
-   pr_err("pblk: could not setup recovery request\n");
-   goto err;
-   }
-
-#ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(nr_rec_secs, >recov_writes);
-#endif
-
-   ret = pblk_submit_io(pblk, rqd);
-   if (ret) {
-   pr_err("pblk: I/O submission failed: %d\n", ret);
-   goto err;
-   }
-
-   mempool_free(recovery, pblk->rec_pool);
-   return;
-
-err:
-   bio_put(bio);
-   pblk_free_rqd(pblk, rqd, PBLK_WRITE);
-}
-
-int

Re: [PATCH] mfd: tps65911-comparator: Fix a build error

2018-04-23 Thread Lee Jones

On Fri, 20 Apr 2018, Dan Carpenter wrote:

> In 2012, we changed the tps65910 API and fixed most drivers but forgot
> to update this one.
> 
> Fixes: 3f7e82759c69 ("mfd: Commonize tps65910 regmap access through header")
> Signed-off-by: Dan Carpenter 

Applied, thanks.

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [PATCH] mfd: tps65911-comparator: Fix a build error

2018-04-23 Thread Lee Jones

On Fri, 20 Apr 2018, Dan Carpenter wrote:

> In 2012, we changed the tps65910 API and fixed most drivers but forgot
> to update this one.
> 
> Fixes: 3f7e82759c69 ("mfd: Commonize tps65910 regmap access through header")
> Signed-off-by: Dan Carpenter 

Applied, thanks.

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [PATCH v2] mfd: tps65911-comparator: Fix an off by one bug

2018-04-23 Thread Lee Jones

On Mon, 23 Apr 2018, Dan Carpenter wrote:

> I really don't want authorship credit for a patch that does:
> 
> #define COMP1 0
> #define COMP2 1
> 
> It just makes me itch to look at it...

As I say, I get where you're coming from, but it was the decision made
by the H/W engineers.  Doesn't mean it's not the correct thing to do
from a S/W POV.

If you're not happy putting your SOB on the patch, I'll draft it and
sign it off.  /me has no shame!

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

[PATCH v2 0/3] Rework write error handling in pblk

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

This patch series fixes the(currently incomplete) write error handling 
in pblk by:

 * queuing and re-submitting failed writes in the write buffer
 * evacuating valid data data in lines with write failures, so the
   chunk(s) with write failures can be reset to a known state by the fw

Lines with failures in smeta are put back on the free list.
Failed chunks will be reset on the next use.

If a write failes in emeta, the lba list is cached so the line can be 
garbage collected without scanning the out-of-band area.

Changes in V2:
- Added the recov_writes counter increase to the new path
- Moved lba list emeta reading during gc to a separate function
- Allocating the saved lba list with pblk_malloc instead of kmalloc
- Fixed formatting issues
- Removed dead code

Hans Holmberg (3):
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: fix smeta write error path

 drivers/lightnvm/pblk-core.c |  52 +++-
 drivers/lightnvm/pblk-gc.c   | 102 +--
 drivers/lightnvm/pblk-init.c |  47 ---
 drivers/lightnvm/pblk-rb.c   |  39 --
 drivers/lightnvm/pblk-recovery.c |  91 -
 drivers/lightnvm/pblk-rl.c   |  29 -
 drivers/lightnvm/pblk-sysfs.c|  15 ++-
 drivers/lightnvm/pblk-write.c| 269 ++-
 drivers/lightnvm/pblk.h  |  36 --
 9 files changed, 384 insertions(+), 296 deletions(-)

-- 
2.7.4

[PATCH v2 3/3] lightnvm: pblk: fix smeta write error path

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

Smeta write errors were previously ignored. Skip these
lines instead and throw them back on the free
list, so the chunks will go through a reset cycle
before we attempt to use the line again.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 413cf3b..dec1bb4 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
atomic_dec(>inflight_io);
 
if (rqd.error) {
-   if (dir == PBLK_WRITE)
+   if (dir == PBLK_WRITE) {
pblk_log_write_err(pblk, );
-   else if (dir == PBLK_READ)
+   ret = 1;
+   } else if (dir == PBLK_READ)
pblk_log_read_err(pblk, );
}
 
@@ -1120,7 +1121,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
 
if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) {
pr_debug("pblk: line smeta I/O failed. Retry\n");
-   return 1;
+   return 0;
}
 
bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line);
-- 
2.7.4

Re: [PATCH v2] mfd: tps65911-comparator: Fix an off by one bug

2018-04-23 Thread Lee Jones

On Mon, 23 Apr 2018, Dan Carpenter wrote:

> I really don't want authorship credit for a patch that does:
> 
> #define COMP1 0
> #define COMP2 1
> 
> It just makes me itch to look at it...

As I say, I get where you're coming from, but it was the decision made
by the H/W engineers.  Doesn't mean it's not the correct thing to do
from a S/W POV.

If you're not happy putting your SOB on the patch, I'll draft it and
sign it off.  /me has no shame!

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

[PATCH v2 0/3] Rework write error handling in pblk

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

This patch series fixes the(currently incomplete) write error handling 
in pblk by:

 * queuing and re-submitting failed writes in the write buffer
 * evacuating valid data data in lines with write failures, so the
   chunk(s) with write failures can be reset to a known state by the fw

Lines with failures in smeta are put back on the free list.
Failed chunks will be reset on the next use.

If a write failes in emeta, the lba list is cached so the line can be 
garbage collected without scanning the out-of-band area.

Changes in V2:
- Added the recov_writes counter increase to the new path
- Moved lba list emeta reading during gc to a separate function
- Allocating the saved lba list with pblk_malloc instead of kmalloc
- Fixed formatting issues
- Removed dead code

Hans Holmberg (3):
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: fix smeta write error path

 drivers/lightnvm/pblk-core.c |  52 +++-
 drivers/lightnvm/pblk-gc.c   | 102 +--
 drivers/lightnvm/pblk-init.c |  47 ---
 drivers/lightnvm/pblk-rb.c   |  39 --
 drivers/lightnvm/pblk-recovery.c |  91 -
 drivers/lightnvm/pblk-rl.c   |  29 -
 drivers/lightnvm/pblk-sysfs.c|  15 ++-
 drivers/lightnvm/pblk-write.c| 269 ++-
 drivers/lightnvm/pblk.h  |  36 --
 9 files changed, 384 insertions(+), 296 deletions(-)

-- 
2.7.4

[PATCH v2 3/3] lightnvm: pblk: fix smeta write error path

2018-04-23 Thread Hans Holmberg

From: Hans Holmberg 

Smeta write errors were previously ignored. Skip these
lines instead and throw them back on the free
list, so the chunks will go through a reset cycle
before we attempt to use the line again.

Signed-off-by: Hans Holmberg 
---
 drivers/lightnvm/pblk-core.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 413cf3b..dec1bb4 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -849,9 +849,10 @@ static int pblk_line_submit_smeta_io(struct pblk *pblk, 
struct pblk_line *line,
atomic_dec(>inflight_io);
 
if (rqd.error) {
-   if (dir == PBLK_WRITE)
+   if (dir == PBLK_WRITE) {
pblk_log_write_err(pblk, );
-   else if (dir == PBLK_READ)
+   ret = 1;
+   } else if (dir == PBLK_READ)
pblk_log_read_err(pblk, );
}
 
@@ -1120,7 +1121,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct 
pblk_line *line,
 
if (init && pblk_line_submit_smeta_io(pblk, line, off, PBLK_WRITE)) {
pr_debug("pblk: line smeta I/O failed. Retry\n");
-   return 1;
+   return 0;
}
 
bitmap_copy(line->invalid_bitmap, line->map_bitmap, lm->sec_per_line);
-- 
2.7.4

Re: [PATCH 1/6] powerpc/64s: Add barrier_nospec

2018-04-23 Thread Nicholas Piggin

On Tue, 24 Apr 2018 14:15:54 +1000
Michael Ellerman  wrote:

> From: Michal Suchanek 
> 
> A no-op form of ori (or immediate of 0 into r31 and the result stored
> in r31) has been re-tasked as a speculation barrier. The instruction
> only acts as a barrier on newer machines with appropriate firmware
> support. On older CPUs it remains a harmless no-op.
> 
> Implement barrier_nospec using this instruction.
> 
> mpe: The semantics of the instruction are believed to be that it
> prevents execution of subsequent instructions until preceding branches
> have been fully resolved and are no longer executing speculatively.
> There is no further documentation available at this time.
> 
> Signed-off-by: Michal Suchanek 
> Signed-off-by: Michael Ellerman 
> ---
> mpe: Make it Book3S64 only, update comment & change log, add a
>  memory clobber to the asm.

These all seem good to me. Thanks Michal.

We should (eventually) work on the module patching problem too.

Thanks,
Nick

Re: [PATCH 1/6] powerpc/64s: Add barrier_nospec

2018-04-23 Thread Nicholas Piggin

On Tue, 24 Apr 2018 14:15:54 +1000
Michael Ellerman  wrote:

> From: Michal Suchanek 
> 
> A no-op form of ori (or immediate of 0 into r31 and the result stored
> in r31) has been re-tasked as a speculation barrier. The instruction
> only acts as a barrier on newer machines with appropriate firmware
> support. On older CPUs it remains a harmless no-op.
> 
> Implement barrier_nospec using this instruction.
> 
> mpe: The semantics of the instruction are believed to be that it
> prevents execution of subsequent instructions until preceding branches
> have been fully resolved and are no longer executing speculatively.
> There is no further documentation available at this time.
> 
> Signed-off-by: Michal Suchanek 
> Signed-off-by: Michael Ellerman 
> ---
> mpe: Make it Book3S64 only, update comment & change log, add a
>  memory clobber to the asm.

These all seem good to me. Thanks Michal.

We should (eventually) work on the module patching problem too.

Thanks,
Nick

Re: [Xen-devel] [PATCH 0/1] drm/xen-zcopy: Add Xen zero-copy helper DRM driver

2018-04-23 Thread Oleksandr Andrushchenko


On 04/24/2018 01:41 AM, Boris Ostrovsky wrote:

On 04/23/2018 08:10 AM, Oleksandr Andrushchenko wrote:

On 04/23/2018 02:52 PM, Wei Liu wrote:

On Fri, Apr 20, 2018 at 02:25:20PM +0300, Oleksandr Andrushchenko wrote:

  the gntdev.

I think this is generic enough that it could be implemented by a
device not tied to Xen. AFAICT the hyper_dma guys also wanted
something similar to this.

You can't just wrap random userspace memory into a dma-buf. We've
just had
this discussion with kvm/qemu folks, who proposed just that, and
after a
bit of discussion they'll now try to have a driver which just wraps a
memfd into a dma-buf.

So, we have to decide either we introduce a new driver
(say, under drivers/xen/xen-dma-buf) or extend the existing
gntdev/balloon to support dma-buf use-cases.

Can anybody from Xen community express their preference here?


Oleksandr talked to me on IRC about this, he said a few IOCTLs need to
be added to either existing drivers or a new driver.

I went through this thread twice and skimmed through the relevant
documents, but I couldn't see any obvious pros and cons for either
approach. So I don't really have an opinion on this.

But, assuming if implemented in existing drivers, those IOCTLs need to
be added to different drivers, which means userspace program needs to
write more code and get more handles, it would be slightly better to
implement a new driver from that perspective.

If gntdev/balloon extension is still considered:

All the IOCTLs will be in gntdev driver (in current xen-zcopy
terminology):
  - DRM_ICOTL_XEN_ZCOPY_DUMB_FROM_REFS
  - DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
  - DRM_IOCTL_XEN_ZCOPY_DUMB_WAIT_FREE

Balloon driver extension, which is needed for contiguous/DMA
buffers, will be to provide new *kernel API*, no UAPI is needed.



So I am obviously a bit late to this thread, but why do you need to add
new ioctls to gntdev and balloon? Doesn't this driver manage to do what
you want without any extensions?

1. I only (may) need to add IOCTLs to gntdev
2. balloon driver needs to be extended, so it can allocate
contiguous (DMA) memory, not IOCTLs/UAPI here, all lives
in the kernel.
3. The reason I need to extend gnttab with new IOCTLs is to
provide new functionality to create a dma-buf from grant references
and to produce grant references for a dma-buf. This is what I have as UAPI
description for xen-zcopy driver:

1. DRM_IOCTL_XEN_ZCOPY_DUMB_FROM_REFS
This will create a DRM dumb buffer from grant references provided
by the frontend. The intended usage is:
  - Frontend
    - creates a dumb/display buffer and allocates memory
    - grants foreign access to the buffer pages
    - passes granted references to the backend
  - Backend
    - issues DRM_XEN_ZCOPY_DUMB_FROM_REFS ioctl to map
  granted references and create a dumb buffer
    - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
    - requests real HW driver/consumer to import the PRIME buffer with
  DRM_IOCTL_PRIME_FD_TO_HANDLE
    - uses handle returned by the real HW driver
  - at the end:
    o closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
    o closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
    o closes file descriptor of the exported buffer

2. DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
This will grant references to a dumb/display buffer's memory provided by the
backend. The intended usage is:
  - Frontend
    - requests backend to allocate dumb/display buffer and grant references
  to its pages
  - Backend
    - requests real HW driver to create a dumb with 
DRM_IOCTL_MODE_CREATE_DUMB

    - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
    - requests zero-copy driver to import the PRIME buffer with
  DRM_IOCTL_PRIME_FD_TO_HANDLE
    - issues DRM_XEN_ZCOPY_DUMB_TO_REFS ioctl to
  grant references to the buffer's memory.
    - passes grant references to the frontend
 - at the end:
    - closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
    - closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
    - closes file descriptor of the imported buffer

3. DRM_XEN_ZCOPY_DUMB_WAIT_FREE
This will block until the dumb buffer with the wait handle provided be 
freed:

this is needed for synchronization between frontend and backend in case
frontend provides grant references of the buffer via
DRM_XEN_ZCOPY_DUMB_FROM_REFS IOCTL and which must be released before
backend replies with XENDISPL_OP_DBUF_DESTROY response.
wait_handle must be the same value returned while calling
DRM_XEN_ZCOPY_DUMB_FROM_REFS IOCTL.

So, as you can see the above functionality is not covered by the 
existing UAPI

of the gntdev driver.
Now, if we change dumb -> dma-buf and remove DRM code (which is only a 
wrapper

here on top of dma-buf) we get new driver for dma-buf for Xen.

This is why I have 2 options here: either create a dedicated driver for this
(e.g. re-work xen-zcopy to be DRM independent and put it under
drivers/xen/xen-dma-buf, for example) or extend the existing

Re: [PATCH] rave-sp: Remove VLA

2018-04-23 Thread Lee Jones

On Mon, 23 Apr 2018, Kyle Spiers wrote:

> As part of the effort to remove VLAs from the kernel[1], this creates
> constants for the checksum lengths of CCITT and 8B2C and changes
> crc_calculated to be the maximum size of a checksum.
> 
> https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Kyle Spiers 
> ---
>  drivers/mfd/rave-sp.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mfd/rave-sp.c b/drivers/mfd/rave-sp.c
> index 5c858e784a89..99fa482419f9 100644
> --- a/drivers/mfd/rave-sp.c
> +++ b/drivers/mfd/rave-sp.c
> @@ -45,7 +45,9 @@
>  #define RAVE_SP_DLE  0x10
>  
>  #define RAVE_SP_MAX_DATA_SIZE64
> -#define RAVE_SP_CHECKSUM_SIZE2  /* Worst case scenario on 
> RDU2 */
> +#define RAVE_SP_CHECKSUM_8B2C1
> +#define RAVE_SP_CHECKSUM_CCITT   2
> +#define RAVE_SP_CHECKSUM_SIZERAVE_SP_CHECKSUM_CCITT
>  /*
>   * We don't store STX, ETX and unescaped bytes, so Rx is only
>   * DATA + CSUM
> @@ -415,7 +417,12 @@ static void rave_sp_receive_frame(struct rave_sp *sp,
>   const size_t payload_length  = length - checksum_length;
>   const u8 *crc_reported   = [payload_length];
>   struct device *dev   = >serdev->dev;
> - u8 crc_calculated[checksum_length];
> + u8 crc_calculated[RAVE_SP_CHECKSUM_SIZE];
> +
> + if (unlikely(length > sizeof(crc_calculated))) {

Forgive me if I have this wrong (it's still very early here), but this
doesn't leave any room for the payload?

<--   length -->
<- payload length ->
[CK][CK][D][A][T][A] .. [64]

It is my hope that length would always be larger than the size of the
checksum, or else there would never be any data?

Should this not be:

if (unlikely(length > RAVE_SP_MAX_DATA_SIZE))

Nit: Adding the check is also an unrelated change, so would require a
separate patch.

> + dev_warn(dev, "Dropping oversized frame\n");
> + return;
> + }
>  
>   print_hex_dump(KERN_DEBUG, "rave-sp rx: ", DUMP_PREFIX_NONE,
>  16, 1, data, length, false);

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [Xen-devel] [PATCH 0/1] drm/xen-zcopy: Add Xen zero-copy helper DRM driver

2018-04-23 Thread Oleksandr Andrushchenko


On 04/24/2018 01:41 AM, Boris Ostrovsky wrote:

On 04/23/2018 08:10 AM, Oleksandr Andrushchenko wrote:

On 04/23/2018 02:52 PM, Wei Liu wrote:

On Fri, Apr 20, 2018 at 02:25:20PM +0300, Oleksandr Andrushchenko wrote:

  the gntdev.

I think this is generic enough that it could be implemented by a
device not tied to Xen. AFAICT the hyper_dma guys also wanted
something similar to this.

You can't just wrap random userspace memory into a dma-buf. We've
just had
this discussion with kvm/qemu folks, who proposed just that, and
after a
bit of discussion they'll now try to have a driver which just wraps a
memfd into a dma-buf.

So, we have to decide either we introduce a new driver
(say, under drivers/xen/xen-dma-buf) or extend the existing
gntdev/balloon to support dma-buf use-cases.

Can anybody from Xen community express their preference here?


Oleksandr talked to me on IRC about this, he said a few IOCTLs need to
be added to either existing drivers or a new driver.

I went through this thread twice and skimmed through the relevant
documents, but I couldn't see any obvious pros and cons for either
approach. So I don't really have an opinion on this.

But, assuming if implemented in existing drivers, those IOCTLs need to
be added to different drivers, which means userspace program needs to
write more code and get more handles, it would be slightly better to
implement a new driver from that perspective.

If gntdev/balloon extension is still considered:

All the IOCTLs will be in gntdev driver (in current xen-zcopy
terminology):
  - DRM_ICOTL_XEN_ZCOPY_DUMB_FROM_REFS
  - DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
  - DRM_IOCTL_XEN_ZCOPY_DUMB_WAIT_FREE

Balloon driver extension, which is needed for contiguous/DMA
buffers, will be to provide new *kernel API*, no UAPI is needed.



So I am obviously a bit late to this thread, but why do you need to add
new ioctls to gntdev and balloon? Doesn't this driver manage to do what
you want without any extensions?

1. I only (may) need to add IOCTLs to gntdev
2. balloon driver needs to be extended, so it can allocate
contiguous (DMA) memory, not IOCTLs/UAPI here, all lives
in the kernel.
3. The reason I need to extend gnttab with new IOCTLs is to
provide new functionality to create a dma-buf from grant references
and to produce grant references for a dma-buf. This is what I have as UAPI
description for xen-zcopy driver:

1. DRM_IOCTL_XEN_ZCOPY_DUMB_FROM_REFS
This will create a DRM dumb buffer from grant references provided
by the frontend. The intended usage is:
  - Frontend
    - creates a dumb/display buffer and allocates memory
    - grants foreign access to the buffer pages
    - passes granted references to the backend
  - Backend
    - issues DRM_XEN_ZCOPY_DUMB_FROM_REFS ioctl to map
  granted references and create a dumb buffer
    - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
    - requests real HW driver/consumer to import the PRIME buffer with
  DRM_IOCTL_PRIME_FD_TO_HANDLE
    - uses handle returned by the real HW driver
  - at the end:
    o closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
    o closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
    o closes file descriptor of the exported buffer

2. DRM_IOCTL_XEN_ZCOPY_DUMB_TO_REFS
This will grant references to a dumb/display buffer's memory provided by the
backend. The intended usage is:
  - Frontend
    - requests backend to allocate dumb/display buffer and grant references
  to its pages
  - Backend
    - requests real HW driver to create a dumb with 
DRM_IOCTL_MODE_CREATE_DUMB

    - requests handle to fd conversion via DRM_IOCTL_PRIME_HANDLE_TO_FD
    - requests zero-copy driver to import the PRIME buffer with
  DRM_IOCTL_PRIME_FD_TO_HANDLE
    - issues DRM_XEN_ZCOPY_DUMB_TO_REFS ioctl to
  grant references to the buffer's memory.
    - passes grant references to the frontend
 - at the end:
    - closes zero-copy driver's handle with DRM_IOCTL_GEM_CLOSE
    - closes real HW driver's handle with DRM_IOCTL_GEM_CLOSE
    - closes file descriptor of the imported buffer

3. DRM_XEN_ZCOPY_DUMB_WAIT_FREE
This will block until the dumb buffer with the wait handle provided be 
freed:

this is needed for synchronization between frontend and backend in case
frontend provides grant references of the buffer via
DRM_XEN_ZCOPY_DUMB_FROM_REFS IOCTL and which must be released before
backend replies with XENDISPL_OP_DBUF_DESTROY response.
wait_handle must be the same value returned while calling
DRM_XEN_ZCOPY_DUMB_FROM_REFS IOCTL.

So, as you can see the above functionality is not covered by the 
existing UAPI

of the gntdev driver.
Now, if we change dumb -> dma-buf and remove DRM code (which is only a 
wrapper

here on top of dma-buf) we get new driver for dma-buf for Xen.

This is why I have 2 options here: either create a dedicated driver for this
(e.g. re-work xen-zcopy to be DRM independent and put it under
drivers/xen/xen-dma-buf, for example) or extend the existing

Re: [PATCH] rave-sp: Remove VLA

2018-04-23 Thread Lee Jones

On Mon, 23 Apr 2018, Kyle Spiers wrote:

> As part of the effort to remove VLAs from the kernel[1], this creates
> constants for the checksum lengths of CCITT and 8B2C and changes
> crc_calculated to be the maximum size of a checksum.
> 
> https://lkml.org/lkml/2018/3/7/621
> 
> Signed-off-by: Kyle Spiers 
> ---
>  drivers/mfd/rave-sp.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mfd/rave-sp.c b/drivers/mfd/rave-sp.c
> index 5c858e784a89..99fa482419f9 100644
> --- a/drivers/mfd/rave-sp.c
> +++ b/drivers/mfd/rave-sp.c
> @@ -45,7 +45,9 @@
>  #define RAVE_SP_DLE  0x10
>  
>  #define RAVE_SP_MAX_DATA_SIZE64
> -#define RAVE_SP_CHECKSUM_SIZE2  /* Worst case scenario on 
> RDU2 */
> +#define RAVE_SP_CHECKSUM_8B2C1
> +#define RAVE_SP_CHECKSUM_CCITT   2
> +#define RAVE_SP_CHECKSUM_SIZERAVE_SP_CHECKSUM_CCITT
>  /*
>   * We don't store STX, ETX and unescaped bytes, so Rx is only
>   * DATA + CSUM
> @@ -415,7 +417,12 @@ static void rave_sp_receive_frame(struct rave_sp *sp,
>   const size_t payload_length  = length - checksum_length;
>   const u8 *crc_reported   = [payload_length];
>   struct device *dev   = >serdev->dev;
> - u8 crc_calculated[checksum_length];
> + u8 crc_calculated[RAVE_SP_CHECKSUM_SIZE];
> +
> + if (unlikely(length > sizeof(crc_calculated))) {

Forgive me if I have this wrong (it's still very early here), but this
doesn't leave any room for the payload?

<--   length -->
<- payload length ->
[CK][CK][D][A][T][A] .. [64]

It is my hope that length would always be larger than the size of the
checksum, or else there would never be any data?

Should this not be:

if (unlikely(length > RAVE_SP_MAX_DATA_SIZE))

Nit: Adding the check is also an unrelated change, so would require a
separate patch.

> + dev_warn(dev, "Dropping oversized frame\n");
> + return;
> + }
>  
>   print_hex_dump(KERN_DEBUG, "rave-sp rx: ", DUMP_PREFIX_NONE,
>  16, 1, data, length, false);

-- 
Lee Jones [李琼斯]
Linaro Services Technical Lead
Linaro.org │ Open source software for ARM SoCs
Follow Linaro: Facebook | Twitter | Blog

Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap

2018-04-23 Thread David Rientjes

On Tue, 24 Apr 2018, Tetsuo Handa wrote:

> > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > > 
> > 
> > I think that's an exceptionally good idea and will mitigate the concerns 
> > of others.
> > 
> > It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> > don't get dozens of unnecessary oom kills.
> > 
> > What do you think about this?  It passes preliminary testing on powerpc 
> > and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> > was a better way to acknowledge your contribution to fixing this issue, 
> > especially since you brought up the exact problem this is addressing in 
> > previous emails.)
> > 
> 
> I don't think this patch is safe, for exit_mmap() is calling
> mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
> held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().

One of the reasons that I extracted __oom_reap_task_mm() out of the new 
oom_reap_task_mm() is to avoid the checks that would be unnecessary when 
called from exit_mmap().  In this case, we can ignore the 
mm_has_blockable_invalidate_notifiers() check because exit_mmap() has 
already done mmu_notifier_release().  So I don't think there's a concern 
about __oom_reap_task_mm() blocking while holding oom_lock.  Unless you 
are referring to something else?

> exit_mmap() must not block while holding oom_lock in order to guarantee that
> oom_reap_task_mm() can give up.
> 
> Some suggestion on top of your patch:
> 
>  mm/mmap.c | 13 +
>  mm/oom_kill.c | 51 ++-
>  2 files changed, 31 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 981eed4..7b31357 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
>   /*
>* Manually reap the mm to free as much memory as possible.
>* Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
> -  * mm from further consideration.  Taking mm->mmap_sem for write
> -  * after setting MMF_OOM_SKIP will guarantee that the oom reaper
> -  * will not run on this mm again after mmap_sem is dropped.
> +  * mm from further consideration. Setting MMF_OOM_SKIP under
> +  * oom_lock held will guarantee that the OOM reaper will not
> +  * run on this mm again.
>*
>* This needs to be done before calling munlock_vma_pages_all(),
>* which clears VM_LOCKED, otherwise the oom reaper cannot
>* reliably test it.
>*/
> - mutex_lock(_lock);
>   __oom_reap_task_mm(mm);
> - mutex_unlock(_lock);
> -
> + mutex_lock(_lock);
>   set_bit(MMF_OOM_SKIP, >flags);
> - down_write(>mmap_sem);
> - up_write(>mmap_sem);
> + mutex_unlock(_lock);
>   }
>  
>   if (mm->locked_vm) {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8ba6cb8..9a29df8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  {
>   bool ret = true;
>  
> + mutex_lock(_lock);
> +
>   /*
> -  * We have to make sure to not race with the victim exit path
> -  * and cause premature new oom victim selection:
> -  * oom_reap_task_mm exit_mm
> -  *   mmget_not_zero
> -  *mmput
> -  *  atomic_dec_and_test
> -  *exit_oom_victim
> -  *  [...]
> -  *  out_of_memory
> -  *select_bad_process
> -  *  # no TIF_MEMDIE task selects new 
> victim
> -  *  unmap_page_range # frees some memory
> +  * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
> +  * work on the mm anymore. The check for MMF_OOM_SKIP must run
> +  * under oom_lock held.
>*/
> - mutex_lock(_lock);
> + if (test_bit(MMF_OOM_SKIP, >flags))
> + goto unlock_oom;
>  
>   if (!down_read_trylock(>mmap_sem)) {
>   ret = false;
> @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   goto unlock_oom;
>   }
>  
> - /*
> -  * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
> -  * work on the mm anymore. The check for MMF_OOM_SKIP must run

Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap

2018-04-23 Thread David Rientjes

On Tue, 24 Apr 2018, Tetsuo Handa wrote:

> > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > > 
> > 
> > I think that's an exceptionally good idea and will mitigate the concerns 
> > of others.
> > 
> > It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> > don't get dozens of unnecessary oom kills.
> > 
> > What do you think about this?  It passes preliminary testing on powerpc 
> > and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> > was a better way to acknowledge your contribution to fixing this issue, 
> > especially since you brought up the exact problem this is addressing in 
> > previous emails.)
> > 
> 
> I don't think this patch is safe, for exit_mmap() is calling
> mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
> held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().

One of the reasons that I extracted __oom_reap_task_mm() out of the new 
oom_reap_task_mm() is to avoid the checks that would be unnecessary when 
called from exit_mmap().  In this case, we can ignore the 
mm_has_blockable_invalidate_notifiers() check because exit_mmap() has 
already done mmu_notifier_release().  So I don't think there's a concern 
about __oom_reap_task_mm() blocking while holding oom_lock.  Unless you 
are referring to something else?

> exit_mmap() must not block while holding oom_lock in order to guarantee that
> oom_reap_task_mm() can give up.
> 
> Some suggestion on top of your patch:
> 
>  mm/mmap.c | 13 +
>  mm/oom_kill.c | 51 ++-
>  2 files changed, 31 insertions(+), 33 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 981eed4..7b31357 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
>   /*
>* Manually reap the mm to free as much memory as possible.
>* Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
> -  * mm from further consideration.  Taking mm->mmap_sem for write
> -  * after setting MMF_OOM_SKIP will guarantee that the oom reaper
> -  * will not run on this mm again after mmap_sem is dropped.
> +  * mm from further consideration. Setting MMF_OOM_SKIP under
> +  * oom_lock held will guarantee that the OOM reaper will not
> +  * run on this mm again.
>*
>* This needs to be done before calling munlock_vma_pages_all(),
>* which clears VM_LOCKED, otherwise the oom reaper cannot
>* reliably test it.
>*/
> - mutex_lock(_lock);
>   __oom_reap_task_mm(mm);
> - mutex_unlock(_lock);
> -
> + mutex_lock(_lock);
>   set_bit(MMF_OOM_SKIP, >flags);
> - down_write(>mmap_sem);
> - up_write(>mmap_sem);
> + mutex_unlock(_lock);
>   }
>  
>   if (mm->locked_vm) {
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8ba6cb8..9a29df8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  {
>   bool ret = true;
>  
> + mutex_lock(_lock);
> +
>   /*
> -  * We have to make sure to not race with the victim exit path
> -  * and cause premature new oom victim selection:
> -  * oom_reap_task_mm exit_mm
> -  *   mmget_not_zero
> -  *mmput
> -  *  atomic_dec_and_test
> -  *exit_oom_victim
> -  *  [...]
> -  *  out_of_memory
> -  *select_bad_process
> -  *  # no TIF_MEMDIE task selects new 
> victim
> -  *  unmap_page_range # frees some memory
> +  * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
> +  * work on the mm anymore. The check for MMF_OOM_SKIP must run
> +  * under oom_lock held.
>*/
> - mutex_lock(_lock);
> + if (test_bit(MMF_OOM_SKIP, >flags))
> + goto unlock_oom;
>  
>   if (!down_read_trylock(>mmap_sem)) {
>   ret = false;
> @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>   goto unlock_oom;
>   }
>  
> - /*
> -  * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
> -  * work on the mm anymore. The check for MMF_OOM_SKIP must run

Re: [PATCH v7 2/5] of: change overlay apply input data from unflattened to FDT

2018-04-23 Thread Jan Kiszka

On 2018-04-24 00:38, Frank Rowand wrote:
> Hi Jan,
> 
> + Alan Tull for fpga perspective
> 
> On 04/22/18 03:30, Jan Kiszka wrote:
>> On 2018-04-11 07:42, Jan Kiszka wrote:
>>> On 2018-04-05 23:12, Rob Herring wrote:
 On Thu, Apr 5, 2018 at 2:28 PM, Frank Rowand  
 wrote:
> On 04/05/18 12:13, Jan Kiszka wrote:
>> On 2018-04-05 20:59, Frank Rowand wrote:
>>> Hi Jan,
>>>
>>> On 04/04/18 15:35, Jan Kiszka wrote:
 Hi Frank,

 On 2018-03-04 01:17, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
>
> Move duplicating and unflattening of an overlay flattened devicetree
> (FDT) into the overlay application code.  To accomplish this,
> of_overlay_apply() is replaced by of_overlay_fdt_apply().
>
> The copy of the FDT (aka "duplicate FDT") now belongs to devicetree
> code, which is thus responsible for freeing the duplicate FDT.  The
> caller of of_overlay_fdt_apply() remains responsible for freeing the
> original FDT.
>
> The unflattened devicetree now belongs to devicetree code, which is
> thus responsible for freeing the unflattened devicetree.
>
> These ownership changes prevent early freeing of the duplicated FDT
> or the unflattened devicetree, which could result in use after free
> errors.
>
> of_overlay_fdt_apply() is a private function for the anticipated
> overlay loader.

 We are using of_fdt_unflatten_tree + of_overlay_apply in the
 (out-of-tree) Jailhouse loader driver in order to register a virtual
 device during hypervisor activation with Linux. The DT overlay is
 created from a a template but modified prior to application to account
 for runtime-specific parameters. See [1] for the current 
 implementation.

 I'm now wondering how to model that scenario best with the new API.
 Given that the loader lost ownership of the unflattened tree but the
 modification API exist only for the that DT state, I'm not yet seeing a
 clear solution. Should we apply the template in disabled form (status =
 "disabled"), modify it, and then activate it while it is already 
 applied?
>>>
>>> Thank you for the pointer to the driver - that makes it much easier to
>>> understand the use case and consider solutions.
>>>
>>> If you can make the changes directly on the FDT instead of on the
>>> expanded devicetree, then you could move to the new API.
>>
>> Are there some examples/references on how to edit FDTs in-place in the
>> kernel? I'd like to avoid writing the n-th FDT parser/generator.
>
> I don't know of any existing in-kernel edits of the FDT (but they might
> exist).  The functions to access an FDT are in libfdt, which is in
> scripts/dtc/libfdt/.

 Let's please not go down that route of doing FDT modifications. There
 is little reason to other than for early boot changes. And it is much
 easier to work on unflattened trees.
>>>
>>> I just briefly looked into libfdt, and it would have meant building it
>>> into the module as there are no library functions exported by the kernel
>>> either. Another reason to drop that.
>>>
>>> What's apparently working now is the pattern I initially suggested:
>>> Register template with status = "disabled" as overlay, then prepare and
>>> apply changeset that contains all needed modifications and sets the
>>> status to "ok". I might be leaking additional resources, but to find
>>> that out, I will now finally have to resolve clean unbinding of the
>>> generic PCI host controller [1] first.
>>
>> static void free_overlay_changeset(struct overlay_changeset *ovcs)
>> {
>>  [...]
>>  /*
>>   * TODO
>>   *
>>   * would like to: kfree(ovcs->overlay_tree);
>>   * but can not since drivers may have pointers into this data
>>   *
>>   * would like to: kfree(ovcs->fdt);
>>   * but can not since drivers may have pointers into this data
>>   */
>>
>>  kfree(ovcs);
>> }
>>
>> What's this? I have kmemleak now jumping at me over this. Who is suppose
>> to plug these leaks? The caller of of_overlay_fdt_apply has no pointers
>> to those objects. I would say that's a regression of the new API.
> 
> The problem already existed but it was hidden.  We have never been able to
> kfree() these object because we do not know if there are any pointers into
> these objects.  The new API makes the problem visible to kmemleak.

My old code didn't have the problem because there was no one steeling
pointers to my overlay, and I was able to safely release all the
resources that I or the core on my behalf allocated. In fact, I recently
even dropped the duplication the fdt prior to unflattening it because I
got its lifecycle under control

Re: [PATCH v7 2/5] of: change overlay apply input data from unflattened to FDT

2018-04-23 Thread Jan Kiszka

On 2018-04-24 00:38, Frank Rowand wrote:
> Hi Jan,
> 
> + Alan Tull for fpga perspective
> 
> On 04/22/18 03:30, Jan Kiszka wrote:
>> On 2018-04-11 07:42, Jan Kiszka wrote:
>>> On 2018-04-05 23:12, Rob Herring wrote:
 On Thu, Apr 5, 2018 at 2:28 PM, Frank Rowand  
 wrote:
> On 04/05/18 12:13, Jan Kiszka wrote:
>> On 2018-04-05 20:59, Frank Rowand wrote:
>>> Hi Jan,
>>>
>>> On 04/04/18 15:35, Jan Kiszka wrote:
 Hi Frank,

 On 2018-03-04 01:17, frowand.l...@gmail.com wrote:
> From: Frank Rowand 
>
> Move duplicating and unflattening of an overlay flattened devicetree
> (FDT) into the overlay application code.  To accomplish this,
> of_overlay_apply() is replaced by of_overlay_fdt_apply().
>
> The copy of the FDT (aka "duplicate FDT") now belongs to devicetree
> code, which is thus responsible for freeing the duplicate FDT.  The
> caller of of_overlay_fdt_apply() remains responsible for freeing the
> original FDT.
>
> The unflattened devicetree now belongs to devicetree code, which is
> thus responsible for freeing the unflattened devicetree.
>
> These ownership changes prevent early freeing of the duplicated FDT
> or the unflattened devicetree, which could result in use after free
> errors.
>
> of_overlay_fdt_apply() is a private function for the anticipated
> overlay loader.

 We are using of_fdt_unflatten_tree + of_overlay_apply in the
 (out-of-tree) Jailhouse loader driver in order to register a virtual
 device during hypervisor activation with Linux. The DT overlay is
 created from a a template but modified prior to application to account
 for runtime-specific parameters. See [1] for the current 
 implementation.

 I'm now wondering how to model that scenario best with the new API.
 Given that the loader lost ownership of the unflattened tree but the
 modification API exist only for the that DT state, I'm not yet seeing a
 clear solution. Should we apply the template in disabled form (status =
 "disabled"), modify it, and then activate it while it is already 
 applied?
>>>
>>> Thank you for the pointer to the driver - that makes it much easier to
>>> understand the use case and consider solutions.
>>>
>>> If you can make the changes directly on the FDT instead of on the
>>> expanded devicetree, then you could move to the new API.
>>
>> Are there some examples/references on how to edit FDTs in-place in the
>> kernel? I'd like to avoid writing the n-th FDT parser/generator.
>
> I don't know of any existing in-kernel edits of the FDT (but they might
> exist).  The functions to access an FDT are in libfdt, which is in
> scripts/dtc/libfdt/.

 Let's please not go down that route of doing FDT modifications. There
 is little reason to other than for early boot changes. And it is much
 easier to work on unflattened trees.
>>>
>>> I just briefly looked into libfdt, and it would have meant building it
>>> into the module as there are no library functions exported by the kernel
>>> either. Another reason to drop that.
>>>
>>> What's apparently working now is the pattern I initially suggested:
>>> Register template with status = "disabled" as overlay, then prepare and
>>> apply changeset that contains all needed modifications and sets the
>>> status to "ok". I might be leaking additional resources, but to find
>>> that out, I will now finally have to resolve clean unbinding of the
>>> generic PCI host controller [1] first.
>>
>> static void free_overlay_changeset(struct overlay_changeset *ovcs)
>> {
>>  [...]
>>  /*
>>   * TODO
>>   *
>>   * would like to: kfree(ovcs->overlay_tree);
>>   * but can not since drivers may have pointers into this data
>>   *
>>   * would like to: kfree(ovcs->fdt);
>>   * but can not since drivers may have pointers into this data
>>   */
>>
>>  kfree(ovcs);
>> }
>>
>> What's this? I have kmemleak now jumping at me over this. Who is suppose
>> to plug these leaks? The caller of of_overlay_fdt_apply has no pointers
>> to those objects. I would say that's a regression of the new API.
> 
> The problem already existed but it was hidden.  We have never been able to
> kfree() these object because we do not know if there are any pointers into
> these objects.  The new API makes the problem visible to kmemleak.

My old code didn't have the problem because there was no one steeling
pointers to my overlay, and I was able to safely release all the
resources that I or the core on my behalf allocated. In fact, I recently
even dropped the duplication the fdt prior to unflattening it because I
got its lifecycle under control (and both kmemleak as well as kasan
confirmed

Re: [PATCH] mmc: mediatek: add 64G DRAM DMA support

2018-04-23 Thread kbuild test robot

Hi Chaotian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc2 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Chaotian-Jing/mmc-mediatek-add-64G-DRAM-DMA-support/20180423-231743
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   drivers/mmc/host/mtk-sd.c: In function 'msdc_dma_setup':
>> drivers/mmc/host/mtk-sd.c:573:35: warning: right shift count >= width of 
>> type [-Wshift-count-overflow]
   bd[j].bd_info |= ((dma_address >> 32) & 0xf) << 28;
  ^~
   drivers/mmc/host/mtk-sd.c: In function 'msdc_init_gpd_bd':
   drivers/mmc/host/mtk-sd.c:1440:31: warning: right shift count >= width of 
type [-Wshift-count-overflow]
  gpd->gpd_info |= ((dma_addr >> 32) & 0xf) << 24;
  ^~
   drivers/mmc/host/mtk-sd.c:1445:31: warning: right shift count >= width of 
type [-Wshift-count-overflow]
  gpd->gpd_info |= ((dma_addr >> 32) & 0xf) << 28;
  ^~
   drivers/mmc/host/mtk-sd.c:1452:32: warning: right shift count >= width of 
type [-Wshift-count-overflow]
   bd[i].bd_info |= ((dma_addr >> 32) & 0xf) << 24;
   ^~

vim +573 drivers/mmc/host/mtk-sd.c

   539  
   540  static inline void msdc_dma_setup(struct msdc_host *host, struct 
msdc_dma *dma,
   541  struct mmc_data *data)
   542  {
   543  unsigned int j, dma_len;
   544  dma_addr_t dma_address;
   545  u32 dma_ctrl;
   546  struct scatterlist *sg;
   547  struct mt_gpdma_desc *gpd;
   548  struct mt_bdma_desc *bd;
   549  
   550  sg = data->sg;
   551  
   552  gpd = dma->gpd;
   553  bd = dma->bd;
   554  
   555  /* modify gpd */
   556  gpd->gpd_info |= GPDMA_DESC_HWO;
   557  gpd->gpd_info |= GPDMA_DESC_BDP;
   558  /* need to clear first. use these bits to calc checksum */
   559  gpd->gpd_info &= ~GPDMA_DESC_CHECKSUM;
   560  gpd->gpd_info |= msdc_dma_calcs((u8 *) gpd, 16) << 8;
   561  
   562  /* modify bd */
   563  for_each_sg(data->sg, sg, data->sg_count, j) {
   564  dma_address = sg_dma_address(sg);
   565  dma_len = sg_dma_len(sg);
   566  
   567  /* init bd */
   568  bd[j].bd_info &= ~BDMA_DESC_BLKPAD;
   569  bd[j].bd_info &= ~BDMA_DESC_DWPAD;
   570  bd[j].ptr = (u32)dma_address;
   571  if (host->dev_comp->support_64g) {
   572  bd[j].bd_info &= ~BDMA_DESC_PTR_H4;
 > 573  bd[j].bd_info |= ((dma_address >> 32) & 0xf) << 
 > 28;
   574  }
   575  bd[j].bd_data_len &= ~BDMA_DESC_BUFLEN;
   576  bd[j].bd_data_len |= (dma_len & BDMA_DESC_BUFLEN);
   577  
   578  if (j == data->sg_count - 1) /* the last bd */
   579  bd[j].bd_info |= BDMA_DESC_EOL;
   580  else
   581  bd[j].bd_info &= ~BDMA_DESC_EOL;
   582  
   583  /* checksume need to clear first */
   584  bd[j].bd_info &= ~BDMA_DESC_CHECKSUM;
   585  bd[j].bd_info |= msdc_dma_calcs((u8 *)([j]), 16) << 
8;
   586  }
   587  
   588  sdr_set_field(host->base + MSDC_DMA_CFG, MSDC_DMA_CFG_DECSEN, 
1);
   589  dma_ctrl = readl_relaxed(host->base + MSDC_DMA_CTRL);
   590  dma_ctrl &= ~(MSDC_DMA_CTRL_BRUSTSZ | MSDC_DMA_CTRL_MODE);
   591  dma_ctrl |= (MSDC_BURST_64B << 12 | 1 << 8);
   592  writel_relaxed(dma_ctrl, host->base + MSDC_DMA_CTRL);
   593  writel((u32)dma->gpd_addr, host->base + MSDC_DMA_SA);
   594  }
   595  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH] mmc: mediatek: add 64G DRAM DMA support

2018-04-23 Thread kbuild test robot

Hi Chaotian,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc2 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Chaotian-Jing/mmc-mediatek-add-64G-DRAM-DMA-support/20180423-231743
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   drivers/mmc/host/mtk-sd.c: In function 'msdc_dma_setup':
>> drivers/mmc/host/mtk-sd.c:573:35: warning: right shift count >= width of 
>> type [-Wshift-count-overflow]
   bd[j].bd_info |= ((dma_address >> 32) & 0xf) << 28;
  ^~
   drivers/mmc/host/mtk-sd.c: In function 'msdc_init_gpd_bd':
   drivers/mmc/host/mtk-sd.c:1440:31: warning: right shift count >= width of 
type [-Wshift-count-overflow]
  gpd->gpd_info |= ((dma_addr >> 32) & 0xf) << 24;
  ^~
   drivers/mmc/host/mtk-sd.c:1445:31: warning: right shift count >= width of 
type [-Wshift-count-overflow]
  gpd->gpd_info |= ((dma_addr >> 32) & 0xf) << 28;
  ^~
   drivers/mmc/host/mtk-sd.c:1452:32: warning: right shift count >= width of 
type [-Wshift-count-overflow]
   bd[i].bd_info |= ((dma_addr >> 32) & 0xf) << 24;
   ^~

vim +573 drivers/mmc/host/mtk-sd.c

   539  
   540  static inline void msdc_dma_setup(struct msdc_host *host, struct 
msdc_dma *dma,
   541  struct mmc_data *data)
   542  {
   543  unsigned int j, dma_len;
   544  dma_addr_t dma_address;
   545  u32 dma_ctrl;
   546  struct scatterlist *sg;
   547  struct mt_gpdma_desc *gpd;
   548  struct mt_bdma_desc *bd;
   549  
   550  sg = data->sg;
   551  
   552  gpd = dma->gpd;
   553  bd = dma->bd;
   554  
   555  /* modify gpd */
   556  gpd->gpd_info |= GPDMA_DESC_HWO;
   557  gpd->gpd_info |= GPDMA_DESC_BDP;
   558  /* need to clear first. use these bits to calc checksum */
   559  gpd->gpd_info &= ~GPDMA_DESC_CHECKSUM;
   560  gpd->gpd_info |= msdc_dma_calcs((u8 *) gpd, 16) << 8;
   561  
   562  /* modify bd */
   563  for_each_sg(data->sg, sg, data->sg_count, j) {
   564  dma_address = sg_dma_address(sg);
   565  dma_len = sg_dma_len(sg);
   566  
   567  /* init bd */
   568  bd[j].bd_info &= ~BDMA_DESC_BLKPAD;
   569  bd[j].bd_info &= ~BDMA_DESC_DWPAD;
   570  bd[j].ptr = (u32)dma_address;
   571  if (host->dev_comp->support_64g) {
   572  bd[j].bd_info &= ~BDMA_DESC_PTR_H4;
 > 573  bd[j].bd_info |= ((dma_address >> 32) & 0xf) << 
 > 28;
   574  }
   575  bd[j].bd_data_len &= ~BDMA_DESC_BUFLEN;
   576  bd[j].bd_data_len |= (dma_len & BDMA_DESC_BUFLEN);
   577  
   578  if (j == data->sg_count - 1) /* the last bd */
   579  bd[j].bd_info |= BDMA_DESC_EOL;
   580  else
   581  bd[j].bd_info &= ~BDMA_DESC_EOL;
   582  
   583  /* checksume need to clear first */
   584  bd[j].bd_info &= ~BDMA_DESC_CHECKSUM;
   585  bd[j].bd_info |= msdc_dma_calcs((u8 *)([j]), 16) << 
8;
   586  }
   587  
   588  sdr_set_field(host->base + MSDC_DMA_CFG, MSDC_DMA_CFG_DECSEN, 
1);
   589  dma_ctrl = readl_relaxed(host->base + MSDC_DMA_CTRL);
   590  dma_ctrl &= ~(MSDC_DMA_CTRL_BRUSTSZ | MSDC_DMA_CTRL_MODE);
   591  dma_ctrl |= (MSDC_BURST_64B << 12 | 1 << 8);
   592  writel_relaxed(dma_ctrl, host->base + MSDC_DMA_CTRL);
   593  writel((u32)dma->gpd_addr, host->base + MSDC_DMA_SA);
   594  }
   595  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 9/9] ath10k: re-enable the firmware fallback mechanism for testmode

2018-04-23 Thread Kalle Valo

Andres Rodriguez  writes:

> The ath10k testmode uses request_firmware_direct() in order to avoid
> producing firmware load warnings. Disabling the fallback mechanism was a
> side effect of disabling warnings.
>
> We now have a new API that allows us to avoid warnings while keeping the
> fallback mechanism enabled. So use that instead.
>
> Signed-off-by: Andres Rodriguez 

Thanks!

Acked-by: Kalle Valo 

-- 
Kalle Valo

Re: [PATCH 3/5] powerpc: use time64_t in read_persistent_clock

2018-04-23 Thread kbuild test robot

Hi Arnd,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.17-rc2 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arnd-Bergmann/powerpc-always-enable-RTC_LIB/20180423-223504
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc64-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc64 

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/maple/time.c: In function 'maple_get_boot_time':
>> arch/powerpc/platforms/maple/time.c:173:9: error: implicit declaration of 
>> function 'rtc_tm_to_time66'; did you mean 'rtc_tm_to_time64'? 
>> [-Werror=implicit-function-declaration]
 return rtc_tm_to_time66();
^~~~
rtc_tm_to_time64
   cc1: all warnings being treated as errors

vim +173 arch/powerpc/platforms/maple/time.c

   139  
   140  time64_t __init maple_get_boot_time(void)
   141  {
   142  struct rtc_time tm;
   143  struct device_node *rtcs;
   144  
   145  rtcs = of_find_compatible_node(NULL, "rtc", "pnpPNP,b00");
   146  if (rtcs) {
   147  struct resource r;
   148  if (of_address_to_resource(rtcs, 0, )) {
   149  printk(KERN_EMERG "Maple: Unable to translate 
RTC"
   150 " address\n");
   151  goto bail;
   152  }
   153  if (!(r.flags & IORESOURCE_IO)) {
   154  printk(KERN_EMERG "Maple: RTC address isn't 
PIO!\n");
   155  goto bail;
   156  }
   157  maple_rtc_addr = r.start;
   158  printk(KERN_INFO "Maple: Found RTC at IO 0x%x\n",
   159 maple_rtc_addr);
   160  }
   161   bail:
   162  if (maple_rtc_addr == 0) {
   163  maple_rtc_addr = RTC_PORT(0); /* legacy address */
   164  printk(KERN_INFO "Maple: No device node for RTC, 
assuming "
   165 "legacy address (0x%x)\n", maple_rtc_addr);
   166  }
   167  
   168  rtc_iores.start = maple_rtc_addr;
   169  rtc_iores.end = maple_rtc_addr + 7;
   170  request_resource(_resource, _iores);
   171  
   172  maple_get_rtc_time();
 > 173  return rtc_tm_to_time66();
   174  }
   175  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 9/9] ath10k: re-enable the firmware fallback mechanism for testmode

2018-04-23 Thread Kalle Valo

Andres Rodriguez  writes:

> The ath10k testmode uses request_firmware_direct() in order to avoid
> producing firmware load warnings. Disabling the fallback mechanism was a
> side effect of disabling warnings.
>
> We now have a new API that allows us to avoid warnings while keeping the
> fallback mechanism enabled. So use that instead.
>
> Signed-off-by: Andres Rodriguez 

Thanks!

Acked-by: Kalle Valo 

-- 
Kalle Valo

Re: [PATCH 3/5] powerpc: use time64_t in read_persistent_clock

2018-04-23 Thread kbuild test robot

Hi Arnd,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on v4.17-rc2 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Arnd-Bergmann/powerpc-always-enable-RTC_LIB/20180423-223504
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc64-defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc64 

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/maple/time.c: In function 'maple_get_boot_time':
>> arch/powerpc/platforms/maple/time.c:173:9: error: implicit declaration of 
>> function 'rtc_tm_to_time66'; did you mean 'rtc_tm_to_time64'? 
>> [-Werror=implicit-function-declaration]
 return rtc_tm_to_time66();
^~~~
rtc_tm_to_time64
   cc1: all warnings being treated as errors

vim +173 arch/powerpc/platforms/maple/time.c

   139  
   140  time64_t __init maple_get_boot_time(void)
   141  {
   142  struct rtc_time tm;
   143  struct device_node *rtcs;
   144  
   145  rtcs = of_find_compatible_node(NULL, "rtc", "pnpPNP,b00");
   146  if (rtcs) {
   147  struct resource r;
   148  if (of_address_to_resource(rtcs, 0, )) {
   149  printk(KERN_EMERG "Maple: Unable to translate 
RTC"
   150 " address\n");
   151  goto bail;
   152  }
   153  if (!(r.flags & IORESOURCE_IO)) {
   154  printk(KERN_EMERG "Maple: RTC address isn't 
PIO!\n");
   155  goto bail;
   156  }
   157  maple_rtc_addr = r.start;
   158  printk(KERN_INFO "Maple: Found RTC at IO 0x%x\n",
   159 maple_rtc_addr);
   160  }
   161   bail:
   162  if (maple_rtc_addr == 0) {
   163  maple_rtc_addr = RTC_PORT(0); /* legacy address */
   164  printk(KERN_INFO "Maple: No device node for RTC, 
assuming "
   165 "legacy address (0x%x)\n", maple_rtc_addr);
   166  }
   167  
   168  rtc_iores.start = maple_rtc_addr;
   169  rtc_iores.end = maple_rtc_addr + 7;
   170  request_resource(_resource, _iores);
   171  
   172  maple_get_rtc_time();
 > 173  return rtc_tm_to_time66();
   174  }
   175  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH 1/3] PM / devfreq: Actually support providing freq_table

2018-04-23 Thread Bjorn Andersson

On Mon 23 Apr 19:48 PDT 2018, Chanwoo Choi wrote:

> Hi,
> 
> On 2018??? 04??? 24??? 09:20, Bjorn Andersson wrote:
> > The code in devfreq_add_device() handles the case where a freq_table is
> > passed by the client, but then requests min and max frequences from
> > the, in this case absent, opp tables.
> > 
> > Read the min and max frequencies from the frequency table, which has
> > been built from the opp table if one exists, instead of querying the
> > opp table.
> > 
> > Signed-off-by: Bjorn Andersson 
> > ---
> > 
> > An alternative approach is to clarify in the devfreq code that it's not
> > possible to pass a freq_table and then in patch 3 create an opp table for 
> > the
> > device in runtime; although the error handling of this becomes non-trivial.
> > 
> > Transitioning the UFSHCD to use opp tables directly is hindered by the fact
> > that the Qualcomm UFS hardware has two different clocks that needs to be
> > running at different rates, so we would need a way to describe the two 
> > rates in
> > the opp table. (And would force us to change the DT binding)
> > 
> >  drivers/devfreq/devfreq.c | 22 --
> >  1 file changed, 4 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> > index fe2af6aa88fc..086ced50a13d 100644
> > --- a/drivers/devfreq/devfreq.c
> > +++ b/drivers/devfreq/devfreq.c
> > @@ -74,30 +74,16 @@ static struct devfreq *find_device_devfreq(struct 
> > device *dev)
> >  
> >  static unsigned long find_available_min_freq(struct devfreq *devfreq)
> >  {
> > -   struct dev_pm_opp *opp;
> > -   unsigned long min_freq = 0;
> > -
> > -   opp = dev_pm_opp_find_freq_ceil(devfreq->dev.parent, _freq);
> > -   if (IS_ERR(opp))
> > -   min_freq = 0;
> > -   else
> > -   dev_pm_opp_put(opp);
> > +   struct devfreq_dev_profile *profile = devfreq->profile;
> >  
> > -   return min_freq;
> > +   return profile->freq_table[0];
> 
> It is wrong. The thermal framework support the devfreq-cooling device
> which uses the dev_pm_opp_enable/disable().
> 

Okay, that makes sense. So rather than registering a custom freq_table I
should register the min and max frequency using dev_pm_opp_add().

> In order to find the correct available min frequency,
> the devfreq have to use the OPP function instead of using the first entry
> of the freq_table array.
> 

Based on this there seems to be room for cleaning out the freq_table
from devfreq, to reduce the confusion. I will review this further.

Thanks,
Bjorn

Re: [PATCH 1/3] PM / devfreq: Actually support providing freq_table

2018-04-23 Thread Bjorn Andersson

On Mon 23 Apr 19:48 PDT 2018, Chanwoo Choi wrote:

> Hi,
> 
> On 2018??? 04??? 24??? 09:20, Bjorn Andersson wrote:
> > The code in devfreq_add_device() handles the case where a freq_table is
> > passed by the client, but then requests min and max frequences from
> > the, in this case absent, opp tables.
> > 
> > Read the min and max frequencies from the frequency table, which has
> > been built from the opp table if one exists, instead of querying the
> > opp table.
> > 
> > Signed-off-by: Bjorn Andersson 
> > ---
> > 
> > An alternative approach is to clarify in the devfreq code that it's not
> > possible to pass a freq_table and then in patch 3 create an opp table for 
> > the
> > device in runtime; although the error handling of this becomes non-trivial.
> > 
> > Transitioning the UFSHCD to use opp tables directly is hindered by the fact
> > that the Qualcomm UFS hardware has two different clocks that needs to be
> > running at different rates, so we would need a way to describe the two 
> > rates in
> > the opp table. (And would force us to change the DT binding)
> > 
> >  drivers/devfreq/devfreq.c | 22 --
> >  1 file changed, 4 insertions(+), 18 deletions(-)
> > 
> > diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> > index fe2af6aa88fc..086ced50a13d 100644
> > --- a/drivers/devfreq/devfreq.c
> > +++ b/drivers/devfreq/devfreq.c
> > @@ -74,30 +74,16 @@ static struct devfreq *find_device_devfreq(struct 
> > device *dev)
> >  
> >  static unsigned long find_available_min_freq(struct devfreq *devfreq)
> >  {
> > -   struct dev_pm_opp *opp;
> > -   unsigned long min_freq = 0;
> > -
> > -   opp = dev_pm_opp_find_freq_ceil(devfreq->dev.parent, _freq);
> > -   if (IS_ERR(opp))
> > -   min_freq = 0;
> > -   else
> > -   dev_pm_opp_put(opp);
> > +   struct devfreq_dev_profile *profile = devfreq->profile;
> >  
> > -   return min_freq;
> > +   return profile->freq_table[0];
> 
> It is wrong. The thermal framework support the devfreq-cooling device
> which uses the dev_pm_opp_enable/disable().
> 

Okay, that makes sense. So rather than registering a custom freq_table I
should register the min and max frequency using dev_pm_opp_add().

> In order to find the correct available min frequency,
> the devfreq have to use the OPP function instead of using the first entry
> of the freq_table array.
> 

Based on this there seems to be room for cleaning out the freq_table
from devfreq, to reduce the confusion. I will review this further.

Thanks,
Bjorn

Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

2018-04-23 Thread Nicholas Piggin

On Mon, 23 Apr 2018 15:47:40 +0530
Pavan Kondeti  wrote:

> Hi Nick,
> 
> On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> > This is a quick hack for comments, but I've always wondered --
> > if we have a short term polling idle states in cpuidle for performance
> > -- why not skip the context switch and entry into all the idle states,
> > and just wait for a bit to see if something wakes up again.
> > 
> > It's not uncommon to see various going-to-idle work in kernel profiles.
> > This might be a way to reduce that (and just the cost of switching
> > registers and kernel stack to idle thread). This can be an important
> > path for single thread request-response throughput.
> > 
> > tbench bandwidth seems to be improved (the numbers aren't too stable
> > but they pretty consistently show some gain). 10-20% would be a pretty
> > nice gain for such workloads
> > 
> > clients 1 2 4 816   128
> > vanilla   232   467   823  1819  3218  9065
> > patched   310   503   962  2465  3743  9820
> >   
> 
> 
> 
> > +idle_spin_end:
> > /* Promote REQ to ACT */
> > rq->clock_update_flags <<= 1;
> > update_rq_clock(rq);
> > @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
> > if (unlikely(signal_pending_state(prev->state, prev))) {
> > prev->state = TASK_RUNNING;
> > } else {
> > +   /*
> > +* Busy wait before switching to idle thread. This
> > +* is marked unlikely because we're idle so jumping
> > +* out of line doesn't matter too much.
> > +*/
> > +   if (unlikely(do_idle_spin && rq->nr_running == 1)) {
> > +   u64 start;
> > +
> > +   do_idle_spin = false;
> > +
> > +   rq->clock_update_flags &= 
> > ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> > +   rq_unlock_irq(rq, );
> > +
> > +   spin_begin();
> > +   start = local_clock();
> > +   while (!need_resched() && prev->state &&
> > +   !signal_pending_state(prev->state, 
> > prev)) {
> > +   spin_cpu_relax();
> > +   if (local_clock() - start > 100)
> > +   break;
> > +   }  
> 
> Couple of comments/questions.
> 
> When a RT task is doing this busy loop, 
> 
> (1) need_resched() may not be set even if a fair/normal task is enqueued on
> this CPU.

This is true, it should probably spin on nr_running == 1, good catch.

> 
> (2) Any lower prio RT task waking up on this CPU may migrate to another CPU
> thinking this CPU is busy with higher prio RT task.

Also true. If we completely replaced the polling idle states with a
spin here, this would not be acceptable and it would have to be quite
a lot more work to interact with load calculations etc.

On the other hand if it is a much smaller spin on the order of
context switch latency that could be considered part of the cost
of context switching for the purposes of load balancing, *maybe*
not much else is need.

Thanks,
Nick

Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

2018-04-23 Thread Nicholas Piggin

On Mon, 23 Apr 2018 15:47:40 +0530
Pavan Kondeti  wrote:

> Hi Nick,
> 
> On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> > This is a quick hack for comments, but I've always wondered --
> > if we have a short term polling idle states in cpuidle for performance
> > -- why not skip the context switch and entry into all the idle states,
> > and just wait for a bit to see if something wakes up again.
> > 
> > It's not uncommon to see various going-to-idle work in kernel profiles.
> > This might be a way to reduce that (and just the cost of switching
> > registers and kernel stack to idle thread). This can be an important
> > path for single thread request-response throughput.
> > 
> > tbench bandwidth seems to be improved (the numbers aren't too stable
> > but they pretty consistently show some gain). 10-20% would be a pretty
> > nice gain for such workloads
> > 
> > clients 1 2 4 816   128
> > vanilla   232   467   823  1819  3218  9065
> > patched   310   503   962  2465  3743  9820
> >   
> 
> 
> 
> > +idle_spin_end:
> > /* Promote REQ to ACT */
> > rq->clock_update_flags <<= 1;
> > update_rq_clock(rq);
> > @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
> > if (unlikely(signal_pending_state(prev->state, prev))) {
> > prev->state = TASK_RUNNING;
> > } else {
> > +   /*
> > +* Busy wait before switching to idle thread. This
> > +* is marked unlikely because we're idle so jumping
> > +* out of line doesn't matter too much.
> > +*/
> > +   if (unlikely(do_idle_spin && rq->nr_running == 1)) {
> > +   u64 start;
> > +
> > +   do_idle_spin = false;
> > +
> > +   rq->clock_update_flags &= 
> > ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> > +   rq_unlock_irq(rq, );
> > +
> > +   spin_begin();
> > +   start = local_clock();
> > +   while (!need_resched() && prev->state &&
> > +   !signal_pending_state(prev->state, 
> > prev)) {
> > +   spin_cpu_relax();
> > +   if (local_clock() - start > 100)
> > +   break;
> > +   }  
> 
> Couple of comments/questions.
> 
> When a RT task is doing this busy loop, 
> 
> (1) need_resched() may not be set even if a fair/normal task is enqueued on
> this CPU.

This is true, it should probably spin on nr_running == 1, good catch.

> 
> (2) Any lower prio RT task waking up on this CPU may migrate to another CPU
> thinking this CPU is busy with higher prio RT task.

Also true. If we completely replaced the polling idle states with a
spin here, this would not be acceptable and it would have to be quite
a lot more work to interact with load calculations etc.

On the other hand if it is a much smaller spin on the order of
context switch latency that could be considered part of the cost
of context switching for the purposes of load balancing, *maybe*
not much else is need.

Thanks,
Nick

Re: [PATCH] mfd: qcom-spmi-pmic: Add support for pm8005,pm8998,pmi8998

2018-04-23 Thread Doug Anderson

Hi,

On Thu, Apr 19, 2018 at 4:00 PM, Stephen Boyd  wrote:
> diff --git a/drivers/mfd/qcom-spmi-pmic.c b/drivers/mfd/qcom-spmi-pmic.c
> index 2022bdfa7ab4..0b26387c22e7 100644
> --- a/drivers/mfd/qcom-spmi-pmic.c
> +++ b/drivers/mfd/qcom-spmi-pmic.c
> @@ -39,6 +39,9 @@
>  #define PM8916_SUBTYPE 0x0b
>  #define PM8004_SUBTYPE 0x0c
>  #define PM8909_SUBTYPE 0x0d
> +#define PM8998_SUBTYPE 0x14
> +#define PMI8998_SUBTYPE0x15
> +#define PM8005_SUBTYPE 0x18

I was being overly paranoid and double-checking these numbers.  I
confirmed PMI8998 and PM8005 from the docs (yay!).  The PM8998 docs
didn't have this, but I confirmed that I was talking to PM8998 by
confirming it was on the right USID and and then printing out the
value at probe time.  All look good.

>  static const struct of_device_id pmic_spmi_id_table[] = {
> { .compatible = "qcom,spmi-pmic", .data = (void *)COMMON_SUBTYPE },
> @@ -54,7 +57,10 @@ static const struct of_device_id pmic_spmi_id_table[] = {
> { .compatible = "qcom,pmi8994",   .data = (void *)PMI8994_SUBTYPE },
> { .compatible = "qcom,pm8916",.data = (void *)PM8916_SUBTYPE },
> { .compatible = "qcom,pm8004",.data = (void *)PM8004_SUBTYPE },
> +   { .compatible = "qcom,pmi8998",   .data = (void *)PMI8998_SUBTYPE },
> +   { .compatible = "qcom,pm8005",.data = (void *)PM8005_SUBTYPE },
> { .compatible = "qcom,pm8909",.data = (void *)PM8909_SUBTYPE },
> +   { .compatible = "qcom,pm8998",.data = (void *)PM8998_SUBTYPE },

nit: It appears that the above table was previously sorted by SUBTYPE
ID.  Could you perhaps move your 3 new PMICs to the bottom to maintain
this?  Other than that, you can add my Reviewed-by if you would like
(not that I have _any_ real expertise on SPMI, so might not be worth
it).

-Doug

Re: [PATCH] mfd: qcom-spmi-pmic: Add support for pm8005,pm8998,pmi8998

2018-04-23 Thread Doug Anderson

Hi,

On Thu, Apr 19, 2018 at 4:00 PM, Stephen Boyd  wrote:
> diff --git a/drivers/mfd/qcom-spmi-pmic.c b/drivers/mfd/qcom-spmi-pmic.c
> index 2022bdfa7ab4..0b26387c22e7 100644
> --- a/drivers/mfd/qcom-spmi-pmic.c
> +++ b/drivers/mfd/qcom-spmi-pmic.c
> @@ -39,6 +39,9 @@
>  #define PM8916_SUBTYPE 0x0b
>  #define PM8004_SUBTYPE 0x0c
>  #define PM8909_SUBTYPE 0x0d
> +#define PM8998_SUBTYPE 0x14
> +#define PMI8998_SUBTYPE0x15
> +#define PM8005_SUBTYPE 0x18

I was being overly paranoid and double-checking these numbers.  I
confirmed PMI8998 and PM8005 from the docs (yay!).  The PM8998 docs
didn't have this, but I confirmed that I was talking to PM8998 by
confirming it was on the right USID and and then printing out the
value at probe time.  All look good.

>  static const struct of_device_id pmic_spmi_id_table[] = {
> { .compatible = "qcom,spmi-pmic", .data = (void *)COMMON_SUBTYPE },
> @@ -54,7 +57,10 @@ static const struct of_device_id pmic_spmi_id_table[] = {
> { .compatible = "qcom,pmi8994",   .data = (void *)PMI8994_SUBTYPE },
> { .compatible = "qcom,pm8916",.data = (void *)PM8916_SUBTYPE },
> { .compatible = "qcom,pm8004",.data = (void *)PM8004_SUBTYPE },
> +   { .compatible = "qcom,pmi8998",   .data = (void *)PMI8998_SUBTYPE },
> +   { .compatible = "qcom,pm8005",.data = (void *)PM8005_SUBTYPE },
> { .compatible = "qcom,pm8909",.data = (void *)PM8909_SUBTYPE },
> +   { .compatible = "qcom,pm8998",.data = (void *)PM8998_SUBTYPE },

nit: It appears that the above table was previously sorted by SUBTYPE
ID.  Could you perhaps move your 3 new PMICs to the bottom to maintain
this?  Other than that, you can add my Reviewed-by if you would like
(not that I have _any_ real expertise on SPMI, so might not be worth
it).

-Doug

Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Shilpasri G Bhat

Hi,

On 04/24/2018 10:40 AM, Stewart Smith wrote:
> Shilpasri G Bhat  writes:
>> gpstate_timer_handler() uses synchronous smp_call to set the pstate
>> on the requested core. This causes the below hard lockup:
>>
>> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
>> (unreliable)
>> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
>> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
>> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
>> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
>> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
>> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
>> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
>> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
>> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
>> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
>> LR = arch_send_call_function_ipi_mask+0x120/0x130
>> [c03fe566ba50] [c004876c] 
>> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
>> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
>> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
>> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
>> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
>> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
>> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
>> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>>
>> Fix this by using the asynchronus smp_call in the timer interrupt handler.
>> We don't have to wait in this handler until the pstates are changed on
>> the core. This change will not have any impact on the global pstate
>> ramp-down algorithm.
>>
>> Reported-by: Nicholas Piggin 
>> Reported-by: Pridhiviraj Paidipeddi 
>> Signed-off-by: Shilpasri G Bhat 
>> ---
>>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>> b/drivers/cpufreq/powernv-cpufreq.c
>> index 0591874..7e0c752 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>>  spin_unlock(>gpstate_lock);
>>
>>  /* Timer may get migrated to a different cpu on cpu hot unplug */
>> -smp_call_function_any(policy->cpus, set_pstate, _data, 1);
>> +smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>>  }
> 
> Should this have:
> Fixes: eaa2c3aeef83f
> and CC stable v4.7+ ?
> 

Yeah this is required.

Fixes: eaa2c3aeef83 (cpufreq: powernv: Ramp-down global pstate slower than
local-pstate)

Thanks and Regards,
Shilpa

Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Shilpasri G Bhat

Hi,

On 04/24/2018 10:40 AM, Stewart Smith wrote:
> Shilpasri G Bhat  writes:
>> gpstate_timer_handler() uses synchronous smp_call to set the pstate
>> on the requested core. This causes the below hard lockup:
>>
>> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
>> (unreliable)
>> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
>> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
>> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
>> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
>> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
>> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
>> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
>> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
>> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
>> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
>> LR = arch_send_call_function_ipi_mask+0x120/0x130
>> [c03fe566ba50] [c004876c] 
>> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
>> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
>> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
>> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
>> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
>> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
>> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
>> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>>
>> Fix this by using the asynchronus smp_call in the timer interrupt handler.
>> We don't have to wait in this handler until the pstates are changed on
>> the core. This change will not have any impact on the global pstate
>> ramp-down algorithm.
>>
>> Reported-by: Nicholas Piggin 
>> Reported-by: Pridhiviraj Paidipeddi 
>> Signed-off-by: Shilpasri G Bhat 
>> ---
>>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
>> b/drivers/cpufreq/powernv-cpufreq.c
>> index 0591874..7e0c752 100644
>> --- a/drivers/cpufreq/powernv-cpufreq.c
>> +++ b/drivers/cpufreq/powernv-cpufreq.c
>> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>>  spin_unlock(>gpstate_lock);
>>
>>  /* Timer may get migrated to a different cpu on cpu hot unplug */
>> -smp_call_function_any(policy->cpus, set_pstate, _data, 1);
>> +smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>>  }
> 
> Should this have:
> Fixes: eaa2c3aeef83f
> and CC stable v4.7+ ?
> 

Yeah this is required.

Fixes: eaa2c3aeef83 (cpufreq: powernv: Ramp-down global pstate slower than
local-pstate)

Thanks and Regards,
Shilpa

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-23 Thread Juergen Gross

On 24/04/18 01:55, Dongli Zhang wrote:
> Hi Wei,
> 
> On 04/23/2018 10:09 PM, Wei Liu wrote:
>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node 
>>> is
>>> introduced: '/local/domain/0/mtwatch/'.
>>>
>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
>>> insertion
>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>
>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
>>> Kernel
>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
>>> inserted,
>>> while this kernel thread is destroyed when the corresponding xenstore node 
>>> is
>>> removed.
>>
>> Instead of inventing yet another node, can you not watch /local/domain
>> directly?
> 
> Would you like to watch at /local/domain directly? Or is your question "is 
> there
> any other way to not watch at /local/domain, while no extra xenstore node will
> be introduced"?
> 
> Actually, the first prototype of this idea was to watch at /local/domain
> directly to get aware of the domU create/destroy, so that xen toolstack will 
> not
> get involved. Joao Martins (CCed) had a concern on the performance as watching
> at /local/domain would lead to large amount of xenwatch events.

That's what the special watches "@introduceDomain" and "@releaseDomain"
are meant for.


Juergen

Re: [Xen-devel] [RFC 0/2] To introduce xenwatch multithreading (xen mtwatch)

2018-04-23 Thread Juergen Gross

On 24/04/18 01:55, Dongli Zhang wrote:
> Hi Wei,
> 
> On 04/23/2018 10:09 PM, Wei Liu wrote:
>> On Sat, Apr 07, 2018 at 07:25:53PM +0800, Dongli Zhang wrote:
>>> About per-domU xenwatch thread create/destroy, a new type of xenstore node 
>>> is
>>> introduced: '/local/domain/0/mtwatch/'.
>>>
>>> Suppose the new domid id 7. During the domU (domid=7) creation, the xen
>>> toolstack writes '/local/domain/0/mtwatch/7' to xenstore before the 
>>> insertion
>>> of '/local/domain/7'. When the domid=7 is destroyed, the last xenstore
>>> operation by xen toolstack is to remove '/local/domain/0/mtwatch/7'.
>>>
>>> The dom0 kernel subscribes a watch at node '/local/domain/0/mtwatch'.  
>>> Kernel
>>> thread [xen-mtwatch-7] is created when '/local/domain/0/mtwatch/7' is 
>>> inserted,
>>> while this kernel thread is destroyed when the corresponding xenstore node 
>>> is
>>> removed.
>>
>> Instead of inventing yet another node, can you not watch /local/domain
>> directly?
> 
> Would you like to watch at /local/domain directly? Or is your question "is 
> there
> any other way to not watch at /local/domain, while no extra xenstore node will
> be introduced"?
> 
> Actually, the first prototype of this idea was to watch at /local/domain
> directly to get aware of the domU create/destroy, so that xen toolstack will 
> not
> get involved. Joao Martins (CCed) had a concern on the performance as watching
> at /local/domain would lead to large amount of xenwatch events.

That's what the special watches "@introduceDomain" and "@releaseDomain"
are meant for.


Juergen

Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap

2018-04-23 Thread Tetsuo Handa

> On Sun, 22 Apr 2018, Tetsuo Handa wrote:
> 
> > > I'm wondering why you do not see oom killing of many processes if the 
> > > victim is a very large process that takes a long time to free memory in 
> > > exit_mmap() as I do because the oom reaper gives up trying to acquire 
> > > mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> > > 
> > 
> > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > 
> 
> I think that's an exceptionally good idea and will mitigate the concerns 
> of others.
> 
> It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> don't get dozens of unnecessary oom kills.
> 
> What do you think about this?  It passes preliminary testing on powerpc 
> and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> was a better way to acknowledge your contribution to fixing this issue, 
> especially since you brought up the exact problem this is addressing in 
> previous emails.)
> 

I don't think this patch is safe, for exit_mmap() is calling
mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().
exit_mmap() must not block while holding oom_lock in order to guarantee that
oom_reap_task_mm() can give up.

Some suggestion on top of your patch:

 mm/mmap.c | 13 +
 mm/oom_kill.c | 51 ++-
 2 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 981eed4..7b31357 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
/*
 * Manually reap the mm to free as much memory as possible.
 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
-* mm from further consideration.  Taking mm->mmap_sem for write
-* after setting MMF_OOM_SKIP will guarantee that the oom reaper
-* will not run on this mm again after mmap_sem is dropped.
+* mm from further consideration. Setting MMF_OOM_SKIP under
+* oom_lock held will guarantee that the OOM reaper will not
+* run on this mm again.
 *
 * This needs to be done before calling munlock_vma_pages_all(),
 * which clears VM_LOCKED, otherwise the oom reaper cannot
 * reliably test it.
 */
-   mutex_lock(_lock);
__oom_reap_task_mm(mm);
-   mutex_unlock(_lock);
-
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
-   down_write(>mmap_sem);
-   up_write(>mmap_sem);
+   mutex_unlock(_lock);
}
 
if (mm->locked_vm) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..9a29df8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 {
bool ret = true;
 
+   mutex_lock(_lock);
+
/*
-* We have to make sure to not race with the victim exit path
-* and cause premature new oom victim selection:
-* oom_reap_task_mm exit_mm
-*   mmget_not_zero
-*mmput
-*  atomic_dec_and_test
-*exit_oom_victim
-*  [...]
-*  out_of_memory
-*select_bad_process
-*  # no TIF_MEMDIE task selects new 
victim
-*  unmap_page_range # frees some memory
+* MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
+* work on the mm anymore. The check for MMF_OOM_SKIP must run
+* under oom_lock held.
 */
-   mutex_lock(_lock);
+   if (test_bit(MMF_OOM_SKIP, >flags))
+   goto unlock_oom;
 
if (!down_read_trylock(>mmap_sem)) {
ret = false;
@@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
goto unlock_oom;
}
 
-   /*
-* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-* work on the mm anymore. The check for MMF_OOM_SKIP must run
-* under mmap_sem for reading because it serializes against the
-* down_write();up_write() cycle in exit_mmap().
-*/
-   if (test_bit(MMF_OOM_SKIP, >flags)) {
-   up_read(>mmap_sem);
-

Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap

2018-04-23 Thread Tetsuo Handa

> On Sun, 22 Apr 2018, Tetsuo Handa wrote:
> 
> > > I'm wondering why you do not see oom killing of many processes if the 
> > > victim is a very large process that takes a long time to free memory in 
> > > exit_mmap() as I do because the oom reaper gives up trying to acquire 
> > > mm->mmap_sem and just sets MMF_OOM_SKIP itself.
> > > 
> > 
> > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before
> > exit_mmap() holds mmap_sem for write. Then, at least memory which could
> > have been reclaimed if exit_mmap() did not hold mmap_sem for write will
> > be guaranteed to be reclaimed before MMF_OOM_SKIP is set.
> > 
> 
> I think that's an exceptionally good idea and will mitigate the concerns 
> of others.
> 
> It can be done without holding mm->mmap_sem in exit_mmap() and uses the 
> same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we 
> don't get dozens of unnecessary oom kills.
> 
> What do you think about this?  It passes preliminary testing on powerpc 
> and I'm enqueued it for much more intensive testing.  (I'm wishing there 
> was a better way to acknowledge your contribution to fixing this issue, 
> especially since you brought up the exact problem this is addressing in 
> previous emails.)
> 

I don't think this patch is safe, for exit_mmap() is calling
mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock
held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap().
exit_mmap() must not block while holding oom_lock in order to guarantee that
oom_reap_task_mm() can give up.

Some suggestion on top of your patch:

 mm/mmap.c | 13 +
 mm/oom_kill.c | 51 ++-
 2 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 981eed4..7b31357 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm)
/*
 * Manually reap the mm to free as much memory as possible.
 * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this
-* mm from further consideration.  Taking mm->mmap_sem for write
-* after setting MMF_OOM_SKIP will guarantee that the oom reaper
-* will not run on this mm again after mmap_sem is dropped.
+* mm from further consideration. Setting MMF_OOM_SKIP under
+* oom_lock held will guarantee that the OOM reaper will not
+* run on this mm again.
 *
 * This needs to be done before calling munlock_vma_pages_all(),
 * which clears VM_LOCKED, otherwise the oom reaper cannot
 * reliably test it.
 */
-   mutex_lock(_lock);
__oom_reap_task_mm(mm);
-   mutex_unlock(_lock);
-
+   mutex_lock(_lock);
set_bit(MMF_OOM_SKIP, >flags);
-   down_write(>mmap_sem);
-   up_write(>mmap_sem);
+   mutex_unlock(_lock);
}
 
if (mm->locked_vm) {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8ba6cb8..9a29df8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 {
bool ret = true;
 
+   mutex_lock(_lock);
+
/*
-* We have to make sure to not race with the victim exit path
-* and cause premature new oom victim selection:
-* oom_reap_task_mm exit_mm
-*   mmget_not_zero
-*mmput
-*  atomic_dec_and_test
-*exit_oom_victim
-*  [...]
-*  out_of_memory
-*select_bad_process
-*  # no TIF_MEMDIE task selects new 
victim
-*  unmap_page_range # frees some memory
+* MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't
+* work on the mm anymore. The check for MMF_OOM_SKIP must run
+* under oom_lock held.
 */
-   mutex_lock(_lock);
+   if (test_bit(MMF_OOM_SKIP, >flags))
+   goto unlock_oom;
 
if (!down_read_trylock(>mmap_sem)) {
ret = false;
@@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
goto unlock_oom;
}
 
-   /*
-* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-* work on the mm anymore. The check for MMF_OOM_SKIP must run
-* under mmap_sem for reading because it serializes against the
-* down_write();up_write() cycle in exit_mmap().
-*/
-   if (test_bit(MMF_OOM_SKIP, >flags)) {
-   up_read(>mmap_sem);
-

Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Stewart Smith

Shilpasri G Bhat  writes:
> gpstate_timer_handler() uses synchronous smp_call to set the pstate
> on the requested core. This causes the below hard lockup:
>
> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
> (unreliable)
> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
> LR = arch_send_call_function_ipi_mask+0x120/0x130
> [c03fe566ba50] [c004876c] 
> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>
> Fix this by using the asynchronus smp_call in the timer interrupt handler.
> We don't have to wait in this handler until the pstates are changed on
> the core. This change will not have any impact on the global pstate
> ramp-down algorithm.
>
> Reported-by: Nicholas Piggin 
> Reported-by: Pridhiviraj Paidipeddi 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 0591874..7e0c752 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>   spin_unlock(>gpstate_lock);
>
>   /* Timer may get migrated to a different cpu on cpu hot unplug */
> - smp_call_function_any(policy->cpus, set_pstate, _data, 1);
> + smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>  }

Should this have:
Fixes: eaa2c3aeef83f
and CC stable v4.7+ ?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Stewart Smith

Shilpasri G Bhat  writes:
> gpstate_timer_handler() uses synchronous smp_call to set the pstate
> on the requested core. This causes the below hard lockup:
>
> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
> (unreliable)
> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
> LR = arch_send_call_function_ipi_mask+0x120/0x130
> [c03fe566ba50] [c004876c] 
> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>
> Fix this by using the asynchronus smp_call in the timer interrupt handler.
> We don't have to wait in this handler until the pstates are changed on
> the core. This change will not have any impact on the global pstate
> ramp-down algorithm.
>
> Reported-by: Nicholas Piggin 
> Reported-by: Pridhiviraj Paidipeddi 
> Signed-off-by: Shilpasri G Bhat 
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 0591874..7e0c752 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>   spin_unlock(>gpstate_lock);
>
>   /* Timer may get migrated to a different cpu on cpu hot unplug */
> - smp_call_function_any(policy->cpus, set_pstate, _data, 1);
> + smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>  }

Should this have:
Fixes: eaa2c3aeef83f
and CC stable v4.7+ ?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] KVM: X86: Allow userspace to define the microcode version

2018-04-23 Thread Paolo Bonzini

On 24/04/2018 05:14, Konrad Rzeszutek Wilk wrote:
> You would need to include the microcode version in the migration stream.
> 
> But this brings another point - what if we want to manifest certain
> new CPUID bits?

You don't do that across migration.  Generally if you want to do live
migration and you set up the guest to know everything about the host
(down to the microcode level), you should make sure your host are pretty
much identical.

Paolo

Re: [PATCH] KVM: X86: Allow userspace to define the microcode version

2018-04-23 Thread Paolo Bonzini

On 24/04/2018 05:14, Konrad Rzeszutek Wilk wrote:
> You would need to include the microcode version in the migration stream.
> 
> But this brings another point - what if we want to manifest certain
> new CPUID bits?

You don't do that across migration.  Generally if you want to do live
migration and you set up the guest to know everything about the host
(down to the microcode level), you should make sure your host are pretty
much identical.

Paolo

linux-next: Tree for Apr 24

2018-04-23 Thread Stephen Rothwell

Hi all,

News:  There will be no linux-next release tomorrow.

Changes since 20180423:

Non-merge commits (relative to Linus' tree): 2197
 1973 files changed, 81135 insertions(+), 37356 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 258 trees (counting Linus' and 44 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (6d08b06e67cd Linux 4.17-rc2)
Merging fixes/master (147a89bc71e7 Merge tag 'kconfig-v4.17' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild)
Merging kbuild-current/fixes (28913ee8191a netfilter: nf_nat_snmp_basic: add 
correct dependency to Makefile)
Merging arc-current/for-curr (661e50bc8532 Linux 4.16-rc4)
Merging arm-current/fixes (30cfae461581 ARM: replace unnecessary perl with sed 
and the shell $(( )) operator)
Merging arm64-fixes/for-next/fixes (b2d71b3cda19 arm64: signal: don't force 
known signals to SIGKILL)
Merging m68k-current/for-linus (ecd685580c8f m68k/mac: Remove bogus "FIXME" 
comment)
Merging powerpc-fixes/fixes (56376c5864f8 powerpc/kvm: Fix lockups when running 
KVM guests on Power8)
Merging sparc/master (17dec0a94915 Merge branch 'userns-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace)
Merging fscrypt-current/for-stable (ae64f9bd1d36 Linux 4.15-rc2)
Merging net/master (77621f024d6b Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf)
Merging bpf/master (b3f8adee85e8 Merge branch 'bpf-sockmap-fixes')
Merging ipsec/master (b48c05ab5d32 xfrm: Fix warning in xfrm6_tunnel_net_exit.)
Merging netfilter/master (5a786232eb69 netfilter: xt_connmark: do not cast 
xt_connmark_tginfo1 to xt_connmark_tginfo2)
Merging ipvs/master (765cca91b895 netfilter: conntrack: include kmemleak.h for 
kmemleak_not_leak())
Merging wireless-drivers/master (77e30e10ee28 iwlwifi: mvm: query regdb for wmm 
rule if needed)
Merging mac80211/master (2f0605a697f4 nl80211: Free connkeys on external 
authentication failure)
Merging rdma-fixes/for-rc (60cc43fc8884 Linux 4.17-rc1)
Merging sound-current/for-linus (c40937e5de05 ALSA: usb-audio: Fix missing 
endian conversion)
Merging pci-current/for-linus (0cf22d6b317c PCI: Add "PCIe" to 
pcie_print_link_status() messages)
Merging driver-core.current/driver-core-linus (3e14c6abbfb5 kobject: don't use 
WARN for registration failures)
Merging tty.current/tty-linus (66dd99c20336 tty: serial: xuartps: Setup early 
console when uartclk is also passed)
Merging usb.current/usb-linus (be75d8f1da08 USB: musb: dsps: drop duplicate phy 
initialisation)
Merging usb-gadget-fixes/fixes (c6ba5084ce0d usb: gadget: udc: renesas_usb3: 
add binging for r8a77965)
Merging usb-serial-fixes/usb-linus (ebe37f322acb USB: serial: option: fix 
dwm-158 3g modem interface)
Merging usb-chipidea-fixes/ci-for-usb-stable (964728f9f407 USB: chipidea: msm: 
fix ulpi-node lookup)
Merging phy/fixes (60cc43fc8884 Linux 4.17-rc1)
Merging staging.current/staging-linus (b00e2fd10429 staging: wilc1000: fix NULL 
pointer exception in host_int_parse_assoc_resp_info())
Merging char-misc.current/char-misc-linus (19972dd568f8 virt: vbox: Log an 
error when we fail to get the host version)
Merging input-current/for-linus (664b0bae0b87 Merge branch 'next' into 
for-linus)
Merging crypto-current/master (eea0d

linux-next: Tree for Apr 24

2018-04-23 Thread Stephen Rothwell

Hi all,

News:  There will be no linux-next release tomorrow.

Changes since 20180423:

Non-merge commits (relative to Linus' tree): 2197
 1973 files changed, 81135 insertions(+), 37356 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 258 trees (counting Linus' and 44 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (6d08b06e67cd Linux 4.17-rc2)
Merging fixes/master (147a89bc71e7 Merge tag 'kconfig-v4.17' of 
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild)
Merging kbuild-current/fixes (28913ee8191a netfilter: nf_nat_snmp_basic: add 
correct dependency to Makefile)
Merging arc-current/for-curr (661e50bc8532 Linux 4.16-rc4)
Merging arm-current/fixes (30cfae461581 ARM: replace unnecessary perl with sed 
and the shell $(( )) operator)
Merging arm64-fixes/for-next/fixes (b2d71b3cda19 arm64: signal: don't force 
known signals to SIGKILL)
Merging m68k-current/for-linus (ecd685580c8f m68k/mac: Remove bogus "FIXME" 
comment)
Merging powerpc-fixes/fixes (56376c5864f8 powerpc/kvm: Fix lockups when running 
KVM guests on Power8)
Merging sparc/master (17dec0a94915 Merge branch 'userns-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace)
Merging fscrypt-current/for-stable (ae64f9bd1d36 Linux 4.15-rc2)
Merging net/master (77621f024d6b Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf)
Merging bpf/master (b3f8adee85e8 Merge branch 'bpf-sockmap-fixes')
Merging ipsec/master (b48c05ab5d32 xfrm: Fix warning in xfrm6_tunnel_net_exit.)
Merging netfilter/master (5a786232eb69 netfilter: xt_connmark: do not cast 
xt_connmark_tginfo1 to xt_connmark_tginfo2)
Merging ipvs/master (765cca91b895 netfilter: conntrack: include kmemleak.h for 
kmemleak_not_leak())
Merging wireless-drivers/master (77e30e10ee28 iwlwifi: mvm: query regdb for wmm 
rule if needed)
Merging mac80211/master (2f0605a697f4 nl80211: Free connkeys on external 
authentication failure)
Merging rdma-fixes/for-rc (60cc43fc8884 Linux 4.17-rc1)
Merging sound-current/for-linus (c40937e5de05 ALSA: usb-audio: Fix missing 
endian conversion)
Merging pci-current/for-linus (0cf22d6b317c PCI: Add "PCIe" to 
pcie_print_link_status() messages)
Merging driver-core.current/driver-core-linus (3e14c6abbfb5 kobject: don't use 
WARN for registration failures)
Merging tty.current/tty-linus (66dd99c20336 tty: serial: xuartps: Setup early 
console when uartclk is also passed)
Merging usb.current/usb-linus (be75d8f1da08 USB: musb: dsps: drop duplicate phy 
initialisation)
Merging usb-gadget-fixes/fixes (c6ba5084ce0d usb: gadget: udc: renesas_usb3: 
add binging for r8a77965)
Merging usb-serial-fixes/usb-linus (ebe37f322acb USB: serial: option: fix 
dwm-158 3g modem interface)
Merging usb-chipidea-fixes/ci-for-usb-stable (964728f9f407 USB: chipidea: msm: 
fix ulpi-node lookup)
Merging phy/fixes (60cc43fc8884 Linux 4.17-rc1)
Merging staging.current/staging-linus (b00e2fd10429 staging: wilc1000: fix NULL 
pointer exception in host_int_parse_assoc_resp_info())
Merging char-misc.current/char-misc-linus (19972dd568f8 virt: vbox: Log an 
error when we fail to get the host version)
Merging input-current/for-linus (664b0bae0b87 Merge branch 'next' into 
for-linus)
Merging crypto-current/master (eea0d

Re: [PATCH 1/3] big key: get rid of stack array allocation

2018-04-23 Thread Eric Biggers

Hi Tycho,

On Mon, Apr 23, 2018 at 07:03:19PM -0600, Tycho Andersen wrote:
> We're interested in getting rid of all of the stack allocated arrays in the
> kernel [1]. This patch simply hardcodes the iv length to match that of the
> hardcoded cipher.
> 
> [1]: https://lkml.org/lkml/2018/3/7/621
> 
> v2: hardcode the length of the nonce to be the GCM AES IV length, and do a
> sanity check in init(), Eric Biggers
> 
> Signed-off-by: Tycho Andersen 
> CC: David Howells 
> CC: James Morris 
> CC: "Serge E. Hallyn" 
> CC: Jason A. Donenfeld 
> CC: Eric Biggers 
> ---
>  security/keys/big_key.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/security/keys/big_key.c b/security/keys/big_key.c
> index 933623784ccd..75c46786a166 100644
> --- a/security/keys/big_key.c
> +++ b/security/keys/big_key.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct big_key_buf {
>   unsigned intnr_pages;
> @@ -109,7 +110,7 @@ static int big_key_crypt(enum big_key_op op, struct 
> big_key_buf *buf, size_t dat
>* an .update function, so there's no chance we'll wind up reusing the
>* key to encrypt updated data. Simply put: one key, one encryption.
>*/
> - u8 zero_nonce[crypto_aead_ivsize(big_key_aead)];
> + u8 zero_nonce[GCM_AES_IV_SIZE];
>  
>   aead_req = aead_request_alloc(big_key_aead, GFP_KERNEL);
>   if (!aead_req)
> @@ -425,6 +426,12 @@ static int __init big_key_init(void)
>   pr_err("Can't alloc crypto: %d\n", ret);
>   return ret;
>   }
> +
> + if (unlikely(crypto_aead_ivsize(big_key_aead) != GCM_AES_IV_SIZE)) {
> + WARN(1, "big key algorithm changed?");
> + return -EINVAL;
> + }
> +

'big_key_aead' needs to be freed on error.

err = -EINVAL;
goto free_aead;

Also how about defining the IV size next to the algorithm name?
Then all the algorithm details would be on adjacent lines:

static const char big_key_alg_name[] = "gcm(aes)";
#define BIG_KEY_IV_SIZE GCM_AES_IV_SIZE

- Eric

Re: [PATCH 1/3] big key: get rid of stack array allocation

2018-04-23 Thread Eric Biggers

Hi Tycho,

On Mon, Apr 23, 2018 at 07:03:19PM -0600, Tycho Andersen wrote:
> We're interested in getting rid of all of the stack allocated arrays in the
> kernel [1]. This patch simply hardcodes the iv length to match that of the
> hardcoded cipher.
> 
> [1]: https://lkml.org/lkml/2018/3/7/621
> 
> v2: hardcode the length of the nonce to be the GCM AES IV length, and do a
> sanity check in init(), Eric Biggers
> 
> Signed-off-by: Tycho Andersen 
> CC: David Howells 
> CC: James Morris 
> CC: "Serge E. Hallyn" 
> CC: Jason A. Donenfeld 
> CC: Eric Biggers 
> ---
>  security/keys/big_key.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/security/keys/big_key.c b/security/keys/big_key.c
> index 933623784ccd..75c46786a166 100644
> --- a/security/keys/big_key.c
> +++ b/security/keys/big_key.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct big_key_buf {
>   unsigned intnr_pages;
> @@ -109,7 +110,7 @@ static int big_key_crypt(enum big_key_op op, struct 
> big_key_buf *buf, size_t dat
>* an .update function, so there's no chance we'll wind up reusing the
>* key to encrypt updated data. Simply put: one key, one encryption.
>*/
> - u8 zero_nonce[crypto_aead_ivsize(big_key_aead)];
> + u8 zero_nonce[GCM_AES_IV_SIZE];
>  
>   aead_req = aead_request_alloc(big_key_aead, GFP_KERNEL);
>   if (!aead_req)
> @@ -425,6 +426,12 @@ static int __init big_key_init(void)
>   pr_err("Can't alloc crypto: %d\n", ret);
>   return ret;
>   }
> +
> + if (unlikely(crypto_aead_ivsize(big_key_aead) != GCM_AES_IV_SIZE)) {
> + WARN(1, "big key algorithm changed?");
> + return -EINVAL;
> + }
> +

'big_key_aead' needs to be freed on error.

err = -EINVAL;
goto free_aead;

Also how about defining the IV size next to the algorithm name?
Then all the algorithm details would be on adjacent lines:

static const char big_key_alg_name[] = "gcm(aes)";
#define BIG_KEY_IV_SIZE GCM_AES_IV_SIZE

- Eric

Re: [PATCH v14 8/9] PCI/AER/DPC: Align FATAL error handling for AER and DPC

2018-04-23 Thread kbuild test robot

Hi Oza,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v4.16]
[cannot apply to pci/next linus/master v4.17-rc1 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Oza-Pawandeep/Address-error-and-recovery-for-AER-and-DPC/20180424-090411
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/pci/pcie/err.c:276:18: sparse: symbol 'pcie_do_fatal_recovery' was 
>> not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH] PCI/AER/DPC: pcie_do_fatal_recovery() can be static

2018-04-23 Thread kbuild test robot


Fixes: 3d7db543cb99 ("PCI/AER/DPC: Align FATAL error handling for AER and DPC")
Signed-off-by: Fengguang Wu 
---
 err.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 99d52a0..9d8d7ef 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -273,7 +273,7 @@ static pci_ers_result_t broadcast_error_message(struct 
pci_dev *dev,
return result_data.result;
 }
 
-pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, int severity)
+static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, int 
severity)
 {
struct pci_dev *udev;
struct pci_bus *parent;

Re: [PATCH v14 8/9] PCI/AER/DPC: Align FATAL error handling for AER and DPC

2018-04-23 Thread kbuild test robot

Hi Oza,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v4.16]
[cannot apply to pci/next linus/master v4.17-rc1 next-20180423]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Oza-Pawandeep/Address-error-and-recovery-for-AER-and-DPC/20180424-090411
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/pci/pcie/err.c:276:18: sparse: symbol 'pcie_do_fatal_recovery' was 
>> not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[RFC PATCH] PCI/AER/DPC: pcie_do_fatal_recovery() can be static

2018-04-23 Thread kbuild test robot


Fixes: 3d7db543cb99 ("PCI/AER/DPC: Align FATAL error handling for AER and DPC")
Signed-off-by: Fengguang Wu 
---
 err.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 99d52a0..9d8d7ef 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -273,7 +273,7 @@ static pci_ers_result_t broadcast_error_message(struct 
pci_dev *dev,
return result_data.result;
 }
 
-pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, int severity)
+static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, int 
severity)
 {
struct pci_dev *udev;
struct pci_bus *parent;

[PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Shilpasri G Bhat

gpstate_timer_handler() uses synchronous smp_call to set the pstate
on the requested core. This causes the below hard lockup:

[c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
(unreliable)
[c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
[c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
[c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
[c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
[c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
[c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
[c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
[c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
[c03fe566b760] [c0009014] decrementer_common+0x114/0x120
--- interrupt: 901 at doorbell_global_ipi+0x34/0x50
LR = arch_send_call_function_ipi_mask+0x120/0x130
[c03fe566ba50] [c004876c] 
arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
[c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
[c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
[c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
[c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
[c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
[c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
[c03fe566be30] [c000b184] system_call+0x58/0x6c

Fix this by using the asynchronus smp_call in the timer interrupt handler.
We don't have to wait in this handler until the pstates are changed on
the core. This change will not have any impact on the global pstate
ramp-down algorithm.

Reported-by: Nicholas Piggin 
Reported-by: Pridhiviraj Paidipeddi 
Signed-off-by: Shilpasri G Bhat 
---
 drivers/cpufreq/powernv-cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 0591874..7e0c752 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
spin_unlock(>gpstate_lock);
 
/* Timer may get migrated to a different cpu on cpu hot unplug */
-   smp_call_function_any(policy->cpus, set_pstate, _data, 1);
+   smp_call_function_any(policy->cpus, set_pstate, _data, 0);
 }
 
 /*
-- 
1.8.3.1

[PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Shilpasri G Bhat

gpstate_timer_handler() uses synchronous smp_call to set the pstate
on the requested core. This causes the below hard lockup:

[c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
(unreliable)
[c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
[c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
[c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
[c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
[c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
[c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
[c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
[c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
[c03fe566b760] [c0009014] decrementer_common+0x114/0x120
--- interrupt: 901 at doorbell_global_ipi+0x34/0x50
LR = arch_send_call_function_ipi_mask+0x120/0x130
[c03fe566ba50] [c004876c] 
arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
[c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
[c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
[c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
[c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
[c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
[c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
[c03fe566be30] [c000b184] system_call+0x58/0x6c

Fix this by using the asynchronus smp_call in the timer interrupt handler.
We don't have to wait in this handler until the pstates are changed on
the core. This change will not have any impact on the global pstate
ramp-down algorithm.

Reported-by: Nicholas Piggin 
Reported-by: Pridhiviraj Paidipeddi 
Signed-off-by: Shilpasri G Bhat 
---
 drivers/cpufreq/powernv-cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 0591874..7e0c752 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
spin_unlock(>gpstate_lock);
 
/* Timer may get migrated to a different cpu on cpu hot unplug */
-   smp_call_function_any(policy->cpus, set_pstate, _data, 1);
+   smp_call_function_any(policy->cpus, set_pstate, _data, 0);
 }
 
 /*
-- 
1.8.3.1

Re: [PATCH net-next] net: init sk_cookie for inet socket

2018-04-23 Thread Yafang Shao

On Tue, Apr 24, 2018 at 12:09 AM, Eric Dumazet  wrote:
>
>
> On 04/23/2018 08:58 AM, David Miller wrote:
>> From: Yafang Shao 
>> Date: Sun, 22 Apr 2018 21:50:04 +0800
>>
>>> With sk_cookie we can identify a socket, that is very helpful for
>>> traceing and statistic, i.e. tcp tracepiont and ebpf.
>>> So we'd better init it by default for inet socket.
>>> When using it, we just need call atomic64_read(>sk_cookie).
>>>
>>> Signed-off-by: Yafang Shao 
>>
>> Applied, thank you.
>>
>
> This is adding yet another atomic_inc on a global cache line.
>

That's a trade-off.

> Most applications do not need the cookie being ever set.
>
> The existing mechanism was fine. Set it on demand.

There are some drawback in the existing mechanism.
- we have to set the net->cookie_gen and then sk->sk_cookie when we
want to get the sk_cookie, that's a little expensive as well.
  After that change, sock_gen_cookie() could be replaced by
atomic64_read(>sk_cookie) in most places.

- If the application want to get the sk_cookie, it must set it first.
   What if the application don't have the permision to write?
   Furthermore, maybe it is a security concern ?


Thanks
Yafang

Re: [PATCH net-next] net: init sk_cookie for inet socket

2018-04-23 Thread Yafang Shao

On Tue, Apr 24, 2018 at 12:09 AM, Eric Dumazet  wrote:
>
>
> On 04/23/2018 08:58 AM, David Miller wrote:
>> From: Yafang Shao 
>> Date: Sun, 22 Apr 2018 21:50:04 +0800
>>
>>> With sk_cookie we can identify a socket, that is very helpful for
>>> traceing and statistic, i.e. tcp tracepiont and ebpf.
>>> So we'd better init it by default for inet socket.
>>> When using it, we just need call atomic64_read(>sk_cookie).
>>>
>>> Signed-off-by: Yafang Shao 
>>
>> Applied, thank you.
>>
>
> This is adding yet another atomic_inc on a global cache line.
>

That's a trade-off.

> Most applications do not need the cookie being ever set.
>
> The existing mechanism was fine. Set it on demand.

There are some drawback in the existing mechanism.
- we have to set the net->cookie_gen and then sk->sk_cookie when we
want to get the sk_cookie, that's a little expensive as well.
  After that change, sock_gen_cookie() could be replaced by
atomic64_read(>sk_cookie) in most places.

- If the application want to get the sk_cookie, it must set it first.
   What if the application don't have the permision to write?
   Furthermore, maybe it is a security concern ?


Thanks
Yafang

Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Eric Dumazet



On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet  wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> 
>>> I would suggest that you rework the interface a bit.  First a user would 
>>> call mmap() on a TCP socket, which would create an empty VMA.  (It would 
>>> set vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
>>> recognize it, but it would have no effect whatsoever on the TCP state 
>>> machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new 
>>> ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>>   void *address;
>>>   size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a 
>>> single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
> 
> MM layer.  I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
> 
> len = get_user(...);
> 
> down_read(>mm->mmap_sem);
> 
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
>   return -EFAULT;
> 
> /* This is buggy.  You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
>   return -EINVAL;
> 
> if (len > vma->vm_end - start)
>   return -EFAULT;  /* too big a request. */
> 
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right.  Instead you'd just stop and tell the caller that you
> didn't map the full requested size.  You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
> 
> You also need to provide some way for user programs to signal that
> they're done with the page in question.  MADV_DONTNEED might be
> sufficient.
> 
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable.  And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
> 
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers.  The latter is awful, especially in
> multithreaded programs.  In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes.  (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on.  I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
> 
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
> 
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless 
>> we can return smaller @length than requested maybe ?
> 
> Exactly.  If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
> 
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty 
>> VMA (no pages in there yet)
> 
> There's nothing to account.  It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
> 

Thanks Andy, I am working on all this, and initial patch looks sane enough.

 include/uapi/linux/tcp.h |7 +
 net/ipv4/tcp.c   |  175 +++
 2 files changed, 93 insertions(+), 89 deletions(-)


I will test all this before sending for review asap.

( I have not done the compat code yet, this can be done later I guess)

Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue

2018-04-23 Thread Eric Dumazet



On 04/23/2018 07:04 PM, Andy Lutomirski wrote:
> On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet  wrote:
>> Hi Andy
>>
>> On 04/23/2018 02:14 PM, Andy Lutomirski wrote:
> 
>>> I would suggest that you rework the interface a bit.  First a user would 
>>> call mmap() on a TCP socket, which would create an empty VMA.  (It would 
>>> set vm_ops to point to tcp_vm_ops or similar so that the TCP code could 
>>> recognize it, but it would have no effect whatsoever on the TCP state 
>>> machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new 
>>> ioctl() or setsockopt() function and pass something like:
>>
>>
>>>
>>> struct tcp_zerocopy_receive {
>>>   void *address;
>>>   size_t length;
>>> };
>>>
>>> The kernel would verify that [address, address+length) is entirely inside a 
>>> single TCP VMA and then would do the vm_insert_range magic.
>>
>> I have no idea what is the proper API for that.
>> Where the TCP VMA(s) would be stored ?
>> In TCP socket, or MM layer ?
> 
> MM layer.  I haven't tested this at all, and the error handling is
> totally wrong, but I think you'd do something like:
> 
> len = get_user(...);
> 
> down_read(>mm->mmap_sem);
> 
> vma = find_vma(mm, start);
> if (!vma || vma->vm_start > start)
>   return -EFAULT;
> 
> /* This is buggy.  You also need to check that the file is a socket.
> This is probably trivial. */
> if (vma->vm_file->private_data != sock)
>   return -EINVAL;
> 
> if (len > vma->vm_end - start)
>   return -EFAULT;  /* too big a request. */
> 
> and now you'd do the vm_insert_page() dance, except that you don't
> have to abort the whole procedure if you discover that something isn't
> aligned right.  Instead you'd just stop and tell the caller that you
> didn't map the full requested size.  You might also need to add some
> code to charge the caller for the pages that get pinned, but that's an
> orthogonal issue.
> 
> You also need to provide some way for user programs to signal that
> they're done with the page in question.  MADV_DONTNEED might be
> sufficient.
> 
> In the mmap() helper, you might want to restrict the mapped size to
> something reasonable.  And it might be nice to hook mremap() to
> prevent user code from causing too much trouble.
> 
> With my x86-writer-of-TLB-code hat on, I expect the major performance
> costs to be the generic costs of mmap() and munmap() (which only
> happen once per socket instead of once per read if you like my idea),
> the cost of a TLB miss when the data gets read (really not so bad on
> modern hardware), and the cost of the TLB invalidation when user code
> is done with the buffers.  The latter is awful, especially in
> multithreaded programs.  In fact, it's so bad that it might be worth
> mentioning in the documentation for this code that it just shouldn't
> be used in multithreaded processes.  (Also, on non-PCID hardware,
> there's an annoying situation in which a recently-migrated thread that
> removes a mapping sends an IPI to the CPU that the thread used to be
> on.  I thought I had a clever idea to get rid of that IPI once, but it
> turned out to be wrong.)
> 
> Architectures like ARM that have superior TLB handling primitives will
> not be hurt as badly if this is used my a multithreaded program.
> 
>>
>>
>> And I am not sure why the error handling would be better (point 4), unless 
>> we can return smaller @length than requested maybe ?
> 
> Exactly.  If I request 10MB mapped and only the first 9MB are aligned
> right, I still want the first 9 MB.
> 
>>
>> Also how the VMA space would be accounted (point 3) when creating an empty 
>> VMA (no pages in there yet)
> 
> There's nothing to account.  It's the same as mapping /dev/null or
> similar -- the mm core should take care of it for you.
> 

Thanks Andy, I am working on all this, and initial patch looks sane enough.

 include/uapi/linux/tcp.h |7 +
 net/ipv4/tcp.c   |  175 +++
 2 files changed, 93 insertions(+), 89 deletions(-)


I will test all this before sending for review asap.

( I have not done the compat code yet, this can be done later I guess)

Re: [PATCH v6 03/17] media: rkisp1: Add user space ABI definitions

2018-04-23 Thread Tomasz Figa

On Thu, Mar 8, 2018 at 6:48 PM Jacob Chen  wrote:
[snip]
> +/**
> + * struct cifisp_lsc_config - Configuration used by Lens shading
correction
> + *
> + * refer to REF_01 for details
> + */
> +struct cifisp_lsc_config {
> +   __u32 r_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +   __u32 gr_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +   __u32 gb_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +   __u32 b_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +
> +   __u32 x_grad_tbl[CIFISP_LSC_GRAD_TBL_SIZE];
> +   __u32 y_grad_tbl[CIFISP_LSC_GRAD_TBL_SIZE];
> +
> +   __u32 x_size_tbl[CIFISP_LSC_SIZE_TBL_SIZE];
> +   __u32 y_size_tbl[CIFISP_LSC_SIZE_TBL_SIZE];

Looking at the code, we only need 12 bits of each, so perhaps it could make
sense to make those __u16? Also, the natural layout for these seems to be
two-dimensional, i.e. [CIFISP_LSC_NUM_SECTORS][CIFISP_LSC_NUM_SECTORS]. I
think it wouldn't be a problem to define it this way for UAPI too.

> +   __u16 config_width;
> +   __u16 config_height;

These 2 seem unused. Just making sure. If they are part of hardware LSC
configuration, then we should keep them.

[snip]
> +struct cifisp_awb_meas_config {
> +   /*
> +* Note: currently the h and v offsets are mapped to grid offsets
> +*/

Perhaps should be part of the kerneldoc comment above? Also, I don't seem
to understand what this means.

> +   struct cifisp_window awb_wnd;
> +   __u32 awb_mode;
> +   __u8 max_y;
> +   __u8 min_y;
> +   __u8 max_csum;
> +   __u8 min_c;
> +   __u8 frames;
> +   __u8 awb_ref_cr;
> +   __u8 awb_ref_cb;
> +   __u8 enable_ymax_cmp;
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_awb_gain_config - Configuration used by auto white
balance gain
> + *
> + * out_data_x = ( AWB_GEAIN_X * in_data + 128) >> 8

typo: AWB_GAIN?

> + */
> +struct cifisp_awb_gain_config {
> +   __u16 gain_red;
> +   __u16 gain_green_r;
> +   __u16 gain_blue;
> +   __u16 gain_green_b;
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_flt_config - Configuration used by ISP filtering
> + *
> + * @mode: ISP_FILT_MODE register fields (from enum cifisp_flt_mode)
> + * @grn_stage1: ISP_FILT_MODE register fields
> + * @chr_h_mode: ISP_FILT_MODE register fields
> + * @chr_v_mode: ISP_FILT_MODE register fields

Missing documentation for remaining fields.

> + *
> + * refer to REF_01 for details.
> + */
> +
> +struct cifisp_flt_config {
> +   __u32 mode;
> +   __u8 grn_stage1;
> +   __u8 chr_h_mode;
> +   __u8 chr_v_mode;

Maybe we could move u8 below u32 to optimize the alignment?

[snip]
> +/**
> + * struct cifisp_hst_config - Configuration used by Histogram
> + *
> + * @mode: histogram mode (from enum cifisp_histogram_mode)
> + * @histogram_predivider: process every stepsize pixel, all other pixels
are skipped
> + * @meas_window: coordinates of the measure window
> + * @hist_weight: weighting factor for sub-windows
> + */
> +struct cifisp_hst_config {
> +   __u32 mode;
> +   __u8 histogram_predivider;
> +   struct cifisp_window meas_window;

Perhaps could swap the two above for better alignment?

> +   __u8 hist_weight[CIFISP_HISTOGRAM_WEIGHT_GRIDS_SIZE];
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_aec_config - Configuration used by Auto Exposure Control
> + *
> + * @mode: Exposure measure mode (from enum cifisp_exp_meas_mode)
> + * @autostop: stop mode (from enum cifisp_exp_ctrl_autostop)
> + * @meas_window: coordinates of the measure window
> + */
> +struct cifisp_aec_config {
> +   __u32 mode;
> +   __u32 autostop;
> +   struct cifisp_window meas_window;
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_afc_config - Configuration used by Auto Focus Control
> + *
> + * @num_afm_win: max CIFISP_AFM_MAX_WINDOWS
> + * @afm_win: coordinates of the meas window
> + * @thres: threshold used for minimizing the influence of noise
> + * @var_shift: the number of bits for the shift operation at the end of
the calculation chain.
> + */
> +struct cifisp_afc_config {
> +   __u8 num_afm_win;
> +   struct cifisp_window afm_win[CIFISP_AFM_MAX_WINDOWS];
> +   __u32 thres;
> +   __u32 var_shift;

Perhaps could put afm_win[] and then num_afm_win here, for better alignment?

Best regards,
Tomasz

Re: [PATCH v6 03/17] media: rkisp1: Add user space ABI definitions

2018-04-23 Thread Tomasz Figa

On Thu, Mar 8, 2018 at 6:48 PM Jacob Chen  wrote:
[snip]
> +/**
> + * struct cifisp_lsc_config - Configuration used by Lens shading
correction
> + *
> + * refer to REF_01 for details
> + */
> +struct cifisp_lsc_config {
> +   __u32 r_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +   __u32 gr_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +   __u32 gb_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +   __u32 b_data_tbl[CIFISP_LSC_DATA_TBL_SIZE];
> +
> +   __u32 x_grad_tbl[CIFISP_LSC_GRAD_TBL_SIZE];
> +   __u32 y_grad_tbl[CIFISP_LSC_GRAD_TBL_SIZE];
> +
> +   __u32 x_size_tbl[CIFISP_LSC_SIZE_TBL_SIZE];
> +   __u32 y_size_tbl[CIFISP_LSC_SIZE_TBL_SIZE];

Looking at the code, we only need 12 bits of each, so perhaps it could make
sense to make those __u16? Also, the natural layout for these seems to be
two-dimensional, i.e. [CIFISP_LSC_NUM_SECTORS][CIFISP_LSC_NUM_SECTORS]. I
think it wouldn't be a problem to define it this way for UAPI too.

> +   __u16 config_width;
> +   __u16 config_height;

These 2 seem unused. Just making sure. If they are part of hardware LSC
configuration, then we should keep them.

[snip]
> +struct cifisp_awb_meas_config {
> +   /*
> +* Note: currently the h and v offsets are mapped to grid offsets
> +*/

Perhaps should be part of the kerneldoc comment above? Also, I don't seem
to understand what this means.

> +   struct cifisp_window awb_wnd;
> +   __u32 awb_mode;
> +   __u8 max_y;
> +   __u8 min_y;
> +   __u8 max_csum;
> +   __u8 min_c;
> +   __u8 frames;
> +   __u8 awb_ref_cr;
> +   __u8 awb_ref_cb;
> +   __u8 enable_ymax_cmp;
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_awb_gain_config - Configuration used by auto white
balance gain
> + *
> + * out_data_x = ( AWB_GEAIN_X * in_data + 128) >> 8

typo: AWB_GAIN?

> + */
> +struct cifisp_awb_gain_config {
> +   __u16 gain_red;
> +   __u16 gain_green_r;
> +   __u16 gain_blue;
> +   __u16 gain_green_b;
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_flt_config - Configuration used by ISP filtering
> + *
> + * @mode: ISP_FILT_MODE register fields (from enum cifisp_flt_mode)
> + * @grn_stage1: ISP_FILT_MODE register fields
> + * @chr_h_mode: ISP_FILT_MODE register fields
> + * @chr_v_mode: ISP_FILT_MODE register fields

Missing documentation for remaining fields.

> + *
> + * refer to REF_01 for details.
> + */
> +
> +struct cifisp_flt_config {
> +   __u32 mode;
> +   __u8 grn_stage1;
> +   __u8 chr_h_mode;
> +   __u8 chr_v_mode;

Maybe we could move u8 below u32 to optimize the alignment?

[snip]
> +/**
> + * struct cifisp_hst_config - Configuration used by Histogram
> + *
> + * @mode: histogram mode (from enum cifisp_histogram_mode)
> + * @histogram_predivider: process every stepsize pixel, all other pixels
are skipped
> + * @meas_window: coordinates of the measure window
> + * @hist_weight: weighting factor for sub-windows
> + */
> +struct cifisp_hst_config {
> +   __u32 mode;
> +   __u8 histogram_predivider;
> +   struct cifisp_window meas_window;

Perhaps could swap the two above for better alignment?

> +   __u8 hist_weight[CIFISP_HISTOGRAM_WEIGHT_GRIDS_SIZE];
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_aec_config - Configuration used by Auto Exposure Control
> + *
> + * @mode: Exposure measure mode (from enum cifisp_exp_meas_mode)
> + * @autostop: stop mode (from enum cifisp_exp_ctrl_autostop)
> + * @meas_window: coordinates of the measure window
> + */
> +struct cifisp_aec_config {
> +   __u32 mode;
> +   __u32 autostop;
> +   struct cifisp_window meas_window;
> +} __attribute__ ((packed));
> +
> +/**
> + * struct cifisp_afc_config - Configuration used by Auto Focus Control
> + *
> + * @num_afm_win: max CIFISP_AFM_MAX_WINDOWS
> + * @afm_win: coordinates of the meas window
> + * @thres: threshold used for minimizing the influence of noise
> + * @var_shift: the number of bits for the shift operation at the end of
the calculation chain.
> + */
> +struct cifisp_afc_config {
> +   __u8 num_afm_win;
> +   struct cifisp_window afm_win[CIFISP_AFM_MAX_WINDOWS];
> +   __u32 thres;
> +   __u32 var_shift;

Perhaps could put afm_win[] and then num_afm_win here, for better alignment?

Best regards,
Tomasz

RE: [PATCH 6/6] devfreq: rk3399_dmc: register devfreq notification to dmc driver.

2018-04-23 Thread MyungJoo Ham

>From: Lin Huang 
>
>Because dmc may also access the PMU_BUS_IDLE_REQ register, we need to
>ensure that the pd driver and the dmc driver will not access at this
>register at the same time.
>
>Signed-off-by: Lin Huang 
>Signed-off-by: Enric Balletbo i Serra 
>---
>
> drivers/devfreq/rk3399_dmc.c  | 47 +--
> drivers/soc/rockchip/pm_domains.c | 31 +++
> include/soc/rockchip/rk3399_dmc.h | 63 +++
> 3 files changed, 96 insertions(+), 45 deletions(-)
> create mode 100644 include/soc/rockchip/rk3399_dmc.h
>
>diff --git a/drivers/devfreq/rk3399_dmc.c b/drivers/devfreq/rk3399_dmc.c
>index 5bfca028eaaf..a1f320634d69 100644
>--- a/drivers/devfreq/rk3399_dmc.c
>+++ b/drivers/devfreq/rk3399_dmc.c
[]
>diff --git a/drivers/soc/rockchip/pm_domains.c 
>b/drivers/soc/rockchip/pm_domains.c
>index 53efc386b1ad..7acc836e7eb7 100644
>--- a/drivers/soc/rockchip/pm_domains.c
>+++ b/drivers/soc/rockchip/pm_domains.c
[]
>+static int dmc_notify(struct notifier_block *nb, unsigned long event,
>+void *data)
>+{
>+  if (event == DEVFREQ_PRECHANGE)
>+  mutex_lock(_pmu->mutex);
>+  else if (event == DEVFREQ_POSTCHANGE)
>+  mutex_unlock(_pmu->mutex);
>+
>+  return NOTIFY_OK;
>+}
>+

Doesn't this incur a deadlock?

1. devfreq.c:update_freq calls devfreq_notify_transition(DEVFREQ_PRECHANGE)
2. pm_domain.c:dmc_notify calls mutex_lock(dmc_pmu->mutex)
3. devfreq.c:update_freq calls target callback
4. rk3399_dmc.c:rk3399_dmcfreq_target calls mutex_lock(>lock)
   >> update_freq cannot proceed. 


Cheers,
MyungJoo

RE: [PATCH 6/6] devfreq: rk3399_dmc: register devfreq notification to dmc driver.

2018-04-23 Thread MyungJoo Ham

>From: Lin Huang 
>
>Because dmc may also access the PMU_BUS_IDLE_REQ register, we need to
>ensure that the pd driver and the dmc driver will not access at this
>register at the same time.
>
>Signed-off-by: Lin Huang 
>Signed-off-by: Enric Balletbo i Serra 
>---
>
> drivers/devfreq/rk3399_dmc.c  | 47 +--
> drivers/soc/rockchip/pm_domains.c | 31 +++
> include/soc/rockchip/rk3399_dmc.h | 63 +++
> 3 files changed, 96 insertions(+), 45 deletions(-)
> create mode 100644 include/soc/rockchip/rk3399_dmc.h
>
>diff --git a/drivers/devfreq/rk3399_dmc.c b/drivers/devfreq/rk3399_dmc.c
>index 5bfca028eaaf..a1f320634d69 100644
>--- a/drivers/devfreq/rk3399_dmc.c
>+++ b/drivers/devfreq/rk3399_dmc.c
[]
>diff --git a/drivers/soc/rockchip/pm_domains.c 
>b/drivers/soc/rockchip/pm_domains.c
>index 53efc386b1ad..7acc836e7eb7 100644
>--- a/drivers/soc/rockchip/pm_domains.c
>+++ b/drivers/soc/rockchip/pm_domains.c
[]
>+static int dmc_notify(struct notifier_block *nb, unsigned long event,
>+void *data)
>+{
>+  if (event == DEVFREQ_PRECHANGE)
>+  mutex_lock(_pmu->mutex);
>+  else if (event == DEVFREQ_POSTCHANGE)
>+  mutex_unlock(_pmu->mutex);
>+
>+  return NOTIFY_OK;
>+}
>+

Doesn't this incur a deadlock?

1. devfreq.c:update_freq calls devfreq_notify_transition(DEVFREQ_PRECHANGE)
2. pm_domain.c:dmc_notify calls mutex_lock(dmc_pmu->mutex)
3. devfreq.c:update_freq calls target callback
4. rk3399_dmc.c:rk3399_dmcfreq_target calls mutex_lock(>lock)
   >> update_freq cannot proceed. 


Cheers,
MyungJoo

Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

2018-04-23 Thread Alex G.




On 04/22/2018 05:48 AM, Borislav Petkov wrote:

On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote:

How does such an error look like, in detail?


It's green on the soft side, with lots of red accents, as well as some
textured white shades:

[   51.414616] pciehp :b0:06.0:pcie204: Slot(176): Link Down
[   51.414634] pciehp :b0:05.0:pcie204: Slot(179): Link Down
[   52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
to correct
[   52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
[   52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   52.703358] pciehp :b0:06.0:pcie204: Slot(176): Link Up
[   52.711616] {1}[Hardware Error]: event severity: fatal
[   52.716754] {1}[Hardware Error]:  Error 0, type: fatal
[   52.721891] {1}[Hardware Error]:   section_type: PCIe error
[   52.727463] {1}[Hardware Error]:   port_type: 6, downstream switch port
[   52.734075] {1}[Hardware Error]:   version: 3.0
[   52.738607] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[   52.744786] {1}[Hardware Error]:   device_id: :b0:06.0
[   52.750271] {1}[Hardware Error]:   slot: 4
[   52.754371] {1}[Hardware Error]:   secondary_bus: 0xb3
[   52.759509] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
[   52.766123] {1}[Hardware Error]:   class_code: 000406
[   52.771182] {1}[Hardware Error]:   bridge: secondary_status: 0x,
control: 0x0003
[   52.779038] pcieport :b0:06.0: aer_status: 0x0010, aer_mask:
0x01a1
[   52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
[   52.786348] pcieport :b0:06.0:[20] Unsupported Request
[   52.786349] pcieport :b0:06.0: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   52.786350] pcieport :b0:06.0: aer_uncor_severity: 0x004eb030
[   52.786352] pcieport :b0:06.0:   TLP Header: 4001 020f
e12023bc 0100
[   52.786357] pcieport :b0:06.0: broadcast error_detected message
[   52.883895] pci :b3:00.0: device has no driver
[   52.883976] pciehp :b0:06.0:pcie204: Slot(176): Link Down
[   52.884184] pciehp :b0:06.0:pcie204: Slot(176): Link Down event
queued; currently getting powered on
[   52.967175] pciehp :b0:06.0:pcie204: Slot(176): Link Up


Btw, from another discussion we're having with Yazen:

@Yazen, do you see how this error record is worth shit?

  class_code: 000406
  command: 0x0407, status: 0x0010
  bridge: secondary_status: 0x, control: 0x0003
  aer_status: 0x0010, aer_mask: 0x01a1
  aer_uncor_severity: 0x004eb030


That tells you what FFS said about the error. Keep in mind that FFS has 
cleared the hardware error bits, which the AER handler would normally 
read from the PCI device.



those above are only some of the fields which are purely useless
undecoded. Makes me wonder what's worse for the user: dump the
half-decoded error or not dump an error at all...


It's immediately obvious if there's a glaring FFS bug and if we get 
bogus data. If you distrust firmware as much as I do, then you will find 
great value in having such info in the logs. It's probably not too 
useful to a casual user, but then neither is a majority of the system log.



Anyway, Alex, I see this in the logs:

[   66.581121] pciehp :b0:06.0:pcie204: Slot(176): Link Down
[   66.591939] pciehp :b0:05.0:pcie204: Slot(179): Card not present
[   66.592102] pciehp :b0:06.0:pcie204: Slot(176): Card not present

and that comes from that pciehp_isr() interrupt handler AFAICT.

So there *is* a way to know that the card is not present anymore. So,
theoretically, and ignoring the code layering for now, we can connect
that error to the card not present event and then ignore the error...


You're missing the timing and assuming you will get the hotplug 
interrupt. In this example, you have 22ms between the link down and 
presence detect state change. This is a fairly fast removal.


Hotplug dependencies aside (you can have the kernel run without PCIe 
hotplug support), I don't think you want to just linger in NMI for 
dozens of milliseconds waiting for presence detect confirmation.


For enterprise SFF NVMe drives, the data lanes will disconnect before 
the presence detect. FFS relies on presence detect, and these are two of 
the reasons why slow removal is such a problem. You might not get a 
presence detect interrupt at all.


Presence detect is optional for PCIe. PD is such a reliable heuristic, 
that it guarantees worse error handling than the crackmonkey firmware. I 
don't see how might be useful in a way which gives us better handling 
than firmware.



Hmmm.


Hmmm

Anyway, heuristics about PCIe error recovery belong in the recovery 
handler. I don't think it's smart to apply policy before we get there


Alex

Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.

2018-04-23 Thread Alex G.




On 04/22/2018 05:48 AM, Borislav Petkov wrote:

On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote:

How does such an error look like, in detail?


It's green on the soft side, with lots of red accents, as well as some
textured white shades:

[   51.414616] pciehp :b0:06.0:pcie204: Slot(176): Link Down
[   51.414634] pciehp :b0:05.0:pcie204: Slot(179): Link Down
[   52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
to correct
[   52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
[   52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   52.703358] pciehp :b0:06.0:pcie204: Slot(176): Link Up
[   52.711616] {1}[Hardware Error]: event severity: fatal
[   52.716754] {1}[Hardware Error]:  Error 0, type: fatal
[   52.721891] {1}[Hardware Error]:   section_type: PCIe error
[   52.727463] {1}[Hardware Error]:   port_type: 6, downstream switch port
[   52.734075] {1}[Hardware Error]:   version: 3.0
[   52.738607] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[   52.744786] {1}[Hardware Error]:   device_id: :b0:06.0
[   52.750271] {1}[Hardware Error]:   slot: 4
[   52.754371] {1}[Hardware Error]:   secondary_bus: 0xb3
[   52.759509] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
[   52.766123] {1}[Hardware Error]:   class_code: 000406
[   52.771182] {1}[Hardware Error]:   bridge: secondary_status: 0x,
control: 0x0003
[   52.779038] pcieport :b0:06.0: aer_status: 0x0010, aer_mask:
0x01a1
[   52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
[   52.786348] pcieport :b0:06.0:[20] Unsupported Request
[   52.786349] pcieport :b0:06.0: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   52.786350] pcieport :b0:06.0: aer_uncor_severity: 0x004eb030
[   52.786352] pcieport :b0:06.0:   TLP Header: 4001 020f
e12023bc 0100
[   52.786357] pcieport :b0:06.0: broadcast error_detected message
[   52.883895] pci :b3:00.0: device has no driver
[   52.883976] pciehp :b0:06.0:pcie204: Slot(176): Link Down
[   52.884184] pciehp :b0:06.0:pcie204: Slot(176): Link Down event
queued; currently getting powered on
[   52.967175] pciehp :b0:06.0:pcie204: Slot(176): Link Up


Btw, from another discussion we're having with Yazen:

@Yazen, do you see how this error record is worth shit?

  class_code: 000406
  command: 0x0407, status: 0x0010
  bridge: secondary_status: 0x, control: 0x0003
  aer_status: 0x0010, aer_mask: 0x01a1
  aer_uncor_severity: 0x004eb030


That tells you what FFS said about the error. Keep in mind that FFS has 
cleared the hardware error bits, which the AER handler would normally 
read from the PCI device.



those above are only some of the fields which are purely useless
undecoded. Makes me wonder what's worse for the user: dump the
half-decoded error or not dump an error at all...


It's immediately obvious if there's a glaring FFS bug and if we get 
bogus data. If you distrust firmware as much as I do, then you will find 
great value in having such info in the logs. It's probably not too 
useful to a casual user, but then neither is a majority of the system log.



Anyway, Alex, I see this in the logs:

[   66.581121] pciehp :b0:06.0:pcie204: Slot(176): Link Down
[   66.591939] pciehp :b0:05.0:pcie204: Slot(179): Card not present
[   66.592102] pciehp :b0:06.0:pcie204: Slot(176): Card not present

and that comes from that pciehp_isr() interrupt handler AFAICT.

So there *is* a way to know that the card is not present anymore. So,
theoretically, and ignoring the code layering for now, we can connect
that error to the card not present event and then ignore the error...


You're missing the timing and assuming you will get the hotplug 
interrupt. In this example, you have 22ms between the link down and 
presence detect state change. This is a fairly fast removal.


Hotplug dependencies aside (you can have the kernel run without PCIe 
hotplug support), I don't think you want to just linger in NMI for 
dozens of milliseconds waiting for presence detect confirmation.


For enterprise SFF NVMe drives, the data lanes will disconnect before 
the presence detect. FFS relies on presence detect, and these are two of 
the reasons why slow removal is such a problem. You might not get a 
presence detect interrupt at all.


Presence detect is optional for PCIe. PD is such a reliable heuristic, 
that it guarantees worse error handling than the crackmonkey firmware. I 
don't see how might be useful in a way which gives us better handling 
than firmware.



Hmmm.


Hmmm

Anyway, heuristics about PCIe error recovery belong in the recovery 
handler. I don't think it's smart to apply policy before we get there


Alex

[PATCH 3/6] powerpc/64s: Patch barrier_nospec in modules

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

Note that unlike RFI which is patched only in kernel the nospec state
reflects settings at the time the module was loaded.

Iterating all modules and re-patching every time the settings change
is not implemented.

Based on lwsync patching.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h  |  6 ++
 arch/powerpc/kernel/module.c  |  6 ++
 arch/powerpc/lib/feature-fixups.c | 16 +---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index afc7280cce3b..4335cddc1cf2 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -54,6 +54,12 @@ void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
 void do_barrier_nospec_fixups(bool enable);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+void do_barrier_nospec_fixups_range(bool enable, void *start, void *end);
+#else
+static inline void do_barrier_nospec_fixups_range(bool enable, void *start, 
void *end) { };
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_SETUP_H */
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 3f7ba0f5bf29..a72698cd3dd0 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -72,6 +72,12 @@ int module_finalize(const Elf_Ehdr *hdr,
do_feature_fixups(powerpc_firmware_features,
  (void *)sect->sh_addr,
  (void *)sect->sh_addr + sect->sh_size);
+
+   sect = find_section(hdr, sechdrs, "__spec_barrier_fixup");
+   if (sect != NULL)
+   do_barrier_nospec_fixups_range(true,
+ (void *)sect->sh_addr,
+ (void *)sect->sh_addr + sect->sh_size);
 #endif
 
sect = find_section(hdr, sechdrs, "__lwsync_fixup");
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 093c1d2ea5fd..3b37529f82f8 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -163,14 +163,14 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
: "unknown");
 }
 
-void do_barrier_nospec_fixups(bool enable)
+void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void 
*fixup_end)
 {
unsigned int instr, *dest;
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___barrier_nospec_fixup),
-   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+   start = fixup_start;
+   end = fixup_end;
 
instr = 0x6000; /* nop */
 
@@ -189,6 +189,16 @@ void do_barrier_nospec_fixups(bool enable)
printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
 
+void do_barrier_nospec_fixups(bool enable)
+{
+   void *start, *end;
+
+   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+
+   do_barrier_nospec_fixups_range(enable, start, end);
+}
+
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 void do_lwsync_fixups(unsigned long value, void *fixup_start, void *fixup_end)
-- 
2.14.1

[PATCH 3/6] powerpc/64s: Patch barrier_nospec in modules

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

Note that unlike RFI which is patched only in kernel the nospec state
reflects settings at the time the module was loaded.

Iterating all modules and re-patching every time the settings change
is not implemented.

Based on lwsync patching.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h  |  6 ++
 arch/powerpc/kernel/module.c  |  6 ++
 arch/powerpc/lib/feature-fixups.c | 16 +---
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index afc7280cce3b..4335cddc1cf2 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -54,6 +54,12 @@ void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
 void do_barrier_nospec_fixups(bool enable);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+void do_barrier_nospec_fixups_range(bool enable, void *start, void *end);
+#else
+static inline void do_barrier_nospec_fixups_range(bool enable, void *start, 
void *end) { };
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_SETUP_H */
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index 3f7ba0f5bf29..a72698cd3dd0 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -72,6 +72,12 @@ int module_finalize(const Elf_Ehdr *hdr,
do_feature_fixups(powerpc_firmware_features,
  (void *)sect->sh_addr,
  (void *)sect->sh_addr + sect->sh_size);
+
+   sect = find_section(hdr, sechdrs, "__spec_barrier_fixup");
+   if (sect != NULL)
+   do_barrier_nospec_fixups_range(true,
+ (void *)sect->sh_addr,
+ (void *)sect->sh_addr + sect->sh_size);
 #endif
 
sect = find_section(hdr, sechdrs, "__lwsync_fixup");
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 093c1d2ea5fd..3b37529f82f8 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -163,14 +163,14 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
: "unknown");
 }
 
-void do_barrier_nospec_fixups(bool enable)
+void do_barrier_nospec_fixups_range(bool enable, void *fixup_start, void 
*fixup_end)
 {
unsigned int instr, *dest;
long *start, *end;
int i;
 
-   start = PTRRELOC(&__start___barrier_nospec_fixup),
-   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+   start = fixup_start;
+   end = fixup_end;
 
instr = 0x6000; /* nop */
 
@@ -189,6 +189,16 @@ void do_barrier_nospec_fixups(bool enable)
printk(KERN_DEBUG "barrier-nospec: patched %d locations\n", i);
 }
 
+void do_barrier_nospec_fixups(bool enable)
+{
+   void *start, *end;
+
+   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+
+   do_barrier_nospec_fixups_range(enable, start, end);
+}
+
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 void do_lwsync_fixups(unsigned long value, void *fixup_start, void *fixup_end)
-- 
2.14.1

[PATCH 2/6] powerpc/64s: Add support for ori barrier_nospec patching

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

Based on the RFI patching. This is required to be able to disable the
speculation barrier.

Only one barrier type is supported and it does nothing when the
firmware does not enable it. Also re-patching modules is not supported
So the only meaningful thing that can be done is patching out the
speculation barrier at boot when the user says it is not wanted.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/barrier.h|  2 +-
 arch/powerpc/include/asm/feature-fixups.h |  9 +
 arch/powerpc/include/asm/setup.h  |  1 +
 arch/powerpc/kernel/security.c|  9 +
 arch/powerpc/kernel/vmlinux.lds.S |  7 +++
 arch/powerpc/lib/feature-fixups.c | 27 +++
 6 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index e582d2c88092..f67b3f6e36be 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -81,7 +81,7 @@ do {  
\
  * Prevent execution of subsequent instructions until preceding branches have
  * been fully resolved and are no longer executing speculatively.
  */
-#define barrier_nospec_asm ori 31,31,0
+#define barrier_nospec_asm NOSPEC_BARRIER_FIXUP_SECTION; nop
 
 // This also acts as a compiler barrier due to the memory clobber.
 #define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index 1e82eb3caabd..86ac59e75f36 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -195,11 +195,20 @@ label##3: \
FTR_ENTRY_OFFSET 951b-952b; \
.popsection;
 
+#define NOSPEC_BARRIER_FIXUP_SECTION   \
+953:   \
+   .pushsection __barrier_nospec_fixup,"a";\
+   .align 2;   \
+954:   \
+   FTR_ENTRY_OFFSET 953b-954b; \
+   .popsection;
+
 
 #ifndef __ASSEMBLY__
 #include 
 
 extern long __start___rfi_flush_fixup, __stop___rfi_flush_fixup;
+extern long __start___barrier_nospec_fixup, __stop___barrier_nospec_fixup;
 
 void apply_feature_fixups(void);
 void setup_feature_keys(void);
diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 27fa52ed6d00..afc7280cce3b 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -52,6 +52,7 @@ enum l1d_flush_type {
 
 void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void do_barrier_nospec_fixups(bool enable);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index bab5a27ea805..b963eae0b0a0 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -9,10 +9,19 @@
 #include 
 
 #include 
+#include 
 
 
 unsigned long powerpc_security_features __read_mostly = SEC_FTR_DEFAULT;
 
+static bool barrier_nospec_enabled;
+
+static void enable_barrier_nospec(bool enable)
+{
+   barrier_nospec_enabled = enable;
+   do_barrier_nospec_fixups(enable);
+}
+
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
bool thread_priv;
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index c8af90ff49f0..ff73f498568c 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -139,6 +139,13 @@ SECTIONS
*(__rfi_flush_fixup)
__stop___rfi_flush_fixup = .;
}
+
+   . = ALIGN(8);
+   __spec_barrier_fixup : AT(ADDR(__spec_barrier_fixup) - LOAD_OFFSET) {
+   __start___barrier_nospec_fixup = .;
+   *(__barrier_nospec_fixup)
+   __stop___barrier_nospec_fixup = .;
+   }
 #endif
 
EXCEPTION_TABLE(0)
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 288fe4f0db4e..093c1d2ea5fd 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -162,6 +162,33 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
(types &  L1D_FLUSH_MTTRIG) ? "mttrig type"
: "unknown");
 }
+
+void do_barrier_nospec_fixups(bool enable)
+{
+   unsigned int instr, *dest;
+   long *start, *end;
+   int i;
+
+   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+
+   instr = 0x6000; /* nop */
+
+   if

[PATCH 2/6] powerpc/64s: Add support for ori barrier_nospec patching

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

Based on the RFI patching. This is required to be able to disable the
speculation barrier.

Only one barrier type is supported and it does nothing when the
firmware does not enable it. Also re-patching modules is not supported
So the only meaningful thing that can be done is patching out the
speculation barrier at boot when the user says it is not wanted.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/barrier.h|  2 +-
 arch/powerpc/include/asm/feature-fixups.h |  9 +
 arch/powerpc/include/asm/setup.h  |  1 +
 arch/powerpc/kernel/security.c|  9 +
 arch/powerpc/kernel/vmlinux.lds.S |  7 +++
 arch/powerpc/lib/feature-fixups.c | 27 +++
 6 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index e582d2c88092..f67b3f6e36be 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -81,7 +81,7 @@ do {  
\
  * Prevent execution of subsequent instructions until preceding branches have
  * been fully resolved and are no longer executing speculatively.
  */
-#define barrier_nospec_asm ori 31,31,0
+#define barrier_nospec_asm NOSPEC_BARRIER_FIXUP_SECTION; nop
 
 // This also acts as a compiler barrier due to the memory clobber.
 #define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
diff --git a/arch/powerpc/include/asm/feature-fixups.h 
b/arch/powerpc/include/asm/feature-fixups.h
index 1e82eb3caabd..86ac59e75f36 100644
--- a/arch/powerpc/include/asm/feature-fixups.h
+++ b/arch/powerpc/include/asm/feature-fixups.h
@@ -195,11 +195,20 @@ label##3: \
FTR_ENTRY_OFFSET 951b-952b; \
.popsection;
 
+#define NOSPEC_BARRIER_FIXUP_SECTION   \
+953:   \
+   .pushsection __barrier_nospec_fixup,"a";\
+   .align 2;   \
+954:   \
+   FTR_ENTRY_OFFSET 953b-954b; \
+   .popsection;
+
 
 #ifndef __ASSEMBLY__
 #include 
 
 extern long __start___rfi_flush_fixup, __stop___rfi_flush_fixup;
+extern long __start___barrier_nospec_fixup, __stop___barrier_nospec_fixup;
 
 void apply_feature_fixups(void);
 void setup_feature_keys(void);
diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 27fa52ed6d00..afc7280cce3b 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -52,6 +52,7 @@ enum l1d_flush_type {
 
 void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void do_barrier_nospec_fixups(bool enable);
 
 #endif /* !__ASSEMBLY__ */
 
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index bab5a27ea805..b963eae0b0a0 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -9,10 +9,19 @@
 #include 
 
 #include 
+#include 
 
 
 unsigned long powerpc_security_features __read_mostly = SEC_FTR_DEFAULT;
 
+static bool barrier_nospec_enabled;
+
+static void enable_barrier_nospec(bool enable)
+{
+   barrier_nospec_enabled = enable;
+   do_barrier_nospec_fixups(enable);
+}
+
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
bool thread_priv;
diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index c8af90ff49f0..ff73f498568c 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -139,6 +139,13 @@ SECTIONS
*(__rfi_flush_fixup)
__stop___rfi_flush_fixup = .;
}
+
+   . = ALIGN(8);
+   __spec_barrier_fixup : AT(ADDR(__spec_barrier_fixup) - LOAD_OFFSET) {
+   __start___barrier_nospec_fixup = .;
+   *(__barrier_nospec_fixup)
+   __stop___barrier_nospec_fixup = .;
+   }
 #endif
 
EXCEPTION_TABLE(0)
diff --git a/arch/powerpc/lib/feature-fixups.c 
b/arch/powerpc/lib/feature-fixups.c
index 288fe4f0db4e..093c1d2ea5fd 100644
--- a/arch/powerpc/lib/feature-fixups.c
+++ b/arch/powerpc/lib/feature-fixups.c
@@ -162,6 +162,33 @@ void do_rfi_flush_fixups(enum l1d_flush_type types)
(types &  L1D_FLUSH_MTTRIG) ? "mttrig type"
: "unknown");
 }
+
+void do_barrier_nospec_fixups(bool enable)
+{
+   unsigned int instr, *dest;
+   long *start, *end;
+   int i;
+
+   start = PTRRELOC(&__start___barrier_nospec_fixup),
+   end = PTRRELOC(&__stop___barrier_nospec_fixup);
+
+   instr = 0x6000; /* nop */
+
+   if (enable) {
+   pr_info("barrier-nospec: using ORI

[PATCH 2/2] ata: ahci: mvebu: override ahci_stop_engine for mvebu AHCI

2018-04-23 Thread xswang

From: Evan Wang 

There is an issue(Errata Ref#226) that the SATA can not be
detected via SATA Port-MultiPlayer(PMP) with following
error log:
  ata1.15: PMP product ID mismatch
  ata1.15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  ata1.15: Port Multiplier vendor mismatch '0x1b4b'!='0x0'
  ata1.15: PMP revalidation failed (errno=-19)

After debugging, the reason is found that the value Port-x
FIS-based Switching Control(PxFBS@0x40) become wrong.
According to design, the bits[11:8, 0] of register PxFBS
are cleared when Port Command and Status (0x18) bit[0]
changes its value from 1 to 0, i.e. falling edge of Port
Command and Status bit[0] sends PULSE that resets PxFBS
bits[11:8; 0].
So it needs a mvebu SATA WA to save the port PxFBS register
before PxCMD ST write and restore it afterwards.

This patch implements the WA in a separate function of
ahci_mvebu_stop_engine to override ahci_stop_gngine.

Signed-off-by: Evan Wang 
Suggested-by: Ofer Heifetz 
Cc: Tejun Heo 
Cc: Thomas Petazzoni 
---
 drivers/ata/ahci_mvebu.c | 56 
 1 file changed, 56 insertions(+)

diff --git a/drivers/ata/ahci_mvebu.c b/drivers/ata/ahci_mvebu.c
index de7128d..0045dac 100644
--- a/drivers/ata/ahci_mvebu.c
+++ b/drivers/ata/ahci_mvebu.c
@@ -62,6 +62,60 @@ static void ahci_mvebu_regret_option(struct ahci_host_priv 
*hpriv)
writel(0x80, hpriv->mmio + AHCI_VENDOR_SPECIFIC_0_DATA);
 }
 
+/**
+ * ahci_mvebu_stop_engine
+ *
+ * @ap:Target ata port
+ *
+ * Errata Ref#226 - SATA Disk HOT swap issue when connected through
+ * Port Multiplier in FIS-based Switching mode.
+ *
+ * To avoid the issue, according to design, the bits[11:8, 0] of
+ * register PxFBS are cleared when Port Command and Status (0x18) bit[0]
+ * changes its value from 1 to 0, i.e. falling edge of Port
+ * Command and Status bit[0] sends PULSE that resets PxFBS
+ * bits[11:8; 0].
+ *
+ * This function is used to override function of "ahci_stop_engine"
+ * from libahci.c by adding the mvebu work around(WA) to save PxFBS
+ * value before the PxCMD ST write of 0, then restore PxFBS value.
+ *
+ * Return: 0 on success; Error code otherwise.
+ */
+int ahci_mvebu_stop_engine(struct ata_port *ap)
+{
+   void __iomem *port_mmio = ahci_port_base(ap);
+   u32 tmp, port_fbs;
+
+   tmp = readl(port_mmio + PORT_CMD);
+
+   /* check if the HBA is idle */
+   if ((tmp & (PORT_CMD_START | PORT_CMD_LIST_ON)) == 0)
+   return 0;
+
+   /* save the port PxFBS register for later restore */
+   port_fbs = readl(port_mmio + PORT_FBS);
+
+   /* setting HBA to idle */
+   tmp &= ~PORT_CMD_START;
+   writel(tmp, port_mmio + PORT_CMD);
+
+   /*
+* bit #15 PxCMD signal doesn't clear PxFBS,
+* restore the PxFBS register right after clearing the PxCMD ST,
+* no need to wait for the PxCMD bit #15.
+*/
+   writel(port_fbs, port_mmio + PORT_FBS);
+
+   /* wait for engine to stop. This could be as long as 500 msec */
+   tmp = ata_wait_register(ap, port_mmio + PORT_CMD,
+   PORT_CMD_LIST_ON, PORT_CMD_LIST_ON, 1, 500);
+   if (tmp & PORT_CMD_LIST_ON)
+   return -EIO;
+
+   return 0;
+}
+
 #ifdef CONFIG_PM_SLEEP
 static int ahci_mvebu_suspend(struct platform_device *pdev, pm_message_t state)
 {
@@ -112,6 +166,8 @@ static int ahci_mvebu_probe(struct platform_device *pdev)
if (rc)
return rc;
 
+   hpriv->stop_engine = ahci_mvebu_stop_engine;
+
if (of_device_is_compatible(pdev->dev.of_node,
"marvell,armada-380-ahci")) {
dram = mv_mbus_dram_info();
-- 
1.9.1

[PATCH 1/2] libahci: Allow drivers to override stop_engine

2018-04-23 Thread xswang

From: Evan Wang 

Marvell armada37xx, armada7k and armada8k share the same
AHCI sata controller IP, and currently there is an issue
(Errata Ref#226)that the SATA can not be detected via SATA
Port-MultiPlayer(PMP). After debugging, the reason is
found that the value of Port-x FIS-based Switching Control
(PxFBS@0x40) became wrong.
According to design, the bits[11:8, 0] of register PxFBS
are cleared when Port Command and Status (0x18) bit[0]
changes its value from 1 to 0, i.e. falling edge of Port
Command and Status bit[0] sends PULSE that resets PxFBS
bits[11:8; 0].
So it needs save the port PxFBS register before PxCMD
ST write and restore the port PxFBS register afterwards
in ahci_stop_engine().

This commit allows drivers to override ahci_stop_engine
behavior for use by the Marvell AHCI driver(and potentially
other drivers in the future).

Signed-off-by: Evan Wang 
Suggested-by: Ofer Heifetz 
Cc: Tejun Heo 
Cc: Thomas Petazzoni 
---
 drivers/ata/ahci.c  |  6 +++---
 drivers/ata/ahci.h  |  7 +++
 drivers/ata/ahci_qoriq.c|  2 +-
 drivers/ata/ahci_xgene.c|  4 ++--
 drivers/ata/libahci.c   | 20 
 drivers/ata/sata_highbank.c |  2 +-
 6 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 1ff1779..6389c88 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -698,7 +698,7 @@ static int ahci_vt8251_hardreset(struct ata_link *link, 
unsigned int *class,
 
DPRINTK("ENTER\n");
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
rc = sata_link_hardreset(link, sata_ehc_deb_timing(>eh_context),
 deadline, , NULL);
@@ -724,7 +724,7 @@ static int ahci_p5wdh_hardreset(struct ata_link *link, 
unsigned int *class,
bool online;
int rc;
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
/* clear D2H reception area to properly wait for D2H FIS */
ata_tf_init(link->device, );
@@ -788,7 +788,7 @@ static int ahci_avn_hardreset(struct ata_link *link, 
unsigned int *class,
 
DPRINTK("ENTER\n");
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
for (i = 0; i < 2; i++) {
u16 val;
diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index 4356ef1..7d88ede 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -366,6 +366,13 @@ struct ahci_host_priv {
 * be overridden anytime before the host is activated.
 */
void(*start_engine)(struct ata_port *ap);
+   /*
+* Optional ahci_stop_engine override, if not set this gets set to the
+* default ahci_stop_engine during ahci_save_initial_config, this can
+* be overridden anytime before the host is activated.
+*/
+   int (*stop_engine)(struct ata_port *ap);
+
irqreturn_t (*irq_handler)(int irq, void *dev_instance);
 
/* only required for per-port MSI(-X) support */
diff --git a/drivers/ata/ahci_qoriq.c b/drivers/ata/ahci_qoriq.c
index 2685f28..cfdef4d 100644
--- a/drivers/ata/ahci_qoriq.c
+++ b/drivers/ata/ahci_qoriq.c
@@ -96,7 +96,7 @@ static int ahci_qoriq_hardreset(struct ata_link *link, 
unsigned int *class,
 
DPRINTK("ENTER\n");
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
/*
 * There is a errata on ls1021a Rev1.0 and Rev2.0 which is:
diff --git a/drivers/ata/ahci_xgene.c b/drivers/ata/ahci_xgene.c
index c2b5941..ad58da7 100644
--- a/drivers/ata/ahci_xgene.c
+++ b/drivers/ata/ahci_xgene.c
@@ -165,7 +165,7 @@ static int xgene_ahci_restart_engine(struct ata_port *ap)
PORT_CMD_ISSUE, 0x0, 1, 100))
  return -EBUSY;
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
ahci_start_fis_rx(ap);
 
/*
@@ -421,7 +421,7 @@ static int xgene_ahci_hardreset(struct ata_link *link, 
unsigned int *class,
portrxfis_saved = readl(port_mmio + PORT_FIS_ADDR);
portrxfishi_saved = readl(port_mmio + PORT_FIS_ADDR_HI);
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
rc = xgene_ahci_do_hardreset(link, deadline, );
 
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index 7adcf3c..e5d9097 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -560,6 +560,9 @@ void ahci_save_initial_config(struct device *dev, struct 
ahci_host_priv *hpriv)
if (!hpriv->start_engine)
hpriv->start_engine = ahci_start_engine;
 
+   if (!hpriv->stop_engine)
+   hpriv->stop_engine = ahci_stop_engine;
+
if (!hpriv->irq_handler)
hpriv->irq_handler = ahci_single_level_irq_intr;
 }
@@ -897,9 +900,10 @@ static void ahci_start_port(struct ata_port *ap)
 static int ahci_deinit_port(struct ata_port

[PATCH 2/2] ata: ahci: mvebu: override ahci_stop_engine for mvebu AHCI

2018-04-23 Thread xswang

From: Evan Wang 

There is an issue(Errata Ref#226) that the SATA can not be
detected via SATA Port-MultiPlayer(PMP) with following
error log:
  ata1.15: PMP product ID mismatch
  ata1.15: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  ata1.15: Port Multiplier vendor mismatch '0x1b4b'!='0x0'
  ata1.15: PMP revalidation failed (errno=-19)

After debugging, the reason is found that the value Port-x
FIS-based Switching Control(PxFBS@0x40) become wrong.
According to design, the bits[11:8, 0] of register PxFBS
are cleared when Port Command and Status (0x18) bit[0]
changes its value from 1 to 0, i.e. falling edge of Port
Command and Status bit[0] sends PULSE that resets PxFBS
bits[11:8; 0].
So it needs a mvebu SATA WA to save the port PxFBS register
before PxCMD ST write and restore it afterwards.

This patch implements the WA in a separate function of
ahci_mvebu_stop_engine to override ahci_stop_gngine.

Signed-off-by: Evan Wang 
Suggested-by: Ofer Heifetz 
Cc: Tejun Heo 
Cc: Thomas Petazzoni 
---
 drivers/ata/ahci_mvebu.c | 56 
 1 file changed, 56 insertions(+)

diff --git a/drivers/ata/ahci_mvebu.c b/drivers/ata/ahci_mvebu.c
index de7128d..0045dac 100644
--- a/drivers/ata/ahci_mvebu.c
+++ b/drivers/ata/ahci_mvebu.c
@@ -62,6 +62,60 @@ static void ahci_mvebu_regret_option(struct ahci_host_priv 
*hpriv)
writel(0x80, hpriv->mmio + AHCI_VENDOR_SPECIFIC_0_DATA);
 }
 
+/**
+ * ahci_mvebu_stop_engine
+ *
+ * @ap:Target ata port
+ *
+ * Errata Ref#226 - SATA Disk HOT swap issue when connected through
+ * Port Multiplier in FIS-based Switching mode.
+ *
+ * To avoid the issue, according to design, the bits[11:8, 0] of
+ * register PxFBS are cleared when Port Command and Status (0x18) bit[0]
+ * changes its value from 1 to 0, i.e. falling edge of Port
+ * Command and Status bit[0] sends PULSE that resets PxFBS
+ * bits[11:8; 0].
+ *
+ * This function is used to override function of "ahci_stop_engine"
+ * from libahci.c by adding the mvebu work around(WA) to save PxFBS
+ * value before the PxCMD ST write of 0, then restore PxFBS value.
+ *
+ * Return: 0 on success; Error code otherwise.
+ */
+int ahci_mvebu_stop_engine(struct ata_port *ap)
+{
+   void __iomem *port_mmio = ahci_port_base(ap);
+   u32 tmp, port_fbs;
+
+   tmp = readl(port_mmio + PORT_CMD);
+
+   /* check if the HBA is idle */
+   if ((tmp & (PORT_CMD_START | PORT_CMD_LIST_ON)) == 0)
+   return 0;
+
+   /* save the port PxFBS register for later restore */
+   port_fbs = readl(port_mmio + PORT_FBS);
+
+   /* setting HBA to idle */
+   tmp &= ~PORT_CMD_START;
+   writel(tmp, port_mmio + PORT_CMD);
+
+   /*
+* bit #15 PxCMD signal doesn't clear PxFBS,
+* restore the PxFBS register right after clearing the PxCMD ST,
+* no need to wait for the PxCMD bit #15.
+*/
+   writel(port_fbs, port_mmio + PORT_FBS);
+
+   /* wait for engine to stop. This could be as long as 500 msec */
+   tmp = ata_wait_register(ap, port_mmio + PORT_CMD,
+   PORT_CMD_LIST_ON, PORT_CMD_LIST_ON, 1, 500);
+   if (tmp & PORT_CMD_LIST_ON)
+   return -EIO;
+
+   return 0;
+}
+
 #ifdef CONFIG_PM_SLEEP
 static int ahci_mvebu_suspend(struct platform_device *pdev, pm_message_t state)
 {
@@ -112,6 +166,8 @@ static int ahci_mvebu_probe(struct platform_device *pdev)
if (rc)
return rc;
 
+   hpriv->stop_engine = ahci_mvebu_stop_engine;
+
if (of_device_is_compatible(pdev->dev.of_node,
"marvell,armada-380-ahci")) {
dram = mv_mbus_dram_info();
-- 
1.9.1

[PATCH 1/2] libahci: Allow drivers to override stop_engine

2018-04-23 Thread xswang

From: Evan Wang 

Marvell armada37xx, armada7k and armada8k share the same
AHCI sata controller IP, and currently there is an issue
(Errata Ref#226)that the SATA can not be detected via SATA
Port-MultiPlayer(PMP). After debugging, the reason is
found that the value of Port-x FIS-based Switching Control
(PxFBS@0x40) became wrong.
According to design, the bits[11:8, 0] of register PxFBS
are cleared when Port Command and Status (0x18) bit[0]
changes its value from 1 to 0, i.e. falling edge of Port
Command and Status bit[0] sends PULSE that resets PxFBS
bits[11:8; 0].
So it needs save the port PxFBS register before PxCMD
ST write and restore the port PxFBS register afterwards
in ahci_stop_engine().

This commit allows drivers to override ahci_stop_engine
behavior for use by the Marvell AHCI driver(and potentially
other drivers in the future).

Signed-off-by: Evan Wang 
Suggested-by: Ofer Heifetz 
Cc: Tejun Heo 
Cc: Thomas Petazzoni 
---
 drivers/ata/ahci.c  |  6 +++---
 drivers/ata/ahci.h  |  7 +++
 drivers/ata/ahci_qoriq.c|  2 +-
 drivers/ata/ahci_xgene.c|  4 ++--
 drivers/ata/libahci.c   | 20 
 drivers/ata/sata_highbank.c |  2 +-
 6 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 1ff1779..6389c88 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -698,7 +698,7 @@ static int ahci_vt8251_hardreset(struct ata_link *link, 
unsigned int *class,
 
DPRINTK("ENTER\n");
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
rc = sata_link_hardreset(link, sata_ehc_deb_timing(>eh_context),
 deadline, , NULL);
@@ -724,7 +724,7 @@ static int ahci_p5wdh_hardreset(struct ata_link *link, 
unsigned int *class,
bool online;
int rc;
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
/* clear D2H reception area to properly wait for D2H FIS */
ata_tf_init(link->device, );
@@ -788,7 +788,7 @@ static int ahci_avn_hardreset(struct ata_link *link, 
unsigned int *class,
 
DPRINTK("ENTER\n");
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
for (i = 0; i < 2; i++) {
u16 val;
diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index 4356ef1..7d88ede 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -366,6 +366,13 @@ struct ahci_host_priv {
 * be overridden anytime before the host is activated.
 */
void(*start_engine)(struct ata_port *ap);
+   /*
+* Optional ahci_stop_engine override, if not set this gets set to the
+* default ahci_stop_engine during ahci_save_initial_config, this can
+* be overridden anytime before the host is activated.
+*/
+   int (*stop_engine)(struct ata_port *ap);
+
irqreturn_t (*irq_handler)(int irq, void *dev_instance);
 
/* only required for per-port MSI(-X) support */
diff --git a/drivers/ata/ahci_qoriq.c b/drivers/ata/ahci_qoriq.c
index 2685f28..cfdef4d 100644
--- a/drivers/ata/ahci_qoriq.c
+++ b/drivers/ata/ahci_qoriq.c
@@ -96,7 +96,7 @@ static int ahci_qoriq_hardreset(struct ata_link *link, 
unsigned int *class,
 
DPRINTK("ENTER\n");
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
/*
 * There is a errata on ls1021a Rev1.0 and Rev2.0 which is:
diff --git a/drivers/ata/ahci_xgene.c b/drivers/ata/ahci_xgene.c
index c2b5941..ad58da7 100644
--- a/drivers/ata/ahci_xgene.c
+++ b/drivers/ata/ahci_xgene.c
@@ -165,7 +165,7 @@ static int xgene_ahci_restart_engine(struct ata_port *ap)
PORT_CMD_ISSUE, 0x0, 1, 100))
  return -EBUSY;
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
ahci_start_fis_rx(ap);
 
/*
@@ -421,7 +421,7 @@ static int xgene_ahci_hardreset(struct ata_link *link, 
unsigned int *class,
portrxfis_saved = readl(port_mmio + PORT_FIS_ADDR);
portrxfishi_saved = readl(port_mmio + PORT_FIS_ADDR_HI);
 
-   ahci_stop_engine(ap);
+   hpriv->stop_engine(ap);
 
rc = xgene_ahci_do_hardreset(link, deadline, );
 
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index 7adcf3c..e5d9097 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -560,6 +560,9 @@ void ahci_save_initial_config(struct device *dev, struct 
ahci_host_priv *hpriv)
if (!hpriv->start_engine)
hpriv->start_engine = ahci_start_engine;
 
+   if (!hpriv->stop_engine)
+   hpriv->stop_engine = ahci_stop_engine;
+
if (!hpriv->irq_handler)
hpriv->irq_handler = ahci_single_level_irq_intr;
 }
@@ -897,9 +900,10 @@ static void ahci_start_port(struct ata_port *ap)
 static int ahci_deinit_port(struct ata_port *ap, const char **emsg)
 {
int rc;
+   struct ahci_host_priv *hpriv = ap->host->private_data;

[PATCH 1/6] powerpc/64s: Add barrier_nospec

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

A no-op form of ori (or immediate of 0 into r31 and the result stored
in r31) has been re-tasked as a speculation barrier. The instruction
only acts as a barrier on newer machines with appropriate firmware
support. On older CPUs it remains a harmless no-op.

Implement barrier_nospec using this instruction.

mpe: The semantics of the instruction are believed to be that it
prevents execution of subsequent instructions until preceding branches
have been fully resolved and are no longer executing speculatively.
There is no further documentation available at this time.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
mpe: Make it Book3S64 only, update comment & change log, add a
 memory clobber to the asm.
---
 arch/powerpc/include/asm/barrier.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index c7c63959ba91..e582d2c88092 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -76,6 +76,21 @@ do { 
\
___p1;  \
 })
 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * Prevent execution of subsequent instructions until preceding branches have
+ * been fully resolved and are no longer executing speculatively.
+ */
+#define barrier_nospec_asm ori 31,31,0
+
+// This also acts as a compiler barrier due to the memory clobber.
+#define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
+
+#else /* !CONFIG_PPC_BOOK3S_64 */
+#define barrier_nospec_asm
+#define barrier_nospec()
+#endif
+
 #include 
 
 #endif /* _ASM_POWERPC_BARRIER_H */
-- 
2.14.1

[PATCH 6/6] powerpc/64: Use barrier_nospec in syscall entry

2018-04-23 Thread Michael Ellerman

Our syscall entry is done in assembly so patch in an explicit
barrier_nospec.

Based on a patch by Michal Suchanek.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
mpe: Move the barrier to immediately prior to the vulnerable load,
 and add a comment trying to explain why. Drop the barrier from
 syscall_dotrace, because that syscall number comes from the
 kernel.
---
 arch/powerpc/kernel/entry_64.S | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..de30f9a34c0c 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #ifdef CONFIG_PPC_BOOK3S
 #include 
@@ -178,6 +179,15 @@ system_call:   /* label this so stack 
traces look sane */
clrldi  r8,r8,32
 15:
slwir0,r0,4
+
+   barrier_nospec_asm
+   /*
+* Prevent the load of the handler below (based on the user-passed
+* system call number) being speculatively executed until the test
+* against NR_syscalls and branch to .Lsyscall_enosys above has
+* committed.
+*/
+
ldx r12,r11,r0  /* Fetch system call handler [ptr] */
mtctr   r12
bctrl   /* Call handler */
-- 
2.14.1

[PATCH 5/6] powerpc: Use barrier_nospec in copy_from_user()

2018-04-23 Thread Michael Ellerman

Based on the x86 commit doing the same.

See commit 304ec1b05031 ("x86/uaccess: Use __uaccess_begin_nospec()
and uaccess_try_nospec") and b3bbfb3fb5d2 ("x86: Introduce
__uaccess_begin_nospec() and uaccess_try_nospec") for more detail.

In all cases we are ordering the load from the potentially
user-controlled pointer vs a previous branch based on an access_ok()
check or similar.

Base on a patch from Michal Suchanek.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/uaccess.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index a62ee663b2c8..6dc3d2eeea4a 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -252,6 +252,7 @@ do {
\
__chk_user_ptr(ptr);\
if (!is_kernel_addr((unsigned long)__gu_addr))  \
might_fault();  \
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -263,8 +264,10 @@ do {   
\
unsigned long  __gu_val = 0;\
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
might_fault();  \
-   if (access_ok(VERIFY_READ, __gu_addr, (size)))  \
+   if (access_ok(VERIFY_READ, __gu_addr, (size))) {\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
+   }   \
(x) = (__force __typeof__(*(ptr)))__gu_val; 
\
__gu_err;   \
 })
@@ -275,6 +278,7 @@ do {
\
unsigned long __gu_val; \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
__chk_user_ptr(ptr);\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -302,15 +306,19 @@ static inline unsigned long raw_copy_from_user(void *to,
 
switch (n) {
case 1:
+   barrier_nospec();
__get_user_size(*(u8 *)to, from, 1, ret);
break;
case 2:
+   barrier_nospec();
__get_user_size(*(u16 *)to, from, 2, ret);
break;
case 4:
+   barrier_nospec();
__get_user_size(*(u32 *)to, from, 4, ret);
break;
case 8:
+   barrier_nospec();
__get_user_size(*(u64 *)to, from, 8, ret);
break;
}
@@ -318,6 +326,7 @@ static inline unsigned long raw_copy_from_user(void *to,
return 0;
}
 
+   barrier_nospec();
return __copy_tofrom_user((__force void __user *)to, from, n);
 }
 
-- 
2.14.1

[PATCH 6/6] powerpc/64: Use barrier_nospec in syscall entry

2018-04-23 Thread Michael Ellerman

Our syscall entry is done in assembly so patch in an explicit
barrier_nospec.

Based on a patch by Michal Suchanek.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
mpe: Move the barrier to immediately prior to the vulnerable load,
 and add a comment trying to explain why. Drop the barrier from
 syscall_dotrace, because that syscall number comes from the
 kernel.
---
 arch/powerpc/kernel/entry_64.S | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..de30f9a34c0c 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #ifdef CONFIG_PPC_BOOK3S
 #include 
@@ -178,6 +179,15 @@ system_call:   /* label this so stack 
traces look sane */
clrldi  r8,r8,32
 15:
slwir0,r0,4
+
+   barrier_nospec_asm
+   /*
+* Prevent the load of the handler below (based on the user-passed
+* system call number) being speculatively executed until the test
+* against NR_syscalls and branch to .Lsyscall_enosys above has
+* committed.
+*/
+
ldx r12,r11,r0  /* Fetch system call handler [ptr] */
mtctr   r12
bctrl   /* Call handler */
-- 
2.14.1

[PATCH 5/6] powerpc: Use barrier_nospec in copy_from_user()

2018-04-23 Thread Michael Ellerman

Based on the x86 commit doing the same.

See commit 304ec1b05031 ("x86/uaccess: Use __uaccess_begin_nospec()
and uaccess_try_nospec") and b3bbfb3fb5d2 ("x86: Introduce
__uaccess_begin_nospec() and uaccess_try_nospec") for more detail.

In all cases we are ordering the load from the potentially
user-controlled pointer vs a previous branch based on an access_ok()
check or similar.

Base on a patch from Michal Suchanek.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/uaccess.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index a62ee663b2c8..6dc3d2eeea4a 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -252,6 +252,7 @@ do {
\
__chk_user_ptr(ptr);\
if (!is_kernel_addr((unsigned long)__gu_addr))  \
might_fault();  \
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -263,8 +264,10 @@ do {   
\
unsigned long  __gu_val = 0;\
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
might_fault();  \
-   if (access_ok(VERIFY_READ, __gu_addr, (size)))  \
+   if (access_ok(VERIFY_READ, __gu_addr, (size))) {\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
+   }   \
(x) = (__force __typeof__(*(ptr)))__gu_val; 
\
__gu_err;   \
 })
@@ -275,6 +278,7 @@ do {
\
unsigned long __gu_val; \
const __typeof__(*(ptr)) __user *__gu_addr = (ptr); \
__chk_user_ptr(ptr);\
+   barrier_nospec();   \
__get_user_size(__gu_val, __gu_addr, (size), __gu_err); \
(x) = (__force __typeof__(*(ptr)))__gu_val; \
__gu_err;   \
@@ -302,15 +306,19 @@ static inline unsigned long raw_copy_from_user(void *to,
 
switch (n) {
case 1:
+   barrier_nospec();
__get_user_size(*(u8 *)to, from, 1, ret);
break;
case 2:
+   barrier_nospec();
__get_user_size(*(u16 *)to, from, 2, ret);
break;
case 4:
+   barrier_nospec();
__get_user_size(*(u32 *)to, from, 4, ret);
break;
case 8:
+   barrier_nospec();
__get_user_size(*(u64 *)to, from, 8, ret);
break;
}
@@ -318,6 +326,7 @@ static inline unsigned long raw_copy_from_user(void *to,
return 0;
}
 
+   barrier_nospec();
return __copy_tofrom_user((__force void __user *)to, from, n);
 }
 
-- 
2.14.1

[PATCH 1/6] powerpc/64s: Add barrier_nospec

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

A no-op form of ori (or immediate of 0 into r31 and the result stored
in r31) has been re-tasked as a speculation barrier. The instruction
only acts as a barrier on newer machines with appropriate firmware
support. On older CPUs it remains a harmless no-op.

Implement barrier_nospec using this instruction.

mpe: The semantics of the instruction are believed to be that it
prevents execution of subsequent instructions until preceding branches
have been fully resolved and are no longer executing speculatively.
There is no further documentation available at this time.

Signed-off-by: Michal Suchanek 
Signed-off-by: Michael Ellerman 
---
mpe: Make it Book3S64 only, update comment & change log, add a
 memory clobber to the asm.
---
 arch/powerpc/include/asm/barrier.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/include/asm/barrier.h 
b/arch/powerpc/include/asm/barrier.h
index c7c63959ba91..e582d2c88092 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -76,6 +76,21 @@ do { 
\
___p1;  \
 })
 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * Prevent execution of subsequent instructions until preceding branches have
+ * been fully resolved and are no longer executing speculatively.
+ */
+#define barrier_nospec_asm ori 31,31,0
+
+// This also acts as a compiler barrier due to the memory clobber.
+#define barrier_nospec() asm (stringify_in_c(barrier_nospec_asm) ::: "memory")
+
+#else /* !CONFIG_PPC_BOOK3S_64 */
+#define barrier_nospec_asm
+#define barrier_nospec()
+#endif
+
 #include 
 
 #endif /* _ASM_POWERPC_BARRIER_H */
-- 
2.14.1

[PATCH 4/6] powerpc/64s: Enable barrier_nospec based on firmware settings

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

Check what firmware told us and enable/disable the barrier_nospec as
appropriate.

We err on the side of enabling the barrier, as it's no-op on older
systems, see the comment for more detail.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h   |  1 +
 arch/powerpc/kernel/security.c | 60 ++
 arch/powerpc/platforms/powernv/setup.c |  1 +
 arch/powerpc/platforms/pseries/setup.c |  1 +
 4 files changed, 63 insertions(+)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 4335cddc1cf2..aeb175e8a525 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -52,6 +52,7 @@ enum l1d_flush_type {
 
 void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void setup_barrier_nospec(void);
 void do_barrier_nospec_fixups(bool enable);
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index b963eae0b0a0..d1b9639e5e24 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -22,6 +23,65 @@ static void enable_barrier_nospec(bool enable)
do_barrier_nospec_fixups(enable);
 }
 
+void setup_barrier_nospec(void)
+{
+   bool enable;
+
+   /*
+* It would make sense to check SEC_FTR_SPEC_BAR_ORI31 below as well.
+* But there's a good reason not to. The two flags we check below are
+* both are enabled by default in the kernel, so if the hcall is not
+* functional they will be enabled.
+* On a system where the host firmware has been updated (so the ori
+* functions as a barrier), but on which the hypervisor (KVM/Qemu) has
+* not been updated, we would like to enable the barrier. Dropping the
+* check for SEC_FTR_SPEC_BAR_ORI31 achieves that. The only downside is
+* we potentially enable the barrier on systems where the host firmware
+* is not updated, but that's harmless as it's a no-op.
+*/
+   enable = security_ftr_enabled(SEC_FTR_FAVOUR_SECURITY) &&
+security_ftr_enabled(SEC_FTR_BNDS_CHK_SPEC_BAR);
+
+   enable_barrier_nospec(enable);
+}
+
+#ifdef CONFIG_DEBUG_FS
+static int barrier_nospec_set(void *data, u64 val)
+{
+   switch (val) {
+   case 0:
+   case 1:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   if (!!val == !!barrier_nospec_enabled)
+   return 0;
+
+   enable_barrier_nospec(!!val);
+
+   return 0;
+}
+
+static int barrier_nospec_get(void *data, u64 *val)
+{
+   *val = barrier_nospec_enabled ? 1 : 0;
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_barrier_nospec,
+   barrier_nospec_get, barrier_nospec_set, "%llu\n");
+
+static __init int barrier_nospec_debugfs_init(void)
+{
+   debugfs_create_file("barrier_nospec", 0600, powerpc_debugfs_root, NULL,
+   _barrier_nospec);
+   return 0;
+}
+device_initcall(barrier_nospec_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
+
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
bool thread_priv;
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index ef8c9ce53a61..e2ca5f77a55f 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -124,6 +124,7 @@ static void pnv_setup_rfi_flush(void)
  security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV));
 
setup_rfi_flush(type, enable);
+   setup_barrier_nospec();
 }
 
 static void __init pnv_setup_arch(void)
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index b55ad4286dc7..63b1f0d10ef0 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -534,6 +534,7 @@ void pseries_setup_rfi_flush(void)
 security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR);
 
setup_rfi_flush(types, enable);
+   setup_barrier_nospec();
 }
 
 #ifdef CONFIG_PCI_IOV
-- 
2.14.1

[PATCH 0/2] Allow drivers to override stop_engine

2018-04-23 Thread xswang

From: Evan Wang 

There maybe some special operations needed when stop engine
for AHCI IP of different vendors, so it is necessary to override
ahci_stop_engine() when stop the engine like overriding start_engine().

Evan Wang (2):
  libahci: Allow drivers to override stop_engine
  ata: ahci: mvebu: override ahci_stop_engine for mvebu AHCI

 drivers/ata/ahci.c  |  6 ++---
 drivers/ata/ahci.h  |  7 ++
 drivers/ata/ahci_mvebu.c| 56 +
 drivers/ata/ahci_qoriq.c|  2 +-
 drivers/ata/ahci_xgene.c|  4 ++--
 drivers/ata/libahci.c   | 20 +---
 drivers/ata/sata_highbank.c |  2 +-
 7 files changed, 82 insertions(+), 15 deletions(-)

-- 
1.9.1

[PATCH 0/2] Allow drivers to override stop_engine

2018-04-23 Thread xswang

From: Evan Wang 

There maybe some special operations needed when stop engine
for AHCI IP of different vendors, so it is necessary to override
ahci_stop_engine() when stop the engine like overriding start_engine().

Evan Wang (2):
  libahci: Allow drivers to override stop_engine
  ata: ahci: mvebu: override ahci_stop_engine for mvebu AHCI

 drivers/ata/ahci.c  |  6 ++---
 drivers/ata/ahci.h  |  7 ++
 drivers/ata/ahci_mvebu.c| 56 +
 drivers/ata/ahci_qoriq.c|  2 +-
 drivers/ata/ahci_xgene.c|  4 ++--
 drivers/ata/libahci.c   | 20 +---
 drivers/ata/sata_highbank.c |  2 +-
 7 files changed, 82 insertions(+), 15 deletions(-)

-- 
1.9.1

[PATCH 4/6] powerpc/64s: Enable barrier_nospec based on firmware settings

2018-04-23 Thread Michael Ellerman

From: Michal Suchanek 

Check what firmware told us and enable/disable the barrier_nospec as
appropriate.

We err on the side of enabling the barrier, as it's no-op on older
systems, see the comment for more detail.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/setup.h   |  1 +
 arch/powerpc/kernel/security.c | 60 ++
 arch/powerpc/platforms/powernv/setup.c |  1 +
 arch/powerpc/platforms/pseries/setup.c |  1 +
 4 files changed, 63 insertions(+)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index 4335cddc1cf2..aeb175e8a525 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -52,6 +52,7 @@ enum l1d_flush_type {
 
 void setup_rfi_flush(enum l1d_flush_type, bool enable);
 void do_rfi_flush_fixups(enum l1d_flush_type types);
+void setup_barrier_nospec(void);
 void do_barrier_nospec_fixups(bool enable);
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c
index b963eae0b0a0..d1b9639e5e24 100644
--- a/arch/powerpc/kernel/security.c
+++ b/arch/powerpc/kernel/security.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -22,6 +23,65 @@ static void enable_barrier_nospec(bool enable)
do_barrier_nospec_fixups(enable);
 }
 
+void setup_barrier_nospec(void)
+{
+   bool enable;
+
+   /*
+* It would make sense to check SEC_FTR_SPEC_BAR_ORI31 below as well.
+* But there's a good reason not to. The two flags we check below are
+* both are enabled by default in the kernel, so if the hcall is not
+* functional they will be enabled.
+* On a system where the host firmware has been updated (so the ori
+* functions as a barrier), but on which the hypervisor (KVM/Qemu) has
+* not been updated, we would like to enable the barrier. Dropping the
+* check for SEC_FTR_SPEC_BAR_ORI31 achieves that. The only downside is
+* we potentially enable the barrier on systems where the host firmware
+* is not updated, but that's harmless as it's a no-op.
+*/
+   enable = security_ftr_enabled(SEC_FTR_FAVOUR_SECURITY) &&
+security_ftr_enabled(SEC_FTR_BNDS_CHK_SPEC_BAR);
+
+   enable_barrier_nospec(enable);
+}
+
+#ifdef CONFIG_DEBUG_FS
+static int barrier_nospec_set(void *data, u64 val)
+{
+   switch (val) {
+   case 0:
+   case 1:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   if (!!val == !!barrier_nospec_enabled)
+   return 0;
+
+   enable_barrier_nospec(!!val);
+
+   return 0;
+}
+
+static int barrier_nospec_get(void *data, u64 *val)
+{
+   *val = barrier_nospec_enabled ? 1 : 0;
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_barrier_nospec,
+   barrier_nospec_get, barrier_nospec_set, "%llu\n");
+
+static __init int barrier_nospec_debugfs_init(void)
+{
+   debugfs_create_file("barrier_nospec", 0600, powerpc_debugfs_root, NULL,
+   _barrier_nospec);
+   return 0;
+}
+device_initcall(barrier_nospec_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
+
 ssize_t cpu_show_meltdown(struct device *dev, struct device_attribute *attr, 
char *buf)
 {
bool thread_priv;
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index ef8c9ce53a61..e2ca5f77a55f 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -124,6 +124,7 @@ static void pnv_setup_rfi_flush(void)
  security_ftr_enabled(SEC_FTR_L1D_FLUSH_HV));
 
setup_rfi_flush(type, enable);
+   setup_barrier_nospec();
 }
 
 static void __init pnv_setup_arch(void)
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index b55ad4286dc7..63b1f0d10ef0 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -534,6 +534,7 @@ void pseries_setup_rfi_flush(void)
 security_ftr_enabled(SEC_FTR_L1D_FLUSH_PR);
 
setup_rfi_flush(types, enable);
+   setup_barrier_nospec();
 }
 
 #ifdef CONFIG_PCI_IOV
-- 
2.14.1

Re: [PATCH 6/6] devfreq: rk3399_dmc: register devfreq notification to dmc driver.

2018-04-23 Thread Chanwoo Choi

Hi,

On 2018년 04월 19일 19:40, Enric Balletbo i Serra wrote:
> From: Lin Huang 
> 
> Because dmc may also access the PMU_BUS_IDLE_REQ register, we need to
> ensure that the pd driver and the dmc driver will not access at this
> register at the same time.
> 
> Signed-off-by: Lin Huang 
> Signed-off-by: Enric Balletbo i Serra 
> ---
> 
>  drivers/devfreq/rk3399_dmc.c  | 47 +--
>  drivers/soc/rockchip/pm_domains.c | 31 +++
>  include/soc/rockchip/rk3399_dmc.h | 63 +++
>  3 files changed, 96 insertions(+), 45 deletions(-)
>  create mode 100644 include/soc/rockchip/rk3399_dmc.h
> 
> diff --git a/drivers/devfreq/rk3399_dmc.c b/drivers/devfreq/rk3399_dmc.c
> index 5bfca028eaaf..a1f320634d69 100644
> --- a/drivers/devfreq/rk3399_dmc.c
> +++ b/drivers/devfreq/rk3399_dmc.c
> @@ -27,51 +27,7 @@
>  #include 
>  
>  #include 
> -
> -struct dram_timing {
> - unsigned int ddr3_speed_bin;
> - unsigned int pd_idle;
> - unsigned int sr_idle;
> - unsigned int sr_mc_gate_idle;
> - unsigned int srpd_lite_idle;
> - unsigned int standby_idle;
> - unsigned int auto_pd_dis_freq;
> - unsigned int dram_dll_dis_freq;
> - unsigned int phy_dll_dis_freq;
> - unsigned int ddr3_odt_dis_freq;
> - unsigned int ddr3_drv;
> - unsigned int ddr3_odt;
> - unsigned int phy_ddr3_ca_drv;
> - unsigned int phy_ddr3_dq_drv;
> - unsigned int phy_ddr3_odt;
> - unsigned int lpddr3_odt_dis_freq;
> - unsigned int lpddr3_drv;
> - unsigned int lpddr3_odt;
> - unsigned int phy_lpddr3_ca_drv;
> - unsigned int phy_lpddr3_dq_drv;
> - unsigned int phy_lpddr3_odt;
> - unsigned int lpddr4_odt_dis_freq;
> - unsigned int lpddr4_drv;
> - unsigned int lpddr4_dq_odt;
> - unsigned int lpddr4_ca_odt;
> - unsigned int phy_lpddr4_ca_drv;
> - unsigned int phy_lpddr4_ck_cs_drv;
> - unsigned int phy_lpddr4_dq_drv;
> - unsigned int phy_lpddr4_odt;
> -};
> -
> -struct rk3399_dmcfreq {
> - struct device *dev;
> - struct devfreq *devfreq;
> - struct devfreq_simple_ondemand_data ondemand_data;
> - struct clk *dmc_clk;
> - struct devfreq_event_dev *edev;
> - struct mutex lock;
> - struct dram_timing timing;
> - struct regulator *vdd_center;
> - unsigned long rate, target_rate;
> - unsigned long volt, target_volt;
> -};
> +#include 
>  
>  static int rk3399_dmcfreq_target(struct device *dev, unsigned long *freq,
>u32 flags)
> @@ -394,6 +350,7 @@ static int rk3399_dmcfreq_probe(struct platform_device 
> *pdev)
>  
>   data->dev = dev;
>   platform_set_drvdata(pdev, data);
> + pd_register_notify_to_dmc(data->devfreq);
>  
>   return 0;
>  }
> diff --git a/drivers/soc/rockchip/pm_domains.c 
> b/drivers/soc/rockchip/pm_domains.c
> index 53efc386b1ad..7acc836e7eb7 100644
> --- a/drivers/soc/rockchip/pm_domains.c
> +++ b/drivers/soc/rockchip/pm_domains.c
> @@ -8,6 +8,7 @@
>   * published by the Free Software Foundation.
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,9 +77,13 @@ struct rockchip_pmu {
>   const struct rockchip_pmu_info *info;
>   struct mutex mutex; /* mutex lock for pmu */
>   struct genpd_onecell_data genpd_data;
> + struct devfreq *devfreq;
> + struct notifier_block dmc_nb;
>   struct generic_pm_domain *domains[];
>  };
>  
> +static struct rockchip_pmu *dmc_pmu;
> +
>  #define to_rockchip_pd(gpd) container_of(gpd, struct rockchip_pm_domain, 
> genpd)
>  
>  #define DOMAIN(pwr, status, req, idle, ack, wakeup)  \
> @@ -601,6 +606,30 @@ static int rockchip_pm_add_subdomain(struct rockchip_pmu 
> *pmu,
>   return error;
>  }
>  
> +static int dmc_notify(struct notifier_block *nb, unsigned long event,
> +   void *data)
> +{
> + if (event == DEVFREQ_PRECHANGE)
> + mutex_lock(_pmu->mutex);
> + else if (event == DEVFREQ_POSTCHANGE)
> + mutex_unlock(_pmu->mutex);
> +
> + return NOTIFY_OK;
> +}
> +
> +int pd_register_notify_to_dmc(struct devfreq *devfreq)
> +{
> + if (!dmc_pmu)
> + return -EPROBE_DEFER;
> +
> + dmc_pmu->devfreq = devfreq;
> + dmc_pmu->dmc_nb.notifier_call = dmc_notify;
> + devfreq_register_notifier(dmc_pmu->devfreq, _pmu->dmc_nb,
> +   DEVFREQ_TRANSITION_NOTIFIER);
> + return 0;
> +}
> +EXPORT_SYMBOL(pd_register_notify_to_dmc);

I think that it is not proper to define the nonstandard function
for only specific device driver. Maybe, It makes the code more complicated.
Between linux kernel frameworks, we have to use the defined function
by linux kernel frameworks.

If drivers/soc/rockchip/pm_domains.c is able to get the devfreq instance
through devicetree, the exported function is not necessary. Sorry for that
I'm not sure the alternative.

[snip]

> diff --git

Re: [PATCH 6/6] devfreq: rk3399_dmc: register devfreq notification to dmc driver.

2018-04-23 Thread Chanwoo Choi

Hi,

On 2018년 04월 19일 19:40, Enric Balletbo i Serra wrote:
> From: Lin Huang 
> 
> Because dmc may also access the PMU_BUS_IDLE_REQ register, we need to
> ensure that the pd driver and the dmc driver will not access at this
> register at the same time.
> 
> Signed-off-by: Lin Huang 
> Signed-off-by: Enric Balletbo i Serra 
> ---
> 
>  drivers/devfreq/rk3399_dmc.c  | 47 +--
>  drivers/soc/rockchip/pm_domains.c | 31 +++
>  include/soc/rockchip/rk3399_dmc.h | 63 +++
>  3 files changed, 96 insertions(+), 45 deletions(-)
>  create mode 100644 include/soc/rockchip/rk3399_dmc.h
> 
> diff --git a/drivers/devfreq/rk3399_dmc.c b/drivers/devfreq/rk3399_dmc.c
> index 5bfca028eaaf..a1f320634d69 100644
> --- a/drivers/devfreq/rk3399_dmc.c
> +++ b/drivers/devfreq/rk3399_dmc.c
> @@ -27,51 +27,7 @@
>  #include 
>  
>  #include 
> -
> -struct dram_timing {
> - unsigned int ddr3_speed_bin;
> - unsigned int pd_idle;
> - unsigned int sr_idle;
> - unsigned int sr_mc_gate_idle;
> - unsigned int srpd_lite_idle;
> - unsigned int standby_idle;
> - unsigned int auto_pd_dis_freq;
> - unsigned int dram_dll_dis_freq;
> - unsigned int phy_dll_dis_freq;
> - unsigned int ddr3_odt_dis_freq;
> - unsigned int ddr3_drv;
> - unsigned int ddr3_odt;
> - unsigned int phy_ddr3_ca_drv;
> - unsigned int phy_ddr3_dq_drv;
> - unsigned int phy_ddr3_odt;
> - unsigned int lpddr3_odt_dis_freq;
> - unsigned int lpddr3_drv;
> - unsigned int lpddr3_odt;
> - unsigned int phy_lpddr3_ca_drv;
> - unsigned int phy_lpddr3_dq_drv;
> - unsigned int phy_lpddr3_odt;
> - unsigned int lpddr4_odt_dis_freq;
> - unsigned int lpddr4_drv;
> - unsigned int lpddr4_dq_odt;
> - unsigned int lpddr4_ca_odt;
> - unsigned int phy_lpddr4_ca_drv;
> - unsigned int phy_lpddr4_ck_cs_drv;
> - unsigned int phy_lpddr4_dq_drv;
> - unsigned int phy_lpddr4_odt;
> -};
> -
> -struct rk3399_dmcfreq {
> - struct device *dev;
> - struct devfreq *devfreq;
> - struct devfreq_simple_ondemand_data ondemand_data;
> - struct clk *dmc_clk;
> - struct devfreq_event_dev *edev;
> - struct mutex lock;
> - struct dram_timing timing;
> - struct regulator *vdd_center;
> - unsigned long rate, target_rate;
> - unsigned long volt, target_volt;
> -};
> +#include 
>  
>  static int rk3399_dmcfreq_target(struct device *dev, unsigned long *freq,
>u32 flags)
> @@ -394,6 +350,7 @@ static int rk3399_dmcfreq_probe(struct platform_device 
> *pdev)
>  
>   data->dev = dev;
>   platform_set_drvdata(pdev, data);
> + pd_register_notify_to_dmc(data->devfreq);
>  
>   return 0;
>  }
> diff --git a/drivers/soc/rockchip/pm_domains.c 
> b/drivers/soc/rockchip/pm_domains.c
> index 53efc386b1ad..7acc836e7eb7 100644
> --- a/drivers/soc/rockchip/pm_domains.c
> +++ b/drivers/soc/rockchip/pm_domains.c
> @@ -8,6 +8,7 @@
>   * published by the Free Software Foundation.
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,9 +77,13 @@ struct rockchip_pmu {
>   const struct rockchip_pmu_info *info;
>   struct mutex mutex; /* mutex lock for pmu */
>   struct genpd_onecell_data genpd_data;
> + struct devfreq *devfreq;
> + struct notifier_block dmc_nb;
>   struct generic_pm_domain *domains[];
>  };
>  
> +static struct rockchip_pmu *dmc_pmu;
> +
>  #define to_rockchip_pd(gpd) container_of(gpd, struct rockchip_pm_domain, 
> genpd)
>  
>  #define DOMAIN(pwr, status, req, idle, ack, wakeup)  \
> @@ -601,6 +606,30 @@ static int rockchip_pm_add_subdomain(struct rockchip_pmu 
> *pmu,
>   return error;
>  }
>  
> +static int dmc_notify(struct notifier_block *nb, unsigned long event,
> +   void *data)
> +{
> + if (event == DEVFREQ_PRECHANGE)
> + mutex_lock(_pmu->mutex);
> + else if (event == DEVFREQ_POSTCHANGE)
> + mutex_unlock(_pmu->mutex);
> +
> + return NOTIFY_OK;
> +}
> +
> +int pd_register_notify_to_dmc(struct devfreq *devfreq)
> +{
> + if (!dmc_pmu)
> + return -EPROBE_DEFER;
> +
> + dmc_pmu->devfreq = devfreq;
> + dmc_pmu->dmc_nb.notifier_call = dmc_notify;
> + devfreq_register_notifier(dmc_pmu->devfreq, _pmu->dmc_nb,
> +   DEVFREQ_TRANSITION_NOTIFIER);
> + return 0;
> +}
> +EXPORT_SYMBOL(pd_register_notify_to_dmc);

I think that it is not proper to define the nonstandard function
for only specific device driver. Maybe, It makes the code more complicated.
Between linux kernel frameworks, we have to use the defined function
by linux kernel frameworks.

If drivers/soc/rockchip/pm_domains.c is able to get the devfreq instance
through devicetree, the exported function is not necessary. Sorry for that
I'm not sure the alternative.

[snip]

> diff --git a/include/soc/rockchip/rk3399_dmc.h 
> b/include/soc/rockchip/rk3399_dmc.h
> new

[RFC v4 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on

2018-04-23 Thread Yang Shi

Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
filesystem with huge page support anymore. tmpfs can use huge page via
THP when mounting by "huge=" mount option.

When applications use huge page on hugetlbfs, it just need check the
filesystem magic number, but it is not enough for tmpfs. Make
stat.st_blksize return huge page size if it is mounted by appropriate
"huge=" option to give applications a hint to optimize the behavior with
THP.

Some applications may not do wisely with THP. For example, QEMU may mmap
file on non huge page aligned hint address with MAP_FIXED, which results
in no pages are PMD mapped even though THP is used. Some applications
may mmap file with non huge page aligned offset. Both behaviors make THP
pointless.

statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it
also may fallback to 4KB page silently if there is not enough huge page.
Furthermore, different f_bsize makes max_blocks and free_blocks
calculation harder but without too much benefit. Returning huge page
size via stat.st_blksize sounds good enough.

Since PUD size huge page for THP has not been supported, now it just
returns HPAGE_PMD_SIZE.

Signed-off-by: Yang Shi 
Cc: "Kirill A. Shutemov" 
Cc: Hugh Dickins 
Cc: Michal Hocko 
Cc: Alexander Viro 
Suggested-by: Christoph Hellwig 
---
v3 --> v4:
* Rework the commit log per the education from Michal and Kirill
* Fix build error if CONFIG_TRANSPARENT_HUGEPAGE is disabled
v2 --> v3:
* Use shmem_sb_info.huge instead of global variable per Michal's comment
v2 --> v1:
* Adopted the suggestion from hch to return huge page size via st_blksize
  instead of creating a new flag.

 mm/shmem.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index b859192..19b8055 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -988,6 +988,7 @@ static int shmem_getattr(const struct path *path, struct 
kstat *stat,
 {
struct inode *inode = path->dentry->d_inode;
struct shmem_inode_info *info = SHMEM_I(inode);
+   struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 
if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
spin_lock_irq(>lock);
@@ -995,6 +996,11 @@ static int shmem_getattr(const struct path *path, struct 
kstat *stat,
spin_unlock_irq(>lock);
}
generic_fillattr(inode, stat);
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+   if (sbinfo->huge > 0)
+   stat->blksize = HPAGE_PMD_SIZE;
+#endif
+   
return 0;
 }
 
-- 
1.8.3.1

[RFC v4 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on

2018-04-23 Thread Yang Shi

Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
filesystem with huge page support anymore. tmpfs can use huge page via
THP when mounting by "huge=" mount option.

When applications use huge page on hugetlbfs, it just need check the
filesystem magic number, but it is not enough for tmpfs. Make
stat.st_blksize return huge page size if it is mounted by appropriate
"huge=" option to give applications a hint to optimize the behavior with
THP.

Some applications may not do wisely with THP. For example, QEMU may mmap
file on non huge page aligned hint address with MAP_FIXED, which results
in no pages are PMD mapped even though THP is used. Some applications
may mmap file with non huge page aligned offset. Both behaviors make THP
pointless.

statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it
also may fallback to 4KB page silently if there is not enough huge page.
Furthermore, different f_bsize makes max_blocks and free_blocks
calculation harder but without too much benefit. Returning huge page
size via stat.st_blksize sounds good enough.

Since PUD size huge page for THP has not been supported, now it just
returns HPAGE_PMD_SIZE.

Signed-off-by: Yang Shi 
Cc: "Kirill A. Shutemov" 
Cc: Hugh Dickins 
Cc: Michal Hocko 
Cc: Alexander Viro 
Suggested-by: Christoph Hellwig 
---
v3 --> v4:
* Rework the commit log per the education from Michal and Kirill
* Fix build error if CONFIG_TRANSPARENT_HUGEPAGE is disabled
v2 --> v3:
* Use shmem_sb_info.huge instead of global variable per Michal's comment
v2 --> v1:
* Adopted the suggestion from hch to return huge page size via st_blksize
  instead of creating a new flag.

 mm/shmem.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index b859192..19b8055 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -988,6 +988,7 @@ static int shmem_getattr(const struct path *path, struct 
kstat *stat,
 {
struct inode *inode = path->dentry->d_inode;
struct shmem_inode_info *info = SHMEM_I(inode);
+   struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 
if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
spin_lock_irq(>lock);
@@ -995,6 +996,11 @@ static int shmem_getattr(const struct path *path, struct 
kstat *stat,
spin_unlock_irq(>lock);
}
generic_fillattr(inode, stat);
+#ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
+   if (sbinfo->huge > 0)
+   stat->blksize = HPAGE_PMD_SIZE;
+#endif
+   
return 0;
 }
 
-- 
1.8.3.1

Re: [PATCH 4/6] dt-bindings: devfreq: rk3399_dmc: remove interrupts as is not required.

2018-04-23 Thread Chanwoo Choi

Hi,

On 2018년 04월 19일 19:40, Enric Balletbo i Serra wrote:
> In ATF we already wait for DDR dvfs finish, so don't need to do this in
> kernel, so remove the interrupts properties as is not longer required.
> 
> Signed-off-by: Enric Balletbo i Serra 
> ---
> 
>  Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt 
> b/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
> index d83ef821d282..e5307155e901 100644
> --- a/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
> +++ b/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
> @@ -5,10 +5,6 @@ Required properties:
>  - devfreq-events: Node to get DDR loading, Refer to
>Documentation/devicetree/bindings/devfreq/event/
>rockchip-dfi.txt
> -- interrupts: The CPU interrupt number. The interrupt 
> specifier
> -  format depends on the interrupt controller.
> -  It should be a DCF interrupt. When DDR DVFS finishes
> -  a DCF interrupt is triggered.
>  - clocks: Phandles for clock specified in "clock-names" property
>  - clock-names :   The name of clock used by the DFI, must be
>"pclk_ddr_mon";
> @@ -172,7 +168,6 @@ Example:
>   dmc: dmc {
>   compatible = "rockchip,rk3399-dmc";
>   devfreq-events = <>;
> - interrupts = ;
>   clocks = < SCLK_DDRCLK>;
>   clock-names = "dmc_clk";
>   operating-points-v2 = <_opp_table>;
> 

The patch3[1] removes the code related to irq. So, Looks good to me.
[1] "[PATCH 3/6] devfreq: rk3399_dmc: remove wait for dcf irq event."

Reviewed-by: Chanwoo Choi 

-- 
Best Regards,
Chanwoo Choi
Samsung Electronics

Re: [PATCH 4/6] dt-bindings: devfreq: rk3399_dmc: remove interrupts as is not required.

2018-04-23 Thread Chanwoo Choi

Hi,

On 2018년 04월 19일 19:40, Enric Balletbo i Serra wrote:
> In ATF we already wait for DDR dvfs finish, so don't need to do this in
> kernel, so remove the interrupts properties as is not longer required.
> 
> Signed-off-by: Enric Balletbo i Serra 
> ---
> 
>  Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt 
> b/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
> index d83ef821d282..e5307155e901 100644
> --- a/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
> +++ b/Documentation/devicetree/bindings/devfreq/rk3399_dmc.txt
> @@ -5,10 +5,6 @@ Required properties:
>  - devfreq-events: Node to get DDR loading, Refer to
>Documentation/devicetree/bindings/devfreq/event/
>rockchip-dfi.txt
> -- interrupts: The CPU interrupt number. The interrupt 
> specifier
> -  format depends on the interrupt controller.
> -  It should be a DCF interrupt. When DDR DVFS finishes
> -  a DCF interrupt is triggered.
>  - clocks: Phandles for clock specified in "clock-names" property
>  - clock-names :   The name of clock used by the DFI, must be
>"pclk_ddr_mon";
> @@ -172,7 +168,6 @@ Example:
>   dmc: dmc {
>   compatible = "rockchip,rk3399-dmc";
>   devfreq-events = <>;
> - interrupts = ;
>   clocks = < SCLK_DDRCLK>;
>   clock-names = "dmc_clk";
>   operating-points-v2 = <_opp_table>;
> 

The patch3[1] removes the code related to irq. So, Looks good to me.
[1] "[PATCH 3/6] devfreq: rk3399_dmc: remove wait for dcf irq event."

Reviewed-by: Chanwoo Choi 

-- 
Best Regards,
Chanwoo Choi
Samsung Electronics

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2432 matches

Mail list logo