Re: [PATCH] concrete /proc/mounts
Al Viro: > Translation: let's generate the entire contents on the first read() and keep > it until the sucker's closed; that way userland wont' see anything changing > under it. Oh, wait... > > NAK. Do you mean that the change can hide other mountpoints which are kept unchanged before/after read()? J. R. Okajima
[PATCH] concrete /proc/mounts
commit 1e83f8634c6efe7dd4e6036ee202ca10bdbca0b3 Author: J. R. Okajima Date: Sat May 25 18:35:13 2019 +0900 concrete /proc/mounts When the size of /proc/mounts exceeds PAGE_SIZE, seq_read() has to release namespace_sem via mounts_op.m_stop(). It means if someone else issues mount(2) or umount(2) and the mounts list got changed, then the continuous getmntent(3) calls show the incomplete mounts list and some entries may not appear in it. This patch generates the full mounts list when mounts_op.m_start() is called, and keep it in the seq_file buffer until the file is closed. The size of the buffer increases if necessary. Other operations m_next, m_stop, m_show become meaningless, but still necessary for the seq_file manner. I don't think the size of the buffer matters because many /proc entries already keep the similar PAGE_SIZE buffer. Increasing /proc/mounts buffer is to keep the correctness of the mount list. Reported-by: Kirill Kolyshkin See-also: https://github.com/kolyshkin/procfs-test Signed-off-by: J. R. Okajima diff --git a/fs/mount.h b/fs/mount.h index f39bc9da4d73..1ffd97696ca9 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -131,9 +131,7 @@ struct proc_mounts { struct mnt_namespace *ns; struct path root; int (*show)(struct seq_file *, struct vfsmount *); - void *cached_mount; - u64 cached_event; - loff_t cached_index; + bool filled; }; extern const struct seq_operations mounts_op; diff --git a/fs/namespace.c b/fs/namespace.c index d18deb4c410b..2984a48cd40f 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1298,46 +1298,77 @@ struct vfsmount *mnt_clone_internal(const struct path *path) #ifdef CONFIG_PROC_FS /* iterator; we want it to have access to namespace_sem, thus here... */ -static void *m_start(struct seq_file *m, loff_t *pos) +static int m_start_fill(struct seq_file *m) { + int err; + size_t last_count; + char *buf; + struct mount *r; struct proc_mounts *p = m->private; down_read(_sem); - if (p->cached_event == p->ns->event) { - void *v = p->cached_mount; - if (*pos == p->cached_index) - return v; - if (*pos == p->cached_index + 1) { - v = seq_list_next(v, >ns->list, >cached_index); - return p->cached_mount = v; + list_for_each_entry(r, >ns->list, mnt_list) { + last_count = m->count; + err = p->show(m, >mnt); + if (unlikely(err < 0)) + break; + if (!seq_has_overflowed(m)) + continue; + + /* expand the buffer */ + buf = kvmalloc(m->size + PAGE_SIZE, GFP_KERNEL); + if (unlikely(!buf)) { + err = -ENOMEM; + break; + } + memcpy(buf, m->buf, last_count); + kvfree(m->buf); + m->buf = buf; + m->size += PAGE_SIZE; + m->count = last_count; + + err = p->show(m, >mnt); + if (unlikely(err < 0)) + break; + else if (unlikely(seq_has_overflowed(m))) { + err = -EFBIG; + break; } } + up_read(_sem); - p->cached_event = p->ns->event; - p->cached_mount = seq_list_start(>ns->list, *pos); - p->cached_index = *pos; - return p->cached_mount; + if (!err) + p->filled = true; + return err; } -static void *m_next(struct seq_file *m, void *v, loff_t *pos) +static void *m_start(struct seq_file *m, loff_t *pos) { + int err; struct proc_mounts *p = m->private; - p->cached_mount = seq_list_next(v, >ns->list, pos); - p->cached_index = *pos; - return p->cached_mount; + if (!p->filled) { + err = m_start_fill(m); + if (unlikely(err)) + return ERR_PTR(err); + } + + return m_start; /* any valid pointer */ +} + +static void *m_next(struct seq_file *m, void *v, loff_t *pos) +{ + return NULL; } static void m_stop(struct seq_file *m, void *v) { - up_read(_sem); + /* empty */ } static int m_show(struct seq_file *m, void *v) { - struct proc_mounts *p = m->private; - struct mount *r = list_entry(v, struct mount, mnt_list); - return p->show(m, >mnt); + return 0; } const struct seq_operations mounts_op = { diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 7626ee11b06c..f8aee5cca1b1 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -279,7 +279,6 @@ static int mo
Re: [PATCH] OVL: add honoracl=off mount option.
NeilBrown: > If the upper and lower layers use incompatible ACL formats, it is not > possible to copy the ACL xttr from one to the other, so overlayfs > cannot work with them. > This happens particularly with NFSv4 which uses system.nfs4_acl, and > ext4 which uses system.posix_acl_access. FYI, Aufs had met the same problem many years ago, and I introduced some options called ICEX (Ignore Copyup Error on Xattr). (from the design doc in aufs) For example, the src branch supports ACL but the dst branch doesn't because the dst branch may natively un-support it or temporary un-support it due to "noacl" mount option. Of course, the dst branch fs may NOT return an error even if the XATTR is not supported. It is totally up to the branch fs. Anyway when the aufs internal copy-up gets an error from the dst branch fs, then aufs tries removing the just copied entry and returns the error to the userspace. The worst case of this situation will be all copy-up will fail. For the copy-up operation, there two basic approaches. - copy the specified XATTR only (by category above), and return the error unconditionally if it happens. - copy all XATTR, and ignore the error on the specified category only. In order to support XATTR and to implement the correct behaviour, aufs chooses the latter approach and introduces some new branch attributes, "icexsec", "icexsys", "icextr", "icexusr", and "icexoth". They correspond to the XATTR namespaces (see above). Additionally, to be convenient, "icex" is also provided which means all "icex*" attributes are set (here the word "icex" stands for "ignore copy-error on XATTR"). The meaning of these attributes is to ignore the error from setting XATTR on that branch. Note that aufs tries copying all XATTR unconditionally, and ignores the error from the dst branch according to the specified attributes. But recent nfs4 got changed its behaviour around ACL, and it didn't pass my local tests. I had posted about that, but got no reply. J. R. Okajima
fix v5.0-rc1, SUNRPC: init machine_cred.magic under CONFIG_DEBUG_CREDENTIALS
By the commit, a52458b48af1 2018-12-19 NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'. struct rpc_cred machine_cred was converted to struct cred, but machine_cred.magic is still uninitialized. Without initializing it, I got 'Invalid credentials' error. It is necessary when CONFIG_DEBUG_CREDENTIALS is enabled. CRED: Invalid credentials CRED: At /proj/aufs/aufs4-linux.git/include/linux/cred.h:253 CRED: Specified credentials: 6ca067d8 CRED: ->magic=0, put_addr= (null) CRED: ->usage=1, subscr=0 CRED: ->*uid = { 0,0,0,0 } CRED: ->*gid = { 0,0,0,0 } CRED: ->security is (null) [ cut here ] kernel BUG at /proj/aufs/aufs4-linux.git/kernel/cred.c:825! invalid opcode: [#1] PREEMPT SMP PTI CPU: 0 PID: 24923 Comm: mount.nfs4 Tainted: GW 5.0.0-rc1aufsD+ #906 Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009 RIP: 0010:__invalid_creds+0x4d/0x60 Code: 44 89 ea 4c 89 e6 48 c7 c7 cf 92 49 82 e8 5e 21 05 00 48 c7 c6 e1 92 49 82 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 23 fe ff ff <0f> 0b 48 c7 c7 00 ac 68 82 e8 25 6a 53 00 0f 1f 44 00 00 66 66 66 RSP: 0018:88810bedf918 EFLAGS: 00010293 RAX: RBX: 829016c0 RCX: RDX: RSI: RDI: 8294c428 RBP: 88810bedf930 R08: 0001 R09: R10: 88810bedf930 R11: R12: 824964d0 R13: 00fd R14: 829016c0 R15: 0001 FS: 7f8616b55480() GS:88811ba0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fda79fe7170 CR3: aca9 CR4: 000406f0 Call Trace: nfs4_discover_server_trunking+0x2cb/0x330 nfs4_init_client+0x16e/0x210 ? lockdep_init_map+0x57/0x1d0 ? rpc_wake_up_task_on_wq_queue_action_locked+0x60/0x60 ? nfs_get_client+0x500/0x680 nfs_get_client+0x51d/0x680 nfs4_set_client+0xb9/0x130 nfs4_create_server+0x10d/0x290 nfs4_remote_mount+0x30/0x90 mount_fs+0x51/0x220 ? __init_waitqueue_head+0x3b/0x50 vfs_kern_mount+0x6b/0x190 ? nfs_do_root_mount+0x3c/0xc0 nfs_do_root_mount+0x84/0xc0 nfs4_try_mount+0x37/0x50 nfs_fs_mount+0x2a1/0xa40 ? nfs_clone_super+0x80/0x80 ? nfs_free_parsed_mount_data+0x60/0x60 mount_fs+0x51/0x220 ? nfs_alloc_parsed_mount_data+0xd0/0xd0 ? mount_fs+0x51/0x220 ? __init_waitqueue_head+0x3b/0x50 vfs_kern_mount+0x6b/0x190 ? ns_capable_common+0xc3/0x110 do_mount+0x220/0xf90 ksys_mount+0xea/0x170 __x64_sys_mount+0x25/0x30 do_syscall_64+0x79/0x360 entry_SYSCALL_64_after_hwframe+0x49/0xbe Here is a patch. diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c index 1ff9768f5456..0e5236a3b5e0 100644 --- a/net/sunrpc/auth.c +++ b/net/sunrpc/auth.c @@ -41,6 +41,9 @@ static unsigned long number_cred_unused; static struct cred machine_cred = { .usage = ATOMIC_INIT(1), +#ifdef CONFIG_DEBUG_CREDENTIALS + .magic = CRED_MAGIC +#endif }; /*
fix v5.0-rc1, SUNRPC: init machine_cred.magic under CONFIG_DEBUG_CREDENTIALS
By the commit, a52458b48af1 2018-12-19 NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'. struct rpc_cred machine_cred was converted to struct cred, but machine_cred.magic is still uninitialized. Without initializing it, I got 'Invalid credentials' error. It is necessary when CONFIG_DEBUG_CREDENTIALS is enabled. CRED: Invalid credentials CRED: At /proj/aufs/aufs4-linux.git/include/linux/cred.h:253 CRED: Specified credentials: 6ca067d8 CRED: ->magic=0, put_addr= (null) CRED: ->usage=1, subscr=0 CRED: ->*uid = { 0,0,0,0 } CRED: ->*gid = { 0,0,0,0 } CRED: ->security is (null) [ cut here ] kernel BUG at /proj/aufs/aufs4-linux.git/kernel/cred.c:825! invalid opcode: [#1] PREEMPT SMP PTI CPU: 0 PID: 24923 Comm: mount.nfs4 Tainted: GW 5.0.0-rc1aufsD+ #906 Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009 RIP: 0010:__invalid_creds+0x4d/0x60 Code: 44 89 ea 4c 89 e6 48 c7 c7 cf 92 49 82 e8 5e 21 05 00 48 c7 c6 e1 92 49 82 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 23 fe ff ff <0f> 0b 48 c7 c7 00 ac 68 82 e8 25 6a 53 00 0f 1f 44 00 00 66 66 66 RSP: 0018:88810bedf918 EFLAGS: 00010293 RAX: RBX: 829016c0 RCX: RDX: RSI: RDI: 8294c428 RBP: 88810bedf930 R08: 0001 R09: R10: 88810bedf930 R11: R12: 824964d0 R13: 00fd R14: 829016c0 R15: 0001 FS: 7f8616b55480() GS:88811ba0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fda79fe7170 CR3: aca9 CR4: 000406f0 Call Trace: nfs4_discover_server_trunking+0x2cb/0x330 nfs4_init_client+0x16e/0x210 ? lockdep_init_map+0x57/0x1d0 ? rpc_wake_up_task_on_wq_queue_action_locked+0x60/0x60 ? nfs_get_client+0x500/0x680 nfs_get_client+0x51d/0x680 nfs4_set_client+0xb9/0x130 nfs4_create_server+0x10d/0x290 nfs4_remote_mount+0x30/0x90 mount_fs+0x51/0x220 ? __init_waitqueue_head+0x3b/0x50 vfs_kern_mount+0x6b/0x190 ? nfs_do_root_mount+0x3c/0xc0 nfs_do_root_mount+0x84/0xc0 nfs4_try_mount+0x37/0x50 nfs_fs_mount+0x2a1/0xa40 ? nfs_clone_super+0x80/0x80 ? nfs_free_parsed_mount_data+0x60/0x60 mount_fs+0x51/0x220 ? nfs_alloc_parsed_mount_data+0xd0/0xd0 ? mount_fs+0x51/0x220 ? __init_waitqueue_head+0x3b/0x50 vfs_kern_mount+0x6b/0x190 ? ns_capable_common+0xc3/0x110 do_mount+0x220/0xf90 ksys_mount+0xea/0x170 __x64_sys_mount+0x25/0x30 do_syscall_64+0x79/0x360 entry_SYSCALL_64_after_hwframe+0x49/0xbe Here is a patch. diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c index 1ff9768f5456..0e5236a3b5e0 100644 --- a/net/sunrpc/auth.c +++ b/net/sunrpc/auth.c @@ -41,6 +41,9 @@ static unsigned long number_cred_unused; static struct cred machine_cred = { .usage = ATOMIC_INIT(1), +#ifdef CONFIG_DEBUG_CREDENTIALS + .magic = CRED_MAGIC +#endif }; /*
Q. re-using lock_classes[]
Hello, "Troubleshooting" section in Documentation/locking/lockdep-design.txt describes 1. Repeated module loading and unloading while running the validator will result in lock-class leakage. The issue here is that each load of the module will create a new set of lock classes for that module's locks, but module unloading does not remove old classes (see below discussion of reuse of lock classes for why). Therefore, if that module is loaded and unloaded repeatedly, the number of lock classes will eventually reach the maximum. ::: One might argue that the validator should be modified to allow lock classes to be reused. However, if you are tempted to make this argument, first review the code and think through the changes that would be required, keeping in mind that the lock classes to be removed are likely to be linked into the lock-dependency graph. This turns out to be harder to do than to say. I am wondering these "module unloading does not remove old classes" "the lock classes to be removed are likely to be linked into the lock-dependency graph" sentences are still valid? Does "the lock-dependency graph" mean class->hash_entry class->lock_entry and/or list_entries[]? Those are handled by zap_class() at unloading the module. Here is my question. Doesn't zap_class() make the slot in lock_classes[] logically unused? If so, can we re-use the unused slot by searchng and testing some members in struct lock_class? For example, bool test_unused(class) { return !rcu_access_pointer(class->name) && !rcu_access_pointer(class->key) && list_empty(>lock_entry) && hlist_unhashed(>hash_entry); } Though a new function list_del_init_rcu() for zap_class() will be necessary. J. R. Okajima
Re: [PATCH 14/39] ovl: stack file ops
Al Viro: > I'd managed to push that particular nest of horrors out of mind ;-/ > Having dug out my notes from back then and grepped around... The real > mess is not even /proc/*/maps - it's /proc/*/map_files/* and yes, the > reasons for that kludge are still valid ;-/ ::: > Uses of ->vm_file (and rules for those) are too convoluted to untangle > at the moment. I still would love to get that straightened out, but > it's not this cycle fodder, more's the pity... I don't fully read this thread, but the discussion is related to the file path printed in /proc/$$/maps? If so, as just for your information, here is an approach that aufs took. In linux-v2.6 era, aufs tried implementing mmap by customzing address_space ops, but it was not good and failed completing the implementation. As wel as overlayfs, aufs has two struct file objects for a single a regular file. One is for a virtual aufs' entry, and the other is for a real layer's entry. When a user issues mmap(2) for the virtual file, aufs redirects the request to the real file on the layer internally. So the vm_file points to the real file. It means /proc/$$/maps prints the unexpected file path. Aufs added another struct file* vm_prfile in struct vma. It points to the virtual aufs file, and /proc/$$/maps prints vm_prfile instead of vm_file. Of cource, maintaining vm_prfile is important since vma may be merged or splitted. Still I don't like this approach, but I don't have another better idea, also it works for many years. You can get the patch in aufs4-standalone.git on sourceforge if you want. J. R. Okajima
Re: [PATCH 14/39] ovl: stack file ops
Al Viro: > I'd managed to push that particular nest of horrors out of mind ;-/ > Having dug out my notes from back then and grepped around... The real > mess is not even /proc/*/maps - it's /proc/*/map_files/* and yes, the > reasons for that kludge are still valid ;-/ ::: > Uses of ->vm_file (and rules for those) are too convoluted to untangle > at the moment. I still would love to get that straightened out, but > it's not this cycle fodder, more's the pity... I don't fully read this thread, but the discussion is related to the file path printed in /proc/$$/maps? If so, as just for your information, here is an approach that aufs took. In linux-v2.6 era, aufs tried implementing mmap by customzing address_space ops, but it was not good and failed completing the implementation. As wel as overlayfs, aufs has two struct file objects for a single a regular file. One is for a virtual aufs' entry, and the other is for a real layer's entry. When a user issues mmap(2) for the virtual file, aufs redirects the request to the real file on the layer internally. So the vm_file points to the real file. It means /proc/$$/maps prints the unexpected file path. Aufs added another struct file* vm_prfile in struct vma. It points to the virtual aufs file, and /proc/$$/maps prints vm_prfile instead of vm_file. Of cource, maintaining vm_prfile is important since vma may be merged or splitted. Still I don't like this approach, but I don't have another better idea, also it works for many years. You can get the patch in aufs4-standalone.git on sourceforge if you want. J. R. Okajima
Re: Why is the length of max mount option a page size??
Jungsub Shin: > Do you mean using symlinks or bind mount to reduce length of mount > options?? I can remove "~/tca_agent/image_layer_dir" string. But the > other character means unique ID of layer and i can't compress it or > replace it to another string. This is not fundamental solution. Docker=20 Can't it be a solution? # mount -o bind /long/path0 /br0 # mount -o bind /long/path1 /br1 ::: # mount -t aufs -o br=/br0:/br1:... none /mntpnt or # ln -s /long/path0 /br0 # ln -s /long/path1 /br1 ::: # mount -t aufs -o br=/br0:/br1:... none /mntpnt J. R. Okajima
Re: Why is the length of max mount option a page size??
Jungsub Shin: > Do you mean using symlinks or bind mount to reduce length of mount > options?? I can remove "~/tca_agent/image_layer_dir" string. But the > other character means unique ID of layer and i can't compress it or > replace it to another string. This is not fundamental solution. Docker=20 Can't it be a solution? # mount -o bind /long/path0 /br0 # mount -o bind /long/path1 /br1 ::: # mount -t aufs -o br=/br0:/br1:... none /mntpnt or # ln -s /long/path0 /br0 # ln -s /long/path1 /br1 ::: # mount -t aufs -o br=/br0:/br1:... none /mntpnt J. R. Okajima
Re: Why is the length of max mount option a page size??
Jungsub Shin: > I suffered this problem below environments. > > OS : Ubuntu 18.04 LTS / 4.15.0-20-generic / 64bit > AUFS module parameter : brs =3D 1 / allow_userns =3D N > Mount command : mount -t aufs -o br=3D~/tca_agent/container_rw_dir/51509032= ::: (very long command line) For such long branches, try "remount,append" option. # mount -t aufs -o br=br0:br1:br2...:brN none /mntpnt # mount -o remount,append=brN+1 /mntpnt # mount -o remount,append=brN+2 /mntpnt ::: J. R. Okajima
Re: Why is the length of max mount option a page size??
Jungsub Shin: > I suffered this problem below environments. > > OS : Ubuntu 18.04 LTS / 4.15.0-20-generic / 64bit > AUFS module parameter : brs =3D 1 / allow_userns =3D N > Mount command : mount -t aufs -o br=3D~/tca_agent/container_rw_dir/51509032= ::: (very long command line) For such long branches, try "remount,append" option. # mount -t aufs -o br=br0:br1:br2...:brN none /mntpnt # mount -o remount,append=brN+1 /mntpnt # mount -o remount,append=brN+2 /mntpnt ::: J. R. Okajima
Re: Why is the length of max mount option a page size??
Jungsub Shin: > I know that almost mount option short and page size is enough to > contain almost mount options. but because of options for > unionfs(overlay, aufs), lenght of mount option could be exceed it > and cut. > > I suffer this problem with aufs. aufs's branch option is cut and > fail to aufs mount. I know that aufs is not offical fs in linux, For aufs, it should not happen because of the module parameter 'brs.' (from the aufs manual) -- .SH Module Parameters .TP .B brs=1 | 0 Specifies to use the branch path data file under sysfs or not. If the number of your branches is large or their path is long and you meet the limitation of mount(8) ro /etc/mtab, you need to enable CONFIG_SYSFS and set aufs module parameter brs=1. When this parameter is set as 1, aufs does not show `br:' (or dirs=) mount option through /proc/mounts (and /etc/mtab). So you can keep yourself from the page limitation of mount(8) or /etc/mtab. Aufs shows branch paths through /fs/aufs/si_XXX/brNNN. Actually the file under sysfs has also a size limitation, but I don't think it is harmful. There is one more side effect in setting 1 to this parameter. If you rename your branch, the branch path written in /etc/mtab will be obsoleted and the future remount will meet some error due to the unmatched parameters (Remember that mount(8) may take the options from /etc/mtab and pass them to the systemcall). If you set 1, /etc/mtab will not hold the branch path and you will not meet such trouble. On the other hand, the entries for the branch path under sysfs are generated dynamically. So it must not be obsoleted. But I don't think users want to rename branches so often. If CONFIG_SYSFS is disable, this parameter is always set to 0. -- If you post the details of your environment to aufs-users ML, then I may be able to describe more. J. R. Okajima
Re: Why is the length of max mount option a page size??
Jungsub Shin: > I know that almost mount option short and page size is enough to > contain almost mount options. but because of options for > unionfs(overlay, aufs), lenght of mount option could be exceed it > and cut. > > I suffer this problem with aufs. aufs's branch option is cut and > fail to aufs mount. I know that aufs is not offical fs in linux, For aufs, it should not happen because of the module parameter 'brs.' (from the aufs manual) -- .SH Module Parameters .TP .B brs=1 | 0 Specifies to use the branch path data file under sysfs or not. If the number of your branches is large or their path is long and you meet the limitation of mount(8) ro /etc/mtab, you need to enable CONFIG_SYSFS and set aufs module parameter brs=1. When this parameter is set as 1, aufs does not show `br:' (or dirs=) mount option through /proc/mounts (and /etc/mtab). So you can keep yourself from the page limitation of mount(8) or /etc/mtab. Aufs shows branch paths through /fs/aufs/si_XXX/brNNN. Actually the file under sysfs has also a size limitation, but I don't think it is harmful. There is one more side effect in setting 1 to this parameter. If you rename your branch, the branch path written in /etc/mtab will be obsoleted and the future remount will meet some error due to the unmatched parameters (Remember that mount(8) may take the options from /etc/mtab and pass them to the systemcall). If you set 1, /etc/mtab will not hold the branch path and you will not meet such trouble. On the other hand, the entries for the branch path under sysfs are generated dynamically. So it must not be obsoleted. But I don't think users want to rename branches so often. If CONFIG_SYSFS is disable, this parameter is always set to 0. -- If you post the details of your environment to aufs-users ML, then I may be able to describe more. J. R. Okajima
Re: [RFC PATCH 00/35] overlayfs: stack file operations
Miklos Szeredi: > This patch series reverts the VFS hacks (with the exception of d_path) and I totally agree with removing d_real things. It must be good to the world. If I understand correctly, this series affects file_inode() too. So there may exist more commits to revert such as fea6d2a6 2017-02-14 vfs: Use upper filesystem inode in bprm_fill_uid() Here is another question. Does overlayfs support atomic_open? I remember implementing atomic_open on aufs was rather tricky many years ago, and I am interested in how overlayfs addresses it. J. R. Okajima
Re: [RFC PATCH 00/35] overlayfs: stack file operations
Miklos Szeredi: > This patch series reverts the VFS hacks (with the exception of d_path) and I totally agree with removing d_real things. It must be good to the world. If I understand correctly, this series affects file_inode() too. So there may exist more commits to revert such as fea6d2a6 2017-02-14 vfs: Use upper filesystem inode in bprm_fill_uid() Here is another question. Does overlayfs support atomic_open? I remember implementing atomic_open on aufs was rather tricky many years ago, and I am interested in how overlayfs addresses it. J. R. Okajima
Re: [GIT PULL] overlayfs update for 4.14
Linus Torvalds: > On Wed, Sep 13, 2017 at 11:46 PM, Miklos Szeredi <mik...@szeredi.hu> wrote: > > > > d_real() is currently is *the* overlayfs-op: > > I know. And it's ugly as #%^! hell. > > We don't want to make it uglier. > > And honestly, if you think that "it's only for overlayfs, so I can do > anything I damn well want to this p[iece of shit" is valid, then I > want to re-educate you. That is *not* how the VFS layer works. Miklos, I've just read your experimental patch to remove ->d_real(). I believe it is the way to go. Moreover I'd suggest you to re-consider these things. - d_real_inode(), d_backing_inode(), and locks_inode() I don't think it a good idea for VFS and other modules to care "which inode I should use" issue. Ideally they should call d_inode() and file_inode() only as they used to. - RENAME_WHITEOUT flag for user-space Why does this flag visible to users? It is a pure internal flag, isn't it? When and what for do users specify this flag? Git-grepping the overlayfs related changes in other modules, I'd also suggest you to re-consider these. v4.2 4bacc9c 2015-06-19 overlayfs: Make f_path always point to the overlay and f_inode to the underlay v4.6 54d5ca8 2016-05-10 vfs: add vfs_select_inode() helper d101a12 2016-03-26 fs: add file_dentry() v4.7 a118084 2016-05-20 vfs: add d_real_inode() helper v4.8 c1892c3 2016-08-03 vfs: fix deadlock in file_remove_privs() on overlayfs v4.9 4d0c5ba 2016-09-16 vfs: do get_write_access() on upper layer of overlayfs c568d68 2016-09-16 locks: fix file locking on overlayfs f3fbbb0 2016-09-16 fsnotify: support overlayfs 598e3c8 2016-09-16 vfs: update ovl inode before relatime check v4.12 78757af 2017-04-20 vfs: ftruncate check IS_APPEND() on real upper inode v4.13 ad0af71 2017-07-04 vfs: introduce inode 'inuse' lock J. R. Okajima
Re: [GIT PULL] overlayfs update for 4.14
Linus Torvalds: > On Wed, Sep 13, 2017 at 11:46 PM, Miklos Szeredi wrote: > > > > d_real() is currently is *the* overlayfs-op: > > I know. And it's ugly as #%^! hell. > > We don't want to make it uglier. > > And honestly, if you think that "it's only for overlayfs, so I can do > anything I damn well want to this p[iece of shit" is valid, then I > want to re-educate you. That is *not* how the VFS layer works. Miklos, I've just read your experimental patch to remove ->d_real(). I believe it is the way to go. Moreover I'd suggest you to re-consider these things. - d_real_inode(), d_backing_inode(), and locks_inode() I don't think it a good idea for VFS and other modules to care "which inode I should use" issue. Ideally they should call d_inode() and file_inode() only as they used to. - RENAME_WHITEOUT flag for user-space Why does this flag visible to users? It is a pure internal flag, isn't it? When and what for do users specify this flag? Git-grepping the overlayfs related changes in other modules, I'd also suggest you to re-consider these. v4.2 4bacc9c 2015-06-19 overlayfs: Make f_path always point to the overlay and f_inode to the underlay v4.6 54d5ca8 2016-05-10 vfs: add vfs_select_inode() helper d101a12 2016-03-26 fs: add file_dentry() v4.7 a118084 2016-05-20 vfs: add d_real_inode() helper v4.8 c1892c3 2016-08-03 vfs: fix deadlock in file_remove_privs() on overlayfs v4.9 4d0c5ba 2016-09-16 vfs: do get_write_access() on upper layer of overlayfs c568d68 2016-09-16 locks: fix file locking on overlayfs f3fbbb0 2016-09-16 fsnotify: support overlayfs 598e3c8 2016-09-16 vfs: update ovl inode before relatime check v4.12 78757af 2017-04-20 vfs: ftruncate check IS_APPEND() on real upper inode v4.13 ad0af71 2017-07-04 vfs: introduce inode 'inuse' lock J. R. Okajima
Re: fanotify hangs with AUFS
Ariel Zelivansky: > I just opened bug 196815 on the kernel bugzilla > (https://bugzilla.kernel.org/show_bug.cgi?id=196815). Let me know if > there's anything else I can do It seems that the problem exists in aufs. I will dive into it. Mainline people should not worry about it. Ariel, Aufs is not in mainline, so I don't think it is appropriate to book the report into kernel bugzilla. Read aufs README file please. J. R. Okajima
Re: fanotify hangs with AUFS
Ariel Zelivansky: > I just opened bug 196815 on the kernel bugzilla > (https://bugzilla.kernel.org/show_bug.cgi?id=196815). Let me know if > there's anything else I can do It seems that the problem exists in aufs. I will dive into it. Mainline people should not worry about it. Ariel, Aufs is not in mainline, so I don't think it is appropriate to book the report into kernel bugzilla. Read aufs README file please. J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Jani Nikula: > On Thu, 15 Jun 2017, "J. R. Okajima" <hooanon...@gmail.com> wrote: > > How about v4.11.x series? > > I got v4.11.5, but it doesn't contain the fix. > > Do you have a plan? > > The upstream commit has the proper Cc: stable and Fixes: tags in place, > it just takes a while for the patches to trickle to stable kernels. Now I can see the fix exists in v4.11.7. Thank you. J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Jani Nikula: > On Thu, 15 Jun 2017, "J. R. Okajima" wrote: > > How about v4.11.x series? > > I got v4.11.5, but it doesn't contain the fix. > > Do you have a plan? > > The upstream commit has the proper Cc: stable and Fixes: tags in place, > it just takes a while for the patches to trickle to stable kernels. Now I can see the fix exists in v4.11.7. Thank you. J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Joonas Lahtinen: > Don't worry, it's not lost. It was merged to drm-intel-fixes and thus is in > the pipeline. > > There were some unexpected delays getting fixes in, sorry for that. Thanx, I got linux-v4.12-rc4 and it contains 4681ee2 2017-05-18 drm/i915: Do not sync RCU during shrinking How about v4.11.x series? I got v4.11.5, but it doesn't contain the fix. Do you have a plan? J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Joonas Lahtinen: > Don't worry, it's not lost. It was merged to drm-intel-fixes and thus is in > the pipeline. > > There were some unexpected delays getting fixes in, sorry for that. Thanx, I got linux-v4.12-rc4 and it contains 4681ee2 2017-05-18 drm/i915: Do not sync RCU during shrinking How about v4.11.x series? I got v4.11.5, but it doesn't contain the fix. Do you have a plan? J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
"J. R. Okajima": > I don't know whether the fix is good to me or not yet. I will test your > fix, but I am busy now and my test will be a few weeks later. Other > people may want the fix soon. So I'd suggest you to reproduce the > problem on your side. I guess "mem=1G" or "mem=512M" will make it easier > to reproduce the problem. > Of course, if you are sure the fix is correct, then you don't have to > wait for my test. Release it soon for other people. Now I am going to able to run my local test. But V3 patch failed to apply onto v4.11.0. Would you provide us two versions of the patch, one for v4.11 series and the other of v4.12-rcN? J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
"J. R. Okajima": > I don't know whether the fix is good to me or not yet. I will test your > fix, but I am busy now and my test will be a few weeks later. Other > people may want the fix soon. So I'd suggest you to reproduce the > problem on your side. I guess "mem=1G" or "mem=512M" will make it easier > to reproduce the problem. > Of course, if you are sure the fix is correct, then you don't have to > wait for my test. Release it soon for other people. Now I am going to able to run my local test. But V3 patch failed to apply onto v4.11.0. Would you provide us two versions of the patch, one for v4.11 series and the other of v4.12-rcN? J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Joonas Lahtinen: > Filing a bug in freedesktop.org with all the details is the fastest way > of getting help. Without the bug (and with such little information as > the previous e-mail) it's hard to estimate the extent and nature of the > bug. My original report was http://marc.info/?l=linux-kernel=149313183203325=2 The report contained - kernel command line options - lockdep msg - call trace but it didn't look drm/i915 shrinker is related. It was git-bisect which lead me to drm/i915 shrinker. > I've anyway gone and prepared a patch to drop the RCU sync completely > from shrinker phase, as discussed originally with Chris. Thank you. I don't know whether the fix is good to me or not yet. I will test your fix, but I am busy now and my test will be a few weeks later. Other people may want the fix soon. So I'd suggest you to reproduce the problem on your side. I guess "mem=1G" or "mem=512M" will make it easier to reproduce the problem. Of course, if you are sure the fix is correct, then you don't have to wait for my test. Release it soon for other people. J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Joonas Lahtinen: > Filing a bug in freedesktop.org with all the details is the fastest way > of getting help. Without the bug (and with such little information as > the previous e-mail) it's hard to estimate the extent and nature of the > bug. My original report was http://marc.info/?l=linux-kernel=149313183203325=2 The report contained - kernel command line options - lockdep msg - call trace but it didn't look drm/i915 shrinker is related. It was git-bisect which lead me to drm/i915 shrinker. > I've anyway gone and prepared a patch to drop the RCU sync completely > from shrinker phase, as discussed originally with Chris. Thank you. I don't know whether the fix is good to me or not yet. I will test your fix, but I am busy now and my test will be a few weeks later. Other people may want the fix soon. So I'd suggest you to reproduce the problem on your side. I guess "mem=1G" or "mem=512M" will make it easier to reproduce the problem. Of course, if you are sure the fix is correct, then you don't have to wait for my test. Release it soon for other people. J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Thanx for the reply. Andrea Arcangeli: > Yes I already reported this, my original fix was way more efficient > (and also safer considering the above) than what landed upstream. My > feedback was ignored though. > > https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html I see. Actually on my test system for v4.11-rc8, kthreadd, kworker, kswapd and others all stopped working due to the synchronize_rcu_expedited call from i915_gem_shrinker_count. It is definitly a show stopper for me as an i915 user. It was a few weeks ago when you posted. It is a pity the fix was not merged before v4.11 comes out. I know v4.11 will appear soon. So I'd ask i915 developers, would you test Andrea Arcangeli's fix and release it as v4.11.1 as soon as possible? J. R. Okajima
Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Thanx for the reply. Andrea Arcangeli: > Yes I already reported this, my original fix was way more efficient > (and also safer considering the above) than what landed upstream. My > feedback was ignored though. > > https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html I see. Actually on my test system for v4.11-rc8, kthreadd, kworker, kswapd and others all stopped working due to the synchronize_rcu_expedited call from i915_gem_shrinker_count. It is definitly a show stopper for me as an i915 user. It was a few weeks ago when you posted. It is a pity the fix was not merged before v4.11 comes out. I know v4.11 will appear soon. So I'd ask i915 developers, would you test Andrea Arcangeli's fix and release it as v4.11.1 as soon as possible? J. R. Okajima
Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Hello, Since v4.11-rc7 I can see the workqueue stops on my development/test system. Git-bisecting tells me the suspicious commit is c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under struct_mutex I am not sure whether this is the real cause or not of my problem, but I have a question. By the commit, the shrinker handlers ->scan_objects() and ->count_objects() both calls synchronize_rcu_expedited() unconditionally. Is it a legal RCU bahavour? I know dev->struct_mutex is unlocked now, but before the commit, these two handlers were not calling synchronize_rcu_expedited(). J. R. Okajima
Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers
Hello, Since v4.11-rc7 I can see the workqueue stops on my development/test system. Git-bisecting tells me the suspicious commit is c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under struct_mutex I am not sure whether this is the real cause or not of my problem, but I have a question. By the commit, the shrinker handlers ->scan_objects() and ->count_objects() both calls synchronize_rcu_expedited() unconditionally. Is it a legal RCU bahavour? I know dev->struct_mutex is unlocked now, but before the commit, these two handlers were not calling synchronize_rcu_expedited(). J. R. Okajima
v4.11-rc7, deadlock? workqueue pool->lock and serial8250 port_lock_key
Hello, I've found this kernel log from v4.11.0-rc7, and all(?) workqueues seem stopped working. The system is using a serial console by the kernel command line root=3D/dev/sda8 ro console=3DttyS0,115200 console=3Dtty0 mem=3D1G Does it mean that show_workqueue_state() on a serial console (almost?) always cause a deadlock? Who I can ask to check this log? I am not sure whether this is related, but I added a workqueue for my development. There is no such message in the log about my workqueue, but I am afraid my development is related to any change around workqueue. J. R. Okajima =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D [ INFO: possible circular locking dependency detected ] 4.11.0-rc7aufsD+ #114 Not tainted --- dockerd/777 is trying to acquire lock: (_lock_key){-.-.-.}, at: [] serial8250_console_wri= te+0x1ad/0x570 but task is already holding lock: (&(>lock)->rlock){-.-.-.}, at: [] show_workqueue_= state+0x1e0/0xd90 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&(>lock)->rlock){-.-.-.}: lock_acquire+0xd5/0x380 _raw_spin_lock+0x32/0x50 __queue_work+0x143/0x1090 queue_work_on+0x38/0x80 put_pwq+0x152/0x2a0 put_pwq_unlocked+0x22/0x30 destroy_workqueue+0x3bd/0x680 nfs4_state_shutdown+0x15/0x30 [nfsd] nfsd_shutdown_generic+0x26/0x50 [nfsd] nfsd_shutdown_net+0x135/0x2b0 [nfsd] nfsd_last_thread+0x21e/0x2d0 [nfsd] svc_shutdown_net+0x4c/0xa0 nfsd_destroy+0x1b3/0x2f0 [nfsd] nfsd+0x28f/0x480 [nfsd] kthread+0x1a0/0x2d0 ret_from_fork+0x2e/0x40 -> #1 (>lock/1){-.-.-.}: lock_acquire+0xd5/0x380 _raw_spin_lock+0x32/0x50 __queue_work+0x143/0x1090 queue_work_on+0x38/0x80 tty_schedule_flip+0x5b/0xb0 tty_flip_buffer_push+0xe/0x10 serial8250_rx_chars+0x19c/0x580 serial8250_handle_irq.part.17+0xd5/0x190 serial8250_default_handle_irq+0x5c/0xa0 serial8250_interrupt+0x83/0x160 __handle_irq_event_percpu+0x73/0x8f0 handle_irq_event_percpu+0x25/0x90 handle_irq_event+0x7d/0x160 handle_edge_irq+0x139/0x9e0 handle_irq+0xa1/0x400 do_IRQ+0x82/0x190 ret_from_intr+0x0/0x19 cpuidle_enter_state+0x150/0x820 cpuidle_enter+0x17/0x20 do_idle+0x25a/0x480 cpu_startup_entry+0x70/0x90 rest_init+0x15e/0x170 start_kernel+0x4e1/0x4ee x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0xe7/0xee verify_cpu+0x0/0xfc -> #0 (_lock_key){-.-.-.}: __lock_acquire+0x1fb7/0x2e00 lock_acquire+0xd5/0x380 _raw_spin_lock_irqsave+0x3d/0x60 serial8250_console_write+0x1ad/0x570 univ8250_console_write+0x36/0x60 console_unlock+0x59b/0xa00 vprintk_emit+0x241/0x430 vprintk_default+0x1f/0x30 vprintk_func+0x2c/0x80 printk+0x43/0x4b show_workqueue_state+0x2a1/0xd90 wq_watchdog_timer_fn+0x245/0x3e0 call_timer_fn+0xd4/0x760 run_timer_softirq+0x541/0xbb0 __do_softirq+0x117/0xcb0 irq_exit+0xf9/0x170 smp_apic_timer_interrupt+0x44/0x70 apic_timer_interrupt+0x90/0xa0 read_hpet+0x107/0x120 ktime_get_ts64+0x115/0x460 poll_select_copy_remaining+0x7e/0x240 SyS_select+0x117/0x160 entry_SYSCALL_64_fastpath+0x18/0xad other info that might help us debug this: Chain exists of: _lock_key --> >lock/1 --> &(>lock)->rlock Possible unsafe locking scenario: CPU0CPU1 lock(&(>lock)->rlock); lock(>lock/1); lock(&(>lock)->rlock); lock(_lock_key); *** DEADLOCK *** 4 locks held by dockerd/777: #0: (/proj/aufs/aufs4-linux.git/kernel/workqueue.c:5320){+.-.-.}, at: [<= 811d278b>] call_timer_fn+0x3b/0x760 #1: (rcu_read_lock_sched){..}, at: [] show_workque= ue_state+0x5/0xd90 #2: (&(>lock)->rlock){-.-.-.}, at: [] show_workq= ueue_state+0x1e0/0xd90 #3: (console_lock){+.+.+.}, at: [] vprintk_emit+0x234/= 0x430 stack backtrace: CPU: 0 PID: 777 Comm: dockerd Not tainted 4.11.0-rc7aufsD+ #114 Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009 Call Trace: dump_stack+0x68/0x9d print_circular_bug+0x2a6/0x2b7 __lock_acquire+0x1fb7/0x2e00 lock_acquire+0xd5/0x380 ? lock_acquire+0xd5/0x380 ? serial8250_console_write+0x1ad/0x570 _raw_spin_lock_irqsave+0x3d/0x60 ? serial8250_console_write+0x1ad/0x570 serial8250_console_write+0x1ad/0x570 ? _raw_spin_unlock+0x69/0x80 univ8250_console_write+0x36/0x60 console_unlock+0x59b/0xa00 ? __down_trylock_console
v4.11-rc7, deadlock? workqueue pool->lock and serial8250 port_lock_key
Hello, I've found this kernel log from v4.11.0-rc7, and all(?) workqueues seem stopped working. The system is using a serial console by the kernel command line root=3D/dev/sda8 ro console=3DttyS0,115200 console=3Dtty0 mem=3D1G Does it mean that show_workqueue_state() on a serial console (almost?) always cause a deadlock? Who I can ask to check this log? I am not sure whether this is related, but I added a workqueue for my development. There is no such message in the log about my workqueue, but I am afraid my development is related to any change around workqueue. J. R. Okajima =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D [ INFO: possible circular locking dependency detected ] 4.11.0-rc7aufsD+ #114 Not tainted --- dockerd/777 is trying to acquire lock: (_lock_key){-.-.-.}, at: [] serial8250_console_wri= te+0x1ad/0x570 but task is already holding lock: (&(>lock)->rlock){-.-.-.}, at: [] show_workqueue_= state+0x1e0/0xd90 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&(>lock)->rlock){-.-.-.}: lock_acquire+0xd5/0x380 _raw_spin_lock+0x32/0x50 __queue_work+0x143/0x1090 queue_work_on+0x38/0x80 put_pwq+0x152/0x2a0 put_pwq_unlocked+0x22/0x30 destroy_workqueue+0x3bd/0x680 nfs4_state_shutdown+0x15/0x30 [nfsd] nfsd_shutdown_generic+0x26/0x50 [nfsd] nfsd_shutdown_net+0x135/0x2b0 [nfsd] nfsd_last_thread+0x21e/0x2d0 [nfsd] svc_shutdown_net+0x4c/0xa0 nfsd_destroy+0x1b3/0x2f0 [nfsd] nfsd+0x28f/0x480 [nfsd] kthread+0x1a0/0x2d0 ret_from_fork+0x2e/0x40 -> #1 (>lock/1){-.-.-.}: lock_acquire+0xd5/0x380 _raw_spin_lock+0x32/0x50 __queue_work+0x143/0x1090 queue_work_on+0x38/0x80 tty_schedule_flip+0x5b/0xb0 tty_flip_buffer_push+0xe/0x10 serial8250_rx_chars+0x19c/0x580 serial8250_handle_irq.part.17+0xd5/0x190 serial8250_default_handle_irq+0x5c/0xa0 serial8250_interrupt+0x83/0x160 __handle_irq_event_percpu+0x73/0x8f0 handle_irq_event_percpu+0x25/0x90 handle_irq_event+0x7d/0x160 handle_edge_irq+0x139/0x9e0 handle_irq+0xa1/0x400 do_IRQ+0x82/0x190 ret_from_intr+0x0/0x19 cpuidle_enter_state+0x150/0x820 cpuidle_enter+0x17/0x20 do_idle+0x25a/0x480 cpu_startup_entry+0x70/0x90 rest_init+0x15e/0x170 start_kernel+0x4e1/0x4ee x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0xe7/0xee verify_cpu+0x0/0xfc -> #0 (_lock_key){-.-.-.}: __lock_acquire+0x1fb7/0x2e00 lock_acquire+0xd5/0x380 _raw_spin_lock_irqsave+0x3d/0x60 serial8250_console_write+0x1ad/0x570 univ8250_console_write+0x36/0x60 console_unlock+0x59b/0xa00 vprintk_emit+0x241/0x430 vprintk_default+0x1f/0x30 vprintk_func+0x2c/0x80 printk+0x43/0x4b show_workqueue_state+0x2a1/0xd90 wq_watchdog_timer_fn+0x245/0x3e0 call_timer_fn+0xd4/0x760 run_timer_softirq+0x541/0xbb0 __do_softirq+0x117/0xcb0 irq_exit+0xf9/0x170 smp_apic_timer_interrupt+0x44/0x70 apic_timer_interrupt+0x90/0xa0 read_hpet+0x107/0x120 ktime_get_ts64+0x115/0x460 poll_select_copy_remaining+0x7e/0x240 SyS_select+0x117/0x160 entry_SYSCALL_64_fastpath+0x18/0xad other info that might help us debug this: Chain exists of: _lock_key --> >lock/1 --> &(>lock)->rlock Possible unsafe locking scenario: CPU0CPU1 lock(&(>lock)->rlock); lock(>lock/1); lock(&(>lock)->rlock); lock(_lock_key); *** DEADLOCK *** 4 locks held by dockerd/777: #0: (/proj/aufs/aufs4-linux.git/kernel/workqueue.c:5320){+.-.-.}, at: [<= 811d278b>] call_timer_fn+0x3b/0x760 #1: (rcu_read_lock_sched){..}, at: [] show_workque= ue_state+0x5/0xd90 #2: (&(>lock)->rlock){-.-.-.}, at: [] show_workq= ueue_state+0x1e0/0xd90 #3: (console_lock){+.+.+.}, at: [] vprintk_emit+0x234/= 0x430 stack backtrace: CPU: 0 PID: 777 Comm: dockerd Not tainted 4.11.0-rc7aufsD+ #114 Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009 Call Trace: dump_stack+0x68/0x9d print_circular_bug+0x2a6/0x2b7 __lock_acquire+0x1fb7/0x2e00 lock_acquire+0xd5/0x380 ? lock_acquire+0xd5/0x380 ? serial8250_console_write+0x1ad/0x570 _raw_spin_lock_irqsave+0x3d/0x60 ? serial8250_console_write+0x1ad/0x570 serial8250_console_write+0x1ad/0x570 ? _raw_spin_unlock+0x69/0x80 univ8250_console_write+0x36/0x60 console_unlock+0x59b/0xa00 ? __down_trylock_console
[tip:locking/core] locking/lockdep: Add new check to lock_downgrade()
Commit-ID: 6419c4af777a773a45a1b1af735de0fcd9a7dcc7 Gitweb: http://git.kernel.org/tip/6419c4af777a773a45a1b1af735de0fcd9a7dcc7 Author: J. R. Okajima <hooanon...@gmail.com> AuthorDate: Fri, 3 Feb 2017 01:38:17 +0900 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Thu, 16 Mar 2017 09:57:07 +0100 locking/lockdep: Add new check to lock_downgrade() Commit: f8319483f57f ("locking/lockdep: Provide a type check for lock_is_held") didn't fully cover rwsems as downgrade_write() was left out. Introduce lock_downgrade() and use it to add new checks. See-also: http://marc.info/?l=linux-kernel=148581164003149=2 Originally-written-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: J. R. Okajima <hooanon...@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Link: http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon...@gmail.com [ Rewrote the changelog. ] Signed-off-by: Ingo Molnar <mi...@kernel.org> --- include/linux/lockdep.h | 3 +++ kernel/locking/lockdep.c | 55 kernel/locking/rwsem.c | 6 ++ 3 files changed, 60 insertions(+), 4 deletions(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 1e327bb..fffe49f 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map *lock, lock_set_class(lock, lock->name, lock->key, subclass, ip); } +extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); + extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); @@ -411,6 +413,7 @@ static inline void lockdep_on(void) # define lock_acquire(l, s, t, r, c, n, i) do { } while (0) # define lock_release(l, n, i) do { } while (0) +# define lock_downgrade(l, i) do { } while (0) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i)do { } while (0) # define lockdep_set_current_reclaim_state(g) do { } while (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index da79548..b1a1cef 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3533,6 +3533,44 @@ __lock_set_class(struct lockdep_map *lock, const char *name, return 1; } +static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + struct task_struct *curr = current; + struct held_lock *hlock; + unsigned int depth; + int i; + + depth = curr->lockdep_depth; + /* +* This function is about (re)setting the class of a held lock, +* yet we're not actually holding any locks. Naughty user! +*/ + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); + + curr->lockdep_depth = i; + curr->curr_chain_key = hlock->prev_chain_key; + + WARN(hlock->read, "downgrading a read lock"); + hlock->read = 1; + hlock->acquire_ip = ip; + + if (reacquire_held_locks(curr, depth, i)) + return 0; + + /* +* I took it apart and put it back together again, except now I have +* these 'spare' parts.. where shall I put them. +*/ + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks - this gets * called on mutex_unlock()/spin_unlock*() (or on a failed @@ -3759,6 +3797,23 @@ void lock_set_class(struct lockdep_map *lock, const char *name, } EXPORT_SYMBOL_GPL(lock_set_class); +void lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion = 1; + check_flags(flags); + if (__lock_downgrade(lock, ip)) + check_chain_key(current); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_downgrade); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 90a74cc..4d48b1c 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -124,10 +124,8 @@ EXPORT_SYMBOL(up_write); */
[tip:locking/core] locking/lockdep: Add new check to lock_downgrade()
Commit-ID: 6419c4af777a773a45a1b1af735de0fcd9a7dcc7 Gitweb: http://git.kernel.org/tip/6419c4af777a773a45a1b1af735de0fcd9a7dcc7 Author: J. R. Okajima AuthorDate: Fri, 3 Feb 2017 01:38:17 +0900 Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:57:07 +0100 locking/lockdep: Add new check to lock_downgrade() Commit: f8319483f57f ("locking/lockdep: Provide a type check for lock_is_held") didn't fully cover rwsems as downgrade_write() was left out. Introduce lock_downgrade() and use it to add new checks. See-also: http://marc.info/?l=linux-kernel=148581164003149=2 Originally-written-by: Peter Zijlstra Signed-off-by: J. R. Okajima Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon...@gmail.com [ Rewrote the changelog. ] Signed-off-by: Ingo Molnar --- include/linux/lockdep.h | 3 +++ kernel/locking/lockdep.c | 55 kernel/locking/rwsem.c | 6 ++ 3 files changed, 60 insertions(+), 4 deletions(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 1e327bb..fffe49f 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map *lock, lock_set_class(lock, lock->name, lock->key, subclass, ip); } +extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); + extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); @@ -411,6 +413,7 @@ static inline void lockdep_on(void) # define lock_acquire(l, s, t, r, c, n, i) do { } while (0) # define lock_release(l, n, i) do { } while (0) +# define lock_downgrade(l, i) do { } while (0) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i)do { } while (0) # define lockdep_set_current_reclaim_state(g) do { } while (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index da79548..b1a1cef 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3533,6 +3533,44 @@ __lock_set_class(struct lockdep_map *lock, const char *name, return 1; } +static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + struct task_struct *curr = current; + struct held_lock *hlock; + unsigned int depth; + int i; + + depth = curr->lockdep_depth; + /* +* This function is about (re)setting the class of a held lock, +* yet we're not actually holding any locks. Naughty user! +*/ + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); + + curr->lockdep_depth = i; + curr->curr_chain_key = hlock->prev_chain_key; + + WARN(hlock->read, "downgrading a read lock"); + hlock->read = 1; + hlock->acquire_ip = ip; + + if (reacquire_held_locks(curr, depth, i)) + return 0; + + /* +* I took it apart and put it back together again, except now I have +* these 'spare' parts.. where shall I put them. +*/ + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks - this gets * called on mutex_unlock()/spin_unlock*() (or on a failed @@ -3759,6 +3797,23 @@ void lock_set_class(struct lockdep_map *lock, const char *name, } EXPORT_SYMBOL_GPL(lock_set_class); +void lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion = 1; + check_flags(flags); + if (__lock_downgrade(lock, ip)) + check_chain_key(current); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_downgrade); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 90a74cc..4d48b1c 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -124,10 +124,8 @@ EXPORT_SYMBOL(up_write); */ void downgrade_write(struct rw_semaphore *sem) { - /* -* lockdep: a downgraded write will live on as a write -* dependency. -*/ + lock_downgrade(>dep_map, _RET_IP_); + rwsem_set_reader_owned(sem); __downgrade_write(sem); }
[tip:locking/core] locking/lockdep: Factor out the find_held_lock() helper function
Commit-ID: 41c2c5b86a5e1a691ddacfc03b631b87a0b19043 Gitweb: http://git.kernel.org/tip/41c2c5b86a5e1a691ddacfc03b631b87a0b19043 Author: J. R. Okajima <hooanon...@gmail.com> AuthorDate: Fri, 3 Feb 2017 01:38:15 +0900 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Thu, 16 Mar 2017 09:57:06 +0100 locking/lockdep: Factor out the find_held_lock() helper function A simple consolidataion to factor out repeated patterns. The behaviour should not change. Signed-off-by: J. R. Okajima <hooanon...@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Link: http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon...@gmail.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/locking/lockdep.c | 114 ++- 1 file changed, 54 insertions(+), 60 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index a95e5d1..0d28b82 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3437,13 +3437,49 @@ static int match_held_lock(struct held_lock *hlock, struct lockdep_map *lock) return 0; } +/* @depth must not be zero */ +static struct held_lock *find_held_lock(struct task_struct *curr, + struct lockdep_map *lock, + unsigned int depth, int *idx) +{ + struct held_lock *ret, *hlock, *prev_hlock; + int i; + + i = depth - 1; + hlock = curr->held_locks + i; + ret = hlock; + if (match_held_lock(hlock, lock)) + goto out; + + ret = NULL; + for (i--, prev_hlock = hlock--; +i >= 0; +i--, prev_hlock = hlock--) { + /* +* We must not cross into another context: +*/ + if (prev_hlock->irq_context != hlock->irq_context) { + ret = NULL; + break; + } + if (match_held_lock(hlock, lock)) { + ret = hlock; + break; + } + } + +out: + *idx = i; + return ret; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; struct lock_class *class; unsigned int depth; int i; @@ -3456,21 +3492,10 @@ __lock_set_class(struct lockdep_map *lock, const char *name, if (DEBUG_LOCKS_WARN_ON(!depth)) return 0; - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: lockdep_init_map(lock, name, key, 0); class = register_lock_class(lock, subclass, 0); hlock->class_idx = class - lock_classes + 1; @@ -3508,7 +3533,7 @@ static int __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; unsigned int depth; int i; @@ -3527,21 +3552,10 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) * Check whether the lock exists in the current stack * of held locks: */ - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock
[tip:locking/core] locking/lockdep: Factor out the validate_held_lock() helper function
Commit-ID: e969970be033841d4c16b2e8ec8a3608347db861 Gitweb: http://git.kernel.org/tip/e969970be033841d4c16b2e8ec8a3608347db861 Author: J. R. Okajima <hooanon...@gmail.com> AuthorDate: Fri, 3 Feb 2017 01:38:16 +0900 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Thu, 16 Mar 2017 09:57:07 +0100 locking/lockdep: Factor out the validate_held_lock() helper function Behaviour should not change. Signed-off-by: J. R. Okajima <hooanon...@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Andrew Morton <a...@linux-foundation.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Link: http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon...@gmail.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/locking/lockdep.c | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 0d28b82..da79548 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3473,6 +3473,24 @@ out: return ret; } +static int reacquire_held_locks(struct task_struct *curr, unsigned int depth, + int idx) +{ + struct held_lock *hlock; + + for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++) { + if (!__lock_acquire(hlock->instance, + hlock_class(hlock)->subclass, + hlock->trylock, + hlock->read, hlock->check, + hlock->hardirqs_off, + hlock->nest_lock, hlock->acquire_ip, + hlock->references, hlock->pin_count)) + return 1; + } + return 0; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, @@ -3503,15 +3521,8 @@ __lock_set_class(struct lockdep_map *lock, const char *name, curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (reacquire_held_locks(curr, depth, i)) + return 0; /* * I took it apart and put it back together again, except now I have @@ -3582,15 +3593,8 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (i++; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (reacquire_held_locks(curr, depth, i + 1)) + return 0; /* * We had N bottles of beer on the wall, we drank one, but now
[tip:locking/core] locking/lockdep: Factor out the find_held_lock() helper function
Commit-ID: 41c2c5b86a5e1a691ddacfc03b631b87a0b19043 Gitweb: http://git.kernel.org/tip/41c2c5b86a5e1a691ddacfc03b631b87a0b19043 Author: J. R. Okajima AuthorDate: Fri, 3 Feb 2017 01:38:15 +0900 Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:57:06 +0100 locking/lockdep: Factor out the find_held_lock() helper function A simple consolidataion to factor out repeated patterns. The behaviour should not change. Signed-off-by: J. R. Okajima Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon...@gmail.com Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 114 ++- 1 file changed, 54 insertions(+), 60 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index a95e5d1..0d28b82 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3437,13 +3437,49 @@ static int match_held_lock(struct held_lock *hlock, struct lockdep_map *lock) return 0; } +/* @depth must not be zero */ +static struct held_lock *find_held_lock(struct task_struct *curr, + struct lockdep_map *lock, + unsigned int depth, int *idx) +{ + struct held_lock *ret, *hlock, *prev_hlock; + int i; + + i = depth - 1; + hlock = curr->held_locks + i; + ret = hlock; + if (match_held_lock(hlock, lock)) + goto out; + + ret = NULL; + for (i--, prev_hlock = hlock--; +i >= 0; +i--, prev_hlock = hlock--) { + /* +* We must not cross into another context: +*/ + if (prev_hlock->irq_context != hlock->irq_context) { + ret = NULL; + break; + } + if (match_held_lock(hlock, lock)) { + ret = hlock; + break; + } + } + +out: + *idx = i; + return ret; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; struct lock_class *class; unsigned int depth; int i; @@ -3456,21 +3492,10 @@ __lock_set_class(struct lockdep_map *lock, const char *name, if (DEBUG_LOCKS_WARN_ON(!depth)) return 0; - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: lockdep_init_map(lock, name, key, 0); class = register_lock_class(lock, subclass, 0); hlock->class_idx = class - lock_classes + 1; @@ -3508,7 +3533,7 @@ static int __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; unsigned int depth; int i; @@ -3527,21 +3552,10 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) * Check whether the lock exists in the current stack * of held locks: */ - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: if (hlock->instance == lock) lock_release_holdtime(hlock); @@ -3903,7 +3917,7 @@ static void __lock_contended(struct lockdep_map *lock, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *p
[tip:locking/core] locking/lockdep: Factor out the validate_held_lock() helper function
Commit-ID: e969970be033841d4c16b2e8ec8a3608347db861 Gitweb: http://git.kernel.org/tip/e969970be033841d4c16b2e8ec8a3608347db861 Author: J. R. Okajima AuthorDate: Fri, 3 Feb 2017 01:38:16 +0900 Committer: Ingo Molnar CommitDate: Thu, 16 Mar 2017 09:57:07 +0100 locking/lockdep: Factor out the validate_held_lock() helper function Behaviour should not change. Signed-off-by: J. R. Okajima Signed-off-by: Peter Zijlstra (Intel) Cc: Andrew Morton Cc: Linus Torvalds Cc: Paul E. McKenney Cc: Peter Zijlstra Cc: Thomas Gleixner Link: http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon...@gmail.com Signed-off-by: Ingo Molnar --- kernel/locking/lockdep.c | 40 ++-- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 0d28b82..da79548 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3473,6 +3473,24 @@ out: return ret; } +static int reacquire_held_locks(struct task_struct *curr, unsigned int depth, + int idx) +{ + struct held_lock *hlock; + + for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++) { + if (!__lock_acquire(hlock->instance, + hlock_class(hlock)->subclass, + hlock->trylock, + hlock->read, hlock->check, + hlock->hardirqs_off, + hlock->nest_lock, hlock->acquire_ip, + hlock->references, hlock->pin_count)) + return 1; + } + return 0; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, @@ -3503,15 +3521,8 @@ __lock_set_class(struct lockdep_map *lock, const char *name, curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (reacquire_held_locks(curr, depth, i)) + return 0; /* * I took it apart and put it back together again, except now I have @@ -3582,15 +3593,8 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (i++; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (reacquire_held_locks(curr, depth, i + 1)) + return 0; /* * We had N bottles of beer on the wall, we drank one, but now
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > With a different owner view, but that's irrelevant to the underlying > inode. Ok, the different ownership is limited within shitfs (or userns, container). Good. I might forget that shiftfs wants to behave like bind-mount. And I noticed that shiftfs setattr() converts uid/gid before calling backend fs' ->setattr(). It is good too. But how about acl? Won't such conversion be necessary for acl too? J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > With a different owner view, but that's irrelevant to the underlying > inode. Ok, the different ownership is limited within shitfs (or userns, container). Good. I might forget that shiftfs wants to behave like bind-mount. And I noticed that shiftfs setattr() converts uid/gid before calling backend fs' ->setattr(). It is good too. But how about acl? Won't such conversion be necessary for acl too? J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > I realised as I was trimming down the vestigial inode properties in the > patch that actually shiftfs does use the i_ino from the underlying for > userspace. The reason why is that it comes from the getattr call in > stat and that's fully what the underlying filesystem returns (including > the inode number). Let me make sure. - shiftfs has its own inode, but it will never be visible to userspace. - the inode attr visible to users are equivalent to the underlying one, includeing dev:ino pair. right? If so, I am afraid it will make users confused. The dev:ino pair is a system-wide identity, but shiftfs creates the same dev:ino pair with different owner. Though I don't know whether the actual application or LSM exists or not who will be damaged by this situation. For git-status case which I wrote previously, it might not be a problem as long as dev:ino is unchanged from git index. But such filesystem looks weird. J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > I realised as I was trimming down the vestigial inode properties in the > patch that actually shiftfs does use the i_ino from the underlying for > userspace. The reason why is that it comes from the getattr call in > stat and that's fully what the underlying filesystem returns (including > the inode number). Let me make sure. - shiftfs has its own inode, but it will never be visible to userspace. - the inode attr visible to users are equivalent to the underlying one, includeing dev:ino pair. right? If so, I am afraid it will make users confused. The dev:ino pair is a system-wide identity, but shiftfs creates the same dev:ino pair with different owner. Though I don't know whether the actual application or LSM exists or not who will be damaged by this situation. For git-status case which I wrote previously, it might not be a problem as long as dev:ino is unchanged from git index. But such filesystem looks weird. J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > Yes, I know the problem. However, I believe most current linux > filesystems no longer guarantee stable, for the lifetime of the file, > inode numbers. The usual docker container root is overlayfs, which, > similarly doesn't support stable inode numbers. I see the odd > complaint about docker with overlayfs having unstable inode numbers, > but none seems to have any serious repercussions. I think it serious. Reusing the backend fs' inum is a good approach which Amir wrote. Based on this, I'd suggest you to support the hardlinks. bakend_dentry = lookup_one_len() if (d_inode->i_nlink != 1) shiftfs_inode = ilookup(); if (!shiftfs_inode) { shiftfs_inode = new_inode(); shiftfs_inode->i_ino = bakend_dentry->d_inode->i_ino; } J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > Yes, I know the problem. However, I believe most current linux > filesystems no longer guarantee stable, for the lifetime of the file, > inode numbers. The usual docker container root is overlayfs, which, > similarly doesn't support stable inode numbers. I see the odd > complaint about docker with overlayfs having unstable inode numbers, > but none seems to have any serious repercussions. I think it serious. Reusing the backend fs' inum is a good approach which Amir wrote. Based on this, I'd suggest you to support the hardlinks. bakend_dentry = lookup_one_len() if (d_inode->i_nlink != 1) shiftfs_inode = ilookup(); if (!shiftfs_inode) { shiftfs_inode = new_inode(); shiftfs_inode->i_ino = bakend_dentry->d_inode->i_ino; } J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > This allows any subtree to be uid/gid shifted and bound elsewhere. It ::: Interesting. But I am afraid that the inconsistency problem of the inode numbers will happen. shiftfs_new_inode() uses get_next_ino() which means - 1st time: inodeA is created and cached, inumA is assigned - after using inodeA, it will be discarded from the cache - 2nd time: inodeA is looked-up again, and another inode number (inumB) is assgined. This inconsistency will not be a problem for the "pure virtual" fs such as procfs and sysfs. But your shiftfs is not pure as them. Shiftfs will be used as a wrapper (or "binder" which means bind-mount) of an orginary filesystem. The symptom of this problem from users perspective will be - find -inum doesn't work - git-status doesn't work, which keeps st_dev and st_ino and compares the current files. Of course they will be limited to when the target dir is huge and/or system memory is low. As long as the inode cache is large enough to hold all necessary inodes, the problem won't happen. If shiftfs will supports exporting via NFS in the future, the consistency of inum will be important too. J. R. Okajima
Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount
James Bottomley: > This allows any subtree to be uid/gid shifted and bound elsewhere. It ::: Interesting. But I am afraid that the inconsistency problem of the inode numbers will happen. shiftfs_new_inode() uses get_next_ino() which means - 1st time: inodeA is created and cached, inumA is assigned - after using inodeA, it will be discarded from the cache - 2nd time: inodeA is looked-up again, and another inode number (inumB) is assgined. This inconsistency will not be a problem for the "pure virtual" fs such as procfs and sysfs. But your shiftfs is not pure as them. Shiftfs will be used as a wrapper (or "binder" which means bind-mount) of an orginary filesystem. The symptom of this problem from users perspective will be - find -inum doesn't work - git-status doesn't work, which keeps st_dev and st_ino and compares the current files. Of course they will be limited to when the target dir is huge and/or system memory is low. As long as the inode cache is large enough to hold all necessary inodes, the problem won't happen. If shiftfs will supports exporting via NFS in the future, the consistency of inum will be important too. J. R. Okajima
Re: [PATCH 3/3] lockdep: new annotation lock_downgrade()
Peter Zijlstra: > > >> kernel/locking/rwsem.c:126:2: error: implicit declaration of functi= on 'lock_downgrade' [-Werror=3Dimplicit-function-declaration] > > lock_downgrade(>dep_map, _RET_IP_); > > ^~ ::: > This is what you need !LOCKDEP stubs for ;-) Ok, here is the update. Just one line added. J. R. Okajima commit 6874cbfb3c4f757efecbeb800bfd4db1050698f6 Author: J. R. Okajima <hooanon...@gmail.com> Date: Thu Feb 2 23:41:38 2017 +0900 lockdep: new annotation lock_downgrade() = The commit f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_= held didn't fully support rwsem. Here downgrade_write() supports the added = type. = Originally-written-by: Peter Zijlstra <pet...@infradead.org> See-also: http://marc.info/?l=3Dlinux-kernel=3D148581164003149=3D2 Signed-off-by: J. R. Okajima <hooanon...@gmail.com> diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 0345cbf..ba75d06 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_ma= p *lock, lock_set_class(lock, lock->name, lock->key, subclass, ip); } = +extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); + extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); @@ -413,6 +415,7 @@ static inline void lockdep_on(void) = # define lock_acquire(l, s, t, r, c, n, i) do { } while (0) # define lock_release(l, n, i) do { } while (0) +# define lock_downgrade(l, i) do { } while (0) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i)do { } while (0) # define lockdep_set_current_reclaim_state(g) do { } while (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7dc8f8e..6a4a740 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const ch= ar *name, return 1; } = +static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + struct task_struct *curr =3D current; + struct held_lock *hlock; + unsigned int depth; + int i; + + depth =3D curr->lockdep_depth; + /* +* This function is about (re)setting the class of a held lock, +* yet we're not actually holding any locks. Naughty user! +*/ + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + hlock =3D find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); + + curr->lockdep_depth =3D i; + curr->curr_chain_key =3D hlock->prev_chain_key; + + WARN(hlock->read, "downgrading a read lock"); + hlock->read =3D 1; + hlock->acquire_ip =3D ip; + + if (validate_held_lock(curr, depth, i)) + return 0; + + /* +* I took it apart and put it back together again, except now I have +* these 'spare' parts.. where shall I put them. +*/ + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth !=3D depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks - this gets * called on mutex_unlock()/spin_unlock*() (or on a failed @@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const= char *name, } EXPORT_SYMBOL_GPL(lock_set_class); = +void lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion =3D 1; + check_flags(flags); + if (__lock_downgrade(lock, ip)) + check_chain_key(current); + current->lockdep_recursion =3D 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_downgrade); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 45ba475..31db3ef 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write); */ void downgrade_write(struct rw_semaphore *sem) { - /* -* lockdep: a downgraded write will live on as a write -* dependency. -*/ + lock_downgrade(>dep_map, _RET_IP_); + rwsem_set_reader_owned(sem); __downgrade_write(sem); }
Re: [PATCH 3/3] lockdep: new annotation lock_downgrade()
Peter Zijlstra: > > >> kernel/locking/rwsem.c:126:2: error: implicit declaration of functi= on 'lock_downgrade' [-Werror=3Dimplicit-function-declaration] > > lock_downgrade(>dep_map, _RET_IP_); > > ^~ ::: > This is what you need !LOCKDEP stubs for ;-) Ok, here is the update. Just one line added. J. R. Okajima commit 6874cbfb3c4f757efecbeb800bfd4db1050698f6 Author: J. R. Okajima Date: Thu Feb 2 23:41:38 2017 +0900 lockdep: new annotation lock_downgrade() = The commit f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_= held didn't fully support rwsem. Here downgrade_write() supports the added = type. = Originally-written-by: Peter Zijlstra See-also: http://marc.info/?l=3Dlinux-kernel=3D148581164003149=3D2 Signed-off-by: J. R. Okajima diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 0345cbf..ba75d06 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_ma= p *lock, lock_set_class(lock, lock->name, lock->key, subclass, ip); } = +extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); + extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); @@ -413,6 +415,7 @@ static inline void lockdep_on(void) = # define lock_acquire(l, s, t, r, c, n, i) do { } while (0) # define lock_release(l, n, i) do { } while (0) +# define lock_downgrade(l, i) do { } while (0) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i)do { } while (0) # define lockdep_set_current_reclaim_state(g) do { } while (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7dc8f8e..6a4a740 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const ch= ar *name, return 1; } = +static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + struct task_struct *curr =3D current; + struct held_lock *hlock; + unsigned int depth; + int i; + + depth =3D curr->lockdep_depth; + /* +* This function is about (re)setting the class of a held lock, +* yet we're not actually holding any locks. Naughty user! +*/ + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + hlock =3D find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); + + curr->lockdep_depth =3D i; + curr->curr_chain_key =3D hlock->prev_chain_key; + + WARN(hlock->read, "downgrading a read lock"); + hlock->read =3D 1; + hlock->acquire_ip =3D ip; + + if (validate_held_lock(curr, depth, i)) + return 0; + + /* +* I took it apart and put it back together again, except now I have +* these 'spare' parts.. where shall I put them. +*/ + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth !=3D depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks - this gets * called on mutex_unlock()/spin_unlock*() (or on a failed @@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const= char *name, } EXPORT_SYMBOL_GPL(lock_set_class); = +void lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion =3D 1; + check_flags(flags); + if (__lock_downgrade(lock, ip)) + check_chain_key(current); + current->lockdep_recursion =3D 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_downgrade); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 45ba475..31db3ef 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write); */ void downgrade_write(struct rw_semaphore *sem) { - /* -* lockdep: a downgraded write will live on as a write -* dependency. -*/ + lock_downgrade(>dep_map, _RET_IP_); + rwsem_set_reader_owned(sem); __downgrade_write(sem); }
[PATCH 1/3] lockdep: consolidate by new find_held_lock()
A simple consolidataion. The behaviour should not change. Signed-off-by: J. R. Okajima <hooanon...@gmail.com> --- kernel/locking/lockdep.c | 114 ++- 1 file changed, 54 insertions(+), 60 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 85d9222..b7a2001 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3428,13 +3428,49 @@ static int match_held_lock(struct held_lock *hlock, struct lockdep_map *lock) return 0; } +/* @depth must not be zero */ +static struct held_lock *find_held_lock(struct task_struct *curr, + struct lockdep_map *lock, + unsigned int depth, int *idx) +{ + struct held_lock *ret, *hlock, *prev_hlock; + int i; + + i = depth - 1; + hlock = curr->held_locks + i; + ret = hlock; + if (match_held_lock(hlock, lock)) + goto out; + + ret = NULL; + for (i--, prev_hlock = hlock--; +i >= 0; +i--, prev_hlock = hlock--) { + /* +* We must not cross into another context: +*/ + if (prev_hlock->irq_context != hlock->irq_context) { + ret = NULL; + break; + } + if (match_held_lock(hlock, lock)) { + ret = hlock; + break; + } + } + +out: + *idx = i; + return ret; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; struct lock_class *class; unsigned int depth; int i; @@ -3447,21 +3483,10 @@ __lock_set_class(struct lockdep_map *lock, const char *name, if (DEBUG_LOCKS_WARN_ON(!depth)) return 0; - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: lockdep_init_map(lock, name, key, 0); class = register_lock_class(lock, subclass, 0); hlock->class_idx = class - lock_classes + 1; @@ -3499,7 +3524,7 @@ static int __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; unsigned int depth; int i; @@ -3518,21 +3543,10 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) * Check whether the lock exists in the current stack * of held locks: */ - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: if (hlock->instance == lock) lock_release_holdtime(hlock); @@ -3894,7 +3908,7 @@ static void __lock_contended(struct lockdep_map *lock, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; struct lock_class_stats *stats; unsigned int depth; int i, contention_point, contending_point; @@ -3907,22 +3921,12 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip) if (DEBUG_LOCKS_WARN_ON(!depth)) return; - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock &&
[PATCH 1/3] lockdep: consolidate by new find_held_lock()
A simple consolidataion. The behaviour should not change. Signed-off-by: J. R. Okajima --- kernel/locking/lockdep.c | 114 ++- 1 file changed, 54 insertions(+), 60 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 85d9222..b7a2001 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3428,13 +3428,49 @@ static int match_held_lock(struct held_lock *hlock, struct lockdep_map *lock) return 0; } +/* @depth must not be zero */ +static struct held_lock *find_held_lock(struct task_struct *curr, + struct lockdep_map *lock, + unsigned int depth, int *idx) +{ + struct held_lock *ret, *hlock, *prev_hlock; + int i; + + i = depth - 1; + hlock = curr->held_locks + i; + ret = hlock; + if (match_held_lock(hlock, lock)) + goto out; + + ret = NULL; + for (i--, prev_hlock = hlock--; +i >= 0; +i--, prev_hlock = hlock--) { + /* +* We must not cross into another context: +*/ + if (prev_hlock->irq_context != hlock->irq_context) { + ret = NULL; + break; + } + if (match_held_lock(hlock, lock)) { + ret = hlock; + break; + } + } + +out: + *idx = i; + return ret; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; struct lock_class *class; unsigned int depth; int i; @@ -3447,21 +3483,10 @@ __lock_set_class(struct lockdep_map *lock, const char *name, if (DEBUG_LOCKS_WARN_ON(!depth)) return 0; - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: lockdep_init_map(lock, name, key, 0); class = register_lock_class(lock, subclass, 0); hlock->class_idx = class - lock_classes + 1; @@ -3499,7 +3524,7 @@ static int __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; unsigned int depth; int i; @@ -3518,21 +3543,10 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) * Check whether the lock exists in the current stack * of held locks: */ - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context) - break; - if (match_held_lock(hlock, lock)) - goto found_it; - prev_hlock = hlock; - } - return print_unlock_imbalance_bug(curr, lock, ip); + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); -found_it: if (hlock->instance == lock) lock_release_holdtime(hlock); @@ -3894,7 +3908,7 @@ static void __lock_contended(struct lockdep_map *lock, unsigned long ip) { struct task_struct *curr = current; - struct held_lock *hlock, *prev_hlock; + struct held_lock *hlock; struct lock_class_stats *stats; unsigned int depth; int i, contention_point, contending_point; @@ -3907,22 +3921,12 @@ __lock_contended(struct lockdep_map *lock, unsigned long ip) if (DEBUG_LOCKS_WARN_ON(!depth)) return; - prev_hlock = NULL; - for (i = depth-1; i >= 0; i--) { - hlock = curr->held_locks + i; - /* -* We must not cross into another context: -*/ - if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
[PATCH 2/3] lockdep: consolidate by new validate_held_lock()
A simple consolidataion. The behaviour should not change. Signed-off-by: J. R. Okajima <hooanon...@gmail.com> --- kernel/locking/lockdep.c | 39 +-- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index b7a2001..7dc8f8e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3464,6 +3464,23 @@ static struct held_lock *find_held_lock(struct task_struct *curr, return ret; } +static int validate_held_lock(struct task_struct *curr, unsigned int depth, + int idx) +{ + struct held_lock *hlock; + + for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++) + if (!__lock_acquire(hlock->instance, + hlock_class(hlock)->subclass, + hlock->trylock, + hlock->read, hlock->check, + hlock->hardirqs_off, + hlock->nest_lock, hlock->acquire_ip, + hlock->references, hlock->pin_count)) + return 1; + return 0; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, @@ -3494,15 +3511,8 @@ __lock_set_class(struct lockdep_map *lock, const char *name, curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (validate_held_lock(curr, depth, i)) + return 0; /* * I took it apart and put it back together again, except now I have @@ -3573,15 +3583,8 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (i++; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (validate_held_lock(curr, depth, i + 1)) + return 0; /* * We had N bottles of beer on the wall, we drank one, but now -- 2.1.4
[PATCH 2/3] lockdep: consolidate by new validate_held_lock()
A simple consolidataion. The behaviour should not change. Signed-off-by: J. R. Okajima --- kernel/locking/lockdep.c | 39 +-- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index b7a2001..7dc8f8e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3464,6 +3464,23 @@ static struct held_lock *find_held_lock(struct task_struct *curr, return ret; } +static int validate_held_lock(struct task_struct *curr, unsigned int depth, + int idx) +{ + struct held_lock *hlock; + + for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++) + if (!__lock_acquire(hlock->instance, + hlock_class(hlock)->subclass, + hlock->trylock, + hlock->read, hlock->check, + hlock->hardirqs_off, + hlock->nest_lock, hlock->acquire_ip, + hlock->references, hlock->pin_count)) + return 1; + return 0; +} + static int __lock_set_class(struct lockdep_map *lock, const char *name, struct lock_class_key *key, unsigned int subclass, @@ -3494,15 +3511,8 @@ __lock_set_class(struct lockdep_map *lock, const char *name, curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (validate_held_lock(curr, depth, i)) + return 0; /* * I took it apart and put it back together again, except now I have @@ -3573,15 +3583,8 @@ __lock_release(struct lockdep_map *lock, int nested, unsigned long ip) curr->lockdep_depth = i; curr->curr_chain_key = hlock->prev_chain_key; - for (i++; i < depth; i++) { - hlock = curr->held_locks + i; - if (!__lock_acquire(hlock->instance, - hlock_class(hlock)->subclass, hlock->trylock, - hlock->read, hlock->check, hlock->hardirqs_off, - hlock->nest_lock, hlock->acquire_ip, - hlock->references, hlock->pin_count)) - return 0; - } + if (validate_held_lock(curr, depth, i + 1)) + return 0; /* * We had N bottles of beer on the wall, we drank one, but now -- 2.1.4
[PATCH 3/3] lockdep: new annotation lock_downgrade()
The commit f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_held didn't fully support rwsem. Here downgrade_write() supports the added type. Originally-written-by: Peter Zijlstra <pet...@infradead.org> See-also: http://marc.info/?l=linux-kernel=148581164003149=2 Signed-off-by: J. R. Okajima <hooanon...@gmail.com> --- include/linux/lockdep.h | 2 ++ kernel/locking/lockdep.c | 55 kernel/locking/rwsem.c | 6 ++ 3 files changed, 59 insertions(+), 4 deletions(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 0345cbf..22f304c 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map *lock, lock_set_class(lock, lock->name, lock->key, subclass, ip); } +extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); + extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7dc8f8e..6a4a740 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const char *name, return 1; } +static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + struct task_struct *curr = current; + struct held_lock *hlock; + unsigned int depth; + int i; + + depth = curr->lockdep_depth; + /* +* This function is about (re)setting the class of a held lock, +* yet we're not actually holding any locks. Naughty user! +*/ + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); + + curr->lockdep_depth = i; + curr->curr_chain_key = hlock->prev_chain_key; + + WARN(hlock->read, "downgrading a read lock"); + hlock->read = 1; + hlock->acquire_ip = ip; + + if (validate_held_lock(curr, depth, i)) + return 0; + + /* +* I took it apart and put it back together again, except now I have +* these 'spare' parts.. where shall I put them. +*/ + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks - this gets * called on mutex_unlock()/spin_unlock*() (or on a failed @@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const char *name, } EXPORT_SYMBOL_GPL(lock_set_class); +void lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion = 1; + check_flags(flags); + if (__lock_downgrade(lock, ip)) + check_chain_key(current); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_downgrade); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 45ba475..31db3ef 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write); */ void downgrade_write(struct rw_semaphore *sem) { - /* -* lockdep: a downgraded write will live on as a write -* dependency. -*/ + lock_downgrade(>dep_map, _RET_IP_); + rwsem_set_reader_owned(sem); __downgrade_write(sem); } -- 2.1.4
[PATCH 3/3] lockdep: new annotation lock_downgrade()
The commit f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_held didn't fully support rwsem. Here downgrade_write() supports the added type. Originally-written-by: Peter Zijlstra See-also: http://marc.info/?l=linux-kernel=148581164003149=2 Signed-off-by: J. R. Okajima --- include/linux/lockdep.h | 2 ++ kernel/locking/lockdep.c | 55 kernel/locking/rwsem.c | 6 ++ 3 files changed, 59 insertions(+), 4 deletions(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 0345cbf..22f304c 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map *lock, lock_set_class(lock, lock->name, lock->key, subclass, ip); } +extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip); + extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask); extern void lockdep_clear_current_reclaim_state(void); extern void lockdep_trace_alloc(gfp_t mask); diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 7dc8f8e..6a4a740 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const char *name, return 1; } +static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + struct task_struct *curr = current; + struct held_lock *hlock; + unsigned int depth; + int i; + + depth = curr->lockdep_depth; + /* +* This function is about (re)setting the class of a held lock, +* yet we're not actually holding any locks. Naughty user! +*/ + if (DEBUG_LOCKS_WARN_ON(!depth)) + return 0; + + hlock = find_held_lock(curr, lock, depth, ); + if (!hlock) + return print_unlock_imbalance_bug(curr, lock, ip); + + curr->lockdep_depth = i; + curr->curr_chain_key = hlock->prev_chain_key; + + WARN(hlock->read, "downgrading a read lock"); + hlock->read = 1; + hlock->acquire_ip = ip; + + if (validate_held_lock(curr, depth, i)) + return 0; + + /* +* I took it apart and put it back together again, except now I have +* these 'spare' parts.. where shall I put them. +*/ + if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth)) + return 0; + return 1; +} + /* * Remove the lock to the list of currently held locks - this gets * called on mutex_unlock()/spin_unlock*() (or on a failed @@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const char *name, } EXPORT_SYMBOL_GPL(lock_set_class); +void lock_downgrade(struct lockdep_map *lock, unsigned long ip) +{ + unsigned long flags; + + if (unlikely(current->lockdep_recursion)) + return; + + raw_local_irq_save(flags); + current->lockdep_recursion = 1; + check_flags(flags); + if (__lock_downgrade(lock, ip)) + check_chain_key(current); + current->lockdep_recursion = 0; + raw_local_irq_restore(flags); +} +EXPORT_SYMBOL_GPL(lock_downgrade); + /* * We are not always called with irqs disabled - do that here, * and also avoid lockdep recursion: diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 45ba475..31db3ef 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write); */ void downgrade_write(struct rw_semaphore *sem) { - /* -* lockdep: a downgraded write will live on as a write -* dependency. -*/ + lock_downgrade(>dep_map, _RET_IP_); + rwsem_set_reader_owned(sem); __downgrade_write(sem); } -- 2.1.4
Re: Q: lockdep_assert_held_read() after downgrade_write()
Peter Zijlstra: > Does something like the below work better? The annotation in > downgrade_write() would look something like: > > + lock_downgrade(>dep_map, 1, _RET_IP_); > > Not even compile tested and lacks the !LOCKDEP build bits. Thanks for the patch. It seems working expectedly. I began writing a similar patch locally with minor consolidations by adding a new function or two. I will send a patch series. Please review and merge them into v4.10. If you don't like the patch, especially the new function name, feel free to change it. I don't know whether !LOCKDEP build bits are necessary or not. J. R. Okajima
Re: Q: lockdep_assert_held_read() after downgrade_write()
Peter Zijlstra: > Does something like the below work better? The annotation in > downgrade_write() would look something like: > > + lock_downgrade(>dep_map, 1, _RET_IP_); > > Not even compile tested and lacks the !LOCKDEP build bits. Thanks for the patch. It seems working expectedly. I began writing a similar patch locally with minor consolidations by adding a new function or two. I will send a patch series. Please review and merge them into v4.10. If you don't like the patch, especially the new function name, feel free to change it. I don't know whether !LOCKDEP build bits are necessary or not. J. R. Okajima
Re: Q: lockdep_assert_held_read() after downgrade_write()
Jens Axboe: > I don't think you understand how it works. downgrade_write() turns a write > lock into read held. To make that last sequence valid, you'd need: > > down_write(); > downgrade_write(); > lockdep_assert_held_read() > up_read(); > > or just not drop up_write() from the last section. Arg... It is my bonehead mistake that I inserted up_write() before downgrade_write(). Sorry about that. Fortunately Peter Zijlstra reviewed downgrade_write() and sent a patch. Thank you, it passed my first test. Now allow me going on the second test (based upon Peter's patch) - two rwsem, rwA and rwB. - the locking order is rwA first, and then rwB. - good case down_read(rwA) down_read(rwB) up_read(rwB) up_read(rwA) down_write(rwA) down_write(rwB) up_write(rwB) up_write(rwA) - questionable case down_write(rwA) down_write(rwB) downgrade_write(rwA) downgrade_write(rwB) up_read(rwB) up_read(rwA) These two downgrade_write() have their strict order? If so, what is that? Do the added two lines + rwsem_release(>dep_map, 1, _RET_IP_); + rwsem_acquire_read(>dep_map, 0, 0, _RET_IP_); produce a traditional AB-BA deadlock warning, don't they? J. R. Okajima
Re: Q: lockdep_assert_held_read() after downgrade_write()
Jens Axboe: > I don't think you understand how it works. downgrade_write() turns a write > lock into read held. To make that last sequence valid, you'd need: > > down_write(); > downgrade_write(); > lockdep_assert_held_read() > up_read(); > > or just not drop up_write() from the last section. Arg... It is my bonehead mistake that I inserted up_write() before downgrade_write(). Sorry about that. Fortunately Peter Zijlstra reviewed downgrade_write() and sent a patch. Thank you, it passed my first test. Now allow me going on the second test (based upon Peter's patch) - two rwsem, rwA and rwB. - the locking order is rwA first, and then rwB. - good case down_read(rwA) down_read(rwB) up_read(rwB) up_read(rwA) down_write(rwA) down_write(rwB) up_write(rwB) up_write(rwA) - questionable case down_write(rwA) down_write(rwB) downgrade_write(rwA) downgrade_write(rwB) up_read(rwB) up_read(rwA) These two downgrade_write() have their strict order? If so, what is that? Do the added two lines + rwsem_release(>dep_map, 1, _RET_IP_); + rwsem_acquire_read(>dep_map, 0, 0, _RET_IP_); produce a traditional AB-BA deadlock warning, don't they? J. R. Okajima
Q: lockdep_assert_held_read() after downgrade_write()
Peter Zijlstra, May I ask you a question? v4.10-rc1 got a commit f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_held I've tested a little and lockdep splat a stack trace. { DECLARE_RWSEM(rw); static struct lock_class_key key; lockdep_set_class(, ); down_read(); lockdep_assert_held_read(); up_read(); down_write(); lockdep_assert_held_exclusive(); up_write(); downgrade_write(); lockdep_assert_held_read(); <-- here up_read(); } I was expecting that lockdep_assert_held_read() splat nothing after downgrade_write(). Is this warning an intentional behaviour? Also the final up_read() gives me a warning too. It is produced at lockdep.c:3514:lock_release(): DEBUG_LOCKS_WARN_ON(depth <= 0) As an additional information, I increased some lockdep constants. Do you think this is related? include/linux/lockdep.h +#define MAX_LOCKDEP_SUBCLASSES (8UL + 4) +#define MAX_LOCKDEP_KEYS_BITS (13 + 3) kernel/locking/lockdep_internals.h +#define MAX_LOCKDEP_ENTRIES(32768UL << 5) +#define MAX_LOCKDEP_CHAINS_BITS(16 + 5) +#define MAX_STACK_TRACE_ENTRIES(524288UL << 5) J. R. Okajima
Q: lockdep_assert_held_read() after downgrade_write()
Peter Zijlstra, May I ask you a question? v4.10-rc1 got a commit f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_held I've tested a little and lockdep splat a stack trace. { DECLARE_RWSEM(rw); static struct lock_class_key key; lockdep_set_class(, ); down_read(); lockdep_assert_held_read(); up_read(); down_write(); lockdep_assert_held_exclusive(); up_write(); downgrade_write(); lockdep_assert_held_read(); <-- here up_read(); } I was expecting that lockdep_assert_held_read() splat nothing after downgrade_write(). Is this warning an intentional behaviour? Also the final up_read() gives me a warning too. It is produced at lockdep.c:3514:lock_release(): DEBUG_LOCKS_WARN_ON(depth <= 0) As an additional information, I increased some lockdep constants. Do you think this is related? include/linux/lockdep.h +#define MAX_LOCKDEP_SUBCLASSES (8UL + 4) +#define MAX_LOCKDEP_KEYS_BITS (13 + 3) kernel/locking/lockdep_internals.h +#define MAX_LOCKDEP_ENTRIES(32768UL << 5) +#define MAX_LOCKDEP_CHAINS_BITS(16 + 5) +#define MAX_STACK_TRACE_ENTRIES(524288UL << 5) J. R. Okajima
Re: [PATCH v4] add u64 number parser
Hello James, Your commit a317178 2016-12-06 parser: add u64 number parser was merged into v4.10-rc1. Good. I have a very similar function and used for a long time. Here I'd suggest you a tiny optimization based on my version. If you think another value is more apropriate as the size of an internal array, you can change it freely. J. R. Okajima commit 030361e78d327d9d89254dc3b320c092221c7cd0 Author: J. R. Okajima <hooanon...@gmail.com> Date: Tue Jan 31 00:01:16 2017 +0900 parser: match_u64, short string optimization = When the given string is short enough, we can skip kmalloc/kfree. Signed-off-by: J. R. Okajima <hooanon...@gmail.com> diff --git a/lib/parser.c b/lib/parser.c index 3278958..cfbc6ec 100644 --- a/lib/parser.c +++ b/lib/parser.c @@ -163,21 +163,25 @@ static int match_number(substring_t *s, int *result,= int base) */ static int match_u64int(substring_t *s, u64 *result, int base) { - char *buf; + char *buf, a[32]; int ret; u64 val; size_t len =3D s->to - s->from; = - buf =3D kmalloc(len + 1, GFP_KERNEL); - if (!buf) - return -ENOMEM; + buf =3D a; + if (len + 1 > sizeof(a)) { + buf =3D kmalloc(len + 1, GFP_KERNEL); + if (unlikely(!buf)) + return -ENOMEM; + } memcpy(buf, s->from, len); buf[len] =3D '\0'; = ret =3D kstrtoull(buf, base, ); if (!ret) *result =3D val; - kfree(buf); + if (buf !=3D a) + kfree(buf); return ret; } =
Re: [PATCH v4] add u64 number parser
Hello James, Your commit a317178 2016-12-06 parser: add u64 number parser was merged into v4.10-rc1. Good. I have a very similar function and used for a long time. Here I'd suggest you a tiny optimization based on my version. If you think another value is more apropriate as the size of an internal array, you can change it freely. J. R. Okajima commit 030361e78d327d9d89254dc3b320c092221c7cd0 Author: J. R. Okajima Date: Tue Jan 31 00:01:16 2017 +0900 parser: match_u64, short string optimization = When the given string is short enough, we can skip kmalloc/kfree. Signed-off-by: J. R. Okajima diff --git a/lib/parser.c b/lib/parser.c index 3278958..cfbc6ec 100644 --- a/lib/parser.c +++ b/lib/parser.c @@ -163,21 +163,25 @@ static int match_number(substring_t *s, int *result,= int base) */ static int match_u64int(substring_t *s, u64 *result, int base) { - char *buf; + char *buf, a[32]; int ret; u64 val; size_t len =3D s->to - s->from; = - buf =3D kmalloc(len + 1, GFP_KERNEL); - if (!buf) - return -ENOMEM; + buf =3D a; + if (len + 1 > sizeof(a)) { + buf =3D kmalloc(len + 1, GFP_KERNEL); + if (unlikely(!buf)) + return -ENOMEM; + } memcpy(buf, s->from, len); buf[len] =3D '\0'; = ret =3D kstrtoull(buf, base, ); if (!ret) *result =3D val; - kfree(buf); + if (buf !=3D a) + kfree(buf); return ret; } =
Re: v4.3-rc2, fault in sock_release via nfs_put_client
Trond Myklebust: > On Sun, Sep 27, 2015 at 8:47 AM, J. R. Okajima wrote: > > I don't know whether this is a known issue or not (since I was off from > > development for a few months), but I've got a "general protection fault: > > " message from linux-4.3-rc2. ::: > There is a fix for this issue in v4.3-rc3. Please see > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=79234c3db6842a3de03817211d891e0c2878f756 Thank you. Confirmed the problem is gone. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
v4.3-rc2, fault in sock_release via nfs_put_client
Hello NFS folks, I don't know whether this is a known issue or not (since I was off from development for a few months), but I've got a "general protection fault: " message from linux-4.3-rc2. Here are the reproducible script and the log. Would you check them please? Notes: - The script cannot reproduce the problem perfectly. Not always. But if you try several times, you will able to see. On my test system, I had to try a few times. - As far as I know, if bind-mount did not happen, then the problem did not happen either. But I am not sure whether bind-mount is really the trigger of this issue. J. R. Okajima -- #!/bin/sh set -eu Stat() # path { stat -f --printf="%n %T\n" $1 } uname -a s=/dev/shm Stat $s c=/tmp/c mkdir -p $c showmount -e > /tmp/e sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 localhost:$s showmount -e | diff /tmp/e - || : sudo mount -t nfs localhost:$s $c Stat $c b=/tmp/b mkdir -p $b sudo mount -o bind $c $b Stat $b > $b/f cat /proc/mounts > /tmp/m sudo umount -l $b diff /tmp/m /proc/mounts || : sync sleep 1 cat /proc/mounts > /tmp/m sudo umount -l $c diff /tmp/m /proc/mounts sudo exportfs -u localhost:$s showmount -e | diff /tmp/e - -- $ sh -x ./nfs-4.3-rc2.sh + set -eu + s=/dev/shm + Stat /dev/shm + stat -f --printf=%n %T\n /dev/shm /dev/shm tmpfs + c=/tmp/c + mkdir -p /tmp/c + showmount -e + sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 localhost:/dev/shm + showmount -e + diff /tmp/e - 3a4 > /dev/shm localhost + : + sudo mount -t nfs localhost:/dev/shm /tmp/c + Stat /tmp/c + stat -f --printf=%n %T\n /tmp/c /tmp/c nfs + b=/tmp/b + mkdir -p /tmp/b + sudo mount -o bind /tmp/c /tmp/b + Stat /tmp/b + stat -f --printf=%n %T\n /tmp/b /tmp/b nfs + + cat /proc/mounts + sudo umount -l /tmp/b + diff /tmp/m /proc/mounts 26d25 < localhost:/dev/shm /tmp/b nfs4 rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0 + : + sync + sleep 1 + cat /proc/mounts + sudo umount -l /tmp/c general protection fault: [#7] PREEMPT SMP Modules linked in: oprofile configs autofs4 nfsd [last unloaded: brd] CPU: 0 PID: 4325 Comm: umount.nfs Tainted: G D 4.3.0-rc2aufsD+ #67 Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009 task: 88002d69ea00 ti: 88002ca0 task.ti: 88002ca0 RIP: 0010:[] [] sock_release+0x21/0x90 RSP: 0018:88002ca03bd8 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 88002812f7c0 RCX: RDX: RSI: RDI: 88002812f7c0 RBP: 88002ca03be8 R08: R09: R10: R11: 0001 R12: R13: 88002812f7c0 R14: 88002b8eeea0 R15: 0001 FS: 7f763d4f57e0() GS:88002fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 01176db8 CR3: 2d76c000 CR4: 000406f0 Stack: 88002b87 88002ca03c18 81715afc 88002b87 88002b87 88002ca03d08 88002ca03c38 81715bd6 88002b87 Call Trace: [] xs_reset_transport+0x18c/0x250 [] xs_close+0x16/0x30 [] xs_destroy+0x16/0x30 [] xprt_destroy+0x6f/0x80 [] xprt_put+0x14/0x20 [] rpc_free_client+0x84/0xc0 [] rpc_release_client+0x5a/0x90 [] rpc_shutdown_client+0xf1/0x100 [] nfs_free_client+0x97/0xa0 [] nfs4_free_client+0xa4/0xc0 [] nfs_put_client+0x295/0x430 [] ? nfs4_alloc_client+0x380/0x380 [] nfs_free_server+0x7c/0xd0 [] nfs_kill_super+0x2c/0x40 [] deactivate_locked_super+0x51/0x90 [] deactivate_super+0x84/0x90 [] cleanup_mnt+0x97/0xe0 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x72/0xa0 [] prepare_exit_to_usermode+0x10b/0x150 [] ? mntput_no_expire+0x5/0x2c0 [] syscall_return_slowpath+0x96/0x2f0 [] int_ret_from_sys_call+0x25/0x9f Code: 5b 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec 10 48 89 5d f0 48 89 fb 4c 89 65 f8 48 8b 47 28 48 85 c0 74 17 <4c> 8b 60 08 ff 50 10 48 c7 43 28 00 00 00 00 4c 89 e7 e8 b8 7d RIP [] sock_release+0x21/0x90 RSP ---[ end trace 1bcdd4036690d082 ]--- umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted $ -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: v4.3-rc2, fault in sock_release via nfs_put_client
Trond Myklebust: > On Sun, Sep 27, 2015 at 8:47 AM, J. R. Okajima <hooanon...@gmail.com> wrote: > > I don't know whether this is a known issue or not (since I was off from > > development for a few months), but I've got a "general protection fault: > > " message from linux-4.3-rc2. ::: > There is a fix for this issue in v4.3-rc3. Please see > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=79234c3db6842a3de03817211d891e0c2878f756 Thank you. Confirmed the problem is gone. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
v4.3-rc2, fault in sock_release via nfs_put_client
Hello NFS folks, I don't know whether this is a known issue or not (since I was off from development for a few months), but I've got a "general protection fault: " message from linux-4.3-rc2. Here are the reproducible script and the log. Would you check them please? Notes: - The script cannot reproduce the problem perfectly. Not always. But if you try several times, you will able to see. On my test system, I had to try a few times. - As far as I know, if bind-mount did not happen, then the problem did not happen either. But I am not sure whether bind-mount is really the trigger of this issue. J. R. Okajima -- #!/bin/sh set -eu Stat() # path { stat -f --printf="%n %T\n" $1 } uname -a s=/dev/shm Stat $s c=/tmp/c mkdir -p $c showmount -e > /tmp/e sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 localhost:$s showmount -e | diff /tmp/e - || : sudo mount -t nfs localhost:$s $c Stat $c b=/tmp/b mkdir -p $b sudo mount -o bind $c $b Stat $b > $b/f cat /proc/mounts > /tmp/m sudo umount -l $b diff /tmp/m /proc/mounts || : sync sleep 1 cat /proc/mounts > /tmp/m sudo umount -l $c diff /tmp/m /proc/mounts sudo exportfs -u localhost:$s showmount -e | diff /tmp/e - -- $ sh -x ./nfs-4.3-rc2.sh + set -eu + s=/dev/shm + Stat /dev/shm + stat -f --printf=%n %T\n /dev/shm /dev/shm tmpfs + c=/tmp/c + mkdir -p /tmp/c + showmount -e + sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 localhost:/dev/shm + showmount -e + diff /tmp/e - 3a4 > /dev/shm localhost + : + sudo mount -t nfs localhost:/dev/shm /tmp/c + Stat /tmp/c + stat -f --printf=%n %T\n /tmp/c /tmp/c nfs + b=/tmp/b + mkdir -p /tmp/b + sudo mount -o bind /tmp/c /tmp/b + Stat /tmp/b + stat -f --printf=%n %T\n /tmp/b /tmp/b nfs + + cat /proc/mounts + sudo umount -l /tmp/b + diff /tmp/m /proc/mounts 26d25 < localhost:/dev/shm /tmp/b nfs4 rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 0 0 + : + sync + sleep 1 + cat /proc/mounts + sudo umount -l /tmp/c general protection fault: [#7] PREEMPT SMP Modules linked in: oprofile configs autofs4 nfsd [last unloaded: brd] CPU: 0 PID: 4325 Comm: umount.nfs Tainted: G D 4.3.0-rc2aufsD+ #67 Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009 task: 88002d69ea00 ti: 88002ca0 task.ti: 88002ca0 RIP: 0010:[] [] sock_release+0x21/0x90 RSP: 0018:88002ca03bd8 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: 88002812f7c0 RCX: RDX: RSI: RDI: 88002812f7c0 RBP: 88002ca03be8 R08: R09: R10: R11: 0001 R12: R13: 88002812f7c0 R14: 88002b8eeea0 R15: 0001 FS: 7f763d4f57e0() GS:88002fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 01176db8 CR3: 2d76c000 CR4: 000406f0 Stack: 88002b87 88002ca03c18 81715afc 88002b87 88002b87 88002ca03d08 88002ca03c38 81715bd6 88002b87 Call Trace: [] xs_reset_transport+0x18c/0x250 [] xs_close+0x16/0x30 [] xs_destroy+0x16/0x30 [] xprt_destroy+0x6f/0x80 [] xprt_put+0x14/0x20 [] rpc_free_client+0x84/0xc0 [] rpc_release_client+0x5a/0x90 [] rpc_shutdown_client+0xf1/0x100 [] nfs_free_client+0x97/0xa0 [] nfs4_free_client+0xa4/0xc0 [] nfs_put_client+0x295/0x430 [] ? nfs4_alloc_client+0x380/0x380 [] nfs_free_server+0x7c/0xd0 [] nfs_kill_super+0x2c/0x40 [] deactivate_locked_super+0x51/0x90 [] deactivate_super+0x84/0x90 [] cleanup_mnt+0x97/0xe0 [] __cleanup_mnt+0x12/0x20 [] task_work_run+0x72/0xa0 [] prepare_exit_to_usermode+0x10b/0x150 [] ? mntput_no_expire+0x5/0x2c0 [] syscall_return_slowpath+0x96/0x2f0 [] int_ret_from_sys_call+0x25/0x9f Code: 5b 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec 10 48 89 5d f0 48 89 fb 4c 89 65 f8 48 8b 47 28 48 85 c0 74 17 <4c> 8b 60 08 ff 50 10 48 c7 43 28 00 00 00 00 4c 89 e7 e8 b8 7d RIP [] sock_release+0x21/0x90 RSP ---[ end trace 1bcdd4036690d082 ]--- umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted umount.nfs: /tmp/c: not mounted $ -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: create and use seq_show_option for escaping
Kees Cook: > This fixes the problem by adding new seq_show_option and seq_show_option_n > helpers, and updating the vulnerable show_option handlers to use them as > needed. Some, like SELinux, need to be open coded due to unusual existing > escape mechanisms. How about other ctrl chars such as CR or FF? I am using the similar function for many years, and it might be more generic because it supports all cntrl chars other than "\t\n\\" (see below). Many of other ctrl chars may not be necessary. But some people uses non-ASCII chars for their pathnames which may contain ESC or other chars. Any crazy chars can corrupt the output of /proc/mount and others. So it might be better to consider all ctrl chars. -- static char au_esc_chars[0x20 + 3]; /* 0x01-0x20, backslash, del, and NULL */ int au_seq_path(struct seq_file *seq, struct path *path) { return seq_path(seq, path, au_esc_chars); } module_init(void) { ::: p = au_esc_chars; for (i = 1; i <= ' '; i++) *p++ = i; *p++ = '\\'; *p++ = '\x7f'; *p = 0; ::: } J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs: create and use seq_show_option for escaping
Kees Cook: This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. How about other ctrl chars such as CR or FF? I am using the similar function for many years, and it might be more generic because it supports all cntrl chars other than \t\n\\ (see below). Many of other ctrl chars may not be necessary. But some people uses non-ASCII chars for their pathnames which may contain ESC or other chars. Any crazy chars can corrupt the output of /proc/mount and others. So it might be better to consider all ctrl chars. -- static char au_esc_chars[0x20 + 3]; /* 0x01-0x20, backslash, del, and NULL */ int au_seq_path(struct seq_file *seq, struct path *path) { return seq_path(seq, path, au_esc_chars); } module_init(void) { ::: p = au_esc_chars; for (i = 1; i = ' '; i++) *p++ = i; *p++ = '\\'; *p++ = '\x7f'; *p = 0; ::: } J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Repercussions of overflow in get_next_ino()
Hello, Nikolay Borisov: > My question is what are the repercussions of get_next_ino overflowing > and at some point having possibly multiple inodes on my system with the > same i_ino id? And why is it safe to have the inode id's overflow and > wrap around? I am afraid some applications won't work correctly. As far as I know, ls(1) and find(1) don't show the file whose inum is zero. See also Subject: [PATCH v2] vfs: get_next_ino(), never inum=0 Date: 2014-05-28 14:06:32 http://marc.info/?l=linux-fsdevel=140128600801771=2 and their thread. For tmpfs, I have another patch. Just FYI, here attached. J. R. Okajima a.patch.bz2 Description: BZip2 compressed data
Re: Repercussions of overflow in get_next_ino()
Hello, Nikolay Borisov: My question is what are the repercussions of get_next_ino overflowing and at some point having possibly multiple inodes on my system with the same i_ino id? And why is it safe to have the inode id's overflow and wrap around? I am afraid some applications won't work correctly. As far as I know, ls(1) and find(1) don't show the file whose inum is zero. See also Subject: [PATCH v2] vfs: get_next_ino(), never inum=0 Date: 2014-05-28 14:06:32 http://marc.info/?l=linux-fsdevelm=140128600801771w=2 and their thread. For tmpfs, I have another patch. Just FYI, here attached. J. R. Okajima a.patch.bz2 Description: BZip2 compressed data
Re: [PULL for 3.18] overlay filesystem v24
Miklos Szeredi: > It would be good to have an inode based union as well. I suggest (as > I suggested many times) to try slimming aufs to a bare minimum and > submit that. It may be easier to just start from scratch instead > trying to drop features from the existing codebase. I'd be happy to > review and generally help with such an effort. ?? In 2009, I have posted the "feature-reduced" version of aufs2 as following your advice. Also a git-branch called "aufs2-tmp-ro" was created whose size was a half of aufs2's. Have you ever posted your review comments about them? It was 2005 when I got an idea of aufs. Its history is almost a decade. In my current local aufs3 develpment branch, I got about 350 commits. And 680 and 1120 commits for aufs2 and aufs1 individually. aufs3-linux.git#aufs3.x-rcN/25lktr$ git log1 --no-merges fs/aufs/*.c | wc -l 348 aufs2-2.6.git#aufs2.2-stdalone-38-lktr$ git log1 --no-merges fs/aufs/*.c | wc -l 685 aufs1.git#master$ git log1 --no-merges aufs/fs/aufs/*.c | wc -l 1126 Of cource it won't be 2000 commits to create the patch series you wrote. I don't know how large the number will be but I am sure this is incredibly tough work both in commiting and reviewing. Is it still helpful and really meaningful for you to review? Don't you think it will be easier to review the resulted code? For example, this branch was created for linux-3.9 when Al Viro wrote about the dir mutex lock in copy-up. It has only 20 commits. aufs3-linux.git#mainline-v3.9-rc8-20130428$ git log1 --no-merges fs/aufs/*.c | wc -l 20 J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PULL for 3.18] overlay filesystem v24
Miklos Szeredi: It would be good to have an inode based union as well. I suggest (as I suggested many times) to try slimming aufs to a bare minimum and submit that. It may be easier to just start from scratch instead trying to drop features from the existing codebase. I'd be happy to review and generally help with such an effort. ?? In 2009, I have posted the feature-reduced version of aufs2 as following your advice. Also a git-branch called aufs2-tmp-ro was created whose size was a half of aufs2's. Have you ever posted your review comments about them? It was 2005 when I got an idea of aufs. Its history is almost a decade. In my current local aufs3 develpment branch, I got about 350 commits. And 680 and 1120 commits for aufs2 and aufs1 individually. aufs3-linux.git#aufs3.x-rcN/25lktr$ git log1 --no-merges fs/aufs/*.c | wc -l 348 aufs2-2.6.git#aufs2.2-stdalone-38-lktr$ git log1 --no-merges fs/aufs/*.c | wc -l 685 aufs1.git#master$ git log1 --no-merges aufs/fs/aufs/*.c | wc -l 1126 Of cource it won't be 2000 commits to create the patch series you wrote. I don't know how large the number will be but I am sure this is incredibly tough work both in commiting and reviewing. Is it still helpful and really meaningful for you to review? Don't you think it will be easier to review the resulted code? For example, this branch was created for linux-3.9 when Al Viro wrote about the dir mutex lock in copy-up. It has only 20 commits. aufs3-linux.git#mainline-v3.9-rc8-20130428$ git log1 --no-merges fs/aufs/*.c | wc -l 20 J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PULL for 3.18] overlay filesystem v24
David Howells: > Miklos Szeredi wrote: > > > I'd like to propose overlayfs for inclusion into 3.18. > > > > Al, would you mind giving it a review? > > > > Git tree is here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git > > overlayfs.current > > Tested-by: David Howells Does it mean overlayfs passed all your unionmount-testsuite? And does the test suite contain tests for "inode-based" union? For example, - read(2) may get the obsoleted filedata (fstat(2) for metadata too). - fcntl(F_SETLK) may be broken by copy-up. - inotify may not work when it refers to the file before being copied-up. - unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after open(O_RDWR). - exporting via NFS and fhandle systemcalls will not work. A few releases ago, OFD file-lock was introduced to improve the behaviour of POSIX lock. POSIX lock has made users confused and I am afraid that the similar story will come up because of the "name-based" union behaviour. Of course the story is not limited to the file-lock. If I remember correctly, are you the one who consitunes the development of UnionMount? Is the development totally stopped? Next paragraph is what I wrote several times. AUFS is an "inode-based" stackable filesystem and solved them many years ago. But I have to admit that AUFS is big. Yes it is grown up. I don't stop including overlayfs into mainline, but if the development of UnionMount is really stopped, then I'd ask people to consider merging aufs as well as overlayfs. http://aufs.sf.net J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PULL for 3.18] overlay filesystem v24
David Howells: Miklos Szeredi mik...@szeredi.hu wrote: I'd like to propose overlayfs for inclusion into 3.18. Al, would you mind giving it a review? Git tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.current Tested-by: David Howells dhowe...@redhat.com Does it mean overlayfs passed all your unionmount-testsuite? And does the test suite contain tests for inode-based union? For example, - read(2) may get the obsoleted filedata (fstat(2) for metadata too). - fcntl(F_SETLK) may be broken by copy-up. - inotify may not work when it refers to the file before being copied-up. - unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after open(O_RDWR). - exporting via NFS and fhandle systemcalls will not work. A few releases ago, OFD file-lock was introduced to improve the behaviour of POSIX lock. POSIX lock has made users confused and I am afraid that the similar story will come up because of the name-based union behaviour. Of course the story is not limited to the file-lock. If I remember correctly, are you the one who consitunes the development of UnionMount? Is the development totally stopped? Next paragraph is what I wrote several times. AUFS is an inode-based stackable filesystem and solved them many years ago. But I have to admit that AUFS is big. Yes it is grown up. I don't stop including overlayfs into mainline, but if the development of UnionMount is really stopped, then I'd ask people to consider merging aufs as well as overlayfs. http://aufs.sf.net J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.15 .. and continuation of merge window
Al Viro: > The former guarantees that the address we are doing trylock on would be that > of a live dentry. The latter makes sure that anything assigned to > dentry->d_parent after we drop ->d_lock will not be freed until we drop > rcu_read_lock. Although I don't know how to reproduce the problem for the latter case, I think the patch is good. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.15 .. and continuation of merge window
Al Viro: > So I suspect that the right fix is a bit trickier - in addition to check > on the fast path (i.e. when trylock gets us the lock on parent), we need > to > * get rcu_read_lock() before dropping ->d_lock. > * check if dentry is already doomed right after taking rcu_read_lock(); > if not, any value we might see in ->d_parent afterwards will point to object > not freed until we drop rcu_read_lock. > > IOW, something like the delta below. Comments? I will try testing later. For now, as a comment before testing, the patch looks weird for me. It checks d_lockref.count twice during d_lockref.lock held. It must be the same result, isn't it? Or does it mean that denty can be handled by lockref_mark_dead() even if d_lockref.lock is held? J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.15 .. and continuation of merge window
Al Viro: So I suspect that the right fix is a bit trickier - in addition to check on the fast path (i.e. when trylock gets us the lock on parent), we need to * get rcu_read_lock() before dropping -d_lock. * check if dentry is already doomed right after taking rcu_read_lock(); if not, any value we might see in -d_parent afterwards will point to object not freed until we drop rcu_read_lock. IOW, something like the delta below. Comments? I will try testing later. For now, as a comment before testing, the patch looks weird for me. It checks d_lockref.count twice during d_lockref.lock held. It must be the same result, isn't it? Or does it mean that denty can be handled by lockref_mark_dead() even if d_lockref.lock is held? J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.15 .. and continuation of merge window
Al Viro: The former guarantees that the address we are doing trylock on would be that of a live dentry. The latter makes sure that anything assigned to dentry-d_parent after we drop -d_lock will not be freed until we drop rcu_read_lock. Although I don't know how to reproduce the problem for the latter case, I think the patch is good. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.15 .. and continuation of merge window
Linus Torvalds: > So I ended up doing an rc8 because I was a bit worried about some > last-minute dcache fixes, but it turns out that nobody seemed to even > notice those. We did have other issues during the week, though, so it ::: I am afraid there is a problem in dcache. Please read http://marc.info/?l=linux-fsdevel=140214911608925=2 For 3.16, please consider these patches. - for vfs [PATCH v2] vfs: get_next_ino(), never inum=0 http://marc.info/?l=linux-fsdevel=140128600801771=2 - for tmpfs [RFC PATCH v4 0/2] tmpfs: manage the inode-number by IDR (performance measure) http://marc.info/?l=linux-fsdevel=140197128021025=2 [RFC PATCH v4 1/2] tmpfs: manage the inode-number by IDR, signed int inum http://marc.info/?l=linux-fsdevel=140197128321029=2 [RFC PATCH v4 2/2] tmpfs: refine a file handle for NFS-exporting http://marc.info/?l=linux-fsdevel=140197128121026=2 J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 3.15 .. and continuation of merge window
Linus Torvalds: So I ended up doing an rc8 because I was a bit worried about some last-minute dcache fixes, but it turns out that nobody seemed to even notice those. We did have other issues during the week, though, so it ::: I am afraid there is a problem in dcache. Please read http://marc.info/?l=linux-fsdevelm=140214911608925w=2 For 3.16, please consider these patches. - for vfs [PATCH v2] vfs: get_next_ino(), never inum=0 http://marc.info/?l=linux-fsdevelm=140128600801771w=2 - for tmpfs [RFC PATCH v4 0/2] tmpfs: manage the inode-number by IDR (performance measure) http://marc.info/?l=linux-fsdevelm=140197128021025w=2 [RFC PATCH v4 1/2] tmpfs: manage the inode-number by IDR, signed int inum http://marc.info/?l=linux-fsdevelm=140197128321029w=2 [RFC PATCH v4 2/2] tmpfs: refine a file handle for NFS-exporting http://marc.info/?l=linux-fsdevelm=140197128121026w=2 J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unionmount and overlayfs testsuite
David Howells: > > Does readlink(2) return "/u/fileA" instead of /ro/fileA?" > > No. > > The test suite sets the lower symlink to point to the union path for its > target. > > [root@andromeda union-testsuite]# readlink /lower/a/indirect_dir_sym100 > /mnt/a/direct_dir_sym100 Now I've found your latest commit fixed the path. No problemo. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unionmount and overlayfs testsuite
David Howells: Does readlink(2) return /u/fileA instead of /ro/fileA? No. The test suite sets the lower symlink to point to the union path for its target. [root@andromeda union-testsuite]# readlink /lower/a/indirect_dir_sym100 /mnt/a/direct_dir_sym100 Now I've found your latest commit fixed the path. No problemo. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unionmount and overlayfs testsuite
"J. R. Okajima": > - readlink.test, > fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E > EINVAL} > It expects "$testdir/direct_dir_sym100". Does it mean UnionMount > converts the target path? > For example, > - /u = /rw + /ro > - /rw/symlinkA doesn't exist > - /ro/symlinkA points to /ro/fileA > Does readlink(2) return "/u/fileA" instead of "/u/fileA"? Ouch! I meant "Does ... instead of /ro/fileA?" J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unionmount and overlayfs testsuite
David Howells: > http://git.infradead.org/users/dhowells/unionmount-testsuite.git I've found some interesting cases. - impermissible.test, open_file_as_bin -t -w $file -E EACCES When $termslash is "/", a '/' is appended to the expanded $file, such as "/path/fileA/". If fileA is a regular file, that path is obviously wrong. Does UnionMount expect EACCES in this case too? Should it be ENOTDIR? It might be better to change errcode=EACCES test ! "$termslash" = "" && errcode=ENOTDIR open_file_as_bin -t -w $file -E $errcode - readlink.test, fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E EINVAL} It expects "$testdir/direct_dir_sym100". Does it mean UnionMount converts the target path? For example, - /u = /rw + /ro - /rw/symlinkA doesn't exist - /ro/symlinkA points to /ro/fileA Does readlink(2) return "/u/fileA" instead of "/u/fileA"? And all tests should be done by a superuser? J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unionmount and overlayfs testsuite
David Howells: http://git.infradead.org/users/dhowells/unionmount-testsuite.git I've found some interesting cases. - impermissible.test, open_file_as_bin -t -w $file -E EACCES When $termslash is /, a '/' is appended to the expanded $file, such as /path/fileA/. If fileA is a regular file, that path is obviously wrong. Does UnionMount expect EACCES in this case too? Should it be ENOTDIR? It might be better to change errcode=EACCES test ! $termslash = errcode=ENOTDIR open_file_as_bin -t -w $file -E $errcode - readlink.test, fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E EINVAL} It expects $testdir/direct_dir_sym100. Does it mean UnionMount converts the target path? For example, - /u = /rw + /ro - /rw/symlinkA doesn't exist - /ro/symlinkA points to /ro/fileA Does readlink(2) return /u/fileA instead of /u/fileA? And all tests should be done by a superuser? J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Unionmount and overlayfs testsuite
J. R. Okajima: - readlink.test, fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E EINVAL} It expects $testdir/direct_dir_sym100. Does it mean UnionMount converts the target path? For example, - /u = /rw + /ro - /rw/symlinkA doesn't exist - /ro/symlinkA points to /ro/fileA Does readlink(2) return /u/fileA instead of /u/fileA? Ouch! I meant Does ... instead of /ro/fileA? J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/13] overlay filesystem v22
Thanks for CC-ing me. Here are some comments. - I have no objection about the 0:0 char-dev whiteout, but you don't have to have the inode for each whiteout. The hardlink is better. In this version, you have now. How about creating a "base" whiteout under workdir at the mount-time? Maybe it is possible by user-space "mount.overlayfs" or kernel-space. When the whiteout meets EMLINK, create a non-hardlink for that target synchronously and re-create the "base" asynchronously. - Is /work always visible to users? If a user accidentally removes it or its children, then some operations will fail. And he cannot recover it anymore. I know it cannot easily happen since its mode is 0. But I am afraid it will be a source of troubles. For example, find(1) or "ls -R /overlayfs" will almost always fail. - If I remember correctly, the length of the dir mutex is held time has been pointed out. This version has still a long mutex held time, a whole copy-up operation includeing lookup, create, copy filedata, copy xattr/attr, and then rename. How about unlock the dir before copying filedata and re-lock and confirm after copying attr? - When two processes copy-up a similar dir hierarcy, for example /dirA/dirB/fileC and /dirA/dirB/dirC/fileD, may a race condition happen? Two processes begin copying-up dirA, first processA succeeds it and second processB fails and returns EIO? - All copy-up operations will be serialized due to lock. - In fstat(2) for a dir, is nlink set to 1 even it is removed? - In readdir, it lookup or getattr to determine whether the found char dev entry is a whiteout or not. I know a char dev is not so many, so this overhead won't be large. But if its name represented "I am a whiteout", then the extra lookup or getattr would be unnecessary. My personal impression for overall is overlayfs starts growing. Also several parts look like towarding aufs. For example, - a means an overlayfs specific work. Aufs has such special dir for copying-up an unlinked file and a pseudo-link. Both are unnecessary for overlayfs because overlayfs copies-up a file in open(2) time, and doesn't support the hardlink between layers. - many small wrapper functions for VFS helpers resemble to aufs too. In aufs, all they have lockdep_off/on. - the internal cache for readdir allocating extra memory. Aufs adopts a simple hashing, while overlayfs uses rbtree. But of course the fundamental design differences between overlayfs and aufs are kept, such as - a name-based union .vs. an inode-aware union - multiple layers - allow users to access the layers directly - etc... If LKML people consider merging overlayfs, then I'd ask to consier aufs too. The basic design is unchanged since when I posted a few years ago. http://marc.info/?l=linux-kernel=123934927611907=2 And latest one is http://aufs.sourceforge.net J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/13] overlay filesystem v22
Thanks for CC-ing me. Here are some comments. - I have no objection about the 0:0 char-dev whiteout, but you don't have to have the inode for each whiteout. The hardlink is better. In this version, you have workdir now. How about creating a base whiteout under workdir at the mount-time? Maybe it is possible by user-space mount.overlayfs or kernel-space. When the whiteout meets EMLINK, create a non-hardlink for that target synchronously and re-create the base asynchronously. - Is workdir/work always visible to users? If a user accidentally removes it or its children, then some operations will fail. And he cannot recover it anymore. I know it cannot easily happen since its mode is 0. But I am afraid it will be a source of troubles. For example, find(1) or ls -R /overlayfs will almost always fail. - If I remember correctly, the length of the dir mutex is held time has been pointed out. This version has still a long mutex held time, a whole copy-up operation includeing lookup, create, copy filedata, copy xattr/attr, and then rename. How about unlock the dir before copying filedata and re-lock and confirm after copying attr? - When two processes copy-up a similar dir hierarcy, for example /dirA/dirB/fileC and /dirA/dirB/dirC/fileD, may a race condition happen? Two processes begin copying-up dirA, first processA succeeds it and second processB fails and returns EIO? - All copy-up operations will be serialized due to workdir lock. - In fstat(2) for a dir, is nlink set to 1 even it is removed? - In readdir, it lookup or getattr to determine whether the found char dev entry is a whiteout or not. I know a char dev is not so many, so this overhead won't be large. But if its name represented I am a whiteout, then the extra lookup or getattr would be unnecessary. My personal impression for overall is overlayfs starts growing. Also several parts look like towarding aufs. For example, - a workdir means an overlayfs specific work. Aufs has such special dir for copying-up an unlinked file and a pseudo-link. Both are unnecessary for overlayfs because overlayfs copies-up a file in open(2) time, and doesn't support the hardlink between layers. - many small wrapper functions for VFS helpers resemble to aufs too. In aufs, all they have lockdep_off/on. - the internal cache for readdir allocating extra memory. Aufs adopts a simple hashing, while overlayfs uses rbtree. But of course the fundamental design differences between overlayfs and aufs are kept, such as - a name-based union .vs. an inode-aware union - multiple layers - allow users to access the layers directly - etc... If LKML people consider merging overlayfs, then I'd ask to consier aufs too. The basic design is unchanged since when I posted a few years ago. http://marc.info/?l=linux-kernelm=123934927611907w=2 And latest one is http://aufs.sourceforge.net J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] ima: audit log files opened with O_DIRECT flag
Mimi Zohar: > As a temporary fix, do not measure, appraise, or audit files > opened with the O_DIRECT flag set. Just audit log it. I have no objection about the patch, but have a question. Are you intending to put it into mainline now (and stable too)? Or is this a local bandage for whoever have met the problem (like me)? In other words, should I wait for another lock free solution from Dmitry Kasatkin? By the way, the mail is not delivered to stable-ML while there is "Cc: stable..." line in the commit log. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] ima: audit log files opened with O_DIRECT flag
Mimi Zohar: As a temporary fix, do not measure, appraise, or audit files opened with the O_DIRECT flag set. Just audit log it. I have no objection about the patch, but have a question. Are you intending to put it into mainline now (and stable too)? Or is this a local bandage for whoever have met the problem (like me)? In other words, should I wait for another lock free solution from Dmitry Kasatkin? By the way, the mail is not delivered to stable-ML while there is Cc: stable... line in the commit log. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
Mimi Zohar: > I assume so, as there wasn't any comment. As a temporary fix, would it > make sense not to measure/appraise/audit files opened with the direct-io > flag based policy? Define a new IMA policy option 'directio'. A sample > rule would look like: > > dont_appraise bprm_check directio fsuuid=... I prefer such approach or anything addressing in IMA only, so it makes sense. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
"J. R. Okajima": > do_blockdev_direct_IO(), I'd suggest > - make two new static inline functions like > r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file). > - these new functions are complied when CONFIG_IMA is enabled, otherwise > they are plain mutex_lock/unlock(). > - then do_blockdev_direct_IO() can call them blindly. > - of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA > is enabled too. One more thing. Since struct file is a sharable object, it might be better to put task-id into struct file. Hmm, then should we support for multiple tasks by list or something? Oh the code grows... J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
Mimi Zohar: > Another approach was posted here > http://marc.info/?l=linux-security-module=138919062430367=2 which > also was not upstreamed. It might be better a little than previous one which handles the flag temporarily. But, in order to make the code cleaner particulary for do_blockdev_direct_IO(), I'd suggest - make two new static inline functions like r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file). - these new functions are complied when CONFIG_IMA is enabled, otherwise they are plain mutex_lock/unlock(). - then do_blockdev_direct_IO() can call them blindly. - of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA is enabled too. I can guess that several people thinks that is still "ugly locking", but the deadlock is much ugly in real world. And we need some workaround for it. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
Dmitry Kasatkin: > It is a different issue. > > I made patch more than a year ago which fix the problem > > https://lkml.org/lkml/2013/2/20/601 Yes, I know this is a different issue, but I didn't know the patch. While I am not sure how ugly (or beautiful) is the approache the patch took, as long as IMA needs to read the file, we should merge the merge. Or we should wait for your another patch. J. R. Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
Dmitry Kasatkin: It is a different issue. I made patch more than a year ago which fix the problem https://lkml.org/lkml/2013/2/20/601 Yes, I know this is a different issue, but I didn't know the patch. While I am not sure how ugly (or beautiful) is the approache the patch took, as long as IMA needs to read the file, we should merge the merge. Or we should wait for your another patch. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
Mimi Zohar: Another approach was posted here http://marc.info/?l=linux-security-modulem=138919062430367w=2 which also was not upstreamed. It might be better a little than previous one which handles the flag temporarily. But, in order to make the code cleaner particulary for do_blockdev_direct_IO(), I'd suggest - make two new static inline functions like r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file). - these new functions are complied when CONFIG_IMA is enabled, otherwise they are plain mutex_lock/unlock(). - then do_blockdev_direct_IO() can call them blindly. - of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA is enabled too. I can guess that several people thinks that is still ugly locking, but the deadlock is much ugly in real world. And we need some workaround for it. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
J. R. Okajima: do_blockdev_direct_IO(), I'd suggest - make two new static inline functions like r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file). - these new functions are complied when CONFIG_IMA is enabled, otherwise they are plain mutex_lock/unlock(). - then do_blockdev_direct_IO() can call them blindly. - of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA is enabled too. One more thing. Since struct file is a sharable object, it might be better to put task-id into struct file. Hmm, then should we support for multiple tasks by list or something? Oh the code grows... J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)
Mimi Zohar: I assume so, as there wasn't any comment. As a temporary fix, would it make sense not to measure/appraise/audit files opened with the direct-io flag based policy? Define a new IMA policy option 'directio'. A sample rule would look like: dont_appraise bprm_check directio fsuuid=... I prefer such approach or anything addressing in IMA only, so it makes sense. J. R. Okajima -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/