Re: [PATCH] concrete /proc/mounts

2019-05-26 Thread J. R. Okajima
Al Viro:
> Translation: let's generate the entire contents on the first read() and keep
> it until the sucker's closed; that way userland wont' see anything changing
> under it.  Oh, wait...
>
> NAK.

Do you mean that the change can hide other mountpoints which are
kept unchanged before/after read()?


J. R. Okajima


[PATCH] concrete /proc/mounts

2019-05-26 Thread J. R. Okajima
commit 1e83f8634c6efe7dd4e6036ee202ca10bdbca0b3
Author: J. R. Okajima 
Date:   Sat May 25 18:35:13 2019 +0900

concrete /proc/mounts

When the size of /proc/mounts exceeds PAGE_SIZE, seq_read() has to
release namespace_sem via mounts_op.m_stop().  It means if someone else
issues mount(2) or umount(2) and the mounts list got changed, then the
continuous getmntent(3) calls show the incomplete mounts list and some
entries may not appear in it.

This patch generates the full mounts list when mounts_op.m_start() is
called, and keep it in the seq_file buffer until the file is closed.
The size of the buffer increases if necessary.  Other operations m_next,
m_stop, m_show become meaningless, but still necessary for the seq_file
manner.

I don't think the size of the buffer matters because many /proc entries
already keep the similar PAGE_SIZE buffer.  Increasing /proc/mounts
buffer is to keep the correctness of the mount list.

Reported-by: Kirill Kolyshkin 
See-also: https://github.com/kolyshkin/procfs-test
Signed-off-by: J. R. Okajima 

diff --git a/fs/mount.h b/fs/mount.h
index f39bc9da4d73..1ffd97696ca9 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -131,9 +131,7 @@ struct proc_mounts {
struct mnt_namespace *ns;
struct path root;
int (*show)(struct seq_file *, struct vfsmount *);
-   void *cached_mount;
-   u64 cached_event;
-   loff_t cached_index;
+   bool filled;
 };
 
 extern const struct seq_operations mounts_op;
diff --git a/fs/namespace.c b/fs/namespace.c
index d18deb4c410b..2984a48cd40f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1298,46 +1298,77 @@ struct vfsmount *mnt_clone_internal(const struct path 
*path)
 
 #ifdef CONFIG_PROC_FS
 /* iterator; we want it to have access to namespace_sem, thus here... */
-static void *m_start(struct seq_file *m, loff_t *pos)
+static int m_start_fill(struct seq_file *m)
 {
+   int err;
+   size_t last_count;
+   char *buf;
+   struct mount *r;
struct proc_mounts *p = m->private;
 
down_read(_sem);
-   if (p->cached_event == p->ns->event) {
-   void *v = p->cached_mount;
-   if (*pos == p->cached_index)
-   return v;
-   if (*pos == p->cached_index + 1) {
-   v = seq_list_next(v, >ns->list, >cached_index);
-   return p->cached_mount = v;
+   list_for_each_entry(r, >ns->list, mnt_list) {
+   last_count = m->count;
+   err = p->show(m, >mnt);
+   if (unlikely(err < 0))
+   break;
+   if (!seq_has_overflowed(m))
+   continue;
+
+   /* expand the buffer */
+   buf = kvmalloc(m->size + PAGE_SIZE, GFP_KERNEL);
+   if (unlikely(!buf)) {
+   err = -ENOMEM;
+   break;
+   }
+   memcpy(buf, m->buf, last_count);
+   kvfree(m->buf);
+   m->buf = buf;
+   m->size += PAGE_SIZE;
+   m->count = last_count;
+
+   err = p->show(m, >mnt);
+   if (unlikely(err < 0))
+   break;
+   else if (unlikely(seq_has_overflowed(m))) {
+   err = -EFBIG;
+   break;
}
}
+   up_read(_sem);
 
-   p->cached_event = p->ns->event;
-   p->cached_mount = seq_list_start(>ns->list, *pos);
-   p->cached_index = *pos;
-   return p->cached_mount;
+   if (!err)
+   p->filled = true;
+   return err;
 }
 
-static void *m_next(struct seq_file *m, void *v, loff_t *pos)
+static void *m_start(struct seq_file *m, loff_t *pos)
 {
+   int err;
struct proc_mounts *p = m->private;
 
-   p->cached_mount = seq_list_next(v, >ns->list, pos);
-   p->cached_index = *pos;
-   return p->cached_mount;
+   if (!p->filled) {
+   err = m_start_fill(m);
+   if (unlikely(err))
+   return ERR_PTR(err);
+   }
+
+   return m_start; /* any valid pointer */
+}
+
+static void *m_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   return NULL;
 }
 
 static void m_stop(struct seq_file *m, void *v)
 {
-   up_read(_sem);
+   /* empty */
 }
 
 static int m_show(struct seq_file *m, void *v)
 {
-   struct proc_mounts *p = m->private;
-   struct mount *r = list_entry(v, struct mount, mnt_list);
-   return p->show(m, >mnt);
+   return 0;
 }
 
 const struct seq_operations mounts_op = {
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 7626ee11b06c..f8aee5cca1b1 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -279,7 +279,6 @@ static int mo

Re: [PATCH] OVL: add honoracl=off mount option.

2019-05-02 Thread J. R. Okajima
NeilBrown:
> If the upper and lower layers use incompatible ACL formats, it is not
> possible to copy the ACL xttr from one to the other, so overlayfs
> cannot work with them.
> This happens particularly with NFSv4 which uses system.nfs4_acl, and
> ext4 which uses system.posix_acl_access.

FYI,
Aufs had met the same problem many years ago, and I introduced some
options called ICEX (Ignore Copyup Error on Xattr).

(from the design doc in aufs)

For example, the src branch supports ACL but the dst branch doesn't
because the dst branch may natively un-support it or temporary
un-support it due to "noacl" mount option. Of course, the dst branch fs
may NOT return an error even if the XATTR is not supported. It is
totally up to the branch fs.

Anyway when the aufs internal copy-up gets an error from the dst branch
fs, then aufs tries removing the just copied entry and returns the error
to the userspace. The worst case of this situation will be all copy-up
will fail.

For the copy-up operation, there two basic approaches.
- copy the specified XATTR only (by category above), and return the
  error unconditionally if it happens.
- copy all XATTR, and ignore the error on the specified category only.

In order to support XATTR and to implement the correct behaviour, aufs
chooses the latter approach and introduces some new branch attributes,
"icexsec", "icexsys", "icextr", "icexusr", and "icexoth".
They correspond to the XATTR namespaces (see above). Additionally, to be
convenient, "icex" is also provided which means all "icex*" attributes
are set (here the word "icex" stands for "ignore copy-error on XATTR").

The meaning of these attributes is to ignore the error from setting
XATTR on that branch.
Note that aufs tries copying all XATTR unconditionally, and ignores the
error from the dst branch according to the specified attributes.


But recent nfs4 got changed its behaviour around ACL, and it didn't pass
my local tests.  I had posted about that, but got no reply.


J. R. Okajima


fix v5.0-rc1, SUNRPC: init machine_cred.magic under CONFIG_DEBUG_CREDENTIALS

2019-01-27 Thread J. R. Okajima
By the commit,
a52458b48af1 2018-12-19 NFS/NFSD/SUNRPC: replace generic creds with 'struct 
cred'.

struct rpc_cred machine_cred was converted to struct cred, but
machine_cred.magic is still uninitialized.  Without initializing it, I
got 'Invalid credentials' error.  It is necessary when
CONFIG_DEBUG_CREDENTIALS is enabled.

CRED: Invalid credentials
CRED: At /proj/aufs/aufs4-linux.git/include/linux/cred.h:253
CRED: Specified credentials: 6ca067d8 
CRED: ->magic=0, put_addr=  (null)
CRED: ->usage=1, subscr=0
CRED: ->*uid = { 0,0,0,0 }
CRED: ->*gid = { 0,0,0,0 }
CRED: ->security is   (null)
[ cut here ]
kernel BUG at /proj/aufs/aufs4-linux.git/kernel/cred.c:825!
invalid opcode:  [#1] PREEMPT SMP PTI
CPU: 0 PID: 24923 Comm: mount.nfs4 Tainted: GW 5.0.0-rc1aufsD+ 
#906
Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009
RIP: 0010:__invalid_creds+0x4d/0x60
Code: 44 89 ea 4c 89 e6 48 c7 c7 cf 92 49 82 e8 5e 21 05 00 48 c7 c6 e1 92 49 
82 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 23 fe ff ff <0f> 0b 48 c7 c7 00 ac 68 
82 e8 25 6a 53 00 0f 1f 44 00 00 66 66 66
RSP: 0018:88810bedf918 EFLAGS: 00010293
RAX:  RBX: 829016c0 RCX: 
RDX:  RSI:  RDI: 8294c428
RBP: 88810bedf930 R08: 0001 R09: 
R10: 88810bedf930 R11:  R12: 824964d0
R13: 00fd R14: 829016c0 R15: 0001
FS:  7f8616b55480() GS:88811ba0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fda79fe7170 CR3: aca9 CR4: 000406f0
Call Trace:
 nfs4_discover_server_trunking+0x2cb/0x330
 nfs4_init_client+0x16e/0x210
 ? lockdep_init_map+0x57/0x1d0
 ? rpc_wake_up_task_on_wq_queue_action_locked+0x60/0x60
 ? nfs_get_client+0x500/0x680
 nfs_get_client+0x51d/0x680
 nfs4_set_client+0xb9/0x130
 nfs4_create_server+0x10d/0x290
 nfs4_remote_mount+0x30/0x90
 mount_fs+0x51/0x220
 ? __init_waitqueue_head+0x3b/0x50
 vfs_kern_mount+0x6b/0x190
 ? nfs_do_root_mount+0x3c/0xc0
 nfs_do_root_mount+0x84/0xc0
 nfs4_try_mount+0x37/0x50
 nfs_fs_mount+0x2a1/0xa40
 ? nfs_clone_super+0x80/0x80
 ? nfs_free_parsed_mount_data+0x60/0x60
 mount_fs+0x51/0x220
 ? nfs_alloc_parsed_mount_data+0xd0/0xd0
 ? mount_fs+0x51/0x220
 ? __init_waitqueue_head+0x3b/0x50
 vfs_kern_mount+0x6b/0x190
 ? ns_capable_common+0xc3/0x110
 do_mount+0x220/0xf90
 ksys_mount+0xea/0x170
 __x64_sys_mount+0x25/0x30
 do_syscall_64+0x79/0x360
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Here is a patch.

diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 1ff9768f5456..0e5236a3b5e0 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -41,6 +41,9 @@ static unsigned long number_cred_unused;
 
 static struct cred machine_cred = {
.usage = ATOMIC_INIT(1),
+#ifdef CONFIG_DEBUG_CREDENTIALS
+   .magic = CRED_MAGIC
+#endif
 };
 
 /*


fix v5.0-rc1, SUNRPC: init machine_cred.magic under CONFIG_DEBUG_CREDENTIALS

2019-01-27 Thread J. R. Okajima
By the commit,
a52458b48af1 2018-12-19 NFS/NFSD/SUNRPC: replace generic creds with 'struct 
cred'.

struct rpc_cred machine_cred was converted to struct cred, but
machine_cred.magic is still uninitialized.  Without initializing it, I
got 'Invalid credentials' error.  It is necessary when
CONFIG_DEBUG_CREDENTIALS is enabled.

CRED: Invalid credentials
CRED: At /proj/aufs/aufs4-linux.git/include/linux/cred.h:253
CRED: Specified credentials: 6ca067d8 
CRED: ->magic=0, put_addr=  (null)
CRED: ->usage=1, subscr=0
CRED: ->*uid = { 0,0,0,0 }
CRED: ->*gid = { 0,0,0,0 }
CRED: ->security is   (null)
[ cut here ]
kernel BUG at /proj/aufs/aufs4-linux.git/kernel/cred.c:825!
invalid opcode:  [#1] PREEMPT SMP PTI
CPU: 0 PID: 24923 Comm: mount.nfs4 Tainted: GW 5.0.0-rc1aufsD+ 
#906
Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009
RIP: 0010:__invalid_creds+0x4d/0x60
Code: 44 89 ea 4c 89 e6 48 c7 c7 cf 92 49 82 e8 5e 21 05 00 48 c7 c6 e1 92 49 
82 48 89 df 65 48 8b 14 25 80 4e 01 00 e8 23 fe ff ff <0f> 0b 48 c7 c7 00 ac 68 
82 e8 25 6a 53 00 0f 1f 44 00 00 66 66 66
RSP: 0018:88810bedf918 EFLAGS: 00010293
RAX:  RBX: 829016c0 RCX: 
RDX:  RSI:  RDI: 8294c428
RBP: 88810bedf930 R08: 0001 R09: 
R10: 88810bedf930 R11:  R12: 824964d0
R13: 00fd R14: 829016c0 R15: 0001
FS:  7f8616b55480() GS:88811ba0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fda79fe7170 CR3: aca9 CR4: 000406f0
Call Trace:
 nfs4_discover_server_trunking+0x2cb/0x330
 nfs4_init_client+0x16e/0x210
 ? lockdep_init_map+0x57/0x1d0
 ? rpc_wake_up_task_on_wq_queue_action_locked+0x60/0x60
 ? nfs_get_client+0x500/0x680
 nfs_get_client+0x51d/0x680
 nfs4_set_client+0xb9/0x130
 nfs4_create_server+0x10d/0x290
 nfs4_remote_mount+0x30/0x90
 mount_fs+0x51/0x220
 ? __init_waitqueue_head+0x3b/0x50
 vfs_kern_mount+0x6b/0x190
 ? nfs_do_root_mount+0x3c/0xc0
 nfs_do_root_mount+0x84/0xc0
 nfs4_try_mount+0x37/0x50
 nfs_fs_mount+0x2a1/0xa40
 ? nfs_clone_super+0x80/0x80
 ? nfs_free_parsed_mount_data+0x60/0x60
 mount_fs+0x51/0x220
 ? nfs_alloc_parsed_mount_data+0xd0/0xd0
 ? mount_fs+0x51/0x220
 ? __init_waitqueue_head+0x3b/0x50
 vfs_kern_mount+0x6b/0x190
 ? ns_capable_common+0xc3/0x110
 do_mount+0x220/0xf90
 ksys_mount+0xea/0x170
 __x64_sys_mount+0x25/0x30
 do_syscall_64+0x79/0x360
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Here is a patch.

diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index 1ff9768f5456..0e5236a3b5e0 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -41,6 +41,9 @@ static unsigned long number_cred_unused;
 
 static struct cred machine_cred = {
.usage = ATOMIC_INIT(1),
+#ifdef CONFIG_DEBUG_CREDENTIALS
+   .magic = CRED_MAGIC
+#endif
 };
 
 /*


Q. re-using lock_classes[]

2018-12-10 Thread J. R. Okajima
Hello,

"Troubleshooting" section in Documentation/locking/lockdep-design.txt
describes

1.  Repeated module loading and unloading while running the validator
will result in lock-class leakage.  The issue here is that each
load of the module will create a new set of lock classes for
that module's locks, but module unloading does not remove old
classes (see below discussion of reuse of lock classes for why).
Therefore, if that module is loaded and unloaded repeatedly,
the number of lock classes will eventually reach the maximum.
:::
One might argue that the validator should be modified to allow
lock classes to be reused.  However, if you are tempted to make this
argument, first review the code and think through the changes that would
be required, keeping in mind that the lock classes to be removed are
likely to be linked into the lock-dependency graph.  This turns out to
be harder to do than to say.


I am wondering these
"module unloading does not remove old classes"
"the lock classes to be removed are likely to be linked into the
lock-dependency graph"
sentences are still valid?
Does "the lock-dependency graph" mean
class->hash_entry
class->lock_entry
and/or
list_entries[]?
Those are handled by zap_class() at unloading the module.

Here is my question.
Doesn't zap_class() make the slot in lock_classes[] logically unused?
If so, can we re-use the unused slot by searchng and testing some
members in struct lock_class?  For example,

bool test_unused(class)
{
return !rcu_access_pointer(class->name)
&& !rcu_access_pointer(class->key)
&& list_empty(>lock_entry)
&& hlist_unhashed(>hash_entry);
}

Though a new function list_del_init_rcu() for zap_class() will be
necessary.


J. R. Okajima


Re: [PATCH 14/39] ovl: stack file ops

2018-06-13 Thread J. R. Okajima
Al Viro:
> I'd managed to push that particular nest of horrors out of mind ;-/
> Having dug out my notes from back then and grepped around...  The real
> mess is not even /proc/*/maps - it's /proc/*/map_files/* and yes, the
> reasons for that kludge are still valid ;-/
:::
> Uses of ->vm_file (and rules for those) are too convoluted to untangle
> at the moment.  I still would love to get that straightened out, but
> it's not this cycle fodder, more's the pity...

I don't fully read this thread, but the discussion is related to the
file path printed in /proc/$$/maps?  If so, as just for your
information, here is an approach that aufs took.

In linux-v2.6 era, aufs tried implementing mmap by customzing
address_space ops, but it was not good and failed completing the
implementation.
As wel as overlayfs, aufs has two struct file objects for a single
a regular file.  One is for a virtual aufs' entry, and the other is for
a real layer's entry.  When a user issues mmap(2) for the virtual file,
aufs redirects the request to the real file on the layer internally.  So
the vm_file points to the real file.  It means /proc/$$/maps prints the
unexpected file path.

Aufs added another struct file* vm_prfile in struct vma.  It points to
the virtual aufs file, and /proc/$$/maps prints vm_prfile instead of
vm_file. Of cource, maintaining vm_prfile is important since vma may be
merged or splitted.
Still I don't like this approach, but I don't have another better idea,
also it works for many years.  You can get the patch in
aufs4-standalone.git on sourceforge if you want.


J. R. Okajima


Re: [PATCH 14/39] ovl: stack file ops

2018-06-13 Thread J. R. Okajima
Al Viro:
> I'd managed to push that particular nest of horrors out of mind ;-/
> Having dug out my notes from back then and grepped around...  The real
> mess is not even /proc/*/maps - it's /proc/*/map_files/* and yes, the
> reasons for that kludge are still valid ;-/
:::
> Uses of ->vm_file (and rules for those) are too convoluted to untangle
> at the moment.  I still would love to get that straightened out, but
> it's not this cycle fodder, more's the pity...

I don't fully read this thread, but the discussion is related to the
file path printed in /proc/$$/maps?  If so, as just for your
information, here is an approach that aufs took.

In linux-v2.6 era, aufs tried implementing mmap by customzing
address_space ops, but it was not good and failed completing the
implementation.
As wel as overlayfs, aufs has two struct file objects for a single
a regular file.  One is for a virtual aufs' entry, and the other is for
a real layer's entry.  When a user issues mmap(2) for the virtual file,
aufs redirects the request to the real file on the layer internally.  So
the vm_file points to the real file.  It means /proc/$$/maps prints the
unexpected file path.

Aufs added another struct file* vm_prfile in struct vma.  It points to
the virtual aufs file, and /proc/$$/maps prints vm_prfile instead of
vm_file. Of cource, maintaining vm_prfile is important since vma may be
merged or splitted.
Still I don't like this approach, but I don't have another better idea,
also it works for many years.  You can get the patch in
aufs4-standalone.git on sourceforge if you want.


J. R. Okajima


Re: Why is the length of max mount option a page size??

2018-05-29 Thread J. R. Okajima
Jungsub Shin:
> Do you mean using symlinks or bind mount to reduce length of mount
> options?? I can remove "~/tca_agent/image_layer_dir" string. But the
> other character means unique ID of layer and i can't compress it or
> replace it to another string. This is not fundamental solution. Docker=20

Can't it be a solution?

# mount -o bind /long/path0 /br0
# mount -o bind /long/path1 /br1
:::
# mount -t aufs -o br=/br0:/br1:... none /mntpnt

or

# ln -s /long/path0 /br0
# ln -s /long/path1 /br1
:::
# mount -t aufs -o br=/br0:/br1:... none /mntpnt


J. R. Okajima


Re: Why is the length of max mount option a page size??

2018-05-29 Thread J. R. Okajima
Jungsub Shin:
> Do you mean using symlinks or bind mount to reduce length of mount
> options?? I can remove "~/tca_agent/image_layer_dir" string. But the
> other character means unique ID of layer and i can't compress it or
> replace it to another string. This is not fundamental solution. Docker=20

Can't it be a solution?

# mount -o bind /long/path0 /br0
# mount -o bind /long/path1 /br1
:::
# mount -t aufs -o br=/br0:/br1:... none /mntpnt

or

# ln -s /long/path0 /br0
# ln -s /long/path1 /br1
:::
# mount -t aufs -o br=/br0:/br1:... none /mntpnt


J. R. Okajima


Re: Why is the length of max mount option a page size??

2018-05-28 Thread J. R. Okajima
Jungsub Shin:
> I suffered this problem below environments.
> 
> OS : Ubuntu 18.04 LTS / 4.15.0-20-generic / 64bit
> AUFS module parameter : brs =3D 1 / allow_userns =3D N
> Mount command : mount -t aufs -o br=3D~/tca_agent/container_rw_dir/51509032=
:::
(very long command line)

For such long branches, try "remount,append" option.

# mount -t aufs -o br=br0:br1:br2...:brN none /mntpnt
# mount -o remount,append=brN+1 /mntpnt
# mount -o remount,append=brN+2 /mntpnt
    :::


J. R. Okajima



Re: Why is the length of max mount option a page size??

2018-05-28 Thread J. R. Okajima
Jungsub Shin:
> I suffered this problem below environments.
> 
> OS : Ubuntu 18.04 LTS / 4.15.0-20-generic / 64bit
> AUFS module parameter : brs =3D 1 / allow_userns =3D N
> Mount command : mount -t aufs -o br=3D~/tca_agent/container_rw_dir/51509032=
:::
(very long command line)

For such long branches, try "remount,append" option.

# mount -t aufs -o br=br0:br1:br2...:brN none /mntpnt
# mount -o remount,append=brN+1 /mntpnt
# mount -o remount,append=brN+2 /mntpnt
    :::


J. R. Okajima



Re: Why is the length of max mount option a page size??

2018-05-25 Thread J. R. Okajima
Jungsub Shin:
> I know that almost mount option short and page size is enough to
> contain almost mount options. but because of options for 
> unionfs(overlay, aufs), lenght of mount option could be exceed it
> and cut.
>
> I suffer this problem with aufs. aufs's branch option is cut and 
> fail to aufs mount. I know that aufs is not offical fs in linux, 

For aufs, it should not happen because of the module parameter 'brs.'

(from the aufs manual)
--
.SH Module Parameters
.TP
.B brs=1 | 0
Specifies to use the branch path data file under sysfs or not.

If the number of your branches is large or their path is long
and you meet the limitation of mount(8) ro /etc/mtab, you need to
enable CONFIG_SYSFS and set aufs module parameter brs=1.

When this parameter is set as 1, aufs does not show `br:' (or dirs=)
mount option through /proc/mounts (and /etc/mtab). So you can
keep yourself from the page limitation of
mount(8) or /etc/mtab.
Aufs shows branch paths through /fs/aufs/si_XXX/brNNN.
Actually the file under sysfs has also a size limitation, but I don't
think it is harmful.

There is one more side effect in setting 1 to this parameter.
If you rename your branch, the branch path written in /etc/mtab will be
obsoleted and the future remount will meet some error due to the
unmatched parameters (Remember that mount(8) may take the options from
/etc/mtab and pass them to the systemcall).
If you set 1, /etc/mtab will not hold the branch path and you will not
meet such trouble. On the other hand, the entries for the
branch path under sysfs are generated dynamically. So it must not be obsoleted.
But I don't think users want to rename branches so often.

If CONFIG_SYSFS is disable, this parameter is always set to 0.
--


If you post the details of your environment to aufs-users ML, then I
may be able to describe more.


J. R. Okajima


Re: Why is the length of max mount option a page size??

2018-05-25 Thread J. R. Okajima
Jungsub Shin:
> I know that almost mount option short and page size is enough to
> contain almost mount options. but because of options for 
> unionfs(overlay, aufs), lenght of mount option could be exceed it
> and cut.
>
> I suffer this problem with aufs. aufs's branch option is cut and 
> fail to aufs mount. I know that aufs is not offical fs in linux, 

For aufs, it should not happen because of the module parameter 'brs.'

(from the aufs manual)
--
.SH Module Parameters
.TP
.B brs=1 | 0
Specifies to use the branch path data file under sysfs or not.

If the number of your branches is large or their path is long
and you meet the limitation of mount(8) ro /etc/mtab, you need to
enable CONFIG_SYSFS and set aufs module parameter brs=1.

When this parameter is set as 1, aufs does not show `br:' (or dirs=)
mount option through /proc/mounts (and /etc/mtab). So you can
keep yourself from the page limitation of
mount(8) or /etc/mtab.
Aufs shows branch paths through /fs/aufs/si_XXX/brNNN.
Actually the file under sysfs has also a size limitation, but I don't
think it is harmful.

There is one more side effect in setting 1 to this parameter.
If you rename your branch, the branch path written in /etc/mtab will be
obsoleted and the future remount will meet some error due to the
unmatched parameters (Remember that mount(8) may take the options from
/etc/mtab and pass them to the systemcall).
If you set 1, /etc/mtab will not hold the branch path and you will not
meet such trouble. On the other hand, the entries for the
branch path under sysfs are generated dynamically. So it must not be obsoleted.
But I don't think users want to rename branches so often.

If CONFIG_SYSFS is disable, this parameter is always set to 0.
--


If you post the details of your environment to aufs-users ML, then I
may be able to describe more.


J. R. Okajima


Re: [RFC PATCH 00/35] overlayfs: stack file operations

2018-04-25 Thread J. R. Okajima
Miklos Szeredi:
> This patch series reverts the VFS hacks (with the exception of d_path) and

I totally agree with removing d_real things.
It must be good to the world.

If I understand correctly, this series affects file_inode() too.
So there may exist more commits to revert such as
fea6d2a6 2017-02-14 vfs: Use upper filesystem inode in bprm_fill_uid()

Here is another question.
Does overlayfs support atomic_open?
I remember implementing atomic_open on aufs was rather tricky many years
ago, and I am interested in how overlayfs addresses it.


J. R. Okajima


Re: [RFC PATCH 00/35] overlayfs: stack file operations

2018-04-25 Thread J. R. Okajima
Miklos Szeredi:
> This patch series reverts the VFS hacks (with the exception of d_path) and

I totally agree with removing d_real things.
It must be good to the world.

If I understand correctly, this series affects file_inode() too.
So there may exist more commits to revert such as
fea6d2a6 2017-02-14 vfs: Use upper filesystem inode in bprm_fill_uid()

Here is another question.
Does overlayfs support atomic_open?
I remember implementing atomic_open on aufs was rather tricky many years
ago, and I am interested in how overlayfs addresses it.


J. R. Okajima


Re: [GIT PULL] overlayfs update for 4.14

2017-09-17 Thread J. R. Okajima
Linus Torvalds:
> On Wed, Sep 13, 2017 at 11:46 PM, Miklos Szeredi <mik...@szeredi.hu> wrote:
> >
> > d_real() is currently is *the* overlayfs-op:
>
> I know. And it's ugly as #%^! hell.
>
> We don't want to  make it uglier.
>
> And honestly, if you think that "it's only for overlayfs, so I can do
> anything I damn well want to this p[iece of shit" is valid, then I
> want to re-educate you. That is *not* how the VFS layer works.

Miklos,

I've just read your experimental patch to remove ->d_real(). I believe
it is the way to go. Moreover I'd suggest you to re-consider these
things.

- d_real_inode(), d_backing_inode(), and locks_inode()
  I don't think it a good idea for VFS and other modules to care "which
  inode I should use" issue. Ideally they should call d_inode() and
  file_inode() only as they used to.

- RENAME_WHITEOUT flag for user-space
  Why does this flag visible to users? It is a pure internal flag, isn't
  it? When and what for do users specify this flag?

Git-grepping the overlayfs related changes in other modules, I'd also
suggest you to re-consider these.

v4.2
4bacc9c 2015-06-19 overlayfs: Make f_path always point to the overlay and 
f_inode to the underlay

v4.6
54d5ca8 2016-05-10 vfs: add vfs_select_inode() helper
d101a12 2016-03-26 fs: add file_dentry()

v4.7
a118084 2016-05-20 vfs: add d_real_inode() helper

v4.8
c1892c3 2016-08-03 vfs: fix deadlock in file_remove_privs() on overlayfs

v4.9
4d0c5ba 2016-09-16 vfs: do get_write_access() on upper layer of overlayfs
c568d68 2016-09-16 locks: fix file locking on overlayfs
f3fbbb0 2016-09-16 fsnotify: support overlayfs
598e3c8 2016-09-16 vfs: update ovl inode before relatime check

v4.12
78757af 2017-04-20 vfs: ftruncate check IS_APPEND() on real upper inode

v4.13
ad0af71 2017-07-04 vfs: introduce inode 'inuse' lock


J. R. Okajima


Re: [GIT PULL] overlayfs update for 4.14

2017-09-17 Thread J. R. Okajima
Linus Torvalds:
> On Wed, Sep 13, 2017 at 11:46 PM, Miklos Szeredi  wrote:
> >
> > d_real() is currently is *the* overlayfs-op:
>
> I know. And it's ugly as #%^! hell.
>
> We don't want to  make it uglier.
>
> And honestly, if you think that "it's only for overlayfs, so I can do
> anything I damn well want to this p[iece of shit" is valid, then I
> want to re-educate you. That is *not* how the VFS layer works.

Miklos,

I've just read your experimental patch to remove ->d_real(). I believe
it is the way to go. Moreover I'd suggest you to re-consider these
things.

- d_real_inode(), d_backing_inode(), and locks_inode()
  I don't think it a good idea for VFS and other modules to care "which
  inode I should use" issue. Ideally they should call d_inode() and
  file_inode() only as they used to.

- RENAME_WHITEOUT flag for user-space
  Why does this flag visible to users? It is a pure internal flag, isn't
  it? When and what for do users specify this flag?

Git-grepping the overlayfs related changes in other modules, I'd also
suggest you to re-consider these.

v4.2
4bacc9c 2015-06-19 overlayfs: Make f_path always point to the overlay and 
f_inode to the underlay

v4.6
54d5ca8 2016-05-10 vfs: add vfs_select_inode() helper
d101a12 2016-03-26 fs: add file_dentry()

v4.7
a118084 2016-05-20 vfs: add d_real_inode() helper

v4.8
c1892c3 2016-08-03 vfs: fix deadlock in file_remove_privs() on overlayfs

v4.9
4d0c5ba 2016-09-16 vfs: do get_write_access() on upper layer of overlayfs
c568d68 2016-09-16 locks: fix file locking on overlayfs
f3fbbb0 2016-09-16 fsnotify: support overlayfs
598e3c8 2016-09-16 vfs: update ovl inode before relatime check

v4.12
78757af 2017-04-20 vfs: ftruncate check IS_APPEND() on real upper inode

v4.13
ad0af71 2017-07-04 vfs: introduce inode 'inuse' lock


J. R. Okajima


Re: fanotify hangs with AUFS

2017-09-03 Thread J. R. Okajima
Ariel Zelivansky:
> I just opened bug 196815 on the kernel bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=196815). Let me know if
> there's anything else I can do

It seems that the problem exists in aufs.
I will dive into it.
Mainline people should not worry about it.

Ariel,
Aufs is not in mainline, so I don't think it is appropriate to book the
report into kernel bugzilla. Read aufs README file please.


J. R. Okajima


Re: fanotify hangs with AUFS

2017-09-03 Thread J. R. Okajima
Ariel Zelivansky:
> I just opened bug 196815 on the kernel bugzilla
> (https://bugzilla.kernel.org/show_bug.cgi?id=196815). Let me know if
> there's anything else I can do

It seems that the problem exists in aufs.
I will dive into it.
Mainline people should not worry about it.

Ariel,
Aufs is not in mainline, so I don't think it is appropriate to book the
report into kernel bugzilla. Read aufs README file please.


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-06-26 Thread J. R. Okajima
Jani Nikula:
> On Thu, 15 Jun 2017, "J. R. Okajima" <hooanon...@gmail.com> wrote:
> > How about v4.11.x series?
> > I got v4.11.5, but it doesn't contain the fix.
> > Do you have a plan?
>
> The upstream commit has the proper Cc: stable and Fixes: tags in place,
> it just takes a while for the patches to trickle to stable kernels.

Now I can see the fix exists in v4.11.7.
Thank you.


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-06-26 Thread J. R. Okajima
Jani Nikula:
> On Thu, 15 Jun 2017, "J. R. Okajima"  wrote:
> > How about v4.11.x series?
> > I got v4.11.5, but it doesn't contain the fix.
> > Do you have a plan?
>
> The upstream commit has the proper Cc: stable and Fixes: tags in place,
> it just takes a while for the patches to trickle to stable kernels.

Now I can see the fix exists in v4.11.7.
Thank you.


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-06-14 Thread J. R. Okajima
Joonas Lahtinen:
> Don't worry, it's not lost. It was merged to drm-intel-fixes and thus is in 
> the pipeline.
>
> There were some unexpected delays getting fixes in, sorry for that.

Thanx, I got linux-v4.12-rc4 and it contains
4681ee2 2017-05-18 drm/i915: Do not sync RCU during shrinking

How about v4.11.x series?
I got v4.11.5, but it doesn't contain the fix.
Do you have a plan?


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-06-14 Thread J. R. Okajima
Joonas Lahtinen:
> Don't worry, it's not lost. It was merged to drm-intel-fixes and thus is in 
> the pipeline.
>
> There were some unexpected delays getting fixes in, sorry for that.

Thanx, I got linux-v4.12-rc4 and it contains
4681ee2 2017-05-18 drm/i915: Do not sync RCU during shrinking

How about v4.11.x series?
I got v4.11.5, but it doesn't contain the fix.
Do you have a plan?


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-05-19 Thread J. R. Okajima
"J. R. Okajima":
> I don't know whether the fix is good to me or not yet. I will test your
> fix, but I am busy now and my test will be a few weeks later. Other
> people may want the fix soon. So I'd suggest you to reproduce the
> problem on your side. I guess "mem=1G" or "mem=512M" will make it easier
> to reproduce the problem.
> Of course, if you are sure the fix is correct, then you don't have to
> wait for my test. Release it soon for other people.

Now I am going to able to run my local test.
But V3 patch failed to apply onto v4.11.0.

Would you provide us two versions of the patch, one for v4.11 series and
the other of v4.12-rcN?


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-05-19 Thread J. R. Okajima
"J. R. Okajima":
> I don't know whether the fix is good to me or not yet. I will test your
> fix, but I am busy now and my test will be a few weeks later. Other
> people may want the fix soon. So I'd suggest you to reproduce the
> problem on your side. I guess "mem=1G" or "mem=512M" will make it easier
> to reproduce the problem.
> Of course, if you are sure the fix is correct, then you don't have to
> wait for my test. Release it soon for other people.

Now I am going to able to run my local test.
But V3 patch failed to apply onto v4.11.0.

Would you provide us two versions of the patch, one for v4.11 series and
the other of v4.12-rcN?


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-05-05 Thread J. R. Okajima
Joonas Lahtinen:
> Filing a bug in freedesktop.org with all the details is the fastest way
> of getting help. Without the bug (and with such little information as
> the previous e-mail) it's hard to estimate the extent and nature of the
> bug.

My original report was
http://marc.info/?l=linux-kernel=149313183203325=2

The report contained
- kernel command line options
- lockdep msg
- call trace
but it didn't look drm/i915 shrinker is related. It was git-bisect which
lead me to drm/i915 shrinker.


> I've anyway gone and prepared a patch to drop the RCU sync completely
> from shrinker phase, as discussed originally with Chris.

Thank you.
I don't know whether the fix is good to me or not yet. I will test your
fix, but I am busy now and my test will be a few weeks later. Other
people may want the fix soon. So I'd suggest you to reproduce the
problem on your side. I guess "mem=1G" or "mem=512M" will make it easier
to reproduce the problem.
Of course, if you are sure the fix is correct, then you don't have to
wait for my test. Release it soon for other people.


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-05-05 Thread J. R. Okajima
Joonas Lahtinen:
> Filing a bug in freedesktop.org with all the details is the fastest way
> of getting help. Without the bug (and with such little information as
> the previous e-mail) it's hard to estimate the extent and nature of the
> bug.

My original report was
http://marc.info/?l=linux-kernel=149313183203325=2

The report contained
- kernel command line options
- lockdep msg
- call trace
but it didn't look drm/i915 shrinker is related. It was git-bisect which
lead me to drm/i915 shrinker.


> I've anyway gone and prepared a patch to drop the RCU sync completely
> from shrinker phase, as discussed originally with Chris.

Thank you.
I don't know whether the fix is good to me or not yet. I will test your
fix, but I am busy now and my test will be a few weeks later. Other
people may want the fix soon. So I'd suggest you to reproduce the
problem on your side. I guess "mem=1G" or "mem=512M" will make it easier
to reproduce the problem.
Of course, if you are sure the fix is correct, then you don't have to
wait for my test. Release it soon for other people.


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-04-30 Thread J. R. Okajima
Thanx for the reply.

Andrea Arcangeli:
> Yes I already reported this, my original fix was way more efficient
> (and also safer considering the above) than what landed upstream. My
> feedback was ignored though.
>
> https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html

I see.
Actually on my test system for v4.11-rc8, kthreadd, kworker, kswapd and
others all stopped working due to the synchronize_rcu_expedited call
from i915_gem_shrinker_count. It is definitly a show stopper for me as
an i915 user.

It was a few weeks ago when you posted. It is a pity the fix was not
merged before v4.11 comes out. I know v4.11 will appear soon. So I'd ask
i915 developers, would you test Andrea Arcangeli's fix and release it as
v4.11.1 as soon as possible?


J. R. Okajima


Re: Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-04-30 Thread J. R. Okajima
Thanx for the reply.

Andrea Arcangeli:
> Yes I already reported this, my original fix was way more efficient
> (and also safer considering the above) than what landed upstream. My
> feedback was ignored though.
>
> https://lists.freedesktop.org/archives/intel-gfx/2017-April/125414.html

I see.
Actually on my test system for v4.11-rc8, kthreadd, kworker, kswapd and
others all stopped working due to the synchronize_rcu_expedited call
from i915_gem_shrinker_count. It is definitly a show stopper for me as
an i915 user.

It was a few weeks ago when you posted. It is a pity the fix was not
merged before v4.11 comes out. I know v4.11 will appear soon. So I'd ask
i915 developers, would you test Andrea Arcangeli's fix and release it as
v4.11.1 as soon as possible?


J. R. Okajima


Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-04-30 Thread J. R. Okajima
Hello,

Since v4.11-rc7 I can see the workqueue stops on my development/test system.
Git-bisecting tells me the suspicious commit is
c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under 
struct_mutex

I am not sure whether this is the real cause or not of my problem, but I
have a question.
By the commit, the shrinker handlers ->scan_objects() and
->count_objects() both calls synchronize_rcu_expedited()
unconditionally. Is it a legal RCU bahavour?

I know dev->struct_mutex is unlocked now, but before the commit, these
two handlers were not calling synchronize_rcu_expedited().


J. R. Okajima


Q. drm/i915 shrinker, synchronize_rcu_expedited() from handlers

2017-04-30 Thread J. R. Okajima
Hello,

Since v4.11-rc7 I can see the workqueue stops on my development/test system.
Git-bisecting tells me the suspicious commit is
c053b5a 2017-04-11 drm/i915: Don't call synchronize_rcu_expedited under 
struct_mutex

I am not sure whether this is the real cause or not of my problem, but I
have a question.
By the commit, the shrinker handlers ->scan_objects() and
->count_objects() both calls synchronize_rcu_expedited()
unconditionally. Is it a legal RCU bahavour?

I know dev->struct_mutex is unlocked now, but before the commit, these
two handlers were not calling synchronize_rcu_expedited().


J. R. Okajima


v4.11-rc7, deadlock? workqueue pool->lock and serial8250 port_lock_key

2017-04-25 Thread J. R. Okajima
Hello,

I've found this kernel log from v4.11.0-rc7, and all(?) workqueues
seem stopped working. The system is using a serial console by the
kernel command line
root=3D/dev/sda8 ro console=3DttyS0,115200 console=3Dtty0 mem=3D1G

Does it mean that show_workqueue_state() on a serial console (almost?)
always cause a deadlock?
Who I can ask to check this log?

I am not sure whether this is related, but I added a workqueue for my
development. There is no such message in the log about my workqueue, but
I am afraid my development is related to any change around workqueue.


J. R. Okajima


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
[ INFO: possible circular locking dependency detected ]
4.11.0-rc7aufsD+ #114 Not tainted
---
dockerd/777 is trying to acquire lock:
 (_lock_key){-.-.-.}, at: [] serial8250_console_wri=
te+0x1ad/0x570

but task is already holding lock:
 (&(>lock)->rlock){-.-.-.}, at: [] show_workqueue_=
state+0x1e0/0xd90

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&(>lock)->rlock){-.-.-.}:
   lock_acquire+0xd5/0x380
   _raw_spin_lock+0x32/0x50
   __queue_work+0x143/0x1090
   queue_work_on+0x38/0x80
   put_pwq+0x152/0x2a0
   put_pwq_unlocked+0x22/0x30
   destroy_workqueue+0x3bd/0x680
   nfs4_state_shutdown+0x15/0x30 [nfsd]
   nfsd_shutdown_generic+0x26/0x50 [nfsd]
   nfsd_shutdown_net+0x135/0x2b0 [nfsd]
   nfsd_last_thread+0x21e/0x2d0 [nfsd]
   svc_shutdown_net+0x4c/0xa0
   nfsd_destroy+0x1b3/0x2f0 [nfsd]
   nfsd+0x28f/0x480 [nfsd]
   kthread+0x1a0/0x2d0
   ret_from_fork+0x2e/0x40

-> #1 (>lock/1){-.-.-.}:
   lock_acquire+0xd5/0x380
   _raw_spin_lock+0x32/0x50
   __queue_work+0x143/0x1090
   queue_work_on+0x38/0x80
   tty_schedule_flip+0x5b/0xb0
   tty_flip_buffer_push+0xe/0x10
   serial8250_rx_chars+0x19c/0x580
   serial8250_handle_irq.part.17+0xd5/0x190
   serial8250_default_handle_irq+0x5c/0xa0
   serial8250_interrupt+0x83/0x160
   __handle_irq_event_percpu+0x73/0x8f0
   handle_irq_event_percpu+0x25/0x90
   handle_irq_event+0x7d/0x160
   handle_edge_irq+0x139/0x9e0
   handle_irq+0xa1/0x400
   do_IRQ+0x82/0x190
   ret_from_intr+0x0/0x19
   cpuidle_enter_state+0x150/0x820
   cpuidle_enter+0x17/0x20
   do_idle+0x25a/0x480
   cpu_startup_entry+0x70/0x90
   rest_init+0x15e/0x170
   start_kernel+0x4e1/0x4ee
   x86_64_start_reservations+0x2a/0x2c
   x86_64_start_kernel+0xe7/0xee
   verify_cpu+0x0/0xfc

-> #0 (_lock_key){-.-.-.}:
   __lock_acquire+0x1fb7/0x2e00
   lock_acquire+0xd5/0x380
   _raw_spin_lock_irqsave+0x3d/0x60
   serial8250_console_write+0x1ad/0x570
   univ8250_console_write+0x36/0x60
   console_unlock+0x59b/0xa00
   vprintk_emit+0x241/0x430
   vprintk_default+0x1f/0x30
   vprintk_func+0x2c/0x80
   printk+0x43/0x4b
   show_workqueue_state+0x2a1/0xd90
   wq_watchdog_timer_fn+0x245/0x3e0
   call_timer_fn+0xd4/0x760
   run_timer_softirq+0x541/0xbb0
   __do_softirq+0x117/0xcb0
   irq_exit+0xf9/0x170
   smp_apic_timer_interrupt+0x44/0x70
   apic_timer_interrupt+0x90/0xa0
   read_hpet+0x107/0x120
   ktime_get_ts64+0x115/0x460
   poll_select_copy_remaining+0x7e/0x240
   SyS_select+0x117/0x160
   entry_SYSCALL_64_fastpath+0x18/0xad

other info that might help us debug this:

Chain exists of:
  _lock_key --> >lock/1 --> &(>lock)->rlock

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(&(>lock)->rlock);
   lock(>lock/1);
   lock(&(>lock)->rlock);
  lock(_lock_key);

 *** DEADLOCK ***

4 locks held by dockerd/777:
 #0:  (/proj/aufs/aufs4-linux.git/kernel/workqueue.c:5320){+.-.-.}, at: [<=
811d278b>] call_timer_fn+0x3b/0x760
 #1:  (rcu_read_lock_sched){..}, at: [] show_workque=
ue_state+0x5/0xd90
 #2:  (&(>lock)->rlock){-.-.-.}, at: [] show_workq=
ueue_state+0x1e0/0xd90
 #3:  (console_lock){+.+.+.}, at: [] vprintk_emit+0x234/=
0x430

stack backtrace:
CPU: 0 PID: 777 Comm: dockerd Not tainted 4.11.0-rc7aufsD+ #114
Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009
Call Trace:
 
 dump_stack+0x68/0x9d
 print_circular_bug+0x2a6/0x2b7
 __lock_acquire+0x1fb7/0x2e00
 lock_acquire+0xd5/0x380
 ? lock_acquire+0xd5/0x380
 ? serial8250_console_write+0x1ad/0x570
 _raw_spin_lock_irqsave+0x3d/0x60
 ? serial8250_console_write+0x1ad/0x570
 serial8250_console_write+0x1ad/0x570
 ? _raw_spin_unlock+0x69/0x80
 univ8250_console_write+0x36/0x60
 console_unlock+0x59b/0xa00
 ? __down_trylock_console

v4.11-rc7, deadlock? workqueue pool->lock and serial8250 port_lock_key

2017-04-25 Thread J. R. Okajima
Hello,

I've found this kernel log from v4.11.0-rc7, and all(?) workqueues
seem stopped working. The system is using a serial console by the
kernel command line
root=3D/dev/sda8 ro console=3DttyS0,115200 console=3Dtty0 mem=3D1G

Does it mean that show_workqueue_state() on a serial console (almost?)
always cause a deadlock?
Who I can ask to check this log?

I am not sure whether this is related, but I added a workqueue for my
development. There is no such message in the log about my workqueue, but
I am afraid my development is related to any change around workqueue.


J. R. Okajima


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
[ INFO: possible circular locking dependency detected ]
4.11.0-rc7aufsD+ #114 Not tainted
---
dockerd/777 is trying to acquire lock:
 (_lock_key){-.-.-.}, at: [] serial8250_console_wri=
te+0x1ad/0x570

but task is already holding lock:
 (&(>lock)->rlock){-.-.-.}, at: [] show_workqueue_=
state+0x1e0/0xd90

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&(>lock)->rlock){-.-.-.}:
   lock_acquire+0xd5/0x380
   _raw_spin_lock+0x32/0x50
   __queue_work+0x143/0x1090
   queue_work_on+0x38/0x80
   put_pwq+0x152/0x2a0
   put_pwq_unlocked+0x22/0x30
   destroy_workqueue+0x3bd/0x680
   nfs4_state_shutdown+0x15/0x30 [nfsd]
   nfsd_shutdown_generic+0x26/0x50 [nfsd]
   nfsd_shutdown_net+0x135/0x2b0 [nfsd]
   nfsd_last_thread+0x21e/0x2d0 [nfsd]
   svc_shutdown_net+0x4c/0xa0
   nfsd_destroy+0x1b3/0x2f0 [nfsd]
   nfsd+0x28f/0x480 [nfsd]
   kthread+0x1a0/0x2d0
   ret_from_fork+0x2e/0x40

-> #1 (>lock/1){-.-.-.}:
   lock_acquire+0xd5/0x380
   _raw_spin_lock+0x32/0x50
   __queue_work+0x143/0x1090
   queue_work_on+0x38/0x80
   tty_schedule_flip+0x5b/0xb0
   tty_flip_buffer_push+0xe/0x10
   serial8250_rx_chars+0x19c/0x580
   serial8250_handle_irq.part.17+0xd5/0x190
   serial8250_default_handle_irq+0x5c/0xa0
   serial8250_interrupt+0x83/0x160
   __handle_irq_event_percpu+0x73/0x8f0
   handle_irq_event_percpu+0x25/0x90
   handle_irq_event+0x7d/0x160
   handle_edge_irq+0x139/0x9e0
   handle_irq+0xa1/0x400
   do_IRQ+0x82/0x190
   ret_from_intr+0x0/0x19
   cpuidle_enter_state+0x150/0x820
   cpuidle_enter+0x17/0x20
   do_idle+0x25a/0x480
   cpu_startup_entry+0x70/0x90
   rest_init+0x15e/0x170
   start_kernel+0x4e1/0x4ee
   x86_64_start_reservations+0x2a/0x2c
   x86_64_start_kernel+0xe7/0xee
   verify_cpu+0x0/0xfc

-> #0 (_lock_key){-.-.-.}:
   __lock_acquire+0x1fb7/0x2e00
   lock_acquire+0xd5/0x380
   _raw_spin_lock_irqsave+0x3d/0x60
   serial8250_console_write+0x1ad/0x570
   univ8250_console_write+0x36/0x60
   console_unlock+0x59b/0xa00
   vprintk_emit+0x241/0x430
   vprintk_default+0x1f/0x30
   vprintk_func+0x2c/0x80
   printk+0x43/0x4b
   show_workqueue_state+0x2a1/0xd90
   wq_watchdog_timer_fn+0x245/0x3e0
   call_timer_fn+0xd4/0x760
   run_timer_softirq+0x541/0xbb0
   __do_softirq+0x117/0xcb0
   irq_exit+0xf9/0x170
   smp_apic_timer_interrupt+0x44/0x70
   apic_timer_interrupt+0x90/0xa0
   read_hpet+0x107/0x120
   ktime_get_ts64+0x115/0x460
   poll_select_copy_remaining+0x7e/0x240
   SyS_select+0x117/0x160
   entry_SYSCALL_64_fastpath+0x18/0xad

other info that might help us debug this:

Chain exists of:
  _lock_key --> >lock/1 --> &(>lock)->rlock

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(&(>lock)->rlock);
   lock(>lock/1);
   lock(&(>lock)->rlock);
  lock(_lock_key);

 *** DEADLOCK ***

4 locks held by dockerd/777:
 #0:  (/proj/aufs/aufs4-linux.git/kernel/workqueue.c:5320){+.-.-.}, at: [<=
811d278b>] call_timer_fn+0x3b/0x760
 #1:  (rcu_read_lock_sched){..}, at: [] show_workque=
ue_state+0x5/0xd90
 #2:  (&(>lock)->rlock){-.-.-.}, at: [] show_workq=
ueue_state+0x1e0/0xd90
 #3:  (console_lock){+.+.+.}, at: [] vprintk_emit+0x234/=
0x430

stack backtrace:
CPU: 0 PID: 777 Comm: dockerd Not tainted 4.11.0-rc7aufsD+ #114
Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009
Call Trace:
 
 dump_stack+0x68/0x9d
 print_circular_bug+0x2a6/0x2b7
 __lock_acquire+0x1fb7/0x2e00
 lock_acquire+0xd5/0x380
 ? lock_acquire+0xd5/0x380
 ? serial8250_console_write+0x1ad/0x570
 _raw_spin_lock_irqsave+0x3d/0x60
 ? serial8250_console_write+0x1ad/0x570
 serial8250_console_write+0x1ad/0x570
 ? _raw_spin_unlock+0x69/0x80
 univ8250_console_write+0x36/0x60
 console_unlock+0x59b/0xa00
 ? __down_trylock_console

[tip:locking/core] locking/lockdep: Add new check to lock_downgrade()

2017-03-16 Thread tip-bot for J. R. Okajima
Commit-ID:  6419c4af777a773a45a1b1af735de0fcd9a7dcc7
Gitweb: http://git.kernel.org/tip/6419c4af777a773a45a1b1af735de0fcd9a7dcc7
Author: J. R. Okajima <hooanon...@gmail.com>
AuthorDate: Fri, 3 Feb 2017 01:38:17 +0900
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 16 Mar 2017 09:57:07 +0100

locking/lockdep: Add new check to lock_downgrade()

Commit:

  f8319483f57f ("locking/lockdep: Provide a type check for lock_is_held")

didn't fully cover rwsems as downgrade_write() was left out.

Introduce lock_downgrade() and use it to add new checks.

See-also: http://marc.info/?l=linux-kernel=148581164003149=2
Originally-written-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: J. R. Okajima <hooanon...@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon...@gmail.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 include/linux/lockdep.h  |  3 +++
 kernel/locking/lockdep.c | 55 
 kernel/locking/rwsem.c   |  6 ++
 3 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 1e327bb..fffe49f 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map 
*lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }
 
+extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip);
+
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
@@ -411,6 +413,7 @@ static inline void lockdep_on(void)
 
 # define lock_acquire(l, s, t, r, c, n, i) do { } while (0)
 # define lock_release(l, n, i) do { } while (0)
+# define lock_downgrade(l, i)  do { } while (0)
 # define lock_set_class(l, n, k, s, i) do { } while (0)
 # define lock_set_subclass(l, s, i)do { } while (0)
 # define lockdep_set_current_reclaim_state(g)  do { } while (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index da79548..b1a1cef 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3533,6 +3533,44 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
return 1;
 }
 
+static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   struct task_struct *curr = current;
+   struct held_lock *hlock;
+   unsigned int depth;
+   int i;
+
+   depth = curr->lockdep_depth;
+   /*
+* This function is about (re)setting the class of a held lock,
+* yet we're not actually holding any locks. Naughty user!
+*/
+   if (DEBUG_LOCKS_WARN_ON(!depth))
+   return 0;
+
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
+
+   curr->lockdep_depth = i;
+   curr->curr_chain_key = hlock->prev_chain_key;
+
+   WARN(hlock->read, "downgrading a read lock");
+   hlock->read = 1;
+   hlock->acquire_ip = ip;
+
+   if (reacquire_held_locks(curr, depth, i))
+   return 0;
+
+   /*
+* I took it apart and put it back together again, except now I have
+* these 'spare' parts.. where shall I put them.
+*/
+   if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth))
+   return 0;
+   return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks - this gets
  * called on mutex_unlock()/spin_unlock*() (or on a failed
@@ -3759,6 +3797,23 @@ void lock_set_class(struct lockdep_map *lock, const char 
*name,
 }
 EXPORT_SYMBOL_GPL(lock_set_class);
 
+void lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   unsigned long flags;
+
+   if (unlikely(current->lockdep_recursion))
+   return;
+
+   raw_local_irq_save(flags);
+   current->lockdep_recursion = 1;
+   check_flags(flags);
+   if (__lock_downgrade(lock, ip))
+   check_chain_key(current);
+   current->lockdep_recursion = 0;
+   raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_downgrade);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 90a74cc..4d48b1c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -124,10 +124,8 @@ EXPORT_SYMBOL(up_write);
  */

[tip:locking/core] locking/lockdep: Add new check to lock_downgrade()

2017-03-16 Thread tip-bot for J. R. Okajima
Commit-ID:  6419c4af777a773a45a1b1af735de0fcd9a7dcc7
Gitweb: http://git.kernel.org/tip/6419c4af777a773a45a1b1af735de0fcd9a7dcc7
Author: J. R. Okajima 
AuthorDate: Fri, 3 Feb 2017 01:38:17 +0900
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:57:07 +0100

locking/lockdep: Add new check to lock_downgrade()

Commit:

  f8319483f57f ("locking/lockdep: Provide a type check for lock_is_held")

didn't fully cover rwsems as downgrade_write() was left out.

Introduce lock_downgrade() and use it to add new checks.

See-also: http://marc.info/?l=linux-kernel=148581164003149=2
Originally-written-by: Peter Zijlstra 
Signed-off-by: J. R. Okajima 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1486053497-9948-3-git-send-email-hooanon...@gmail.com
[ Rewrote the changelog. ]
Signed-off-by: Ingo Molnar 
---
 include/linux/lockdep.h  |  3 +++
 kernel/locking/lockdep.c | 55 
 kernel/locking/rwsem.c   |  6 ++
 3 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 1e327bb..fffe49f 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map 
*lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }
 
+extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip);
+
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
@@ -411,6 +413,7 @@ static inline void lockdep_on(void)
 
 # define lock_acquire(l, s, t, r, c, n, i) do { } while (0)
 # define lock_release(l, n, i) do { } while (0)
+# define lock_downgrade(l, i)  do { } while (0)
 # define lock_set_class(l, n, k, s, i) do { } while (0)
 # define lock_set_subclass(l, s, i)do { } while (0)
 # define lockdep_set_current_reclaim_state(g)  do { } while (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index da79548..b1a1cef 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3533,6 +3533,44 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
return 1;
 }
 
+static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   struct task_struct *curr = current;
+   struct held_lock *hlock;
+   unsigned int depth;
+   int i;
+
+   depth = curr->lockdep_depth;
+   /*
+* This function is about (re)setting the class of a held lock,
+* yet we're not actually holding any locks. Naughty user!
+*/
+   if (DEBUG_LOCKS_WARN_ON(!depth))
+   return 0;
+
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
+
+   curr->lockdep_depth = i;
+   curr->curr_chain_key = hlock->prev_chain_key;
+
+   WARN(hlock->read, "downgrading a read lock");
+   hlock->read = 1;
+   hlock->acquire_ip = ip;
+
+   if (reacquire_held_locks(curr, depth, i))
+   return 0;
+
+   /*
+* I took it apart and put it back together again, except now I have
+* these 'spare' parts.. where shall I put them.
+*/
+   if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth))
+   return 0;
+   return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks - this gets
  * called on mutex_unlock()/spin_unlock*() (or on a failed
@@ -3759,6 +3797,23 @@ void lock_set_class(struct lockdep_map *lock, const char 
*name,
 }
 EXPORT_SYMBOL_GPL(lock_set_class);
 
+void lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   unsigned long flags;
+
+   if (unlikely(current->lockdep_recursion))
+   return;
+
+   raw_local_irq_save(flags);
+   current->lockdep_recursion = 1;
+   check_flags(flags);
+   if (__lock_downgrade(lock, ip))
+   check_chain_key(current);
+   current->lockdep_recursion = 0;
+   raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_downgrade);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 90a74cc..4d48b1c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -124,10 +124,8 @@ EXPORT_SYMBOL(up_write);
  */
 void downgrade_write(struct rw_semaphore *sem)
 {
-   /*
-* lockdep: a downgraded write will live on as a write
-* dependency.
-*/
+   lock_downgrade(>dep_map, _RET_IP_);
+
rwsem_set_reader_owned(sem);
__downgrade_write(sem);
 }


[tip:locking/core] locking/lockdep: Factor out the find_held_lock() helper function

2017-03-16 Thread tip-bot for J. R. Okajima
Commit-ID:  41c2c5b86a5e1a691ddacfc03b631b87a0b19043
Gitweb: http://git.kernel.org/tip/41c2c5b86a5e1a691ddacfc03b631b87a0b19043
Author: J. R. Okajima <hooanon...@gmail.com>
AuthorDate: Fri, 3 Feb 2017 01:38:15 +0900
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 16 Mar 2017 09:57:06 +0100

locking/lockdep: Factor out the find_held_lock() helper function

A simple consolidataion to factor out repeated patterns.

The behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon...@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/locking/lockdep.c | 114 ++-
 1 file changed, 54 insertions(+), 60 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index a95e5d1..0d28b82 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3437,13 +3437,49 @@ static int match_held_lock(struct held_lock *hlock, 
struct lockdep_map *lock)
return 0;
 }
 
+/* @depth must not be zero */
+static struct held_lock *find_held_lock(struct task_struct *curr,
+   struct lockdep_map *lock,
+   unsigned int depth, int *idx)
+{
+   struct held_lock *ret, *hlock, *prev_hlock;
+   int i;
+
+   i = depth - 1;
+   hlock = curr->held_locks + i;
+   ret = hlock;
+   if (match_held_lock(hlock, lock))
+   goto out;
+
+   ret = NULL;
+   for (i--, prev_hlock = hlock--;
+i >= 0;
+i--, prev_hlock = hlock--) {
+   /*
+* We must not cross into another context:
+*/
+   if (prev_hlock->irq_context != hlock->irq_context) {
+   ret = NULL;
+   break;
+   }
+   if (match_held_lock(hlock, lock)) {
+   ret = hlock;
+   break;
+   }
+   }
+
+out:
+   *idx = i;
+   return ret;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
 unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
struct lock_class *class;
unsigned int depth;
int i;
@@ -3456,21 +3492,10 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
if (DEBUG_LOCKS_WARN_ON(!depth))
return 0;
 
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
lockdep_init_map(lock, name, key, 0);
class = register_lock_class(lock, subclass, 0);
hlock->class_idx = class - lock_classes + 1;
@@ -3508,7 +3533,7 @@ static int
 __lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
unsigned int depth;
int i;
 
@@ -3527,21 +3552,10 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
 * Check whether the lock exists in the current stack
 * of held locks:
 */
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock

[tip:locking/core] locking/lockdep: Factor out the validate_held_lock() helper function

2017-03-16 Thread tip-bot for J. R. Okajima
Commit-ID:  e969970be033841d4c16b2e8ec8a3608347db861
Gitweb: http://git.kernel.org/tip/e969970be033841d4c16b2e8ec8a3608347db861
Author: J. R. Okajima <hooanon...@gmail.com>
AuthorDate: Fri, 3 Feb 2017 01:38:16 +0900
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Thu, 16 Mar 2017 09:57:07 +0100

locking/lockdep: Factor out the validate_held_lock() helper function

Behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon...@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon...@gmail.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/locking/lockdep.c | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 0d28b82..da79548 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3473,6 +3473,24 @@ out:
return ret;
 }
 
+static int reacquire_held_locks(struct task_struct *curr, unsigned int depth,
+ int idx)
+{
+   struct held_lock *hlock;
+
+   for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++) {
+   if (!__lock_acquire(hlock->instance,
+   hlock_class(hlock)->subclass,
+   hlock->trylock,
+   hlock->read, hlock->check,
+   hlock->hardirqs_off,
+   hlock->nest_lock, hlock->acquire_ip,
+   hlock->references, hlock->pin_count))
+   return 1;
+   }
+   return 0;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
@@ -3503,15 +3521,8 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (reacquire_held_locks(curr, depth, i))
+   return 0;
 
/*
 * I took it apart and put it back together again, except now I have
@@ -3582,15 +3593,8 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (i++; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (reacquire_held_locks(curr, depth, i + 1))
+   return 0;
 
/*
 * We had N bottles of beer on the wall, we drank one, but now


[tip:locking/core] locking/lockdep: Factor out the find_held_lock() helper function

2017-03-16 Thread tip-bot for J. R. Okajima
Commit-ID:  41c2c5b86a5e1a691ddacfc03b631b87a0b19043
Gitweb: http://git.kernel.org/tip/41c2c5b86a5e1a691ddacfc03b631b87a0b19043
Author: J. R. Okajima 
AuthorDate: Fri, 3 Feb 2017 01:38:15 +0900
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:57:06 +0100

locking/lockdep: Factor out the find_held_lock() helper function

A simple consolidataion to factor out repeated patterns.

The behaviour should not change.

Signed-off-by: J. R. Okajima 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1486053497-9948-1-git-send-email-hooanon...@gmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/locking/lockdep.c | 114 ++-
 1 file changed, 54 insertions(+), 60 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index a95e5d1..0d28b82 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3437,13 +3437,49 @@ static int match_held_lock(struct held_lock *hlock, 
struct lockdep_map *lock)
return 0;
 }
 
+/* @depth must not be zero */
+static struct held_lock *find_held_lock(struct task_struct *curr,
+   struct lockdep_map *lock,
+   unsigned int depth, int *idx)
+{
+   struct held_lock *ret, *hlock, *prev_hlock;
+   int i;
+
+   i = depth - 1;
+   hlock = curr->held_locks + i;
+   ret = hlock;
+   if (match_held_lock(hlock, lock))
+   goto out;
+
+   ret = NULL;
+   for (i--, prev_hlock = hlock--;
+i >= 0;
+i--, prev_hlock = hlock--) {
+   /*
+* We must not cross into another context:
+*/
+   if (prev_hlock->irq_context != hlock->irq_context) {
+   ret = NULL;
+   break;
+   }
+   if (match_held_lock(hlock, lock)) {
+   ret = hlock;
+   break;
+   }
+   }
+
+out:
+   *idx = i;
+   return ret;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
 unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
struct lock_class *class;
unsigned int depth;
int i;
@@ -3456,21 +3492,10 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
if (DEBUG_LOCKS_WARN_ON(!depth))
return 0;
 
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
lockdep_init_map(lock, name, key, 0);
class = register_lock_class(lock, subclass, 0);
hlock->class_idx = class - lock_classes + 1;
@@ -3508,7 +3533,7 @@ static int
 __lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
unsigned int depth;
int i;
 
@@ -3527,21 +3552,10 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
 * Check whether the lock exists in the current stack
 * of held locks:
 */
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
if (hlock->instance == lock)
lock_release_holdtime(hlock);
 
@@ -3903,7 +3917,7 @@ static void
 __lock_contended(struct lockdep_map *lock, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *p

[tip:locking/core] locking/lockdep: Factor out the validate_held_lock() helper function

2017-03-16 Thread tip-bot for J. R. Okajima
Commit-ID:  e969970be033841d4c16b2e8ec8a3608347db861
Gitweb: http://git.kernel.org/tip/e969970be033841d4c16b2e8ec8a3608347db861
Author: J. R. Okajima 
AuthorDate: Fri, 3 Feb 2017 01:38:16 +0900
Committer:  Ingo Molnar 
CommitDate: Thu, 16 Mar 2017 09:57:07 +0100

locking/lockdep: Factor out the validate_held_lock() helper function

Behaviour should not change.

Signed-off-by: J. R. Okajima 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1486053497-9948-2-git-send-email-hooanon...@gmail.com
Signed-off-by: Ingo Molnar 
---
 kernel/locking/lockdep.c | 40 ++--
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 0d28b82..da79548 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3473,6 +3473,24 @@ out:
return ret;
 }
 
+static int reacquire_held_locks(struct task_struct *curr, unsigned int depth,
+ int idx)
+{
+   struct held_lock *hlock;
+
+   for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++) {
+   if (!__lock_acquire(hlock->instance,
+   hlock_class(hlock)->subclass,
+   hlock->trylock,
+   hlock->read, hlock->check,
+   hlock->hardirqs_off,
+   hlock->nest_lock, hlock->acquire_ip,
+   hlock->references, hlock->pin_count))
+   return 1;
+   }
+   return 0;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
@@ -3503,15 +3521,8 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (reacquire_held_locks(curr, depth, i))
+   return 0;
 
/*
 * I took it apart and put it back together again, except now I have
@@ -3582,15 +3593,8 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (i++; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (reacquire_held_locks(curr, depth, i + 1))
+   return 0;
 
/*
 * We had N bottles of beer on the wall, we drank one, but now


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> With a different owner view, but that's irrelevant to the underlying
> inode.

Ok, the different ownership is limited within shitfs (or userns,
container). Good. I might forget that shiftfs wants to behave like
bind-mount.

And I noticed that shiftfs setattr() converts uid/gid before calling
backend fs' ->setattr(). It is good too.
But how about acl? Won't such conversion be necessary for acl too?


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> With a different owner view, but that's irrelevant to the underlying
> inode.

Ok, the different ownership is limited within shitfs (or userns,
container). Good. I might forget that shiftfs wants to behave like
bind-mount.

And I noticed that shiftfs setattr() converts uid/gid before calling
backend fs' ->setattr(). It is good too.
But how about acl? Won't such conversion be necessary for acl too?


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> I realised as I was trimming down the vestigial inode properties in the
> patch that actually shiftfs does use the i_ino from the underlying for
> userspace.  The reason why is that it comes from the getattr call in
> stat and that's fully what the underlying filesystem returns (including
> the inode number).

Let me make sure.
- shiftfs has its own inode, but it will never be visible to userspace.
- the inode attr visible to users are equivalent to the underlying one,
  includeing dev:ino pair.
right?
If so, I am afraid it will make users confused. The dev:ino pair is a
system-wide identity, but shiftfs creates the same dev:ino pair with
different owner. Though I don't know whether the actual application or
LSM exists or not who will be damaged by this situation.
For git-status case which I wrote previously, it might not be a problem
as long as dev:ino is unchanged from git index.
But such filesystem looks weird.


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-20 Thread J. R. Okajima
James Bottomley:
> I realised as I was trimming down the vestigial inode properties in the
> patch that actually shiftfs does use the i_ino from the underlying for
> userspace.  The reason why is that it comes from the getattr call in
> stat and that's fully what the underlying filesystem returns (including
> the inode number).

Let me make sure.
- shiftfs has its own inode, but it will never be visible to userspace.
- the inode attr visible to users are equivalent to the underlying one,
  includeing dev:ino pair.
right?
If so, I am afraid it will make users confused. The dev:ino pair is a
system-wide identity, but shiftfs creates the same dev:ino pair with
different owner. Though I don't know whether the actual application or
LSM exists or not who will be damaged by this situation.
For git-status case which I wrote previously, it might not be a problem
as long as dev:ino is unchanged from git index.
But such filesystem looks weird.


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-06 Thread J. R. Okajima
James Bottomley:
> Yes, I know the problem.  However, I believe most current linux
> filesystems no longer guarantee stable, for the lifetime of the file,
> inode numbers.  The usual docker container root is overlayfs, which,
> similarly doesn't support stable inode numbers.  I see the odd
> complaint about docker with overlayfs having unstable inode numbers,
> but none seems to have any serious repercussions.

I think it serious.
Reusing the backend fs' inum is a good approach which Amir wrote.
Based on this, I'd suggest you to support the hardlinks.

bakend_dentry = lookup_one_len()
if (d_inode->i_nlink != 1)
shiftfs_inode = ilookup();
if (!shiftfs_inode) {
shiftfs_inode = new_inode();
shiftfs_inode->i_ino = bakend_dentry->d_inode->i_ino;
}


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-06 Thread J. R. Okajima
James Bottomley:
> Yes, I know the problem.  However, I believe most current linux
> filesystems no longer guarantee stable, for the lifetime of the file,
> inode numbers.  The usual docker container root is overlayfs, which,
> similarly doesn't support stable inode numbers.  I see the odd
> complaint about docker with overlayfs having unstable inode numbers,
> but none seems to have any serious repercussions.

I think it serious.
Reusing the backend fs' inum is a good approach which Amir wrote.
Based on this, I'd suggest you to support the hardlinks.

bakend_dentry = lookup_one_len()
if (d_inode->i_nlink != 1)
shiftfs_inode = ilookup();
if (!shiftfs_inode) {
shiftfs_inode = new_inode();
shiftfs_inode->i_ino = bakend_dentry->d_inode->i_ino;
}


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-05 Thread J. R. Okajima
James Bottomley:
> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
:::

Interesting.
But I am afraid that the inconsistency problem of the inode numbers will
happen.

shiftfs_new_inode() uses get_next_ino() which means
- 1st time: inodeA is created and cached, inumA is assigned
- after using inodeA, it will be discarded from the cache
- 2nd time: inodeA is looked-up again, and another inode number (inumB)
  is assgined.

This inconsistency will not be a problem for the "pure virtual" fs such
as procfs and sysfs. But your shiftfs is not pure as them. Shiftfs will
be used as a wrapper (or "binder" which means bind-mount) of an orginary
filesystem.
The symptom of this problem from users perspective will be
- find -inum doesn't work
- git-status doesn't work, which keeps st_dev and st_ino and compares
  the current files.
Of course they will be limited to when the target dir is huge and/or
system memory is low. As long as the inode cache is large enough to hold
all necessary inodes, the problem won't happen.

If shiftfs will supports exporting via NFS in the future, the
consistency of inum will be important too.


J. R. Okajima


Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

2017-02-05 Thread J. R. Okajima
James Bottomley:
> This allows any subtree to be uid/gid shifted and bound elsewhere.  It
:::

Interesting.
But I am afraid that the inconsistency problem of the inode numbers will
happen.

shiftfs_new_inode() uses get_next_ino() which means
- 1st time: inodeA is created and cached, inumA is assigned
- after using inodeA, it will be discarded from the cache
- 2nd time: inodeA is looked-up again, and another inode number (inumB)
  is assgined.

This inconsistency will not be a problem for the "pure virtual" fs such
as procfs and sysfs. But your shiftfs is not pure as them. Shiftfs will
be used as a wrapper (or "binder" which means bind-mount) of an orginary
filesystem.
The symptom of this problem from users perspective will be
- find -inum doesn't work
- git-status doesn't work, which keeps st_dev and st_ino and compares
  the current files.
Of course they will be limited to when the target dir is huge and/or
system memory is low. As long as the inode cache is large enough to hold
all necessary inodes, the problem won't happen.

If shiftfs will supports exporting via NFS in the future, the
consistency of inum will be important too.


J. R. Okajima


Re: [PATCH 3/3] lockdep: new annotation lock_downgrade()

2017-02-02 Thread J. R. Okajima
Peter Zijlstra:
> > >> kernel/locking/rwsem.c:126:2: error: implicit declaration of functi=
on 'lock_downgrade' [-Werror=3Dimplicit-function-declaration]
> >  lock_downgrade(>dep_map, _RET_IP_);
> >  ^~
:::
> This is what you need !LOCKDEP stubs for ;-)

Ok, here is the update.
Just one line added.


J. R. Okajima


commit 6874cbfb3c4f757efecbeb800bfd4db1050698f6
Author: J. R. Okajima <hooanon...@gmail.com>
Date:   Thu Feb 2 23:41:38 2017 +0900

lockdep: new annotation lock_downgrade()
=

The commit
f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_=
held
didn't fully support rwsem. Here downgrade_write() supports the added =
type.
=

Originally-written-by: Peter Zijlstra <pet...@infradead.org>
See-also: http://marc.info/?l=3Dlinux-kernel=3D148581164003149=3D2
Signed-off-by: J. R. Okajima <hooanon...@gmail.com>

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 0345cbf..ba75d06 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_ma=
p *lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }
 =

+extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip);
+
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
@@ -413,6 +415,7 @@ static inline void lockdep_on(void)
 =

 # define lock_acquire(l, s, t, r, c, n, i) do { } while (0)
 # define lock_release(l, n, i) do { } while (0)
+# define lock_downgrade(l, i)  do { } while (0)
 # define lock_set_class(l, n, k, s, i) do { } while (0)
 # define lock_set_subclass(l, s, i)do { } while (0)
 # define lockdep_set_current_reclaim_state(g)  do { } while (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc8f8e..6a4a740 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const ch=
ar *name,
return 1;
 }
 =

+static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   struct task_struct *curr =3D current;
+   struct held_lock *hlock;
+   unsigned int depth;
+   int i;
+
+   depth =3D curr->lockdep_depth;
+   /*
+* This function is about (re)setting the class of a held lock,
+* yet we're not actually holding any locks. Naughty user!
+*/
+   if (DEBUG_LOCKS_WARN_ON(!depth))
+   return 0;
+
+   hlock =3D find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
+
+   curr->lockdep_depth =3D i;
+   curr->curr_chain_key =3D hlock->prev_chain_key;
+
+   WARN(hlock->read, "downgrading a read lock");
+   hlock->read =3D 1;
+   hlock->acquire_ip =3D ip;
+
+   if (validate_held_lock(curr, depth, i))
+   return 0;
+
+   /*
+* I took it apart and put it back together again, except now I have
+* these 'spare' parts.. where shall I put them.
+*/
+   if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth !=3D depth))
+   return 0;
+   return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks - this gets
  * called on mutex_unlock()/spin_unlock*() (or on a failed
@@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const=
 char *name,
 }
 EXPORT_SYMBOL_GPL(lock_set_class);
 =

+void lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   unsigned long flags;
+
+   if (unlikely(current->lockdep_recursion))
+   return;
+
+   raw_local_irq_save(flags);
+   current->lockdep_recursion =3D 1;
+   check_flags(flags);
+   if (__lock_downgrade(lock, ip))
+   check_chain_key(current);
+   current->lockdep_recursion =3D 0;
+   raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_downgrade);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 45ba475..31db3ef 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write);
  */
 void downgrade_write(struct rw_semaphore *sem)
 {
-   /*
-* lockdep: a downgraded write will live on as a write
-* dependency.
-*/
+   lock_downgrade(>dep_map, _RET_IP_);
+
rwsem_set_reader_owned(sem);
__downgrade_write(sem);
 }


Re: [PATCH 3/3] lockdep: new annotation lock_downgrade()

2017-02-02 Thread J. R. Okajima
Peter Zijlstra:
> > >> kernel/locking/rwsem.c:126:2: error: implicit declaration of functi=
on 'lock_downgrade' [-Werror=3Dimplicit-function-declaration]
> >  lock_downgrade(>dep_map, _RET_IP_);
> >  ^~
:::
> This is what you need !LOCKDEP stubs for ;-)

Ok, here is the update.
Just one line added.


J. R. Okajima


commit 6874cbfb3c4f757efecbeb800bfd4db1050698f6
Author: J. R. Okajima 
Date:   Thu Feb 2 23:41:38 2017 +0900

lockdep: new annotation lock_downgrade()
=

The commit
f831948 2016-11-30 locking/lockdep: Provide a type check for lock_is_=
held
didn't fully support rwsem. Here downgrade_write() supports the added =
type.
=

Originally-written-by: Peter Zijlstra 
See-also: http://marc.info/?l=3Dlinux-kernel=3D148581164003149=3D2
    Signed-off-by: J. R. Okajima 

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 0345cbf..ba75d06 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_ma=
p *lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }
 =

+extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip);
+
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
@@ -413,6 +415,7 @@ static inline void lockdep_on(void)
 =

 # define lock_acquire(l, s, t, r, c, n, i) do { } while (0)
 # define lock_release(l, n, i) do { } while (0)
+# define lock_downgrade(l, i)  do { } while (0)
 # define lock_set_class(l, n, k, s, i) do { } while (0)
 # define lock_set_subclass(l, s, i)do { } while (0)
 # define lockdep_set_current_reclaim_state(g)  do { } while (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc8f8e..6a4a740 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const ch=
ar *name,
return 1;
 }
 =

+static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   struct task_struct *curr =3D current;
+   struct held_lock *hlock;
+   unsigned int depth;
+   int i;
+
+   depth =3D curr->lockdep_depth;
+   /*
+* This function is about (re)setting the class of a held lock,
+* yet we're not actually holding any locks. Naughty user!
+*/
+   if (DEBUG_LOCKS_WARN_ON(!depth))
+   return 0;
+
+   hlock =3D find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
+
+   curr->lockdep_depth =3D i;
+   curr->curr_chain_key =3D hlock->prev_chain_key;
+
+   WARN(hlock->read, "downgrading a read lock");
+   hlock->read =3D 1;
+   hlock->acquire_ip =3D ip;
+
+   if (validate_held_lock(curr, depth, i))
+   return 0;
+
+   /*
+* I took it apart and put it back together again, except now I have
+* these 'spare' parts.. where shall I put them.
+*/
+   if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth !=3D depth))
+   return 0;
+   return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks - this gets
  * called on mutex_unlock()/spin_unlock*() (or on a failed
@@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const=
 char *name,
 }
 EXPORT_SYMBOL_GPL(lock_set_class);
 =

+void lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   unsigned long flags;
+
+   if (unlikely(current->lockdep_recursion))
+   return;
+
+   raw_local_irq_save(flags);
+   current->lockdep_recursion =3D 1;
+   check_flags(flags);
+   if (__lock_downgrade(lock, ip))
+   check_chain_key(current);
+   current->lockdep_recursion =3D 0;
+   raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_downgrade);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 45ba475..31db3ef 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write);
  */
 void downgrade_write(struct rw_semaphore *sem)
 {
-   /*
-* lockdep: a downgraded write will live on as a write
-* dependency.
-*/
+   lock_downgrade(>dep_map, _RET_IP_);
+
rwsem_set_reader_owned(sem);
__downgrade_write(sem);
 }


[PATCH 1/3] lockdep: consolidate by new find_held_lock()

2017-02-02 Thread J. R. Okajima
A simple consolidataion. The behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon...@gmail.com>
---
 kernel/locking/lockdep.c | 114 ++-
 1 file changed, 54 insertions(+), 60 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 85d9222..b7a2001 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3428,13 +3428,49 @@ static int match_held_lock(struct held_lock *hlock, 
struct lockdep_map *lock)
return 0;
 }
 
+/* @depth must not be zero */
+static struct held_lock *find_held_lock(struct task_struct *curr,
+   struct lockdep_map *lock,
+   unsigned int depth, int *idx)
+{
+   struct held_lock *ret, *hlock, *prev_hlock;
+   int i;
+
+   i = depth - 1;
+   hlock = curr->held_locks + i;
+   ret = hlock;
+   if (match_held_lock(hlock, lock))
+   goto out;
+
+   ret = NULL;
+   for (i--, prev_hlock = hlock--;
+i >= 0;
+i--, prev_hlock = hlock--) {
+   /*
+* We must not cross into another context:
+*/
+   if (prev_hlock->irq_context != hlock->irq_context) {
+   ret = NULL;
+   break;
+   }
+   if (match_held_lock(hlock, lock)) {
+   ret = hlock;
+   break;
+   }
+   }
+
+out:
+   *idx = i;
+   return ret;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
 unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
struct lock_class *class;
unsigned int depth;
int i;
@@ -3447,21 +3483,10 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
if (DEBUG_LOCKS_WARN_ON(!depth))
return 0;
 
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
lockdep_init_map(lock, name, key, 0);
class = register_lock_class(lock, subclass, 0);
hlock->class_idx = class - lock_classes + 1;
@@ -3499,7 +3524,7 @@ static int
 __lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
unsigned int depth;
int i;
 
@@ -3518,21 +3543,10 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
 * Check whether the lock exists in the current stack
 * of held locks:
 */
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
if (hlock->instance == lock)
lock_release_holdtime(hlock);
 
@@ -3894,7 +3908,7 @@ static void
 __lock_contended(struct lockdep_map *lock, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
struct lock_class_stats *stats;
unsigned int depth;
int i, contention_point, contending_point;
@@ -3907,22 +3921,12 @@ __lock_contended(struct lockdep_map *lock, unsigned 
long ip)
if (DEBUG_LOCKS_WARN_ON(!depth))
return;
 
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock &&

[PATCH 1/3] lockdep: consolidate by new find_held_lock()

2017-02-02 Thread J. R. Okajima
A simple consolidataion. The behaviour should not change.

Signed-off-by: J. R. Okajima 
---
 kernel/locking/lockdep.c | 114 ++-
 1 file changed, 54 insertions(+), 60 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 85d9222..b7a2001 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3428,13 +3428,49 @@ static int match_held_lock(struct held_lock *hlock, 
struct lockdep_map *lock)
return 0;
 }
 
+/* @depth must not be zero */
+static struct held_lock *find_held_lock(struct task_struct *curr,
+   struct lockdep_map *lock,
+   unsigned int depth, int *idx)
+{
+   struct held_lock *ret, *hlock, *prev_hlock;
+   int i;
+
+   i = depth - 1;
+   hlock = curr->held_locks + i;
+   ret = hlock;
+   if (match_held_lock(hlock, lock))
+   goto out;
+
+   ret = NULL;
+   for (i--, prev_hlock = hlock--;
+i >= 0;
+i--, prev_hlock = hlock--) {
+   /*
+* We must not cross into another context:
+*/
+   if (prev_hlock->irq_context != hlock->irq_context) {
+   ret = NULL;
+   break;
+   }
+   if (match_held_lock(hlock, lock)) {
+   ret = hlock;
+   break;
+   }
+   }
+
+out:
+   *idx = i;
+   return ret;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
 unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
struct lock_class *class;
unsigned int depth;
int i;
@@ -3447,21 +3483,10 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
if (DEBUG_LOCKS_WARN_ON(!depth))
return 0;
 
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
lockdep_init_map(lock, name, key, 0);
class = register_lock_class(lock, subclass, 0);
hlock->class_idx = class - lock_classes + 1;
@@ -3499,7 +3524,7 @@ static int
 __lock_release(struct lockdep_map *lock, int nested, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
unsigned int depth;
int i;
 
@@ -3518,21 +3543,10 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
 * Check whether the lock exists in the current stack
 * of held locks:
 */
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
-   break;
-   if (match_held_lock(hlock, lock))
-   goto found_it;
-   prev_hlock = hlock;
-   }
-   return print_unlock_imbalance_bug(curr, lock, ip);
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
 
-found_it:
if (hlock->instance == lock)
lock_release_holdtime(hlock);
 
@@ -3894,7 +3908,7 @@ static void
 __lock_contended(struct lockdep_map *lock, unsigned long ip)
 {
struct task_struct *curr = current;
-   struct held_lock *hlock, *prev_hlock;
+   struct held_lock *hlock;
struct lock_class_stats *stats;
unsigned int depth;
int i, contention_point, contending_point;
@@ -3907,22 +3921,12 @@ __lock_contended(struct lockdep_map *lock, unsigned 
long ip)
if (DEBUG_LOCKS_WARN_ON(!depth))
return;
 
-   prev_hlock = NULL;
-   for (i = depth-1; i >= 0; i--) {
-   hlock = curr->held_locks + i;
-   /*
-* We must not cross into another context:
-*/
-   if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)

[PATCH 2/3] lockdep: consolidate by new validate_held_lock()

2017-02-02 Thread J. R. Okajima
A simple consolidataion. The behaviour should not change.

Signed-off-by: J. R. Okajima <hooanon...@gmail.com>
---
 kernel/locking/lockdep.c | 39 +--
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b7a2001..7dc8f8e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3464,6 +3464,23 @@ static struct held_lock *find_held_lock(struct 
task_struct *curr,
return ret;
 }
 
+static int validate_held_lock(struct task_struct *curr, unsigned int depth,
+ int idx)
+{
+   struct held_lock *hlock;
+
+   for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++)
+   if (!__lock_acquire(hlock->instance,
+   hlock_class(hlock)->subclass,
+   hlock->trylock,
+   hlock->read, hlock->check,
+   hlock->hardirqs_off,
+   hlock->nest_lock, hlock->acquire_ip,
+   hlock->references, hlock->pin_count))
+   return 1;
+   return 0;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
@@ -3494,15 +3511,8 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (validate_held_lock(curr, depth, i))
+   return 0;
 
/*
 * I took it apart and put it back together again, except now I have
@@ -3573,15 +3583,8 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (i++; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (validate_held_lock(curr, depth, i + 1))
+   return 0;
 
/*
 * We had N bottles of beer on the wall, we drank one, but now
-- 
2.1.4



[PATCH 2/3] lockdep: consolidate by new validate_held_lock()

2017-02-02 Thread J. R. Okajima
A simple consolidataion. The behaviour should not change.

Signed-off-by: J. R. Okajima 
---
 kernel/locking/lockdep.c | 39 +--
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b7a2001..7dc8f8e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3464,6 +3464,23 @@ static struct held_lock *find_held_lock(struct 
task_struct *curr,
return ret;
 }
 
+static int validate_held_lock(struct task_struct *curr, unsigned int depth,
+ int idx)
+{
+   struct held_lock *hlock;
+
+   for (hlock = curr->held_locks + idx; idx < depth; idx++, hlock++)
+   if (!__lock_acquire(hlock->instance,
+   hlock_class(hlock)->subclass,
+   hlock->trylock,
+   hlock->read, hlock->check,
+   hlock->hardirqs_off,
+   hlock->nest_lock, hlock->acquire_ip,
+   hlock->references, hlock->pin_count))
+   return 1;
+   return 0;
+}
+
 static int
 __lock_set_class(struct lockdep_map *lock, const char *name,
 struct lock_class_key *key, unsigned int subclass,
@@ -3494,15 +3511,8 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (validate_held_lock(curr, depth, i))
+   return 0;
 
/*
 * I took it apart and put it back together again, except now I have
@@ -3573,15 +3583,8 @@ __lock_release(struct lockdep_map *lock, int nested, 
unsigned long ip)
curr->lockdep_depth = i;
curr->curr_chain_key = hlock->prev_chain_key;
 
-   for (i++; i < depth; i++) {
-   hlock = curr->held_locks + i;
-   if (!__lock_acquire(hlock->instance,
-   hlock_class(hlock)->subclass, hlock->trylock,
-   hlock->read, hlock->check, hlock->hardirqs_off,
-   hlock->nest_lock, hlock->acquire_ip,
-   hlock->references, hlock->pin_count))
-   return 0;
-   }
+   if (validate_held_lock(curr, depth, i + 1))
+   return 0;
 
/*
 * We had N bottles of beer on the wall, we drank one, but now
-- 
2.1.4



[PATCH 3/3] lockdep: new annotation lock_downgrade()

2017-02-02 Thread J. R. Okajima
The commit
f831948 2016-11-30 locking/lockdep: Provide a type check for 
lock_is_held
didn't fully support rwsem. Here downgrade_write() supports the added type.

Originally-written-by: Peter Zijlstra <pet...@infradead.org>
See-also: http://marc.info/?l=linux-kernel=148581164003149=2
Signed-off-by: J. R. Okajima <hooanon...@gmail.com>
---
 include/linux/lockdep.h  |  2 ++
 kernel/locking/lockdep.c | 55 
 kernel/locking/rwsem.c   |  6 ++
 3 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 0345cbf..22f304c 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map 
*lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }
 
+extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip);
+
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc8f8e..6a4a740 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
return 1;
 }
 
+static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   struct task_struct *curr = current;
+   struct held_lock *hlock;
+   unsigned int depth;
+   int i;
+
+   depth = curr->lockdep_depth;
+   /*
+* This function is about (re)setting the class of a held lock,
+* yet we're not actually holding any locks. Naughty user!
+*/
+   if (DEBUG_LOCKS_WARN_ON(!depth))
+   return 0;
+
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
+
+   curr->lockdep_depth = i;
+   curr->curr_chain_key = hlock->prev_chain_key;
+
+   WARN(hlock->read, "downgrading a read lock");
+   hlock->read = 1;
+   hlock->acquire_ip = ip;
+
+   if (validate_held_lock(curr, depth, i))
+   return 0;
+
+   /*
+* I took it apart and put it back together again, except now I have
+* these 'spare' parts.. where shall I put them.
+*/
+   if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth))
+   return 0;
+   return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks - this gets
  * called on mutex_unlock()/spin_unlock*() (or on a failed
@@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const char 
*name,
 }
 EXPORT_SYMBOL_GPL(lock_set_class);
 
+void lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   unsigned long flags;
+
+   if (unlikely(current->lockdep_recursion))
+   return;
+
+   raw_local_irq_save(flags);
+   current->lockdep_recursion = 1;
+   check_flags(flags);
+   if (__lock_downgrade(lock, ip))
+   check_chain_key(current);
+   current->lockdep_recursion = 0;
+   raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_downgrade);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 45ba475..31db3ef 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write);
  */
 void downgrade_write(struct rw_semaphore *sem)
 {
-   /*
-* lockdep: a downgraded write will live on as a write
-* dependency.
-*/
+   lock_downgrade(>dep_map, _RET_IP_);
+
rwsem_set_reader_owned(sem);
__downgrade_write(sem);
 }
-- 
2.1.4



[PATCH 3/3] lockdep: new annotation lock_downgrade()

2017-02-02 Thread J. R. Okajima
The commit
f831948 2016-11-30 locking/lockdep: Provide a type check for 
lock_is_held
didn't fully support rwsem. Here downgrade_write() supports the added type.

Originally-written-by: Peter Zijlstra 
See-also: http://marc.info/?l=linux-kernel=148581164003149=2
Signed-off-by: J. R. Okajima 
---
 include/linux/lockdep.h  |  2 ++
 kernel/locking/lockdep.c | 55 
 kernel/locking/rwsem.c   |  6 ++
 3 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 0345cbf..22f304c 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -361,6 +361,8 @@ static inline void lock_set_subclass(struct lockdep_map 
*lock,
lock_set_class(lock, lock->name, lock->key, subclass, ip);
 }
 
+extern void lock_downgrade(struct lockdep_map *lock, unsigned long ip);
+
 extern void lockdep_set_current_reclaim_state(gfp_t gfp_mask);
 extern void lockdep_clear_current_reclaim_state(void);
 extern void lockdep_trace_alloc(gfp_t mask);
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc8f8e..6a4a740 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3523,6 +3523,44 @@ __lock_set_class(struct lockdep_map *lock, const char 
*name,
return 1;
 }
 
+static int __lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   struct task_struct *curr = current;
+   struct held_lock *hlock;
+   unsigned int depth;
+   int i;
+
+   depth = curr->lockdep_depth;
+   /*
+* This function is about (re)setting the class of a held lock,
+* yet we're not actually holding any locks. Naughty user!
+*/
+   if (DEBUG_LOCKS_WARN_ON(!depth))
+   return 0;
+
+   hlock = find_held_lock(curr, lock, depth, );
+   if (!hlock)
+   return print_unlock_imbalance_bug(curr, lock, ip);
+
+   curr->lockdep_depth = i;
+   curr->curr_chain_key = hlock->prev_chain_key;
+
+   WARN(hlock->read, "downgrading a read lock");
+   hlock->read = 1;
+   hlock->acquire_ip = ip;
+
+   if (validate_held_lock(curr, depth, i))
+   return 0;
+
+   /*
+* I took it apart and put it back together again, except now I have
+* these 'spare' parts.. where shall I put them.
+*/
+   if (DEBUG_LOCKS_WARN_ON(curr->lockdep_depth != depth))
+   return 0;
+   return 1;
+}
+
 /*
  * Remove the lock to the list of currently held locks - this gets
  * called on mutex_unlock()/spin_unlock*() (or on a failed
@@ -3749,6 +3787,23 @@ void lock_set_class(struct lockdep_map *lock, const char 
*name,
 }
 EXPORT_SYMBOL_GPL(lock_set_class);
 
+void lock_downgrade(struct lockdep_map *lock, unsigned long ip)
+{
+   unsigned long flags;
+
+   if (unlikely(current->lockdep_recursion))
+   return;
+
+   raw_local_irq_save(flags);
+   current->lockdep_recursion = 1;
+   check_flags(flags);
+   if (__lock_downgrade(lock, ip))
+   check_chain_key(current);
+   current->lockdep_recursion = 0;
+   raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_downgrade);
+
 /*
  * We are not always called with irqs disabled - do that here,
  * and also avoid lockdep recursion:
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 45ba475..31db3ef 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -123,10 +123,8 @@ EXPORT_SYMBOL(up_write);
  */
 void downgrade_write(struct rw_semaphore *sem)
 {
-   /*
-* lockdep: a downgraded write will live on as a write
-* dependency.
-*/
+   lock_downgrade(>dep_map, _RET_IP_);
+
rwsem_set_reader_owned(sem);
__downgrade_write(sem);
 }
-- 
2.1.4



Re: Q: lockdep_assert_held_read() after downgrade_write()

2017-02-02 Thread J. R. Okajima
Peter Zijlstra:
> Does something like the below work better? The annotation in
> downgrade_write() would look something like:
>
> + lock_downgrade(>dep_map, 1, _RET_IP_);
>
> Not even compile tested and lacks the !LOCKDEP build bits.

Thanks for the patch.
It seems working expectedly. I began writing a similar patch locally
with minor consolidations by adding a new function or two. I will send a
patch series. Please review and merge them into v4.10. If you don't like
the patch, especially the new function name, feel free to change it.
I don't know whether !LOCKDEP build bits are necessary or not.


J. R. Okajima


Re: Q: lockdep_assert_held_read() after downgrade_write()

2017-02-02 Thread J. R. Okajima
Peter Zijlstra:
> Does something like the below work better? The annotation in
> downgrade_write() would look something like:
>
> + lock_downgrade(>dep_map, 1, _RET_IP_);
>
> Not even compile tested and lacks the !LOCKDEP build bits.

Thanks for the patch.
It seems working expectedly. I began writing a similar patch locally
with minor consolidations by adding a new function or two. I will send a
patch series. Please review and merge them into v4.10. If you don't like
the patch, especially the new function name, feel free to change it.
I don't know whether !LOCKDEP build bits are necessary or not.


J. R. Okajima


Re: Q: lockdep_assert_held_read() after downgrade_write()

2017-01-31 Thread J. R. Okajima
Jens Axboe:
> I don't think you understand how it works. downgrade_write() turns a write
> lock into read held. To make that last sequence valid, you'd need:
>
>   down_write();
>   downgrade_write();
>   lockdep_assert_held_read()
>   up_read();
>
> or just not drop up_write() from the last section.

Arg...
It is my bonehead mistake that I inserted up_write() before
downgrade_write(). Sorry about that.
Fortunately Peter Zijlstra reviewed downgrade_write() and sent a
patch. Thank you, it passed my first test.

Now allow me going on the second test (based upon Peter's patch)

- two rwsem, rwA and rwB.
- the locking order is rwA first, and then rwB.
- good case
  down_read(rwA)
  down_read(rwB)
  up_read(rwB)
  up_read(rwA)

  down_write(rwA)
  down_write(rwB)
  up_write(rwB)
  up_write(rwA)

- questionable case
  down_write(rwA)
  down_write(rwB)
  downgrade_write(rwA)
  downgrade_write(rwB)
  up_read(rwB)
  up_read(rwA)

These two downgrade_write() have their strict order? If so, what is
that?
Do the added two lines
+   rwsem_release(>dep_map, 1, _RET_IP_);
+   rwsem_acquire_read(>dep_map, 0, 0, _RET_IP_);
produce a traditional AB-BA deadlock warning, don't they?


J. R. Okajima


Re: Q: lockdep_assert_held_read() after downgrade_write()

2017-01-31 Thread J. R. Okajima
Jens Axboe:
> I don't think you understand how it works. downgrade_write() turns a write
> lock into read held. To make that last sequence valid, you'd need:
>
>   down_write();
>   downgrade_write();
>   lockdep_assert_held_read()
>   up_read();
>
> or just not drop up_write() from the last section.

Arg...
It is my bonehead mistake that I inserted up_write() before
downgrade_write(). Sorry about that.
Fortunately Peter Zijlstra reviewed downgrade_write() and sent a
patch. Thank you, it passed my first test.

Now allow me going on the second test (based upon Peter's patch)

- two rwsem, rwA and rwB.
- the locking order is rwA first, and then rwB.
- good case
  down_read(rwA)
  down_read(rwB)
  up_read(rwB)
  up_read(rwA)

  down_write(rwA)
  down_write(rwB)
  up_write(rwB)
  up_write(rwA)

- questionable case
  down_write(rwA)
  down_write(rwB)
  downgrade_write(rwA)
  downgrade_write(rwB)
  up_read(rwB)
  up_read(rwA)

These two downgrade_write() have their strict order? If so, what is
that?
Do the added two lines
+   rwsem_release(>dep_map, 1, _RET_IP_);
+   rwsem_acquire_read(>dep_map, 0, 0, _RET_IP_);
produce a traditional AB-BA deadlock warning, don't they?


J. R. Okajima


Q: lockdep_assert_held_read() after downgrade_write()

2017-01-30 Thread J. R. Okajima
Peter Zijlstra,

May I ask you a question?
v4.10-rc1 got a commit
f831948 2016-11-30 locking/lockdep: Provide a type check for 
lock_is_held
I've tested a little and lockdep splat a stack trace.

{
DECLARE_RWSEM(rw);
static struct lock_class_key key;
lockdep_set_class(, );

down_read();
lockdep_assert_held_read();
up_read();

down_write();
lockdep_assert_held_exclusive();
up_write();

downgrade_write();
lockdep_assert_held_read();  <-- here
up_read();
}

I was expecting that lockdep_assert_held_read() splat nothing after
downgrade_write(). Is this warning an intentional behaviour?

Also the final up_read() gives me a warning too. It is produced at
lockdep.c:3514:lock_release(): DEBUG_LOCKS_WARN_ON(depth <= 0)

As an additional information, I increased some lockdep constants.
Do you think this is related?

include/linux/lockdep.h
+#define MAX_LOCKDEP_SUBCLASSES (8UL + 4)
+#define MAX_LOCKDEP_KEYS_BITS  (13 + 3)

kernel/locking/lockdep_internals.h
+#define MAX_LOCKDEP_ENTRIES(32768UL << 5)
+#define MAX_LOCKDEP_CHAINS_BITS(16 + 5)
+#define MAX_STACK_TRACE_ENTRIES(524288UL << 5)


J. R. Okajima


Q: lockdep_assert_held_read() after downgrade_write()

2017-01-30 Thread J. R. Okajima
Peter Zijlstra,

May I ask you a question?
v4.10-rc1 got a commit
f831948 2016-11-30 locking/lockdep: Provide a type check for 
lock_is_held
I've tested a little and lockdep splat a stack trace.

{
DECLARE_RWSEM(rw);
static struct lock_class_key key;
lockdep_set_class(, );

down_read();
lockdep_assert_held_read();
up_read();

down_write();
lockdep_assert_held_exclusive();
up_write();

downgrade_write();
lockdep_assert_held_read();  <-- here
up_read();
}

I was expecting that lockdep_assert_held_read() splat nothing after
downgrade_write(). Is this warning an intentional behaviour?

Also the final up_read() gives me a warning too. It is produced at
lockdep.c:3514:lock_release(): DEBUG_LOCKS_WARN_ON(depth <= 0)

As an additional information, I increased some lockdep constants.
Do you think this is related?

include/linux/lockdep.h
+#define MAX_LOCKDEP_SUBCLASSES (8UL + 4)
+#define MAX_LOCKDEP_KEYS_BITS  (13 + 3)

kernel/locking/lockdep_internals.h
+#define MAX_LOCKDEP_ENTRIES(32768UL << 5)
+#define MAX_LOCKDEP_CHAINS_BITS(16 + 5)
+#define MAX_STACK_TRACE_ENTRIES(524288UL << 5)


J. R. Okajima


Re: [PATCH v4] add u64 number parser

2017-01-30 Thread J. R. Okajima
Hello James,

Your commit
a317178 2016-12-06 parser: add u64 number parser
was merged into v4.10-rc1. Good.
I have a very similar function and used for a long time. Here I'd
suggest you a tiny optimization based on my version.

If you think another value is more apropriate as the size of an internal
array, you can change it freely.


J. R. Okajima

commit 030361e78d327d9d89254dc3b320c092221c7cd0
Author: J. R. Okajima <hooanon...@gmail.com>
Date:   Tue Jan 31 00:01:16 2017 +0900

parser: match_u64, short string optimization
 =

When the given string is short enough, we can skip kmalloc/kfree.

Signed-off-by: J. R. Okajima <hooanon...@gmail.com>

diff --git a/lib/parser.c b/lib/parser.c
index 3278958..cfbc6ec 100644
--- a/lib/parser.c
+++ b/lib/parser.c
@@ -163,21 +163,25 @@ static int match_number(substring_t *s, int *result,=
 int base)
  */
 static int match_u64int(substring_t *s, u64 *result, int base)
 {
-   char *buf;
+   char *buf, a[32];
int ret;
u64 val;
size_t len =3D s->to - s->from;
 =

-   buf =3D kmalloc(len + 1, GFP_KERNEL);
-   if (!buf)
-   return -ENOMEM;
+   buf =3D a;
+   if (len + 1 > sizeof(a)) {
+   buf =3D kmalloc(len + 1, GFP_KERNEL);
+   if (unlikely(!buf))
+   return -ENOMEM;
+   }
memcpy(buf, s->from, len);
buf[len] =3D '\0';
 =

ret =3D kstrtoull(buf, base, );
if (!ret)
*result =3D val;
-   kfree(buf);
+   if (buf !=3D a)
+   kfree(buf);
return ret;
 }
 =



Re: [PATCH v4] add u64 number parser

2017-01-30 Thread J. R. Okajima
Hello James,

Your commit
a317178 2016-12-06 parser: add u64 number parser
was merged into v4.10-rc1. Good.
I have a very similar function and used for a long time. Here I'd
suggest you a tiny optimization based on my version.

If you think another value is more apropriate as the size of an internal
array, you can change it freely.


J. R. Okajima

commit 030361e78d327d9d89254dc3b320c092221c7cd0
Author: J. R. Okajima 
Date:   Tue Jan 31 00:01:16 2017 +0900

parser: match_u64, short string optimization
 =

When the given string is short enough, we can skip kmalloc/kfree.

Signed-off-by: J. R. Okajima 

diff --git a/lib/parser.c b/lib/parser.c
index 3278958..cfbc6ec 100644
--- a/lib/parser.c
+++ b/lib/parser.c
@@ -163,21 +163,25 @@ static int match_number(substring_t *s, int *result,=
 int base)
  */
 static int match_u64int(substring_t *s, u64 *result, int base)
 {
-   char *buf;
+   char *buf, a[32];
int ret;
u64 val;
size_t len =3D s->to - s->from;
 =

-   buf =3D kmalloc(len + 1, GFP_KERNEL);
-   if (!buf)
-   return -ENOMEM;
+   buf =3D a;
+   if (len + 1 > sizeof(a)) {
+   buf =3D kmalloc(len + 1, GFP_KERNEL);
+   if (unlikely(!buf))
+   return -ENOMEM;
+   }
memcpy(buf, s->from, len);
buf[len] =3D '\0';
 =

ret =3D kstrtoull(buf, base, );
if (!ret)
*result =3D val;
-   kfree(buf);
+   if (buf !=3D a)
+   kfree(buf);
return ret;
 }
 =



Re: v4.3-rc2, fault in sock_release via nfs_put_client

2015-09-27 Thread J. R. Okajima

Trond Myklebust:
> On Sun, Sep 27, 2015 at 8:47 AM, J. R. Okajima  wrote:
> > I don't know whether this is a known issue or not (since I was off from
> > development for a few months), but I've got a "general protection fault:
> > " message from linux-4.3-rc2.
:::
> There is a fix for this issue in v4.3-rc3. Please see
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=79234c3db6842a3de03817211d891e0c2878f756

Thank you.
Confirmed the problem is gone.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


v4.3-rc2, fault in sock_release via nfs_put_client

2015-09-27 Thread J. R. Okajima

Hello NFS folks,

I don't know whether this is a known issue or not (since I was off from
development for a few months), but I've got a "general protection fault:
" message from linux-4.3-rc2.
Here are the reproducible script and the log.
Would you check them please?

Notes:
- The script cannot reproduce the problem perfectly. Not always. But if
  you try several times, you will able to see. On my test system, I had
  to try a few times.
- As far as I know, if bind-mount did not happen, then the problem did
  not happen either. But I am not sure whether bind-mount is really the
  trigger of this issue.


J. R. Okajima

--
#!/bin/sh

set -eu

Stat() # path
{
stat -f --printf="%n %T\n" $1
}

uname -a
s=/dev/shm
Stat $s
c=/tmp/c
mkdir -p $c
showmount -e > /tmp/e
sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 
localhost:$s
showmount -e | diff /tmp/e - || :
sudo mount -t nfs localhost:$s $c
Stat $c
b=/tmp/b
mkdir -p $b
sudo mount -o bind $c $b
Stat $b
> $b/f
cat /proc/mounts > /tmp/m
sudo umount -l $b
diff /tmp/m /proc/mounts || :
sync
sleep 1
cat /proc/mounts > /tmp/m
sudo umount -l $c
diff /tmp/m /proc/mounts
sudo exportfs -u localhost:$s
showmount -e | diff /tmp/e -
--

$ sh -x ./nfs-4.3-rc2.sh
+ set -eu
+ s=/dev/shm
+ Stat /dev/shm
+ stat -f --printf=%n %T\n /dev/shm
/dev/shm tmpfs
+ c=/tmp/c
+ mkdir -p /tmp/c
+ showmount -e
+ sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 
localhost:/dev/shm
+ showmount -e
+ diff /tmp/e -
3a4
> /dev/shm localhost
+ :
+ sudo mount -t nfs localhost:/dev/shm /tmp/c
+ Stat /tmp/c
+ stat -f --printf=%n %T\n /tmp/c
/tmp/c nfs
+ b=/tmp/b
+ mkdir -p /tmp/b
+ sudo mount -o bind /tmp/c /tmp/b
+ Stat /tmp/b
+ stat -f --printf=%n %T\n /tmp/b
/tmp/b nfs
+
+ cat /proc/mounts
+ sudo umount -l /tmp/b
+ diff /tmp/m /proc/mounts
26d25
< localhost:/dev/shm /tmp/b nfs4 
rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1
 0 0
+ :
+ sync
+ sleep 1
+ cat /proc/mounts
+ sudo umount -l /tmp/c
general protection fault:  [#7] PREEMPT SMP
Modules linked in: oprofile configs autofs4 nfsd [last unloaded: brd]
CPU: 0 PID: 4325 Comm: umount.nfs Tainted: G  D 4.3.0-rc2aufsD+ #67
Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009
task: 88002d69ea00 ti: 88002ca0 task.ti: 88002ca0
RIP: 0010:[]  [] sock_release+0x21/0x90
RSP: 0018:88002ca03bd8  EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 88002812f7c0 RCX: 
RDX:  RSI:  RDI: 88002812f7c0
RBP: 88002ca03be8 R08:  R09: 
R10:  R11: 0001 R12: 
R13: 88002812f7c0 R14: 88002b8eeea0 R15: 0001
FS:  7f763d4f57e0() GS:88002fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 01176db8 CR3: 2d76c000 CR4: 000406f0
Stack:
 88002b87  88002ca03c18 81715afc
 88002b87  88002b87 88002ca03d08
 88002ca03c38 81715bd6  88002b87
Call Trace:
 [] xs_reset_transport+0x18c/0x250
 [] xs_close+0x16/0x30
 [] xs_destroy+0x16/0x30
 [] xprt_destroy+0x6f/0x80
 [] xprt_put+0x14/0x20
 [] rpc_free_client+0x84/0xc0
 [] rpc_release_client+0x5a/0x90
 [] rpc_shutdown_client+0xf1/0x100
 [] nfs_free_client+0x97/0xa0
 [] nfs4_free_client+0xa4/0xc0
 [] nfs_put_client+0x295/0x430
 [] ? nfs4_alloc_client+0x380/0x380
 [] nfs_free_server+0x7c/0xd0
 [] nfs_kill_super+0x2c/0x40
 [] deactivate_locked_super+0x51/0x90
 [] deactivate_super+0x84/0x90
 [] cleanup_mnt+0x97/0xe0
 [] __cleanup_mnt+0x12/0x20
 [] task_work_run+0x72/0xa0
 [] prepare_exit_to_usermode+0x10b/0x150
 [] ? mntput_no_expire+0x5/0x2c0
 [] syscall_return_slowpath+0x96/0x2f0
 [] int_ret_from_sys_call+0x25/0x9f
Code: 5b 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec 10 48 
89 5d f0 48 89 fb 4c 89 65 f8 48 8b 47 28 48 85 c0 74 17 <4c> 8b 60 08 ff 50 10 
48 c7 43 28 00 00 00 00 4c 89 e7 e8 b8 7d 
RIP  [] sock_release+0x21/0x90
 RSP 
---[ end trace 1bcdd4036690d082 ]---
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
$
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: v4.3-rc2, fault in sock_release via nfs_put_client

2015-09-27 Thread J. R. Okajima

Trond Myklebust:
> On Sun, Sep 27, 2015 at 8:47 AM, J. R. Okajima <hooanon...@gmail.com> wrote:
> > I don't know whether this is a known issue or not (since I was off from
> > development for a few months), but I've got a "general protection fault:
> > " message from linux-4.3-rc2.
:::
> There is a fix for this issue in v4.3-rc3. Please see
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=79234c3db6842a3de03817211d891e0c2878f756

Thank you.
Confirmed the problem is gone.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


v4.3-rc2, fault in sock_release via nfs_put_client

2015-09-27 Thread J. R. Okajima

Hello NFS folks,

I don't know whether this is a known issue or not (since I was off from
development for a few months), but I've got a "general protection fault:
" message from linux-4.3-rc2.
Here are the reproducible script and the log.
Would you check them please?

Notes:
- The script cannot reproduce the problem perfectly. Not always. But if
  you try several times, you will able to see. On my test system, I had
  to try a few times.
- As far as I know, if bind-mount did not happen, then the problem did
  not happen either. But I am not sure whether bind-mount is really the
  trigger of this issue.


J. R. Okajima

--
#!/bin/sh

set -eu

Stat() # path
{
stat -f --printf="%n %T\n" $1
}

uname -a
s=/dev/shm
Stat $s
c=/tmp/c
mkdir -p $c
showmount -e > /tmp/e
sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 
localhost:$s
showmount -e | diff /tmp/e - || :
sudo mount -t nfs localhost:$s $c
Stat $c
b=/tmp/b
mkdir -p $b
sudo mount -o bind $c $b
Stat $b
> $b/f
cat /proc/mounts > /tmp/m
sudo umount -l $b
diff /tmp/m /proc/mounts || :
sync
sleep 1
cat /proc/mounts > /tmp/m
sudo umount -l $c
diff /tmp/m /proc/mounts
sudo exportfs -u localhost:$s
showmount -e | diff /tmp/e -
--

$ sh -x ./nfs-4.3-rc2.sh
+ set -eu
+ s=/dev/shm
+ Stat /dev/shm
+ stat -f --printf=%n %T\n /dev/shm
/dev/shm tmpfs
+ c=/tmp/c
+ mkdir -p /tmp/c
+ showmount -e
+ sudo exportfs -i -o rw,async,no_subtree_check,no_root_squash,fsid=99 
localhost:/dev/shm
+ showmount -e
+ diff /tmp/e -
3a4
> /dev/shm localhost
+ :
+ sudo mount -t nfs localhost:/dev/shm /tmp/c
+ Stat /tmp/c
+ stat -f --printf=%n %T\n /tmp/c
/tmp/c nfs
+ b=/tmp/b
+ mkdir -p /tmp/b
+ sudo mount -o bind /tmp/c /tmp/b
+ Stat /tmp/b
+ stat -f --printf=%n %T\n /tmp/b
/tmp/b nfs
+
+ cat /proc/mounts
+ sudo umount -l /tmp/b
+ diff /tmp/m /proc/mounts
26d25
< localhost:/dev/shm /tmp/b nfs4 
rw,relatime,vers=4.0,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1
 0 0
+ :
+ sync
+ sleep 1
+ cat /proc/mounts
+ sudo umount -l /tmp/c
general protection fault:  [#7] PREEMPT SMP
Modules linked in: oprofile configs autofs4 nfsd [last unloaded: brd]
CPU: 0 PID: 4325 Comm: umount.nfs Tainted: G  D 4.3.0-rc2aufsD+ #67
Hardware name: Pegatron Pegatron/IPM41, BIOS 0001 02/05/2009
task: 88002d69ea00 ti: 88002ca0 task.ti: 88002ca0
RIP: 0010:[]  [] sock_release+0x21/0x90
RSP: 0018:88002ca03bd8  EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 88002812f7c0 RCX: 
RDX:  RSI:  RDI: 88002812f7c0
RBP: 88002ca03be8 R08:  R09: 
R10:  R11: 0001 R12: 
R13: 88002812f7c0 R14: 88002b8eeea0 R15: 0001
FS:  7f763d4f57e0() GS:88002fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 01176db8 CR3: 2d76c000 CR4: 000406f0
Stack:
 88002b87  88002ca03c18 81715afc
 88002b87  88002b87 88002ca03d08
 88002ca03c38 81715bd6  88002b87
Call Trace:
 [] xs_reset_transport+0x18c/0x250
 [] xs_close+0x16/0x30
 [] xs_destroy+0x16/0x30
 [] xprt_destroy+0x6f/0x80
 [] xprt_put+0x14/0x20
 [] rpc_free_client+0x84/0xc0
 [] rpc_release_client+0x5a/0x90
 [] rpc_shutdown_client+0xf1/0x100
 [] nfs_free_client+0x97/0xa0
 [] nfs4_free_client+0xa4/0xc0
 [] nfs_put_client+0x295/0x430
 [] ? nfs4_alloc_client+0x380/0x380
 [] nfs_free_server+0x7c/0xd0
 [] nfs_kill_super+0x2c/0x40
 [] deactivate_locked_super+0x51/0x90
 [] deactivate_super+0x84/0x90
 [] cleanup_mnt+0x97/0xe0
 [] __cleanup_mnt+0x12/0x20
 [] task_work_run+0x72/0xa0
 [] prepare_exit_to_usermode+0x10b/0x150
 [] ? mntput_no_expire+0x5/0x2c0
 [] syscall_return_slowpath+0x96/0x2f0
 [] int_ret_from_sys_call+0x25/0x9f
Code: 5b 5d c3 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec 10 48 
89 5d f0 48 89 fb 4c 89 65 f8 48 8b 47 28 48 85 c0 74 17 <4c> 8b 60 08 ff 50 10 
48 c7 43 28 00 00 00 00 4c 89 e7 e8 b8 7d 
RIP  [] sock_release+0x21/0x90
 RSP 
---[ end trace 1bcdd4036690d082 ]---
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
umount.nfs: /tmp/c: not mounted
$
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: create and use seq_show_option for escaping

2015-08-08 Thread J. R. Okajima

Kees Cook:
> This fixes the problem by adding new seq_show_option and seq_show_option_n
> helpers, and updating the vulnerable show_option handlers to use them as
> needed. Some, like SELinux, need to be open coded due to unusual existing
> escape mechanisms.

How about other ctrl chars such as CR or FF?
I am using the similar function for many years, and it might be more
generic because it supports all cntrl chars other than "\t\n\\" (see
below).

Many of other ctrl chars may not be necessary. But some people uses
non-ASCII chars for their pathnames which may contain ESC or other
chars. Any crazy chars can corrupt the output of /proc/mount and
others. So it might be better to consider all ctrl chars.

--
static char au_esc_chars[0x20 + 3]; /* 0x01-0x20, backslash, del, and NULL */

int au_seq_path(struct seq_file *seq, struct path *path)
{
return seq_path(seq, path, au_esc_chars);
}

module_init(void)
{
:::
p = au_esc_chars;
for (i = 1; i <= ' '; i++)
*p++ = i;
*p++ = '\\';
*p++ = '\x7f';
    *p = 0;
:::
}


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs: create and use seq_show_option for escaping

2015-08-08 Thread J. R. Okajima

Kees Cook:
 This fixes the problem by adding new seq_show_option and seq_show_option_n
 helpers, and updating the vulnerable show_option handlers to use them as
 needed. Some, like SELinux, need to be open coded due to unusual existing
 escape mechanisms.

How about other ctrl chars such as CR or FF?
I am using the similar function for many years, and it might be more
generic because it supports all cntrl chars other than \t\n\\ (see
below).

Many of other ctrl chars may not be necessary. But some people uses
non-ASCII chars for their pathnames which may contain ESC or other
chars. Any crazy chars can corrupt the output of /proc/mount and
others. So it might be better to consider all ctrl chars.

--
static char au_esc_chars[0x20 + 3]; /* 0x01-0x20, backslash, del, and NULL */

int au_seq_path(struct seq_file *seq, struct path *path)
{
return seq_path(seq, path, au_esc_chars);
}

module_init(void)
{
:::
p = au_esc_chars;
for (i = 1; i = ' '; i++)
*p++ = i;
*p++ = '\\';
*p++ = '\x7f';
*p = 0;
:::
}


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Repercussions of overflow in get_next_ino()

2015-05-07 Thread J. R. Okajima
Hello,

Nikolay Borisov:
> My question is what are the repercussions of get_next_ino overflowing 
> and at some point having possibly multiple inodes on my system with the 
> same i_ino id? And why is it safe to have the inode id's overflow and 
> wrap around?

I am afraid some applications won't work correctly.
As far as I know, ls(1) and find(1) don't show the file whose inum is
zero.

See also
Subject: [PATCH v2] vfs: get_next_ino(), never inum=0
Date: 2014-05-28 14:06:32
http://marc.info/?l=linux-fsdevel=140128600801771=2
and their thread.

For tmpfs, I have another patch. Just FYI, here attached.


J. R. Okajima



a.patch.bz2
Description: BZip2 compressed data


Re: Repercussions of overflow in get_next_ino()

2015-05-07 Thread J. R. Okajima
Hello,

Nikolay Borisov:
 My question is what are the repercussions of get_next_ino overflowing 
 and at some point having possibly multiple inodes on my system with the 
 same i_ino id? And why is it safe to have the inode id's overflow and 
 wrap around?

I am afraid some applications won't work correctly.
As far as I know, ls(1) and find(1) don't show the file whose inum is
zero.

See also
Subject: [PATCH v2] vfs: get_next_ino(), never inum=0
Date: 2014-05-28 14:06:32
http://marc.info/?l=linux-fsdevelm=140128600801771w=2
and their thread.

For tmpfs, I have another patch. Just FYI, here attached.


J. R. Okajima



a.patch.bz2
Description: BZip2 compressed data


Re: [PULL for 3.18] overlay filesystem v24

2014-09-30 Thread J. R. Okajima

Miklos Szeredi:
> It would be good to have an inode based union as well.  I suggest (as
> I suggested many times) to try slimming aufs to a bare minimum and
> submit that.  It may be easier to just start from scratch instead
> trying to drop features from the existing codebase.   I'd be happy to
> review and generally help with such an effort.

??
In 2009, I have posted the "feature-reduced" version of aufs2 as
following your advice. Also a git-branch called "aufs2-tmp-ro" was
created whose size was a half of aufs2's. Have you ever posted your
review comments about them?

It was 2005 when I got an idea of aufs. Its history is almost a decade.
In my current local aufs3 develpment branch, I got about 350
commits. And 680 and 1120 commits for aufs2 and aufs1 individually.

aufs3-linux.git#aufs3.x-rcN/25lktr$ git log1 --no-merges fs/aufs/*.c | wc -l
348
aufs2-2.6.git#aufs2.2-stdalone-38-lktr$ git log1 --no-merges fs/aufs/*.c | wc -l
685
aufs1.git#master$ git log1 --no-merges aufs/fs/aufs/*.c | wc -l
1126

Of cource it won't be 2000 commits to create the patch series you
wrote. I don't know how large the number will be but I am sure this is
incredibly tough work both in commiting and reviewing. Is it still
helpful and really meaningful for you to review? Don't you think it will
be easier to review the resulted code?
For example, this branch was created for linux-3.9 when Al Viro wrote
about the dir mutex lock in copy-up. It has only 20 commits.

aufs3-linux.git#mainline-v3.9-rc8-20130428$ git log1 --no-merges fs/aufs/*.c | 
wc -l
20


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PULL for 3.18] overlay filesystem v24

2014-09-30 Thread J. R. Okajima

Miklos Szeredi:
 It would be good to have an inode based union as well.  I suggest (as
 I suggested many times) to try slimming aufs to a bare minimum and
 submit that.  It may be easier to just start from scratch instead
 trying to drop features from the existing codebase.   I'd be happy to
 review and generally help with such an effort.

??
In 2009, I have posted the feature-reduced version of aufs2 as
following your advice. Also a git-branch called aufs2-tmp-ro was
created whose size was a half of aufs2's. Have you ever posted your
review comments about them?

It was 2005 when I got an idea of aufs. Its history is almost a decade.
In my current local aufs3 develpment branch, I got about 350
commits. And 680 and 1120 commits for aufs2 and aufs1 individually.

aufs3-linux.git#aufs3.x-rcN/25lktr$ git log1 --no-merges fs/aufs/*.c | wc -l
348
aufs2-2.6.git#aufs2.2-stdalone-38-lktr$ git log1 --no-merges fs/aufs/*.c | wc -l
685
aufs1.git#master$ git log1 --no-merges aufs/fs/aufs/*.c | wc -l
1126

Of cource it won't be 2000 commits to create the patch series you
wrote. I don't know how large the number will be but I am sure this is
incredibly tough work both in commiting and reviewing. Is it still
helpful and really meaningful for you to review? Don't you think it will
be easier to review the resulted code?
For example, this branch was created for linux-3.9 when Al Viro wrote
about the dir mutex lock in copy-up. It has only 20 commits.

aufs3-linux.git#mainline-v3.9-rc8-20130428$ git log1 --no-merges fs/aufs/*.c | 
wc -l
20


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PULL for 3.18] overlay filesystem v24

2014-09-29 Thread J. R. Okajima

David Howells:
> Miklos Szeredi  wrote:
>
> > I'd like to propose overlayfs for inclusion into 3.18.
> > 
> > Al, would you mind giving it a review?
> > 
> > Git tree is here:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git 
> > overlayfs.current
>
> Tested-by: David Howells 

Does it mean overlayfs passed all your unionmount-testsuite? And does
the test suite contain tests for "inode-based" union? For example,
- read(2) may get the obsoleted filedata (fstat(2) for metadata too).
- fcntl(F_SETLK) may be broken by copy-up.
- inotify may not work when it refers to the file before being
  copied-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
  open(O_RDWR).
- exporting via NFS and fhandle systemcalls will not work.

A few releases ago, OFD file-lock was introduced to improve the
behaviour of POSIX lock. POSIX lock has made users confused and I am
afraid that the similar story will come up because of the "name-based"
union behaviour. Of course the story is not limited to the file-lock.

If I remember correctly, are you the one who consitunes the development
of UnionMount? Is the development totally stopped?
Next paragraph is what I wrote several times.
AUFS is an "inode-based" stackable filesystem and solved them many years
ago. But I have to admit that AUFS is big. Yes it is grown up.
I don't stop including overlayfs into mainline, but if the development
of UnionMount is really stopped, then I'd ask people to consider merging
aufs as well as overlayfs.

http://aufs.sf.net

J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PULL for 3.18] overlay filesystem v24

2014-09-29 Thread J. R. Okajima

David Howells:
 Miklos Szeredi mik...@szeredi.hu wrote:

  I'd like to propose overlayfs for inclusion into 3.18.
  
  Al, would you mind giving it a review?
  
  Git tree is here:
  
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git 
  overlayfs.current

 Tested-by: David Howells dhowe...@redhat.com

Does it mean overlayfs passed all your unionmount-testsuite? And does
the test suite contain tests for inode-based union? For example,
- read(2) may get the obsoleted filedata (fstat(2) for metadata too).
- fcntl(F_SETLK) may be broken by copy-up.
- inotify may not work when it refers to the file before being
  copied-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
  open(O_RDWR).
- exporting via NFS and fhandle systemcalls will not work.

A few releases ago, OFD file-lock was introduced to improve the
behaviour of POSIX lock. POSIX lock has made users confused and I am
afraid that the similar story will come up because of the name-based
union behaviour. Of course the story is not limited to the file-lock.

If I remember correctly, are you the one who consitunes the development
of UnionMount? Is the development totally stopped?
Next paragraph is what I wrote several times.
AUFS is an inode-based stackable filesystem and solved them many years
ago. But I have to admit that AUFS is big. Yes it is grown up.
I don't stop including overlayfs into mainline, but if the development
of UnionMount is really stopped, then I'd ask people to consider merging
aufs as well as overlayfs.

http://aufs.sf.net

J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.15 .. and continuation of merge window

2014-06-11 Thread J. R. Okajima

Al Viro:
> The former guarantees that the address we are doing trylock on would be that
> of a live dentry.  The latter makes sure that anything assigned to 
> dentry->d_parent after we drop ->d_lock will not be freed until we drop
> rcu_read_lock.

Although I don't know how to reproduce the problem for the latter case,
I think the patch is good.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.15 .. and continuation of merge window

2014-06-11 Thread J. R. Okajima

Al Viro:
> So I suspect that the right fix is a bit trickier - in addition to check
> on the fast path (i.e. when trylock gets us the lock on parent), we need
> to
>   * get rcu_read_lock() before dropping ->d_lock.
>   * check if dentry is already doomed right after taking rcu_read_lock();
> if not, any value we might see in ->d_parent afterwards will point to object
> not freed until we drop rcu_read_lock.
>
> IOW, something like the delta below.  Comments?

I will try testing later.
For now, as a comment before testing, the patch looks weird for me. It
checks d_lockref.count twice during d_lockref.lock held. It must be the
same result, isn't it? Or does it mean that denty can be handled by
lockref_mark_dead() even if d_lockref.lock is held?


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.15 .. and continuation of merge window

2014-06-11 Thread J. R. Okajima

Al Viro:
 So I suspect that the right fix is a bit trickier - in addition to check
 on the fast path (i.e. when trylock gets us the lock on parent), we need
 to
   * get rcu_read_lock() before dropping -d_lock.
   * check if dentry is already doomed right after taking rcu_read_lock();
 if not, any value we might see in -d_parent afterwards will point to object
 not freed until we drop rcu_read_lock.

 IOW, something like the delta below.  Comments?

I will try testing later.
For now, as a comment before testing, the patch looks weird for me. It
checks d_lockref.count twice during d_lockref.lock held. It must be the
same result, isn't it? Or does it mean that denty can be handled by
lockref_mark_dead() even if d_lockref.lock is held?


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.15 .. and continuation of merge window

2014-06-11 Thread J. R. Okajima

Al Viro:
 The former guarantees that the address we are doing trylock on would be that
 of a live dentry.  The latter makes sure that anything assigned to 
 dentry-d_parent after we drop -d_lock will not be freed until we drop
 rcu_read_lock.

Although I don't know how to reproduce the problem for the latter case,
I think the patch is good.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.15 .. and continuation of merge window

2014-06-08 Thread J. R. Okajima

Linus Torvalds:
> So I ended up doing an rc8 because I was a bit worried about some
> last-minute dcache fixes, but it turns out that nobody seemed to even
> notice those. We did have other issues during the week, though, so it
:::

I am afraid there is a problem in dcache. Please read
http://marc.info/?l=linux-fsdevel=140214911608925=2


For 3.16, please consider these patches.

- for vfs
  [PATCH v2] vfs: get_next_ino(), never inum=0
  http://marc.info/?l=linux-fsdevel=140128600801771=2

- for tmpfs
  [RFC PATCH v4 0/2] tmpfs: manage the inode-number by IDR (performance measure)
  http://marc.info/?l=linux-fsdevel=140197128021025=2
  [RFC PATCH v4 1/2] tmpfs: manage the inode-number by IDR, signed int inum
  http://marc.info/?l=linux-fsdevel=140197128321029=2
  [RFC PATCH v4 2/2] tmpfs: refine a file handle for NFS-exporting
  http://marc.info/?l=linux-fsdevel=140197128121026=2


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 3.15 .. and continuation of merge window

2014-06-08 Thread J. R. Okajima

Linus Torvalds:
 So I ended up doing an rc8 because I was a bit worried about some
 last-minute dcache fixes, but it turns out that nobody seemed to even
 notice those. We did have other issues during the week, though, so it
:::

I am afraid there is a problem in dcache. Please read
http://marc.info/?l=linux-fsdevelm=140214911608925w=2


For 3.16, please consider these patches.

- for vfs
  [PATCH v2] vfs: get_next_ino(), never inum=0
  http://marc.info/?l=linux-fsdevelm=140128600801771w=2

- for tmpfs
  [RFC PATCH v4 0/2] tmpfs: manage the inode-number by IDR (performance measure)
  http://marc.info/?l=linux-fsdevelm=140197128021025w=2
  [RFC PATCH v4 1/2] tmpfs: manage the inode-number by IDR, signed int inum
  http://marc.info/?l=linux-fsdevelm=140197128321029w=2
  [RFC PATCH v4 2/2] tmpfs: refine a file handle for NFS-exporting
  http://marc.info/?l=linux-fsdevelm=140197128121026w=2


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unionmount and overlayfs testsuite

2014-05-30 Thread J. R. Okajima

David Howells:
> >   Does readlink(2) return "/u/fileA" instead of /ro/fileA?"
>
> No.
>
> The test suite sets the lower symlink to point to the union path for its 
> target.
>
> [root@andromeda union-testsuite]# readlink /lower/a/indirect_dir_sym100 
> /mnt/a/direct_dir_sym100

Now I've found your latest commit fixed the path.
No problemo.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unionmount and overlayfs testsuite

2014-05-30 Thread J. R. Okajima

David Howells:
Does readlink(2) return /u/fileA instead of /ro/fileA?

 No.

 The test suite sets the lower symlink to point to the union path for its 
 target.

 [root@andromeda union-testsuite]# readlink /lower/a/indirect_dir_sym100 
 /mnt/a/direct_dir_sym100

Now I've found your latest commit fixed the path.
No problemo.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unionmount and overlayfs testsuite

2014-05-29 Thread J. R. Okajima

"J. R. Okajima":
> - readlink.test,
>   fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E 
> EINVAL}
>   It expects "$testdir/direct_dir_sym100". Does it mean UnionMount
>   converts the target path?
>   For example,
>   - /u = /rw + /ro
>   - /rw/symlinkA doesn't exist
>   - /ro/symlinkA points to /ro/fileA
>   Does readlink(2) return "/u/fileA" instead of "/u/fileA"?

Ouch!
I meant "Does ... instead of /ro/fileA?"


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unionmount and overlayfs testsuite

2014-05-29 Thread J. R. Okajima

David Howells:
> http://git.infradead.org/users/dhowells/unionmount-testsuite.git

I've found some interesting cases.

- impermissible.test,
open_file_as_bin -t -w $file -E EACCES
  When $termslash is "/", a '/' is appended to the expanded $file, such
  as "/path/fileA/". If fileA is a regular file, that path is obviously
  wrong. Does UnionMount expect EACCES in this case too?
  Should it be ENOTDIR?
  It might be better to change
errcode=EACCES
test ! "$termslash" = "" && errcode=ENOTDIR
open_file_as_bin -t -w $file -E $errcode

- readlink.test,
fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E 
EINVAL}
  It expects "$testdir/direct_dir_sym100". Does it mean UnionMount
  converts the target path?
  For example,
  - /u = /rw + /ro
  - /rw/symlinkA doesn't exist
  - /ro/symlinkA points to /ro/fileA
  Does readlink(2) return "/u/fileA" instead of "/u/fileA"?

And all tests should be done by a superuser?


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unionmount and overlayfs testsuite

2014-05-29 Thread J. R. Okajima

David Howells:
 http://git.infradead.org/users/dhowells/unionmount-testsuite.git

I've found some interesting cases.

- impermissible.test,
open_file_as_bin -t -w $file -E EACCES
  When $termslash is /, a '/' is appended to the expanded $file, such
  as /path/fileA/. If fileA is a regular file, that path is obviously
  wrong. Does UnionMount expect EACCES in this case too?
  Should it be ENOTDIR?
  It might be better to change
errcode=EACCES
test ! $termslash =   errcode=ENOTDIR
open_file_as_bin -t -w $file -E $errcode

- readlink.test,
fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E 
EINVAL}
  It expects $testdir/direct_dir_sym100. Does it mean UnionMount
  converts the target path?
  For example,
  - /u = /rw + /ro
  - /rw/symlinkA doesn't exist
  - /ro/symlinkA points to /ro/fileA
  Does readlink(2) return /u/fileA instead of /u/fileA?

And all tests should be done by a superuser?


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Unionmount and overlayfs testsuite

2014-05-29 Thread J. R. Okajima

J. R. Okajima:
 - readlink.test,
   fs_op readlink $file -R $testdir/direct_dir_sym100 ${termslash:+-E 
 EINVAL}
   It expects $testdir/direct_dir_sym100. Does it mean UnionMount
   converts the target path?
   For example,
   - /u = /rw + /ro
   - /rw/symlinkA doesn't exist
   - /ro/symlinkA points to /ro/fileA
   Does readlink(2) return /u/fileA instead of /u/fileA?

Ouch!
I meant Does ... instead of /ro/fileA?


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/13] overlay filesystem v22

2014-05-25 Thread J. R. Okajima

Thanks for CC-ing me.

Here are some comments.

- I have no objection about the 0:0 char-dev whiteout, but you don't
  have to have the inode for each whiteout. The hardlink is better.
  In this version, you have  now. How about creating a "base"
  whiteout under workdir at the mount-time? Maybe it is possible by
  user-space "mount.overlayfs" or kernel-space. When the whiteout meets
  EMLINK, create a non-hardlink for that target synchronously and
  re-create the "base" asynchronously.

- Is /work always visible to users? If a user accidentally
  removes it or its children, then some operations will fail. And he
  cannot recover it anymore. I know it cannot easily happen since its
  mode is 0. But I am afraid it will be a source of troubles. For
  example, find(1) or "ls -R /overlayfs" will almost always fail.

- If I remember correctly, the length of the dir mutex is held time has
  been pointed out. This version has still a long mutex held time, a whole
  copy-up operation includeing lookup, create, copy filedata, copy
  xattr/attr, and then rename. How about unlock the dir before copying
  filedata and re-lock and confirm after copying attr?

- When two processes copy-up a similar dir hierarcy, for example
  /dirA/dirB/fileC and /dirA/dirB/dirC/fileD, may a race condition
  happen? Two processes begin copying-up dirA, first processA succeeds
  it and second processB fails and returns EIO?

- All copy-up operations will be serialized due to  lock.

- In fstat(2) for a dir, is nlink set to 1 even it is removed?

- In readdir, it lookup or getattr to determine whether the found char
  dev entry is a whiteout or not. I know a char dev is not so many, so
  this overhead won't be large. But if its name represented "I am a
  whiteout", then the extra lookup or getattr would be unnecessary.


My personal impression for overall is overlayfs starts growing.
Also several parts look like towarding aufs. For example,
- a  means an overlayfs specific work. Aufs has such special
  dir for copying-up an unlinked file and a pseudo-link. Both are
  unnecessary for overlayfs because overlayfs copies-up a file in
  open(2) time, and doesn't support the hardlink between layers.
- many small wrapper functions for VFS helpers resemble to aufs
  too. In aufs, all they have lockdep_off/on.
- the internal cache for readdir allocating extra memory. Aufs adopts
  a simple hashing, while overlayfs uses rbtree.

But of course the fundamental design differences between overlayfs and
aufs are kept, such as
- a name-based union .vs. an inode-aware union
- multiple layers
- allow users to access the layers directly
- etc...

If LKML people consider merging overlayfs, then I'd ask to consier aufs
too. The basic design is unchanged since when I posted a few years ago.
http://marc.info/?l=linux-kernel=123934927611907=2

And latest one is
http://aufs.sourceforge.net


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 00/13] overlay filesystem v22

2014-05-25 Thread J. R. Okajima

Thanks for CC-ing me.

Here are some comments.

- I have no objection about the 0:0 char-dev whiteout, but you don't
  have to have the inode for each whiteout. The hardlink is better.
  In this version, you have workdir now. How about creating a base
  whiteout under workdir at the mount-time? Maybe it is possible by
  user-space mount.overlayfs or kernel-space. When the whiteout meets
  EMLINK, create a non-hardlink for that target synchronously and
  re-create the base asynchronously.

- Is workdir/work always visible to users? If a user accidentally
  removes it or its children, then some operations will fail. And he
  cannot recover it anymore. I know it cannot easily happen since its
  mode is 0. But I am afraid it will be a source of troubles. For
  example, find(1) or ls -R /overlayfs will almost always fail.

- If I remember correctly, the length of the dir mutex is held time has
  been pointed out. This version has still a long mutex held time, a whole
  copy-up operation includeing lookup, create, copy filedata, copy
  xattr/attr, and then rename. How about unlock the dir before copying
  filedata and re-lock and confirm after copying attr?

- When two processes copy-up a similar dir hierarcy, for example
  /dirA/dirB/fileC and /dirA/dirB/dirC/fileD, may a race condition
  happen? Two processes begin copying-up dirA, first processA succeeds
  it and second processB fails and returns EIO?

- All copy-up operations will be serialized due to workdir lock.

- In fstat(2) for a dir, is nlink set to 1 even it is removed?

- In readdir, it lookup or getattr to determine whether the found char
  dev entry is a whiteout or not. I know a char dev is not so many, so
  this overhead won't be large. But if its name represented I am a
  whiteout, then the extra lookup or getattr would be unnecessary.


My personal impression for overall is overlayfs starts growing.
Also several parts look like towarding aufs. For example,
- a workdir means an overlayfs specific work. Aufs has such special
  dir for copying-up an unlinked file and a pseudo-link. Both are
  unnecessary for overlayfs because overlayfs copies-up a file in
  open(2) time, and doesn't support the hardlink between layers.
- many small wrapper functions for VFS helpers resemble to aufs
  too. In aufs, all they have lockdep_off/on.
- the internal cache for readdir allocating extra memory. Aufs adopts
  a simple hashing, while overlayfs uses rbtree.

But of course the fundamental design differences between overlayfs and
aufs are kept, such as
- a name-based union .vs. an inode-aware union
- multiple layers
- allow users to access the layers directly
- etc...

If LKML people consider merging overlayfs, then I'd ask to consier aufs
too. The basic design is unchanged since when I posted a few years ago.
http://marc.info/?l=linux-kernelm=123934927611907w=2

And latest one is
http://aufs.sourceforge.net


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] ima: audit log files opened with O_DIRECT flag

2014-05-13 Thread J. R. Okajima

Mimi Zohar:
> As a temporary fix, do not measure, appraise, or audit files
> opened with the O_DIRECT flag set.  Just audit log it.

I have no objection about the patch, but have a question.
Are you intending to put it into mainline now (and stable too)? Or is
this a local bandage for whoever have met the problem (like me)?
In other words, should I wait for another lock free solution from Dmitry
Kasatkin?

By the way, the mail is not delivered to stable-ML while there is
"Cc: stable..." line in the commit log.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] ima: audit log files opened with O_DIRECT flag

2014-05-13 Thread J. R. Okajima

Mimi Zohar:
 As a temporary fix, do not measure, appraise, or audit files
 opened with the O_DIRECT flag set.  Just audit log it.

I have no objection about the patch, but have a question.
Are you intending to put it into mainline now (and stable too)? Or is
this a local bandage for whoever have met the problem (like me)?
In other words, should I wait for another lock free solution from Dmitry
Kasatkin?

By the way, the mail is not delivered to stable-ML while there is
Cc: stable... line in the commit log.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

Mimi Zohar:
> I assume so, as there wasn't any comment.  As a temporary fix, would it
> make sense not to measure/appraise/audit files opened with the direct-io
> flag based policy?  Define a new IMA policy option 'directio'.  A sample
> rule would look like:  
>
> dont_appraise bprm_check directio fsuuid=... 

I prefer such approach or anything addressing in IMA only, so it makes
sense.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

"J. R. Okajima":
> do_blockdev_direct_IO(), I'd suggest
> - make two new static inline functions like
>   r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file).
> - these new functions are complied when CONFIG_IMA is enabled, otherwise
>   they are plain mutex_lock/unlock().
> - then do_blockdev_direct_IO() can call them blindly.
> - of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA
>   is enabled too.

One more thing.
Since struct file is a sharable object, it might be better to put
task-id into struct file. Hmm, then should we support for multiple tasks
by list or something? Oh the code grows...


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

Mimi Zohar:
> Another approach was posted here
> http://marc.info/?l=linux-security-module=138919062430367=2 which
> also was not upstreamed.

It might be better a little than previous one which handles the flag
temporarily. But, in order to make the code cleaner particulary for
do_blockdev_direct_IO(), I'd suggest
- make two new static inline functions like
  r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file).
- these new functions are complied when CONFIG_IMA is enabled, otherwise
  they are plain mutex_lock/unlock().
- then do_blockdev_direct_IO() can call them blindly.
- of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA
  is enabled too.

I can guess that several people thinks that is still "ugly locking", but
the deadlock is much ugly in real world. And we need some workaround for
it.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

Dmitry Kasatkin:
> It is a different issue.
>
> I made patch more than a year ago which fix the problem
>
> https://lkml.org/lkml/2013/2/20/601

Yes, I know this is a different issue, but I didn't know the patch.
While I am not sure how ugly (or beautiful) is the approache the patch
took, as long as IMA needs to read the file, we should merge the merge.
Or we should wait for your another patch.


J. R. Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

Dmitry Kasatkin:
 It is a different issue.

 I made patch more than a year ago which fix the problem

 https://lkml.org/lkml/2013/2/20/601

Yes, I know this is a different issue, but I didn't know the patch.
While I am not sure how ugly (or beautiful) is the approache the patch
took, as long as IMA needs to read the file, we should merge the merge.
Or we should wait for your another patch.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

Mimi Zohar:
 Another approach was posted here
 http://marc.info/?l=linux-security-modulem=138919062430367w=2 which
 also was not upstreamed.

It might be better a little than previous one which handles the flag
temporarily. But, in order to make the code cleaner particulary for
do_blockdev_direct_IO(), I'd suggest
- make two new static inline functions like
  r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file).
- these new functions are complied when CONFIG_IMA is enabled, otherwise
  they are plain mutex_lock/unlock().
- then do_blockdev_direct_IO() can call them blindly.
- of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA
  is enabled too.

I can guess that several people thinks that is still ugly locking, but
the deadlock is much ugly in real world. And we need some workaround for
it.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

J. R. Okajima:
 do_blockdev_direct_IO(), I'd suggest
 - make two new static inline functions like
   r = ima_aware_file_inode_mutex_lock(file) and ..._unlock(r, file).
 - these new functions are complied when CONFIG_IMA is enabled, otherwise
   they are plain mutex_lock/unlock().
 - then do_blockdev_direct_IO() can call them blindly.
 - of course, O_DIRECT_HAVELOCK should be complied only when CONFIG_IMA
   is enabled too.

One more thing.
Since struct file is a sharable object, it might be better to put
task-id into struct file. Hmm, then should we support for multiple tasks
by list or something? Oh the code grows...


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: IMA + O_DIRECT (Re: [PATCH 0/1] fix IMA + Apparmor kernel panic)

2014-05-09 Thread J. R. Okajima

Mimi Zohar:
 I assume so, as there wasn't any comment.  As a temporary fix, would it
 make sense not to measure/appraise/audit files opened with the direct-io
 flag based policy?  Define a new IMA policy option 'directio'.  A sample
 rule would look like:  

 dont_appraise bprm_check directio fsuuid=... 

I prefer such approach or anything addressing in IMA only, so it makes
sense.


J. R. Okajima
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >