Re: bpf: Massive skbuff_head_cache memory leak?
Hello, Alexei and Daniel. Can you show us how to run testcases you are testing? On 2018/09/22 22:25, Tetsuo Handa wrote: > Hello. > > syzbot is reporting many lockup problems on bpf.git / bpf-next.git / net.git > / net-next.git trees. > > INFO: rcu detected stall in br_multicast_port_group_expired (2) > > https://syzkaller.appspot.com/bug?id=15c7ad8cf35a07059e8a697a22527e11d294bc94 > > INFO: rcu detected stall in tun_chr_close > > https://syzkaller.appspot.com/bug?id=6c50618bde03e5a2eefdd0269cf9739c5ebb8270 > > INFO: rcu detected stall in discover_timer > > https://syzkaller.appspot.com/bug?id=55da031ddb910e58ab9c6853a5784efd94f03b54 > > INFO: rcu detected stall in ret_from_fork (2) > > https://syzkaller.appspot.com/bug?id=c83129a6683b44b39f5b8864a1325893c9218363 > > INFO: rcu detected stall in addrconf_rs_timer > > https://syzkaller.appspot.com/bug?id=21c029af65f81488edbc07a10ed20792444711b6 > > INFO: rcu detected stall in kthread (2) > > https://syzkaller.appspot.com/bug?id=6accd1ed11c31110fed1982f6ad38cc9676477d2 > > INFO: rcu detected stall in ext4_filemap_fault > > https://syzkaller.appspot.com/bug?id=817e38d20e9ee53390ac361bf0fd2007eaf188af > > INFO: rcu detected stall in run_timer_softirq (2) > > https://syzkaller.appspot.com/bug?id=f5a230a3ff7822f8d39fddf8485931bd06ae47fe > > INFO: rcu detected stall in bpf_prog_ADDR > > https://syzkaller.appspot.com/bug?id=fb4911fd0e861171cc55124e209f810a0dd68744 > > INFO: rcu detected stall in __run_timers (2) > > https://syzkaller.appspot.com/bug?id=65416569ddc8d2feb8f19066aa761f5a47f7451a > > The cause of lockup seems to be flood of printk() messages from memory > allocation > failures, and one of out_of_memory() messages indicates that skbuff_head_cache > usage is huge enough to suspect in-kernel memory leaks. > > [ 1554.547011] skbuff_head_cache1847887KB1847887KB > > Unfortunately, we cannot find from logs what syzbot is trying to do > because constant printk() messages is flooding away syzkaller messages. > Can you try running your testcases with kmemleak enabled? > On 2018/09/27 2:35, Dmitry Vyukov wrote: > I also started suspecting Apparmor. We switched to Apparmor on Aug 30: > https://groups.google.com/d/msg/syzkaller-bugs/o73lO4KGh0w/j9pcH2tSBAAJ > Now the instances that use SELinux and Smack explicitly contain that > in the name, but the rest are Apparmor. > Aug 30 roughly matches these assorted "task hung" reports. Perhaps > some Apparmor hook leaks a reference to skbs? Maybe. They have CONFIG_DEFAULT_SECURITY="apparmor". But I'm wondering why this problem is not occurring on linux-next.git when this problem is occurring on bpf.git / bpf-next.git / net.git / net-next.git trees. Is syzbot running different testcases depending on which git tree is targeted?
Re: WARNING in add_uevent_var
On 2018/04/03 21:34, Johannes Berg wrote: > On Sun, 2018-04-01 at 23:01 -0700, syzbot wrote: > >> So far this crash happened 5 times on net-next, upstream. >> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6614377067184128 >> > > Huh, fun. Looks like you're basically creating a new HWSIM radio with an > insanely long name (4k!) and nothing stops you, until we try to generate > an rfkill instance which sends a uevent and only has a 2k buffer for the > environment variables, where we put the name ... > > But yeah, we should probably limit the phy name to something sane, I'll > pick 128 and send a patch. > I think it should be NAME_MAX. #include #include #include #include #include int main(int argc, char *argv[]) { const int fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_GENERIC); static struct { struct nlmsghdr hdr; char data[4120 - sizeof(struct nlmsghdr)]; } buf = { }; struct sockaddr_nl addr = { .nl_family = AF_NETLINK }; struct iovec iov = { , sizeof(buf) }; struct msghdr msg = { .msg_name = , .msg_namelen = sizeof(addr), .msg_iov = , .msg_iovlen = 1, }; buf.hdr.nlmsg_len = 0x1018; buf.hdr.nlmsg_type = 0x22; buf.hdr.nlmsg_flags = 0x109; *(uint8_t*) buf.data = 4; *(uint8_t*) (buf.data + 1) = 0; *(uint16_t*) (buf.data + 2) = 0; *(uint16_t*) (buf.data + 4) = 0x1004; *(uint16_t*) (buf.data + 6) = 0x11; memset(buf.data + 8, 'A', 4096); /* strlen() > NAME_MAX */ sendmsg(fd, , 0); return 0; } >From 3eba6541da0c7338e3d71ea83cbc69962923d65e Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Date: Mon, 7 May 2018 15:58:37 +0900 Subject: [PATCH] net: rfkill: Add filename varidity test at rfkill_alloc(). syzbot is hitting WARN() at kobject_uevent_env() [1]. This is because the test case is requesting too long name. Since the name is used as part of pathname under /sys/devices/virtual/ , this name parameter must obey the limitations of a filename. Fix this by applying name validity test to rfkill_alloc(). [1] https://syzkaller.appspot.com/bug?id=c1e1ab1d042b637ee9e25c64b107d51630b9d9ec Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Reported-by: syzbot <syzbot+230d9e642a85d3fec...@syzkaller.appspotmail.com> Cc: Johannes Berg <johan...@sipsolutions.net> --- net/rfkill/core.c | 4 1 file changed, 4 insertions(+) diff --git a/net/rfkill/core.c b/net/rfkill/core.c index 59d0eb9..d8beebc 100644 --- a/net/rfkill/core.c +++ b/net/rfkill/core.c @@ -933,6 +933,10 @@ struct rfkill * __must_check rfkill_alloc(const char *name, if (WARN_ON(type == RFKILL_TYPE_ALL || type >= NUM_RFKILL_TYPES)) return NULL; + if (strlen(name) > NAME_MAX || strchr(name, '/') || + !strcmp(name, ".") || !strcmp(name, "..")) + return NULL; + rfkill = kzalloc(sizeof(*rfkill) + strlen(name) + 1, GFP_KERNEL); if (!rfkill) return NULL; -- 1.8.3.1
Re: WARNING in kernfs_add_one
On 2018/05/06 7:07, Greg KH wrote: >> More likely wireless territory > > Ugh, that's what I get for writing emails before coffee in the > morning... > > Yes, you are right, this looks like a wireless issue. > > Now cc: linux-wireless. > Nope, if you look at previous fault injection messages... >From 7ddcaa3d4327d4f29d11053bd2011bf77ecf72af Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Date: Mon, 7 May 2018 14:19:50 +0900 Subject: [PATCH] driver core: Don't ignore class_dir_create_and_add() failure. syzbot is hitting WARN() at kernfs_add_one() [1]. This is because kernfs_create_link() is confused by previous device_add() call which continued without setting dev->kobj.parent field when get_device_parent() failed by memory allocation fault injection. Fix this by propagating the error from class_dir_create_and_add() to the calllers of get_device_parent(). [1] https://syzkaller.appspot.com/bug?id=fae0fb607989ea744526d1c082a5b8de6529116f Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Reported-by: syzbot <syzbot+df47f81c226b31d89...@syzkaller.appspotmail.com> Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- drivers/base/core.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/base/core.c b/drivers/base/core.c index b610816..d680fd0 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1467,7 +1467,7 @@ class_dir_create_and_add(struct class *class, struct kobject *parent_kobj) dir = kzalloc(sizeof(*dir), GFP_KERNEL); if (!dir) - return NULL; + return ERR_PTR(-ENOMEM); dir->class = class; kobject_init(>kobj, _dir_ktype); @@ -1477,7 +1477,7 @@ class_dir_create_and_add(struct class *class, struct kobject *parent_kobj) retval = kobject_add(>kobj, parent_kobj, "%s", class->name); if (retval < 0) { kobject_put(>kobj); - return NULL; + return ERR_PTR(retval); } return >kobj; } @@ -1784,6 +1784,10 @@ int device_add(struct device *dev) parent = get_device(dev->parent); kobj = get_device_parent(dev, parent); + if (IS_ERR(kobj)) { + error = PTR_ERR(kobj); + goto parent_error; + } if (kobj) dev->kobj.parent = kobj; @@ -1882,6 +1886,7 @@ int device_add(struct device *dev) kobject_del(>kobj); Error: cleanup_glue_dir(dev, glue_dir); +parent_error: put_device(parent); name_error: kfree(dev->p); @@ -2701,6 +2706,11 @@ int device_move(struct device *dev, struct device *new_parent, device_pm_lock(); new_parent = get_device(new_parent); new_parent_kobj = get_device_parent(dev, new_parent); + if (IS_ERR(new_parent_kobj)) { + error = PTR_ERR(new_parent_kobj); + put_device(new_parent); + goto out; + } pr_debug("device: '%s': %s: moving to '%s'\n", dev_name(dev), __func__, new_parent ? dev_name(new_parent) : ""); -- 1.8.3.1
Re: INFO: rcu detected stall in __schedule
I'm not sure whether this is a PPP bug. As of uptime = 484, RCU says that it stalled for 125 seconds. -- [ 484.407032] INFO: rcu_sched self-detected stall on CPU [ 484.412488] 0-...!: (125000 ticks this GP) idle=f3e/1/4611686018427387906 softirq=112858/112858 fqs=0 [ 484.422300] (t=125000 jiffies g=61626 c=61625 q=1534) [ 484.427663] rcu_sched kthread starved for 125000 jiffies! g61626 c61625 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0 -- 484 - 125 = 359, which was about to start SND related fuzzing in that log. -- 2033/05/18 03:36:31 executing program 1: r0 = socket(0x4a, 0x5, 0x7) setsockopt$inet_int(r0, 0x0, 0x18, &(0x7f00)=0x200, 0x4) bind$inet6(r0, &(0x7fc0)={0xa, 0x0, 0x0, @loopback={0x0, 0x1}}, 0x1c) perf_event_open(&(0x7f40)={0x2, 0x70, 0x3e5}, 0x0, 0x, 0x, 0x0) timer_create(0x0, &(0x7f0001c0)={0x0, 0x15, 0x0, @thr={&(0x7f000440), &(0x7f000540)}}, &(0x7f000200)) timer_getoverrun(0x0) perf_event_open(&(0x7f25c000)={0x2, 0x78, 0x3e3}, 0x0, 0x0, 0x, 0x0) r1 = syz_open_dev$sndctrl(&(0x7f000200)='/dev/snd/controlC#\x00', 0x2, 0x0) perf_event_open(&(0x7f001000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8ce, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfff8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_bp={&(0x7f005000), 0x2}, 0x10c}, 0x0, 0x0, 0x, 0x0) ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7fc0)=0x1) r2 = syz_open_dev$sndpcmp(&(0x7f000100)='/dev/snd/pcmC#D#p\x00', 0x1, 0x4000) ioctl$SNDRV_SEQ_IOCTL_GET_QUEUE_CLIENT(r2, 0xc04c5349, &(0x7f000240)={0x200, 0xfcdc, 0x1}) syz_open_dev$tun(&(0x7f0003c0)='/dev/net/tun\x00', 0x0, 0x20402) ioctl$SNDRV_CTL_IOCTL_PVERSION(r1, 0xc1105517, &(0x7f001000)=""/250) ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f00)) 2033/05/18 03:36:31 executing program 4: syz_emit_ethernet(0x3e, &(0x7fc0)={@broadcast=[0xff, 0xff, 0xff, 0xff, 0xff, 0xff], @empty=[0x0, 0x0, 0xb00], [], {@ipv4={0x800, {{0x5, 0x4, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0, 0x1, 0x0, @remote={0xac, 0x14, 0x14, 0xbb}, @dev={0xac, 0x14, 0x14}}, @icmp=@parameter_prob={0x5, 0x4, 0x0, 0x0, 0x0, 0x0, {0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @local={0xac, 0x223, 0x14, 0xaa}, @dev={0xac, 0x14, 0x14}}}, &(0x7f00)={0x0, 0x2, [0x0, 0x2e6]}) 2033/05/18 03:36:31 executing program 1: r0 = socket$pppoe(0x18, 0x1, 0x0) connect$pppoe(r0, &(0x7fc0)={0x18, 0x0, {0x1, @broadcast=[0xff, 0xff, 0xff, 0xff, 0xff, 0xff], 'ip6_vti0\x00'}}, 0x1e) r1 = socket(0x3, 0xb, 0x8001) setsockopt$inet_sctp6_SCTP_ADAPTATION_LAYER(r1, 0x84, 0x7, &(0x7f000100)={0x2}, 0x4) ioctl$sock_inet_SIOCGIFADDR(r0, 0x8915, &(0x7f40)={'veth1_to_bridge\x00', {0x2, 0x4e21}}) r2 = syz_open_dev$admmidi(&(0x7f00)='/dev/admmidi#\x00', 0x6, 0x8000) setsockopt$SO_VM_SOCKETS_BUFFER_MAX_SIZE(r2, 0x28, 0x2, &(0x7f80)=0xff00, 0x8) [ 359.306427] snd_virmidi snd_virmidi.0: control 112:0:0:�:0 is already present --
Re: WARNING: refcount bug in should_fail
Eric W. Biederman wrote: > Al Virowrites: > > > On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote: > > > >> FWIW, I'm going through the ->kill_sb() instances, fixing that sort > >> of bugs (most of them preexisting, but I should've checked instead > >> of assuming that everything's fine). Will push out later tonight. > > > > OK, see vfs.git#for-linus. Caught: 4 old bugs (allocation failure > > in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp. > > and double-dput in late failure exit in rpc_fill_super()) > > and 5 regressions from register_shrinker() failure recovery. > > One issue with your vfs.git#for-linus branch. > > It is missing Fixes tags and Cc: stable on those patches. > As the bug came in v4.15 those tags would really help the stable > maintainers get the recent regression fixes applied. OK. The patch was sent to linux.git as commit 8e04944f0ea8b838. #syz fix: mm,vmscan: Allow preallocating memory for register_shrinker().
Re: WARNING: refcount bug in should_fail
Dmitry Vyukov wrote: > On Sun, Apr 1, 2018 at 12:32 PM, Dmitry Vyukov <dvyu...@google.com> wrote: > > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa > > <penguin-ker...@i-love.sakura.ne.jp> wrote: > >> Switching from mm to fsdevel, for this report says that put_net(net) in > >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to > >> register_shrinker() failure. > >> > >> Relevant commits will be > >> commit 9ee332d99e4d5a97 ("sget(): handle failures of register_shrinker()") > >> and > >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to > >> mount_ns."). > >> > >> When sget_userns() in mount_ns() failed, mount_ns() returns an error code > >> to > >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info) > >> was > >> not called by rpc_fill_super() (via fill_super callback passed to > >> mount_ns()) > >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb() > >> from > >> deactivate_locked_super()). > >> > >> -- > >> static struct dentry * > >> rpc_mount(struct file_system_type *fs_type, > >> int flags, const char *dev_name, void *data) > >> { > >> struct net *net = current->nsproxy->net_ns; > >> return mount_ns(fs_type, flags, data, net, net->user_ns, > >> rpc_fill_super); > >> } > >> -- > > > > Messed kernel output, this is definitely not in should_fail. > > > > #syz dup: WARNING: refcount bug in sk_alloc > > Please don't drop reporter (syzbot) email from CC. > > #syz dup: WARNING: refcount bug in sk_alloc > Excuse me? This "refcount bug in should_fail" is talking about sget_userns() versus rpc_fill_super(). I think we need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080. > > >> syzbot wrote: > >>> Hello, > >>> > >>> syzbot hit the following crash on bpf-next commit > >>> 6f1b5a2b58d8470e5a8b25ab29f5fdb46168 (Tue Feb 27 04:11:23 2018 +) > >>> Merge branch 'bpf-kselftest-improvements' > >>> > >>> C reproducer is attached. > >>> syzkaller reproducer is attached. > >>> Raw console output is attached. > >>> compiler: gcc (GCC) 7.1.1 20170620 > >>> .config is attached. > >>> > >>> IMPORTANT: if you fix the bug, please add the following tag to the commit: > >>> Reported-by: syzbot+84371b6062cb639d7...@syzkaller.appspotmail.com > >>> It will help syzbot understand when the bug is fixed. See footer for > >>> details. > >>> If you forward the report, please keep this part and the footer. > >>> > >>> [ cut here ] > >>> FAULT_INJECTION: forcing a failure. > >>> name failslab, interval 1, probability 0, space 0, times 0 > >>> refcount_t: underflow; use-after-free. > >>> CPU: 1 PID: 4239 Comm: syzkaller149381 Not tainted 4.16.0-rc2+ #20 > >>> WARNING: CPU: 0 PID: 4237 at lib/refcount.c:187 > >>> refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187 > >>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > >>> Google 01/01/2011 > >>> Call Trace: > >>> Kernel panic - not syncing: panic_on_warn set ... > >>> > >>> __dump_stack lib/dump_stack.c:17 [inline] > >>> dump_stack+0x194/0x24d lib/dump_stack.c:53 > >>> fail_dump lib/fault-inject.c:51 [inline] > >>> should_fail+0x8c0/0xa40 lib/fault-inject.c:149 > >>> should_failslab+0xec/0x120 mm/failslab.c:32 > >>> slab_pre_alloc_hook mm/slab.h:422 [inline] > >>> slab_alloc mm/slab.c:3365 [inline] > >>> __do_kmalloc mm/slab.c:3703 [inline] > >>> __kmalloc+0x63/0x760 mm/slab.c:3714 > >>> kmalloc include/linux/slab.h:517 [inline] > >>> kzalloc include/linux/slab.h:701 [inline] > >>> register_shrinker+0x10e/0x2d0 mm/vmscan.c:268 > >>> sget_userns+0xbbf/0xe40 fs/super.c:520 > >>> mount_ns+0x6d/0x190 fs/super.c:1029 > >>> rpc_mount+0x9e/0xd0 net/sunrpc/rpc_pipe.c:1451 > >>> mount_fs+0x66/0x2d0 fs/super.c:1222 > >>> vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037 > >>> vfs_kern_mount fs/namespace.c:2509 [inline] > >>> do_new_mount fs/namespace.c:2512 [inline] > >>>
Re: WARNING: refcount bug in should_fail
syzbot wrote: > > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa > > <penguin-ker...@i-love.sakura.ne.jp> wrote: > >> Switching from mm to fsdevel, for this report says that put_net(net) in > >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to > >> register_shrinker() failure. > > >> Relevant commits will be > >> commit 9ee332d99e4d5a97 ("sget(): handle failures of > >> register_shrinker()") and > >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to > >> mount_ns."). > > >> When sget_userns() in mount_ns() failed, mount_ns() returns an error > >> code to > >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info) > >> was > >> not called by rpc_fill_super() (via fill_super callback passed to > >> mount_ns()) > >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb() > >> from > >> deactivate_locked_super()). > > >> -- > >> static struct dentry * > >> rpc_mount(struct file_system_type *fs_type, > >> int flags, const char *dev_name, void *data) > >> { > >> struct net *net = current->nsproxy->net_ns; > >> return mount_ns(fs_type, flags, data, net, net->user_ns, > >> rpc_fill_super); > >> } > >> -- > > > Messed kernel output, this is definitely not in should_fail. > > > #syz dup: WARNING: refcount bug in sk_alloc > > Can't find the corresponding bug. > I don't think this is a dup of existing bug. We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.
Re: WARNING: refcount bug in should_fail
Switching from mm to fsdevel, for this report says that put_net(net) in rpc_kill_sb() made net->count < 0 when mount_ns() failed due to register_shrinker() failure. Relevant commits will be commit 9ee332d99e4d5a97 ("sget(): handle failures of register_shrinker()") and commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to mount_ns."). When sget_userns() in mount_ns() failed, mount_ns() returns an error code to the caller without calling fill_super(). That is, get_net(sb->s_fs_info) was not called by rpc_fill_super() (via fill_super callback passed to mount_ns()) but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb() from deactivate_locked_super()). -- static struct dentry * rpc_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { struct net *net = current->nsproxy->net_ns; return mount_ns(fs_type, flags, data, net, net->user_ns, rpc_fill_super); } -- syzbot wrote: > Hello, > > syzbot hit the following crash on bpf-next commit > 6f1b5a2b58d8470e5a8b25ab29f5fdb46168 (Tue Feb 27 04:11:23 2018 +) > Merge branch 'bpf-kselftest-improvements' > > C reproducer is attached. > syzkaller reproducer is attached. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+84371b6062cb639d7...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > [ cut here ] > FAULT_INJECTION: forcing a failure. > name failslab, interval 1, probability 0, space 0, times 0 > refcount_t: underflow; use-after-free. > CPU: 1 PID: 4239 Comm: syzkaller149381 Not tainted 4.16.0-rc2+ #20 > WARNING: CPU: 0 PID: 4237 at lib/refcount.c:187 > refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > Kernel panic - not syncing: panic_on_warn set ... > > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x24d lib/dump_stack.c:53 > fail_dump lib/fault-inject.c:51 [inline] > should_fail+0x8c0/0xa40 lib/fault-inject.c:149 > should_failslab+0xec/0x120 mm/failslab.c:32 > slab_pre_alloc_hook mm/slab.h:422 [inline] > slab_alloc mm/slab.c:3365 [inline] > __do_kmalloc mm/slab.c:3703 [inline] > __kmalloc+0x63/0x760 mm/slab.c:3714 > kmalloc include/linux/slab.h:517 [inline] > kzalloc include/linux/slab.h:701 [inline] > register_shrinker+0x10e/0x2d0 mm/vmscan.c:268 > sget_userns+0xbbf/0xe40 fs/super.c:520 > mount_ns+0x6d/0x190 fs/super.c:1029 > rpc_mount+0x9e/0xd0 net/sunrpc/rpc_pipe.c:1451 > mount_fs+0x66/0x2d0 fs/super.c:1222 > vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037 > vfs_kern_mount fs/namespace.c:2509 [inline] > do_new_mount fs/namespace.c:2512 [inline] > do_mount+0xea4/0x2bb0 fs/namespace.c:2842 > SYSC_mount fs/namespace.c:3058 [inline] > SyS_mount+0xab/0x120 fs/namespace.c:3035 > do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x42/0xb7 > RIP: 0033:0x4460f9 > RSP: 002b:7fbcd769ad78 EFLAGS: 0246 ORIG_RAX: 00a5 > RAX: ffda RBX: 006dcc6c RCX: 004460f9 > RDX: 2080 RSI: 2040 RDI: 2000 > RBP: 7fbcd769ad80 R08: 20c0 R09: 3131 > R10: R11: 0246 R12: 006dcc68 > R13: R14: 0037 R15: 0030656c69662f2e > CPU: 0 PID: 4237 Comm: syzkaller149381 Not tainted 4.16.0-rc2+ #20 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x24d lib/dump_stack.c:53 > panic+0x1e4/0x41c kernel/panic.c:183 > __warn+0x1dc/0x200 kernel/panic.c:547 > report_bug+0x211/0x2d0 lib/bug.c:184 > fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178 > fixup_bug arch/x86/kernel/traps.c:247 [inline] > do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296 > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 > invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957 > RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187 > RSP: 0018:8801b164f6d8 EFLAGS: 00010286 > RAX: dc08 RBX: RCX: 815ac30e > RDX: RSI: 1100362c9e8b RDI: 1100362c9e60 > RBP: 8801b164f768 R08: R09: > R10: 8801b164f610 R11: R12: 1100362c9edc > R13: R14: 0001 R15: 8801ae924044 > refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212 > put_net include/net/net_namespace.h:220 [inline] > rpc_kill_sb+0x253/0x3c0 net/sunrpc/rpc_pipe.c:1473 >
Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
Peter, are you OK with this patch? Tetsuo Handa wrote: > From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> > Date: Thu, 8 Feb 2018 10:35:35 +0900 > Subject: [PATCH v2] lockdep: Fix fs_reclaim warning. > > Dave Jones reported fs_reclaim lockdep warnings. > > > WARNING: possible recursive locking detected > 4.15.0-rc9-backup-debug+ #1 Not tainted > > sshd/24800 is trying to acquire lock: >(fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > but task is already holding lock: >(fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > other info that might help us debug this: >Possible unsafe locking scenario: > > CPU0 > > lock(fs_reclaim); > lock(fs_reclaim); > >*** DEADLOCK *** > >May be due to missing lock nesting notation > > 2 locks held by sshd/24800: >#0: (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] > tcp_sendmsg+0x19/0x40 >#1: (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > stack backtrace: > CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 > Call Trace: >dump_stack+0xbc/0x13f >__lock_acquire+0xa09/0x2040 >lock_acquire+0x12e/0x350 >fs_reclaim_acquire.part.102+0x29/0x30 >kmem_cache_alloc+0x3d/0x2c0 >alloc_extent_state+0xa7/0x410 >__clear_extent_bit+0x3ea/0x570 >try_release_extent_mapping+0x21a/0x260 >__btrfs_releasepage+0xb0/0x1c0 >btrfs_releasepage+0x161/0x170 >try_to_release_page+0x162/0x1c0 >shrink_page_list+0x1d5a/0x2fb0 >shrink_inactive_list+0x451/0x940 >shrink_node_memcg.constprop.88+0x4c9/0x5e0 >shrink_node+0x12d/0x260 >try_to_free_pages+0x418/0xaf0 >__alloc_pages_slowpath+0x976/0x1790 >__alloc_pages_nodemask+0x52c/0x5c0 >new_slab+0x374/0x3f0 >___slab_alloc.constprop.81+0x47e/0x5a0 >__slab_alloc.constprop.80+0x32/0x60 >__kmalloc_track_caller+0x267/0x310 >__kmalloc_reserve.isra.40+0x29/0x80 >__alloc_skb+0xee/0x390 >sk_stream_alloc_skb+0xb8/0x340 >tcp_sendmsg_locked+0x8e6/0x1d30 >tcp_sendmsg+0x27/0x40 >inet_sendmsg+0xd0/0x310 >sock_write_iter+0x17a/0x240 >__vfs_write+0x2ab/0x380 >vfs_write+0xfb/0x260 >SyS_write+0xb6/0x140 >do_syscall_64+0x1e5/0xc05 >entry_SYSCALL64_slow_path+0x25/0x25 > > This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework > FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/ > lockdep_clear_current_reclaim_state() in __perform_reclaim() and > lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/ > fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply > propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() > is trying to grab the 'fake' lock again when __perform_reclaim() already > grabbed the 'fake' lock. > > The > > /* this guy won't enter reclaim */ > if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) > return false; > > test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock > was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context > (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't > enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa > ("page allocator: calculate the alloc_flags for allocation only once") > added the PF_MEMALLOC safeguard ( > > /* Avoid recursion of direct reclaim */ > if (p->flags & PF_MEMALLOC) > goto nopage; > > in __alloc_pages_slowpath()). > > Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow > __need_fs_reclaim() to return false. > > Reported-and-tested-by: Dave Jones <da...@codemonkey.org.uk> > Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Nick Piggin <npig...@gmail.com> > --- > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 81e18ce..19fb76b 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) > return false; > > /* this guy won't enter reclaim */ > - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) > + if (current->flags & PF_MEMALLOC) > return false; > > /* We're only interested __GFP_FS allocations for now */ > -- > 1.8.3.1 >
Re: [PATCH v2] lockdep: Fix fs_reclaim warning.
Nikolay Borisov wrote: > I think I've hit another incarnation of that one. The call stack is: > http://paste.opensuse.org/3f22d013 > > The cleaned up callstack of all the ? entries look like: > > __lock_acquire+0x2d8a/0x4b70 > lock_acquire+0x110/0x330 > kmem_cache_alloc+0x29/0x2c0 > __clear_extent_bit+0x488/0x800 > try_release_extent_mapping+0x288/0x3c0 > __btrfs_releasepage+0x6c/0x140 > shrink_page_list+0x227e/0x3110 > shrink_inactive_list+0x414/0xdb0 > shrink_node_memcg+0x7c8/0x1250 > shrink_node+0x2ae/0xb50 > do_try_to_free_pages+0x2b1/0xe20 > try_to_free_pages+0x205/0x570 > __alloc_pages_nodemask+0xb91/0x2160 > new_slab+0x27a/0x4e0 > ___slab_alloc+0x355/0x610 > __slab_alloc+0x4c/0xa0 > kmem_cache_alloc+0x22d/0x2c0 > mempool_alloc+0xe1/0x280 Yes, for mempool_alloc() is adding __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask. gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */ gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */ gfp_mask |= __GFP_NOWARN; /* failures are OK */ > bio_alloc_bioset+0x1d7/0x830 > ext4_mpage_readpages+0x99f/0x1000 <- > __do_page_cache_readahead+0x4be/0x840 > filemap_fault+0x8c8/0xfc0 > ext4_filemap_fault+0x7d/0xb0 > __do_fault+0x7a/0x150 > __handle_mm_fault+0x1542/0x29d0 > __do_page_fault+0x557/0xa30 > async_page_fault+0x4c/0x60
[PATCH v2] lockdep: Fix fs_reclaim warning.
>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Date: Thu, 8 Feb 2018 10:35:35 +0900 Subject: [PATCH v2] lockdep: Fix fs_reclaim warning. Dave Jones reported fs_reclaim lockdep warnings. WARNING: possible recursive locking detected 4.15.0-rc9-backup-debug+ #1 Not tainted sshd/24800 is trying to acquire lock: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 but task is already holding lock: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(fs_reclaim); lock(fs_reclaim); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by sshd/24800: #0: (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40 #1: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 stack backtrace: CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 Call Trace: dump_stack+0xbc/0x13f __lock_acquire+0xa09/0x2040 lock_acquire+0x12e/0x350 fs_reclaim_acquire.part.102+0x29/0x30 kmem_cache_alloc+0x3d/0x2c0 alloc_extent_state+0xa7/0x410 __clear_extent_bit+0x3ea/0x570 try_release_extent_mapping+0x21a/0x260 __btrfs_releasepage+0xb0/0x1c0 btrfs_releasepage+0x161/0x170 try_to_release_page+0x162/0x1c0 shrink_page_list+0x1d5a/0x2fb0 shrink_inactive_list+0x451/0x940 shrink_node_memcg.constprop.88+0x4c9/0x5e0 shrink_node+0x12d/0x260 try_to_free_pages+0x418/0xaf0 __alloc_pages_slowpath+0x976/0x1790 __alloc_pages_nodemask+0x52c/0x5c0 new_slab+0x374/0x3f0 ___slab_alloc.constprop.81+0x47e/0x5a0 __slab_alloc.constprop.80+0x32/0x60 __kmalloc_track_caller+0x267/0x310 __kmalloc_reserve.isra.40+0x29/0x80 __alloc_skb+0xee/0x390 sk_stream_alloc_skb+0xb8/0x340 tcp_sendmsg_locked+0x8e6/0x1d30 tcp_sendmsg+0x27/0x40 inet_sendmsg+0xd0/0x310 sock_write_iter+0x17a/0x240 __vfs_write+0x2ab/0x380 vfs_write+0xfb/0x260 SyS_write+0xb6/0x140 do_syscall_64+0x1e5/0xc05 entry_SYSCALL64_slow_path+0x25/0x25 This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/ lockdep_clear_current_reclaim_state() in __perform_reclaim() and lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/ fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is trying to grab the 'fake' lock again when __perform_reclaim() already grabbed the 'fake' lock. The /* this guy won't enter reclaim */ if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) return false; test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for allocation only once") added the PF_MEMALLOC safeguard ( /* Avoid recursion of direct reclaim */ if (p->flags & PF_MEMALLOC) goto nopage; in __alloc_pages_slowpath()). Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow __need_fs_reclaim() to return false. Reported-and-tested-by: Dave Jones <da...@codemonkey.org.uk> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Nick Piggin <npig...@gmail.com> --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 81e18ce..19fb76b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) return false; /* this guy won't enter reclaim */ - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) + if (current->flags & PF_MEMALLOC) return false; /* We're only interested __GFP_FS allocations for now */ -- 1.8.3.1
Re: [4.15-rc9] fs_reclaim lockdep trace
Peter Zijlstra wrote: > On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote: > > Peter Zijlstra wrote: > > > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote: > > > > This warning seems to be caused by commit d92a8cfcb37ecd13 > > > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the > > > > location of > > > > > > > > /* this guy won't enter reclaim */ > > > > if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) > > > > return false; > > > > > > > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim > > > > context > > > > (__GFP_NOFS)"). > > > > > > I'm not entirly sure I get what you mean here. How did I move it? It was > > > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not > > > mark the lock as held. > > > > d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with > > fs_reclaim_acquire(), and removed current->lockdep_recursion handling. > > > > -- > > # git show d92a8cfcb37ecd13 | grep recursion > > -# define INIT_LOCKDEP .lockdep_recursion = 0, > > .lockdep_reclaim_gfp = 0, > > +# define INIT_LOCKDEP .lockdep_recursion = 0, > > unsigned intlockdep_recursion; > > - if (unlikely(current->lockdep_recursion)) > > - current->lockdep_recursion = 1; > > - current->lockdep_recursion = 0; > > -* context checking code. This tests GFP_FS recursion (a lock taken > > -- > > That should not matter at all. The only case that would matter for is if > lockdep itself would ever call into lockdep again. Not something that > happens here. > > > > The new code has it in fs_reclaim_acquire/release to the same effect, if > > > __GFP_NOMEMALLOC, we'll not acquire/release the lock. > > > > Excuse me, but I can't catch. > > We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC. > > Right, got the case inverted, same difference though. Before we'd do > mark_held_lock(), now we do acquire/release under the same conditions. > > > > > Since __kmalloc_reserve() from __alloc_skb() adds > > > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is > > > > failing to return false despite PF_MEMALLOC context (and resulted in > > > > lockdep warning). > > > > > > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC. > > > That's what the name says. > > > > __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that > > allocation > > request should use. > > Right. > > > But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM. > > Ah indeed. > > > Then, how can fs_reclaim contribute to deadlock? > > Not sure it can. But if we're going to allow this, it needs to come with > a clear description on why. Not a few clues to a puzzle. > Let's decode Dave's report. -- stack backtrace: CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 Call Trace: dump_stack+0xbc/0x13f __lock_acquire+0xa09/0x2040 lock_acquire+0x12e/0x350 fs_reclaim_acquire.part.102+0x29/0x30 kmem_cache_alloc+0x3d/0x2c0 alloc_extent_state+0xa7/0x410 __clear_extent_bit+0x3ea/0x570 try_release_extent_mapping+0x21a/0x260 __btrfs_releasepage+0xb0/0x1c0 btrfs_releasepage+0x161/0x170 try_to_release_page+0x162/0x1c0 shrink_page_list+0x1d5a/0x2fb0 shrink_inactive_list+0x451/0x940 shrink_node_memcg.constprop.88+0x4c9/0x5e0 shrink_node+0x12d/0x260 try_to_free_pages+0x418/0xaf0 __alloc_pages_slowpath+0x976/0x1790 __alloc_pages_nodemask+0x52c/0x5c0 new_slab+0x374/0x3f0 ___slab_alloc.constprop.81+0x47e/0x5a0 __slab_alloc.constprop.80+0x32/0x60 __kmalloc_track_caller+0x267/0x310 __kmalloc_reserve.isra.40+0x29/0x80 __alloc_skb+0xee/0x390 sk_stream_alloc_skb+0xb8/0x340 -- struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, bool force_schedule) { skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp) = { // gfp == GFP_KERNEL static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { // priority == GFP_KERNEL return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE) = { data = kmalloc_reserve(size, gfp_mask, node, ) = { // gfp_mask == GFP_KERNEL obj = kmalloc_node_track_caller(size, flags | __GFP_NOMEMALLOC | __GFP_NOWARN, node) = { // flags == GFP_KERNEL __kmalloc_node_track_caller(size, GF
Re: [4.15-rc9] fs_reclaim lockdep trace
Peter Zijlstra wrote: > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote: > > This warning seems to be caused by commit d92a8cfcb37ecd13 > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the > > location of > > > > /* this guy won't enter reclaim */ > > if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) > > return false; > > > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context > > (__GFP_NOFS)"). > > I'm not entirly sure I get what you mean here. How did I move it? It was > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not > mark the lock as held. d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with fs_reclaim_acquire(), and removed current->lockdep_recursion handling. -- # git show d92a8cfcb37ecd13 | grep recursion -# define INIT_LOCKDEP .lockdep_recursion = 0, .lockdep_reclaim_gfp = 0, +# define INIT_LOCKDEP .lockdep_recursion = 0, unsigned intlockdep_recursion; - if (unlikely(current->lockdep_recursion)) - current->lockdep_recursion = 1; - current->lockdep_recursion = 0; -* context checking code. This tests GFP_FS recursion (a lock taken -- > > The new code has it in fs_reclaim_acquire/release to the same effect, if > __GFP_NOMEMALLOC, we'll not acquire/release the lock. Excuse me, but I can't catch. We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC. -- +static bool __need_fs_reclaim(gfp_t gfp_mask) +{ (...snipped...) + /* this guy won't enter reclaim */ + if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) + return false; (...snipped...) +} -- > > > > Since __kmalloc_reserve() from __alloc_skb() adds > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is > > failing to return false despite PF_MEMALLOC context (and resulted in > > lockdep warning). > > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC. > That's what the name says. __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation request should use. -- static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask) { if (unlikely(gfp_mask & __GFP_NOMEMALLOC)) return 0; if (gfp_mask & __GFP_MEMALLOC) return ALLOC_NO_WATERMARKS; if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) return ALLOC_NO_WATERMARKS; if (!in_interrupt()) { if (current->flags & PF_MEMALLOC) return ALLOC_NO_WATERMARKS; else if (oom_reserves_allowed(current)) return ALLOC_OOM; } return 0; } -- But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM. -- /* Attempt with potentially adjusted zonelist and alloc_flags */ page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); if (page) goto got_pg; /* Caller is not willing to reclaim, we can't balance anything */ if (!can_direct_reclaim) goto nopage; /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, _some_progress); if (page) goto got_pg; /* Try direct compaction and then allocating */ page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, compact_priority, _result); if (page) goto got_pg; /* Do not loop if specifically requested */ if (gfp_mask & __GFP_NORETRY) goto nopage; -- Then, how can fs_reclaim contribute to deadlock? > > > Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking > > __GFP_NOMEMALLOC might make sense. But since this safeguard was added by > > commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for > > allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense. > > Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to > > return false. > > This does not in fact explain what's going on, it just points to > 'random' patches. > > Are you talking about this: > > + /* Avoid recursion of direct reclaim */ > + if (p->flags & PF_MEMALLOC) &
Re: kernel panic: Out of memory and no killable processes... (2)
syzbot wrote: > syzbot hit the following crash on net-next commit > 6bb46bc57c8e9ce947cc605e555b7204b44d2b10 (Fri Jan 26 16:00:23 2018 +) > Merge branch 'cxgb4-fix-dump-collection-when-firmware-crashed' > > C reproducer is attached. > syzkaller reproducer is attached. > Raw console output is attached. > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached. > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+8630e35fc7287b392...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > [ 3685] 0 3685178211 1843200 0 sshd > [ 3692] 0 3692 43760327680 0 > syzkaller025682 > [ 3695] 0 3695 43760368640 0 > syzkaller025682 > Kernel panic - not syncing: Out of memory and no killable processes... > This sounds like too huge vmalloc() request where size is controlled by userspace. -- [ 27.738855] syzkaller025682 invoked oom-killer: gfp_mask=0x14002c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN), nodemask=(null), order=0, oom_score_adj=0 [ 27.754960] syzkaller025682 cpuset=/ mems_allowed=0 [ 27.760386] CPU: 0 PID: 3689 Comm: syzkaller025682 Not tainted 4.15.0-rc9+ #212 [ 27.767820] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 27.777166] Call Trace: [ 27.779739] dump_stack+0x194/0x257 [ 27.793194] dump_header+0x28c/0xe1e [ 27.888997] oom_kill_process+0x8b5/0x14a0 [ 27.981648] out_of_memory+0x86d/0x1220 [ 28.003684] __alloc_pages_slowpath+0x1d1b/0x2d00 [ 28.054140] __alloc_pages_nodemask+0x9fb/0xd80 [ 28.090590] alloc_pages_current+0xb6/0x1e0 [ 28.094927] __vmalloc_node_range+0x409/0x650 [ 28.103837] __vmalloc_node_flags_caller+0x50/0x60 [ 28.113166] kvmalloc_node+0x82/0xd0 [ 28.116869] xt_alloc_table_info+0x64/0xe0 [ 28.121097] do_ip6t_set_ctl+0x29b/0x5f0 [ 28.139158] nf_setsockopt+0x67/0xc0 [ 28.142862] ipv6_setsockopt+0x115/0x150 [ 28.146912] udpv6_setsockopt+0x45/0x80 [ 28.150867] sock_common_setsockopt+0x95/0xd0 [ 28.155359] SyS_setsockopt+0x189/0x360 [ 28.177379] entry_SYSCALL_64_fastpath+0x29/0xa0 -- struct xt_table_info *xt_alloc_table_info(unsigned int size) { (...snipped...) info = kvmalloc(sz, GFP_KERNEL); (...snipped...) } static int do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len) { int ret; if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) return -EPERM; switch (cmd) { case IP6T_SO_SET_REPLACE: ret = do_replace(sock_net(sk), user, len); break; case IP6T_SO_SET_ADD_COUNTERS: ret = do_add_counters(sock_net(sk), user, len, 0); break; default: ret = -EINVAL; } return ret; } vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back off when the current task is killed") but then became unkillable by commit b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is killed""). Therefore, we can't handle this problem from MM side. Please consider adding some limit from networking side.
Re: [4.15-rc9] fs_reclaim lockdep trace
Dave, would you try below patch? >From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Date: Sun, 28 Jan 2018 14:17:14 +0900 Subject: [PATCH] lockdep: Fix fs_reclaim warning. Dave Jones reported fs_reclaim lockdep warnings. WARNING: possible recursive locking detected 4.15.0-rc9-backup-debug+ #1 Not tainted sshd/24800 is trying to acquire lock: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 but task is already holding lock: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(fs_reclaim); lock(fs_reclaim); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by sshd/24800: #0: (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40 #1: (fs_reclaim){+.+.}, at: [<84f438c2>] fs_reclaim_acquire.part.102+0x5/0x30 stack backtrace: CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 Call Trace: dump_stack+0xbc/0x13f __lock_acquire+0xa09/0x2040 lock_acquire+0x12e/0x350 fs_reclaim_acquire.part.102+0x29/0x30 kmem_cache_alloc+0x3d/0x2c0 alloc_extent_state+0xa7/0x410 __clear_extent_bit+0x3ea/0x570 try_release_extent_mapping+0x21a/0x260 __btrfs_releasepage+0xb0/0x1c0 btrfs_releasepage+0x161/0x170 try_to_release_page+0x162/0x1c0 shrink_page_list+0x1d5a/0x2fb0 shrink_inactive_list+0x451/0x940 shrink_node_memcg.constprop.88+0x4c9/0x5e0 shrink_node+0x12d/0x260 try_to_free_pages+0x418/0xaf0 __alloc_pages_slowpath+0x976/0x1790 __alloc_pages_nodemask+0x52c/0x5c0 new_slab+0x374/0x3f0 ___slab_alloc.constprop.81+0x47e/0x5a0 __slab_alloc.constprop.80+0x32/0x60 __kmalloc_track_caller+0x267/0x310 __kmalloc_reserve.isra.40+0x29/0x80 __alloc_skb+0xee/0x390 sk_stream_alloc_skb+0xb8/0x340 tcp_sendmsg_locked+0x8e6/0x1d30 tcp_sendmsg+0x27/0x40 inet_sendmsg+0xd0/0x310 sock_write_iter+0x17a/0x240 __vfs_write+0x2ab/0x380 vfs_write+0xfb/0x260 SyS_write+0xb6/0x140 do_syscall_64+0x1e5/0xc05 entry_SYSCALL64_slow_path+0x25/0x25 Since no fs locks are held, doing GFP_KERNEL allocation should be safe as long as there is PF_MEMALLOC safeguard ( /* Avoid recursion of direct reclaim */ if (p->flags & PF_MEMALLOC) goto nopage; ) which prevents infinite recursion. This warning seems to be caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the location of /* this guy won't enter reclaim */ if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) return false; check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context (__GFP_NOFS)"). Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is failing to return false despite PF_MEMALLOC context (and resulted in lockdep warning). Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking __GFP_NOMEMALLOC might make sense. But since this safeguard was added by commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense. Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to return false. Reported-by: Dave Jones <da...@codemonkey.org.uk> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Nick Piggin <npig...@gmail.com> --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 76c9688..7804b0e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask) return false; /* this guy won't enter reclaim */ - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) + if (current->flags & PF_MEMALLOC) return false; /* We're only interested __GFP_FS allocations for now */ -- 1.8.3.1
Re: [4.15-rc9] fs_reclaim lockdep trace
On 2018/01/28 10:16, Tetsuo Handa wrote: > Linus Torvalds wrote: >> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <da...@codemonkey.org.uk> wrote: >>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote: >>> > Just triggered this on a server I was rsync'ing to. >>> >>> Actually, I can trigger this really easily, even with an rsync from one >>> disk to another. Though that also smells a little like networking in >>> the traces. Maybe netdev has ideas. >> >> Is this new to 4.15? Or is it just that you're testing something new? >> >> If it's new and easy to repro, can you just bisect it? And if it isn't >> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being >> ok)? >> >> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15, >> but it was rewritten for 4.14.. I'm wondering if that remodeling ended >> up triggering something. > > --- linux-4.13.16/mm/page_alloc.c > +++ linux-4.14.15/mm/page_alloc.c Oops. This output was inverted. > @@ -3527,53 +3519,12 @@ > return true; > } > return false; > } > #endif /* CONFIG_COMPACTION */ > > -#ifdef CONFIG_LOCKDEP > -struct lockdep_map __fs_reclaim_map = > - STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map); > - > -static bool __need_fs_reclaim(gfp_t gfp_mask) > -{ > - gfp_mask = current_gfp_context(gfp_mask); > - > - /* no reclaim without waiting on it */ > - if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) > - return false; > - > - /* this guy won't enter reclaim */ > - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) > - return false; Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is failing to return false here. But why checking __GFP_NOMEMALLOC here? __alloc_pages_slowpath() skips direct reclaim if !(gfp_mask & __GFP_DIRECT_RECLAIM) or (current->flags & PF_MEMALLOC), doesn't it? -- static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { (...snipped...) /* Caller is not willing to reclaim, we can't balance anything */ if (!can_direct_reclaim) goto nopage; /* Avoid recursion of direct reclaim */ if (current->flags & PF_MEMALLOC) goto nopage; /* Try direct reclaim and then allocating */ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, _some_progress); if (page) goto got_pg; (...snipped...) } --
Re: [4.15-rc9] fs_reclaim lockdep trace
Linus Torvalds wrote: > On Sat, Jan 27, 2018 at 2:24 PM, Dave Joneswrote: >> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote: >> > Just triggered this on a server I was rsync'ing to. >> >> Actually, I can trigger this really easily, even with an rsync from one >> disk to another. Though that also smells a little like networking in >> the traces. Maybe netdev has ideas. > > Is this new to 4.15? Or is it just that you're testing something new? > > If it's new and easy to repro, can you just bisect it? And if it isn't > new, can you perhaps check whether it's new to 4.14 (ie 4.13 being > ok)? > > Because that fs_reclaim_acquire/release() debugging isn't new to 4.15, > but it was rewritten for 4.14.. I'm wondering if that remodeling ended > up triggering something. --- linux-4.13.16/mm/page_alloc.c +++ linux-4.14.15/mm/page_alloc.c @@ -3527,53 +3519,12 @@ return true; } return false; } #endif /* CONFIG_COMPACTION */ -#ifdef CONFIG_LOCKDEP -struct lockdep_map __fs_reclaim_map = - STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map); - -static bool __need_fs_reclaim(gfp_t gfp_mask) -{ - gfp_mask = current_gfp_context(gfp_mask); - - /* no reclaim without waiting on it */ - if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) - return false; - - /* this guy won't enter reclaim */ - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) - return false; - - /* We're only interested __GFP_FS allocations for now */ - if (!(gfp_mask & __GFP_FS)) - return false; - - if (gfp_mask & __GFP_NOLOCKDEP) - return false; - - return true; -} - -void fs_reclaim_acquire(gfp_t gfp_mask) -{ - if (__need_fs_reclaim(gfp_mask)) - lock_map_acquire(&__fs_reclaim_map); -} -EXPORT_SYMBOL_GPL(fs_reclaim_acquire); - -void fs_reclaim_release(gfp_t gfp_mask) -{ - if (__need_fs_reclaim(gfp_mask)) - lock_map_release(&__fs_reclaim_map); -} -EXPORT_SYMBOL_GPL(fs_reclaim_release); -#endif - /* Perform direct synchronous page reclaim */ static int __perform_reclaim(gfp_t gfp_mask, unsigned int order, const struct alloc_context *ac) { struct reclaim_state reclaim_state; @@ -3582,21 +3533,21 @@ cond_resched(); /* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); noreclaim_flag = memalloc_noreclaim_save(); - fs_reclaim_acquire(gfp_mask); + lockdep_set_current_reclaim_state(gfp_mask); reclaim_state.reclaimed_slab = 0; current->reclaim_state = _state; progress = try_to_free_pages(ac->zonelist, order, gfp_mask, ac->nodemask); current->reclaim_state = NULL; - fs_reclaim_release(gfp_mask); + lockdep_clear_current_reclaim_state(); memalloc_noreclaim_restore(noreclaim_flag); cond_resched(); return progress; } > > Adding PeterZ to the participants list in case he has ideas. I'm not > seeing what would be the problem in that call chain from hell. > >Linus Dave Jones wrote: > > WARNING: possible recursive locking detected > 4.15.0-rc9-backup-debug+ #1 Not tainted > > sshd/24800 is trying to acquire lock: > (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > but task is already holding lock: > (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > other info that might help us debug this: > Possible unsafe locking scenario: > >CPU0 > > lock(fs_reclaim); > lock(fs_reclaim); > > *** DEADLOCK *** > > May be due to missing lock nesting notation > > 2 locks held by sshd/24800: > #0: (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40 > #1: (fs_reclaim){+.+.}, at: [<84f438c2>] > fs_reclaim_acquire.part.102+0x5/0x30 > > stack backtrace: > CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1 > Call Trace: > dump_stack+0xbc/0x13f > __lock_acquire+0xa09/0x2040 > lock_acquire+0x12e/0x350 > fs_reclaim_acquire.part.102+0x29/0x30 > kmem_cache_alloc+0x3d/0x2c0 > alloc_extent_state+0xa7/0x410 > __clear_extent_bit+0x3ea/0x570 > try_release_extent_mapping+0x21a/0x260 > __btrfs_releasepage+0xb0/0x1c0 > btrfs_releasepage+0x161/0x170 > try_to_release_page+0x162/0x1c0 > shrink_page_list+0x1d5a/0x2fb0 > shrink_inactive_list+0x451/0x940 > shrink_node_memcg.constprop.88+0x4c9/0x5e0 > shrink_node+0x12d/0x260 > try_to_free_pages+0x418/0xaf0 > __alloc_pages_slowpath+0x976/0x1790 > __alloc_pages_nodemask+0x52c/0x5c0 > new_slab+0x374/0x3f0 > ___slab_alloc.constprop.81+0x47e/0x5a0 >
[linux-next] NULL pointer dereference in rt6_fill_node
Hello. I'm hitting this with linux-next-20170125. Is this known? [ OK ] Started Show Plymouth Reboot Screen. [ OK ] Stopped PostgreSQL database server. [ OK ] Stopped Dynamic System Tuning Daemon. [ OK ] Stopped target Network. Stopping LSB: Bring up/down networking... [ 828.169782] BUG: unable to handle kernel NULL pointer dereference at 02a4 [ 828.172557] IP: rt6_fill_node.isra.60+0x181/0x5c0 [ 828.174211] PGD 0 [ 828.174212] [ 828.175467] Oops: [#1] SMP DEBUG_PAGEALLOC [ 828.177068] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp vmw_balloon pcspkr vmw_vmci i2c_piix4 shpchp ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect sysimgblt mptspi fb_sys_fops scsi_transport_spi ttm ata_piix mptscsih drm i2c_core e1000 mptbase libata [ 828.200197] CPU: 3 PID: 539 Comm: NetworkManager Not tainted 4.10.0-rc5-next-20170125+ #45 [ 828.203094] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 828.206835] task: 88006f984100 task.stack: c90005afc000 [ 828.208934] RIP: 0010:rt6_fill_node.isra.60+0x181/0x5c0 [ 828.210701] RSP: 0018:c90005aff938 EFLAGS: 00010246 [ 828.212485] RAX: RBX: 88006c71ae00 RCX: [ 828.214933] RDX: RSI: c90005aff95c RDI: 880071773ee0 [ 828.217394] RBP: c90005aff9e0 R08: R09: 0004 [ 828.219945] R10: 880071773ee4 R11: R12: 880071983f00 [ 828.222381] R13: R14: R15: 880071773ec0 [ 828.224920] FS: 7ffaa9733940() GS:88007580() knlGS: [ 828.227664] CS: 0010 DS: ES: CR0: 80050033 [ 828.229670] CR2: 02a4 CR3: 71269000 CR4: 001406e0 [ 828.232149] Call Trace: [ 828.232998] ? slab_kernel_map+0xa0/0x130 [ 828.234468] rt6_dump_route+0x60/0x70 [ 828.235717] fib6_dump_node+0x2e/0x60 [ 828.237005] fib6_walk_continue+0x124/0x140 [ 828.238425] fib6_walk+0x64/0x90 [ 828.239567] ? _raw_read_lock_bh+0x6e/0x80 [ 828.240962] inet6_dump_fib+0x237/0x370 [ 828.242309] ? inet6_dump_fib+0x4b/0x370 [ 828.243646] netlink_dump+0x126/0x280 [ 828.244963] __netlink_dump_start+0x16f/0x210 [ 828.246464] ? fib6_flush_trees+0x40/0x40 [ 828.247848] rtnetlink_rcv_msg+0x13b/0x230 [ 828.249275] ? fib6_flush_trees+0x40/0x40 [ 828.250641] ? rtnl_newlink+0x880/0x880 [ 828.252021] netlink_rcv_skb+0x99/0xb0 [ 828.253399] rtnetlink_rcv+0x25/0x30 [ 828.254646] netlink_unicast+0x16d/0x200 [ 828.255982] netlink_sendmsg+0x36b/0x390 [ 828.257348] ___sys_sendmsg+0x1c9/0x2c0 [ 828.259271] ? ___cache_free+0x14a/0x5b0 [ 828.261112] ? __fget+0x117/0x210 [ 828.262777] ? expand_files+0x2e0/0x2e0 [ 828.264657] __sys_sendmsg+0x4f/0x90 [ 828.266399] SyS_sendmsg+0xd/0x20 [ 828.268042] do_syscall_64+0x67/0x1f0 [ 828.269810] entry_SYSCALL64_slow_path+0x25/0x25 [ 828.271885] RIP: 0033:0x7ffaa7f639fd [ 828.273594] RSP: 002b:7ffc0e0c13b0 EFLAGS: 0293 ORIG_RAX: 002e [ 828.276704] RAX: ffda RBX: 55f219737bd0 RCX: 7ffaa7f639fd [ 828.279712] RDX: RSI: 7ffc0e0c1460 RDI: 000c [ 828.282623] RBP: 55f219737ae0 R08: 0001 R09: 1000 [ 828.286019] R10: 0063 R11: 0293 R12: 55f21977e4f0 [ 828.288921] R13: 7ffc0e0c1460 R14: R15: 55f21977e4f0 [ 828.291766] Code: c7 47 18 00 00 00 00 48 8b 43 18 48 8b 50 48 83 e2 01 74 08 48 8b 40 48 a8 04 74 21 41 c7 47 18 10 00 00 00 48 8b 83 58 01 00 00 <8b> 88 a4 02 00 00 85 c9 74 08 41 c7 47 18 11 00 00 00 41 c6 47 [ 828.299024] RIP: rt6_fill_node.isra.60+0x181/0x5c0 RSP: c90005aff938 [ 828.301732] CR2: 02a4 2861: 48 8b 50 48 mov0x48(%rax),%rdx 2865: 83 e2 01and$0x1,%edx 2868: 74 08 je 2872286a: 48 8b 40 48 mov0x48(%rax),%rax 286e: a8 04 test $0x4,%al 2870: 74 21 je 2893 rtm->rtm_flags |= RTNH_F_LINKDOWN; 2872: 41 c7 47 18 10 00 00movl $0x10,0x18(%r15) 2879: 00 if (rt->rt6i_idev->cnf.ignore_routes_with_linkdown)
[PATCH] netfilter: x_tables: fix kmemcheck warning.
kmemcheck complains that some of struct nf_hook_ops members allocated at xt_hook_ops_alloc() are not initialized before nf_register_net_hook() is called. Add __GFP_ZERO to initialize explicitly. [ 367.411936] nf_conntrack version 0.5.0 (6144 buckets, 24576 max) [ 367.458540] ip_tables: (C) 2000-2006 Netfilter Core Team [ 367.463977] WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (88003af7f300) [ 367.465633] 303f5381 [ 367.468185] u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u [ 367.470687] ^ [ 367.471079] RIP: 0010:[] [] nf_register_net_hook+0x2f/0x160 [ 367.472846] RSP: 0018:88003f5abcb0 EFLAGS: 00010286 [ 367.473821] RAX: 88003adb73c0 RBX: 88003af7f300 RCX: [ 367.475122] RDX: RSI: 0001 RDI: 88003adb7400 [ 367.476485] RBP: 88003f5abcc8 R08: 0067 R09: [ 367.477764] R10: 88003adb83c0 R11: R12: 81876760 [ 367.479053] R13: 88003adb73c0 R14: 0003 R15: 88003af7f300 [ 367.480351] FS: () GS:8182c000() knlGS: [ 367.482018] CS: 0010 DS: ES: CR0: 80050033 [ 367.483069] CR2: 88003f404280 CR3: 3ad3 CR4: 001406f0 [ 367.484357] [] nf_register_net_hooks+0x3c/0xa0 [ 367.485617] [] ipt_register_table+0xef/0x130 [ 367.486765] [] iptable_filter_table_init.part.1+0x51/0x70 [ 367.488125] [] iptable_filter_net_init+0x2a/0x30 [ 367.489375] [] ops_init+0x3c/0x130 [ 367.490410] [] register_pernet_operations+0x108/0x1c0 [ 367.491746] [] register_pernet_subsys+0x23/0x40 [ 367.492933] [] iptable_filter_init+0x33/0x4b [ 367.494053] [] do_one_initcall+0x4a/0x180 [ 367.495168] [] kernel_init_freeable+0x15b/0x201 [ 367.496332] [] kernel_init+0x9/0x100 [ 367.497369] [] ret_from_fork+0x1f/0x40 [ 367.498449] [] 0x [ 367.499523] Initializing XFRM netlink socket [ 367.500404] NET: Registered protocol family 10 [ 367.501792] ip6_tables: (C) 2000-2006 Netfilter Core Team [ 367.502971] NET: Registered protocol family 17 Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> --- net/netfilter/x_tables.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index e0aa7c1..fc261fe 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1513,7 +1513,7 @@ xt_hook_ops_alloc(const struct xt_table *table, nf_hookfn *fn) if (!num_hooks) return ERR_PTR(-EINVAL); - ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL); + ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL | __GFP_ZERO); if (ops == NULL) return ERR_PTR(-ENOMEM); -- 1.8.3.1
Re: [PATCH] vmxnet3: avoid calling pskb_may_pull with interrupts disabled
Neil Horman wrote: > On Mon, Mar 07, 2016 at 03:16:14PM -0500, David Miller wrote: > > From: Neil Horman> > Date: Fri, 4 Mar 2016 13:40:48 -0500 This patch is calling spin_unlock_irqrestore() without spin_lock_irqsave(). In file included from include/linux/seqlock.h:35:0, from include/linux/time.h:5, from include/linux/stat.h:18, from include/linux/module.h:10, from drivers/net/vmxnet3/vmxnet3_drv.c:27: drivers/net/vmxnet3/vmxnet3_drv.c: In function 'vmxnet3_xmit_frame': include/linux/spinlock.h:246:30: warning: 'flags' may be used uninitialized in this function [-Wmaybe-uninitialized] _raw_spin_unlock_irqrestore(lock, flags); \ ^ drivers/net/vmxnet3/vmxnet3_drv.c:977:16: note: 'flags' was declared here unsigned long flags; ^ vmxnet3_tq_xmit(struct sk_buff *skb, struct vmxnet3_tx_queue *tq, struct vmxnet3_adapter *adapter, struct net_device *netdev) { (...snipped...) goto hdr_too_big; (...snipped...) spin_lock_irqsave(>tx_lock, flags); (...snipped...) spin_unlock_irqrestore(>tx_lock, flags); (...snipped...) return NETDEV_TX_OK; (...snipped...) hdr_too_big: tq->stats.drop_oversized_hdr++; unlock_drop_pkt: spin_unlock_irqrestore(>tx_lock, flags); (...snipped...) return NETDEV_TX_OK; }
Re: 4.4.3: OOPS when running "stress-ng --sock 5"
Holger Schurig wrote: > So I did an "arm-linux-gnueabihf-objdump -Sgd linux/vmlinux", not sure > if that helps: > > c00972ec <__rmqueue>: > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > static struct page *__rmqueue(struct zone *zone, unsigned int order, > int migratetype, gfp_t gfp_flags) > { > c00972ec: e1a0c00dmov ip, sp > c00972f0: e92ddff0push{r4, r5, r6, r7, r8, r9, sl, fp, ip, > lr, pc} > c00972f4: e24cb004sub fp, ip, #4 > c00972f8: e24dd024sub sp, sp, #36 ; 0x24 > unsigned int current_order; > struct free_area *area; > struct page *page; > > /* Find a page of the appropriate size in the preferred list */ > for (current_order = order; current_order < MAX_ORDER; > ++current_order) { > c00972fc: e351000acmp r1, #10 > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > I tried on x86_64 but I could not reproduce it. Thus, we need to examine this problem using your environment. I didn't notice that c00972ec is __rmqueue+0x0. Actual line number to examine is c0097360 ("pc" register) which is __rmqueue+0x74. Please show us line number and assembly code around c0097360.
Re: [PATCH] unix: properly account for FDs passed over unix sockets
Willy Tarreau wrote: > On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote: > > The MSG_PEEK code should not be harmful and the patch is good as is. I > > first understood from the published private thread, that it is possible > > for a program to exceed the rlimit of fds. But the DoS is only by > > keeping the fds in flight and not attaching them to any program. > > Exactly. The real issue is when these FDs become very expensive such as > pipes full of data. > As you wrote how to abuse this vulnerability which exists in Linux 2.0 and later kernel, I quote a short description from private thread. "an unprivileged user consumes all file descriptors so that other unprivileged user cannot work" and "an unprivileged user consumes all kernel memory so that the OOM killer kills almost all processes before the culprit process is killed (CVE-2013-4312)". Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ath6kl: Use vmalloc for loading firmware using api1 method
Andy Shevchenko wrote: > On Sat, Nov 28, 2015 at 8:58 PM, Brent Taylorwrote: > > Whats the status on this patch? I don't see it on patchwork anymore > > nor is it in any of the git trees I checked. > > > > You forget to use kvfree() instead of kfree() in core.c. > In addition to that, I think you can do like below. if (>fw == fw) *fw = vmalloc(fw_entry->size); else *fw = kmalloc(fw_entry->size, GFP_KERNEL); if (*fw == NULL) ret = -ENOMEM; else memcpy(*fw, fw_entry->data, fw_entry->size); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mho...@suse.com> Cc: Russell King <li...@arm.linux.org.uk> # arm Cc: <linux-a...@vger.kernel.org> # apei Cc: <drbd-u...@lists.linbit.com> # drbd Cc: <linux-ker...@vger.kernel.org> # mspec Cc: <dri-de...@lists.freedesktop.org> # drm Cc: Oleg Drokin <oleg.dro...@intel.com> # lustre Cc: Andreas Dilger <andreas.dil...@intel.com> # lustre Cc: <codal...@coda.cs.cmu.edu> # coda Cc: <linux-...@lists.infradead.org> # jffs2 Cc: Jan Kara <j...@suse.com> # udf Cc: <linux-fsde...@vger.kernel.org> # xattr Cc: <linux...@kvack.org> # ipc + mm Cc: <netdev@vger.kernel.org> # ipv4 --- arch/arm/mm/dma-mapping.c | 11 ++-- drivers/acpi/apei/erst.c | 6 ++-- drivers/block/drbd/drbd_bitmap.c | 26 + drivers/block/drbd/drbd_int.h | 3 -- drivers/char/mspec.c | 15 ++ drivers/gpu/drm/drm_hashtab.c | 5 +--- .../lustre/include/linux/libcfs/libcfs_private.h | 8 ++ fs/coda/coda_linux.h | 3 +- fs/jffs2/build.c | 8 ++ fs/jffs2/fs.c | 5 +--- fs/jffs2/super.c | 5 +--- fs/udf/super.c | 7 + fs/xattr.c | 33 ++ ipc/sem.c | 2 +- ipc/util.c | 11 ++-- ipc/util.h | 2 +- mm/percpu.c| 18 +--- mm/vmalloc.c | 5 +--- net/ipv4/fib_trie.c| 4 +-- 19 files changed, 46 insertions(+), 131 deletions(-) diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index e62400e..492bf3e 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -1200,10 +1200,7 @@ error: while (i--) if (pages[i]) __free_pages(pages[i], 0); - if (array_size <= PAGE_SIZE) - kfree(pages); - else - vfree(pages); + kvfree(pages); return NULL; } @@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, struct page **pages, size_t size, struct dma_attrs *attrs) { int count = size >> PAGE_SHIFT; - int array_size = count * sizeof(struct page *); int i; if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) { @@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, struct page **pages, __free_pages(pages[i], 0); } - if (array_size <= PAGE_SIZE) - kfree(pages); - else - vfree(pages); + kvfree(pages); return 0; } diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c index 6682c5d..6e6bc10 100644 --- a/drivers/acpi/apei/erst.c +++ b/drivers/acpi/apei/erst.c @@ -32,6 +32,7 @@ #include #include #include +#include /* kvfree() */ #include #include "apei-internal.h" @@ -532,10 +533,7 @@ retry: return -ENOMEM; memcpy(new_entries, entries, erst_record_id_cache.len * sizeof(entries[0])); - if (erst_record_id_cache.size < PAGE_SIZE) - kfree(entries); - else - vfree(entries); + kvfree(entries); erst_record_id_cache.entries = entries = new_entries; erst_record_id_cache.size = new_size; } diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c index 9462d27..2daaafb 100644 --- a/drivers/block/drbd/drbd_bitmap.c +++ b/drivers/block/drbd/drbd_bitmap.c @@ -364,12 +364,9 @@ static void bm_free_pages(struct page **pages, unsigned long number) } } -static void bm_vk_free(void *ptr, int v) +static inline void bm_vk_free(void *ptr) { - if (v) - vfree(ptr); - else - kfree(ptr); + kvfree(ptr); } /* @@ -379,7 +376,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap *b, unsigned long want) { struct page **old_pages = b->bm_pages; struct page **new_pages, *page; - u
[PATCH] tcp: use kmalloc() than kmalloc_array().
Commit 095dc8e0c3686d58 ("tcp: fix/cleanup inet_ehash_locks_alloc()") silently changed from kmalloc() to kmalloc_array(). The latter has overflow check whereas the former doesn't have. If nblocks * locksz might overflow, we need to do like - if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz) + if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz) hashinfo->ehash_locks = vmalloc(nblocks * locksz); because kmalloc_array() detects overflow and returns NULL. But if nblocks * locksz is guaranteed not to overflow, there is no need to use kmalloc_array(). Since I assume it won't overflow, use kmalloc() than kmalloc_array(). Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> --- net/ipv4/inet_hashtables.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index ccc5980..8f4ab27 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -648,8 +648,8 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) /* no more locks than number of hash buckets */ nblocks = min(nblocks, hashinfo->ehash_mask + 1); - hashinfo->ehash_locks = kmalloc_array(nblocks, locksz, - GFP_KERNEL | __GFP_NOWARN); + hashinfo->ehash_locks = kmalloc(nblocks * locksz, + GFP_KERNEL | __GFP_NOWARN); if (!hashinfo->ehash_locks) hashinfo->ehash_locks = vmalloc(nblocks * locksz); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: use kmalloc() than kmalloc_array().
Tetsuo Handa wrote: > Commit 095dc8e0c3686d58 ("tcp: fix/cleanup inet_ehash_locks_alloc()") > silently changed from kmalloc() to kmalloc_array(). The latter has > overflow check whereas the former doesn't have. > > If nblocks * locksz might overflow, we need to do like > > - if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz) > + if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz) Oops, I meant - if (!hashinfo->ehash_locks) + if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz) here. >hashinfo->ehash_locks = vmalloc(nblocks * locksz); > > because kmalloc_array() detects overflow and returns NULL. > But if nblocks * locksz is guaranteed not to overflow, there is > no need to use kmalloc_array(). > > Since I assume it won't overflow, use kmalloc() than kmalloc_array(). I don't know about possible value range. Please confirm that it can't overflow. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: use kmalloc() than kmalloc_array().
David Miller wrote: > From: Eric Dumazet> Date: Sat, 07 Nov 2015 10:50:07 -0800 > > > I do not feel we should go back to kmalloc() just because > > vmalloc_array() does not exist yet. > > Agreed. > Please change as you like. I was thinking to introduce a helper that does vmalloc() when kmalloc() failed because locations that do ptr = kmalloc(size, GFP_NOFS); if (!ptr) ptr = vmalloc(size); /* Wrong because GFP_KERNEL is used implicitly */ are found. I just noticed that inet_ehash_locks_alloc() is doing ptr = kmalloc_array(count, size, GFP_KERNEL); if (!ptr) ptr = vmalloc(count * size); /* Wrong because overflow is not checked */ and wanted to know your intent. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] cxgb4: memory corruption in debugfs
Dan Carpenter wrote: You can't use kstrtoul() with an int or it causes memory corruption. Also j should be unsigned or we have underflow bugs. I considered changing j to unsigned long but everything fits in a u32. Excuse me, but kstrtouint()'s last argument is not u32 * but unsigned int *. Aren't there architectures where sizeof(unsigned int) sizeof(u32) ? Fixes: 8e3d04fd7d70 ('cxgb4: Add MPS tracing support') Signed-off-by: Dan Carpenter dan.carpen...@oracle.com diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c index 1732e29..0a87a32 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c @@ -1289,13 +1289,14 @@ static unsigned int xdigit2int(unsigned char c) static ssize_t mps_trc_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { - int i, j, enable, ret; + int i, enable, ret; u32 *data, *mask; struct trace_params tp; const struct inode *ino; unsigned int trcidx; char *s, *p, *word, *end; struct adapter *adap; + u32 j; ino = file_inode(file); trcidx = (uintptr_t)ino-i_private 3; @@ -1340,7 +1341,7 @@ static ssize_t mps_trc_write(struct file *file, const char __user *buf, if (!strncmp(word, qid=, 4)) { end = (char *)word + 4; - ret = kstrtoul(end, 10, (unsigned long *)j); + ret = kstrtouint(end, 10, j); if (ret) goto out; if (!adap-trace_rss) { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET] printk, netconsole: implement reliable netconsole
Tejun Heo wrote: If we can assume that scheduler is working, adding a kernel thread that does while (1) { read messages with metadata from /dev/kmsg send them using UDP network } might be easier than modifying netconsole module. But, I mean, if we are gonna do that in kernel, we better do it properly where it belongs. What's up with easier than modifying netconsole module? Why is netconsole special? And how would the above be any less complex than a single timer function? What am I missing? User space daemon is sometimes disturbed unexpectedly due to (a) SIGKILL by OOM-killer (b) spurious ptrace() by somebody (c) spurious signals such as SIGSTOP / SIGINT (d) stalls triggered by page faults under OOM condition (e) other problems such as scheduler being not working We have built-in protection for (a) named /proc/$pid/oom_score_adj , but we need to configure access control modules for protecting (b) and (c), and we don't have protection for (d). Thinking from OOM stall discussion, (d) is fatal when trying to obtain kernel messages under problematic condition. I thought that a kernel thread that does while (1) { read messages with metadata from /dev/kmsg send them using UDP network } is automatically protected from (a), (b), (c) and (d), and it could be implemented outside of netconsole module. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET] printk, netconsole: implement reliable netconsole
Tejun Heo wrote: On Sat, Apr 18, 2015 at 03:03:46AM +0900, Tetsuo Handa wrote: packet will be sufficient for finding out whether the packets were lost and/or reordered in flight. printk(Hello); = netconsole sends Hello using UDP printk(netconsole); = netconsole sends 0001 netconsole using UDP printk(world\n); = netconsole sends 0002 world\n using UDP It might be nice to allow administrator to prefix a sequence number to netconsole messages for those who are using special receiver program (e.g. ncrx) which checks that sequence number. That said, this is pretty much what the first 12 patches do (except for the last printk patch, which can be taken out). We already have sequencing and established format to expose them to userland - try cat /dev/kmsg, which btw is what local loggers on modern systems use anyway. Why introduce netconsole's own version of metadata? I didn't mean to introduce netconsole's own version of metadata. I meant we don't need to implement in-kernel retry logic. If we can assume that scheduler is working, adding a kernel thread that does while (1) { read messages with metadata from /dev/kmsg send them using UDP network } might be easier than modifying netconsole module. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET] printk, netconsole: implement reliable netconsole
Tejun Heo wrote: Hello, David. On Fri, Apr 17, 2015 at 01:17:12PM -0400, David Miller wrote: If userland cannot run properly, it is almost certain that neither will your complex reliability layer logic. * The bulk of patches are to pipe extended log messages to console drivers and let netconsole relay them to the receiver (and quite a bit of refactoring in the process), which, regardless of the reliability logic, is beneficial as we're currently losing structured logging (dictionary) and other metadata over consoles and regardless of where the reliability logic is implemented, it's a lot easier to have messages IDs. * The only thing necessary for reliable transmission are timer and netpoll. There sure are cases where they go down too but there's a pretty big gap between those two going down and userland getting hosed, but where to put the retransmission and reliability logic definitely is debatable. * That said, the reliability part of the patch series are just two patches - 13 and 14, both of which are actually pretty simple. I tend to agree with Tetsuo, that in-kernel netconsole should remain as simple as possible and once it starts to have any smarts and less trivial logic the job belongs in userspace. Upto patch 12, it's just the same mechanism transferring extended messages. It doesn't add any smartness to netconsole per-se except that it can now emit messages with metadata headers. What do you think about them? So, this patchset aims for obtaining kernel messages under problematic condition. You have to hold messages until ack is delivered. This means that printk buffer can become full before burst messages (e.g. SysRq-t) are acked due to packet loss in the network. printk() cannot wait for ack. Trying to wait for ack would break something. How can you transmit subsequent kernel messages which failed to enqueue due to waiting for ack for previous kernel messages? Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET] printk, netconsole: implement reliable netconsole
Tejun Heo wrote: printk() cannot wait for ack. Trying to wait for ack would break something. How can you transmit subsequent kernel messages which failed to enqueue due to waiting for ack for previous kernel messages? Well, if log buffer overflows and the messages aren't at the logging target yet, they're lost. It's the same as doing dmesg on localhost, isn't it? This doesn't have much to do with where the reliability logic is implemented and is exactly the same with local logging too. If you tolerate loss of kernel messages, adding sequence number to each UDP packet will be sufficient for finding out whether the packets were lost and/or reordered in flight. printk(Hello); = netconsole sends Hello using UDP printk(netconsole); = netconsole sends 0001 netconsole using UDP printk(world\n); = netconsole sends 0002 world\n using UDP It might be nice to allow administrator to prefix a sequence number to netconsole messages for those who are using special receiver program (e.g. ncrx) which checks that sequence number. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHSET] printk, netconsole: implement reliable netconsole
Tejun Heo wrote: * Implement netconsole retransmission support. Matching rx socket on the source port is automatically created for extended targets and the log receiver can request retransmission by sending reponse packets. This is completely decoupled from the main write path and doesn't make netconsole less robust when things start go south. If the sender side can wait for retransmission, why can't we use userspace programs (e.g. rsyslogd)? For me, netconsole is mainly for saving kernel messages which cannot be waited for a few deciseconds (e.g. system reset by softdog's timeout) and for saving kernel messages (e.g. SysRq-t) during disk I/O hang up. I have a logger for receiving netconsole messages at http://sourceforge.jp/projects/akari/scm/svn/tree/head/branches/udplogger/ and things I expect for netconsole are shown below. (a) spool the message to send up to 1 line or adimn configured size so that total number of small UDP packets can be reduced (b) don't hesitate to send the spooled message immediately if either kernel panic or system reset is in progress (c) allow different console log level for different console drivers so that I can send kernel messages via netconsole without making local console noisy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6.25-rc1] Locks in udp_recvmsg()?
Hello. I found that udp_recvmsg() in net/ipv4/udp.c for 2.6.25-rc1 calls lock_sock() only when it releases the datagram (i.e. out_free: and csum_copy_err:). Is it correct to call __skb_recv_datagram() without calling lock_sock() when it acquires the datagram (i.e. try_again:)? Regards. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[2.6.24-mm1] TCP/IPv6 connect() oopses at twothirdsMD4Transform()
Hello. Kernel config is at http://I-love.SAKURA.ne.jp/tmp/config-2.6.24-mm1 2.6.24 works fine. Regards. -- BUG: unable to handle kernel paging request at 25476bec IP: [c0211c28] twothirdsMD4Transform+0x78/0x37c *pde = Oops: [#1] SMP DEBUG_PAGEALLOC last sysfs file: /sys/devices/pci:00/:00:10.0/host0/target0:0:1/0:0:1:0/type Modules linked in: nfsd lockd sunrpc exportfs pcnet32 Pid: 2148, comm: a.out Not tainted (2.6.24-mm1 #1) EIP: 0060:[c0211c28] EFLAGS: 00010286 CPU: 0 EIP is at twothirdsMD4Transform+0x78/0x37c EAX: 00084000 EBX: 0800 ECX: 8000 EDX: db45ddec ESI: EDI: 52806380 EBP: db45dddc ESP: db45ddc8 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process a.out (pid: 2148, ti=db45d000 task=deaf9250 task.ti=db45d000) Stack: 8000 def6ef9c 6380 c0759d60 db45de1c db45de28 c0211fd2 0040 c0759d40 0100 52806380 1f2e00ba fffa249f 5a696b37 8dbe1970 cf7579d0 3b0cc350 a54b10a8 def6e9a0 def6ef8c Call Trace: [c0211fd2] ? secure_tcpv6_sequence_number+0x58/0x7a [c0322247] ? tcp_v6_connect+0x46d/0x4e3 [c02ade07] ? lock_sock_nested+0x56/0x5e [c02ecbae] ? inet_stream_connect+0x1c/0x163 [c02ecc24] ? inet_stream_connect+0x92/0x163 [c02ab14b] ? sys_connect+0x72/0x98 [c013b945] ? lock_release_holdtime+0x4e/0x54 [c0115250] ? do_page_fault+0x1c5/0x3fc [c013e858] ? __lock_release+0x4b/0x51 [c0115250] ? do_page_fault+0x1c5/0x3fc [c02ab9f5] ? sys_socketcall+0x6f/0x15e [c0103ae7] ? restore_nocheck+0x12/0x15 [c0103a86] ? syscall_call+0x7/0xb === Code: 31 c1 03 0c ba 8b 7a 0c 01 ce 8b 4d ec c1 c6 0b 31 d9 21 f1 31 d9 03 0c ba 8b 7a 10 01 c8 8b 4d ec c1 c0 13 31 f1 21 c1 33 4d ec 03 0c ba 8b 7a 14 01 cb 89 c1 c1 c3 03 31 f1 21 d9 31 f1 03 0c EIP: [c0211c28] twothirdsMD4Transform+0x78/0x37c SS:ESP 0068:db45ddc8 ---[ end trace 160518059a282c77 ]--- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.24-mm1] TCP/IPv6 connect() oopses at twothirdsMD4Transform()
Hello. random: revert braindamage that snuck into checkpatch cleanup Signed-off-by: Matt Mackall [EMAIL PROTECTED] Yes. It solved the oops. Thank you. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process's security context.
Hello. Are there any remaining questions/problems about this patch? If none, I want this patch applied to net-2.6.25 tree. Regards. --- This patch modifies security_socket_post_accept() and introduces security_socket_post_recv_datagram() LSM hooks. Currently, security_socket_post_accept() is called *after* fd_install() at sys_accept(). This means that userland process might access accept()ed but not yet properly labeled socket. I believe this security_socket_post_accept() should be called *before* fd_install() for the three reasons. * The userland process will always access proper context labeled by security_socket_post_accept() rather than default context labeled by security_inode_alloc() (called from alloc_inode() from new_inode() from sock_alloc() from sys_accept()). This gives a security module a chance to perform access control based on proper context. * If security_socket_post_accept() failed to assign proper context (e.g. -ENOMEM), the accept()ed socket can't have proper context. Use of void for security_socket_post_accept() deprives a security module of a chance to abandon the accept()ed socket. * A security module can perform connection filtering based on process's security context. If security_socket_post_accept() is called after fd_install(), it is too late to do so because the accept()ed socket is already visible to userland processes. Currently, there is no way to directly map security context from incoming packet to user process. This is because the creator or owner of a socket is not always the receiver of an incoming packet. The userland process who receives the incoming packet is not known until a process calls sys_recvmsg(). So, I want to add a LSM hook to give a security module a chance to control after the recipient of the incoming packet is known. Signed-off-by: Kentaro Takeda [EMAIL PROTECTED] Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- include/linux/security.h | 34 +- net/core/datagram.c | 29 - net/socket.c |7 +-- security/dummy.c | 13 ++--- security/security.c | 10 -- 5 files changed, 80 insertions(+), 13 deletions(-) --- net-2.6.25.orig/include/linux/security.h +++ net-2.6.25/include/linux/security.h @@ -781,8 +781,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers based on the process accepting this connection. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -796,6 +800,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recv_datagram: + * Check permission after receiving a datagram. + * This hook allows a security module to filter packets + * from unwanted peers based on the process receiving this datagram. + * The packet will be discarded if this hook returns nonzero. + * @sk contains the socket. + * @skb contains the socket buffer. + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1387,12 +1400,13 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, - struct socket * newsock); + int (*socket_post_accept) (struct socket *sock, struct socket *newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb, + unsigned int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket * sock); int (*socket_getsockopt) (struct socket * sock, int level, int optname); @@ -2297,10 +2311,12 @@ int security_socket_bind(struct socket * int security_socket_connect(struct
[PATCH net-2.6.25] Add packet filtering based on process's security context.
This patch modifies security_socket_post_accept() and introduces security_socket_post_recv_datagram() LSM hooks. Currently, security_socket_post_accept() is called *after* fd_install() at sys_accept(). This means that userland process might access accept()ed but not yet properly labeled socket. I believe this security_socket_post_accept() should be called *before* fd_install() for the three reasons. * The userland process will always access proper context labeled by security_socket_post_accept() rather than default context labeled by security_inode_alloc() (called from alloc_inode() from new_inode() from sock_alloc() from sys_accept()). This gives a security module a chance to perform access control based on proper context. * If security_socket_post_accept() failed to assign proper context (e.g. -ENOMEM), the accept()ed socket can't have proper context. Use of void for security_socket_post_accept() deprives a security module of a chance to abandon the accept()ed socket. * A security module can perform connection filtering based on process's security context. If security_socket_post_accept() is called after fd_install(), it is too late to do so because the accept()ed socket is already visible to userland processes. Currently, there is no way to directly map security context from incoming packet to user process. This is because the creator or owner of a socket is not always the receiver of an incoming packet. The userland process who receives the incoming packet is not known until a process calls sys_recvmsg(). So, I want to add a LSM hook to give a security module a chance to control after the recipient of the incoming packet is known. Signed-off-by: Kentaro Takeda [EMAIL PROTECTED] Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- include/linux/security.h | 34 +- net/core/datagram.c | 29 - net/socket.c |7 +-- security/dummy.c | 13 ++--- security/security.c | 10 -- 5 files changed, 80 insertions(+), 13 deletions(-) --- net-2.6.25.orig/include/linux/security.h +++ net-2.6.25/include/linux/security.h @@ -781,8 +781,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers based on the process accepting this connection. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -796,6 +800,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recv_datagram: + * Check permission after receiving a datagram. + * This hook allows a security module to filter packets + * from unwanted peers based on the process receiving this datagram. + * The packet will be discarded if this hook returns nonzero. + * @sk contains the socket. + * @skb contains the socket buffer. + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1387,12 +1400,13 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, - struct socket * newsock); + int (*socket_post_accept) (struct socket *sock, struct socket *newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb, + unsigned int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket * sock); int (*socket_getsockopt) (struct socket * sock, int level, int optname); @@ -2297,10 +2311,12 @@ int security_socket_bind(struct socket * int security_socket_connect(struct socket *sock, struct sockaddr *address, int addrlen); int security_socket_listen(struct socket *sock, int backlog); int security_socket_accept(struct
Re: [PATCH net-2.6.25] Add packet filtering based on process\'s security context.
Hello. Casey Schaufler wrote: Do you have a real situation where two user processes with different security contexts share a socket? How do you get into that situation, and is it appropriate to have that situation in your security scheme? Can this occur without using privilege? I hope such situation won't occur, as I have mentioned in the previous posting. | Precautions: This approach has a side effect which unlikely occurs. | | If a socket is shared by multiple processes with differnt policy, | the process who should be able to accept this connection | will not be able to accept this connection | because socket_post_accept() aborts this connection. | But if socket_post_accept() doesn't abort this connection, | the process who must not be able to accept this connection | will repeat accept() forever, which is a worse side effect. | | Similarly, if a socket is shared by multiple processes with differnt policy, | the process who should be able to pick up this datagram | will not be able to pick up this datagram | because socket_post_recv_datagram() discards this datagram. | But if socket_post_recv_datagram() doesn't discard this datagram, | the process who must not be able to pick up this datagram | will repeat recvmsg() forever, which is a worse side effect. | | So, don't give different permissions between processes who shares one socket. | Otherwise, some connections/datagrams cannot be delivered to intended process. But it is possible to write a code like -- app3.c start -- /* gcc -Wall -O2 -o /tmp/app3 app3.c */ #include fcntl.h #include netinet/in.h #include stdio.h #include string.h #include sys/select.h #include sys/socket.h #include sys/types.h #include unistd.h int main(int argc, char *argv[]) { const int fd1 = socket(PF_INET, SOCK_DGRAM, 0), fd2 = socket(PF_INET, SOCK_DGRAM, 0); struct sockaddr_in addr; char buf[16]; memset(addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_addr.s_addr = htonl(INADDR_ANY); addr.sin_port = htons(1); fprintf(stderr, %s started.\n, argv[0]); if (bind(fd1, (struct sockaddr *) addr, sizeof(addr))) { fprintf(stderr, Can't bind()\n); return 1; } if (sendto(fd2, hello\n, 6, 0, (struct sockaddr *) addr, sizeof(addr)) != 6 || sendto(fd2, world\n, 6, 0, (struct sockaddr *) addr, sizeof(addr)) != 6) { fprintf(stderr, Can't sendto()\n); return 1; } while (1) { fd_set rfds; FD_ZERO(rfds); FD_SET(fd1, rfds); select(fd1 + 1, rfds, NULL, NULL, NULL); if (FD_ISSET(fd1, rfds)) break; fprintf(stderr, Can't select()\n); return 1; } if (fcntl(fd1, FD_CLOEXEC, 0)) { fprintf(stderr, Can't fcntl()\n); return 1; } snprintf(buf, sizeof(buf), %d, fd1); execlp(/tmp/app4, app4, buf, NULL); fprintf(stderr, Can't execve()\n); return 1; } -- app3.c end -- -- app4.c start -- /* gcc -Wall -O2 -o /tmp/app4 app4.c */ #include netinet/in.h #include stdio.h #include stdlib.h #include sys/socket.h #include sys/types.h #include unistd.h int main(int argc, char *argv[]) { int fd; if (argc != 2) { fprintf(stderr, Bad parameter.\n); return 1; } fprintf(stderr, %s started.\n, argv[0]); fd = atoi(argv[1]); while (1) { struct sockaddr_in addr; socklen_t size = sizeof(addr); char buffer[1024]; int len; len = recvfrom(fd, buffer, sizeof(buffer), 0, (struct sockaddr *) addr, size); if (len == EOF) { fprintf(stderr, Can't recvfrom()\n); return 1; } write(1, buffer, len); } return 0; } -- app4.c end -- and assign different policy to /tmp/app1 and /tmp/app2 . Therefore, I want to check at sys_recvmsg() time. (Usage: Compile app3 and app4 and run /tmp/app3 .) For TCP's case, see http://www.mail-archive.com/[EMAIL PROTECTED]/msg02531.html What I want to do is perform connection/packet filtering with the recipient of the incoming connection/packet known. My security scheme controls based on the recipient of the incoming connection/packet. In this case, not /tmp/app1 or /tmp/app3 , but /tmp/app2 or /tmp/app4. This case occurs without using privilege. Regards. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AF_UNIX MSG_PEEK bug?
Hello. Brent Casavant wrote: However, the program would occasionally get into a situation where a call to recv(sockfd, buf, len, MSG_PEEK) returns some number of bytes less than the requested length, and persists in this state (i.e. retrying the call continues to return the same amount of data) even when more than sufficient data is known to have been successfully written to the socket. Did you try MSG_WAITALL flag? See man 2 recv. A TCP socket handles data in bytes. You cannot complain if the amount received by recv() is smaller than expected unless you use MSG_WAITALL flag. Thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.
Hello, Samir. Did you receive the following messages? Since these messages were dropped at vger.kernel.org , I'm worrying that you couldn't receive the following messages. Tetsuo Handa wrote: Hello. Samir Bellabes wrote: what differences between you approach and netfilter in this case ? if it's about packet filtering, you already have all you wishes in netfilter project. Except a hook for making decision with the name of process who picks that packet up known. I think that we really don't need it, because we can catch the informations as I explained. Well, I haven't understood yet why we don't need it. How can you know the the name of process who copies that datagram to its userspace memory? A socket may be shared by multiple different executable files, so the name of the executable file is not known until one of processes who share the socket issues accept()/recvmsg() syscall. Are you saying that I should not use the name of the executable file? Regards. Tetsuo Handa wrote: Hello. I made an example. Usage: Compile app1 and app2 and run /tmp/app1 . Run something like curl http://localhost:1/ to connect to /tmp/app1. I want to know that */tmp/app2* accepts TCP connection so that the user can control whether this TCP connection from 127.0.0.1 port N should be accepted by */tmp/app2* or not. How can we do this without socket_post_accept() change? Regards. -- app1.c start -- /* gcc -Wall -O2 -o /tmp/app1 app1.c */ #include fcntl.h #include netinet/in.h #include stdio.h #include string.h #include sys/select.h #include sys/socket.h #include sys/types.h #include unistd.h int main(int argc, char *argv[]) { const int fd = socket(PF_INET, SOCK_STREAM, 0); struct sockaddr_in addr; char buf[16]; memset(addr, 0, sizeof(addr)); addr.sin_family = AF_INET; addr.sin_addr.s_addr = htonl(INADDR_ANY); addr.sin_port = htons(1); fprintf(stderr, %s started.\n, argv[0]); if (bind(fd, (struct sockaddr *) addr, sizeof(addr))) { fprintf(stderr, Can't bind()\n); return 1; } else if (listen(fd, 5)) { fprintf(stderr, Can't listen()\n); return 1; } while (1) { fd_set rfds; FD_ZERO(rfds); FD_SET(fd, rfds); select(fd + 1, rfds, NULL, NULL, NULL); if (FD_ISSET(fd, rfds)) break; fprintf(stderr, Can't select()\n); return 1; } if (fcntl(fd, FD_CLOEXEC, 0)) { fprintf(stderr, Can't fcntl()\n); return 1; } snprintf(buf, sizeof(buf), %d, fd); execlp(/tmp/app2, app2, buf, NULL); fprintf(stderr, Can't execve()\n); return 1; } -- app1.c end -- -- app2.c start -- /* gcc -Wall -O2 -o /tmp/app2 app2.c */ #include netinet/in.h #include stdio.h #include stdlib.h #include sys/socket.h #include sys/types.h #include unistd.h int main(int argc, char *argv[]) { int lfd; if (argc != 2) { fprintf(stderr, Bad parameter.\n); return 1; } fprintf(stderr, %s started.\n, argv[0]); lfd = atoi(argv[1]); while (1) { struct sockaddr_in addr; socklen_t size = sizeof(addr); int fd = accept(lfd, (struct sockaddr *) addr, size); char c; if (fd == EOF) { fprintf(stderr, Can't accept()\n); return 1; } while (read(fd, c, 1) == 1 write(fd, c, 1) == 1); close(fd); } return 0; } -- app2.c end -- -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process's securitycontext.
Hello. Patrick McHardy wrote: No news on that. I'm also a bit sceptical if adding all this complexity and overhead would really be worth it (considering only netfilter) just to use the owner match and UID/GID matching. It wouldn't even be accurate because there is not 1:1 mapping of sockets and processes. Considering only LSM, socket_post_accept()/socket_post_recv_datagram() hooks are not complicated at all. A socket may be mapped to multiple processes, but at the moment of picking up (i.e. accept()/recvmsg()), I think it is accurate 1:1 mapping. I'm more interested in Who picks this connection/datagram up? than Which socket enqueues this connection/datagram? It may be indifferent for netfilter, but it is region of interest for me. I actually like Samir Bellabes' approach, which doesn't suffer from these problems IIRC. Oh, I found him at http://nfws.inl.fr/en/?p=50 . (Sorry, I didn't know.) He is the person who was discussing with me a few days ago. From memory, one approach under discussion was to add netfilter hooks to the transport layer, which could be invoked correctly by each type of protocol when the target process is selected. We can only invoke the hooks after the socket lookup, but we don't know which process is going to call recvmsg() for that socket. Right. Thus, I'm proposing LSM hooks at accept()/recvmsg() time. Regards. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.
Hello. Thank you for detailed explanation. Samir Bellabes wrote: By filtering, you should mean packets filtring, shouldn't you ? because this hook is able to deny the accept() syscall for a process, so it's a kind of filtring too. Yes, you are right. No, it's performed from the userspace. the goal is to don't touch the network stack at all. OK. One thing I'm worrying. Use of userspace process assumes that it shall not be killed by some reason (SIGKILL or OOM-killer or something). What happens if userspace process is dead? TOMOYO wants to continue packet filtering functionality even if userspace process is dead. So, TOMOYO loads all filtering rules inside kernel and performs filtering inside kernel. TOMOYO's goals are (1) perform simple stateless packet filtering without any assistance outside the kernel (2) provide simple approach for people (including me) who disable iptables or worry the correctness of configurations because of very complicated syntax. (Many users disable both iptables/SELinux inside local network because their very-very-flexible-configurations give them a headache.) (3) allow people describe per-an-application firewall rules in understandable syntax. Dropping the TCP SYN for a TCP connection, for example, then the process don't see the connection arrived and was dropped. That's an advantage. Making select() not to say ready to accept is what TOMOYO wants (although it is very difficult to do so because of (3)). With your tool, you will have to learn the ftp client to let incoming connection, reach this dynamic port, so does this mean, as it's dynamic, to allow TCP port 1024-65535 ? Yes. Users add lines like kernel /usr/sbin/vsftpd allow_network TCP accept xxx.xxx.xxx.xxx-yyy.yyy.yyy.yyy 1024-65535 which means /usr/sbin/vsftpd is able to accept incoming TCP connections if source IP address is between xxx.xxx.xxx.xxx-yyy.yyy.yyy.yyy and source port is between 1024-65535 and kernel ... /usr/bin/ftp allow_network TCP accept zzz.zzz.zzz.zzz 1024-65535 which means /usr/bin/ftp is able to accept incoming TCP connections if source IP address is zzz.zzz.zzz.zzz and source port is between 1024-65535. Users have to add explicitly, but it is very easy and understandable. I'm not speaking anymore about socket filtering inside sock-aps-accept(). I do filtering for sock-aps-accept() with the netfilter tool, by handling packets directly. For TOMOYO, (3) is more important than making select() not to say ready to accept. Packet filtering without knowing the program's name who picks up this connection is not a TOMOYO's goal. So, TOMOYO still wants to do filtering inside sock-aps-accept() or security_socket_post_accept(). Thank you. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.
Hello. Samir Bellabes wrote: at security_socket_accept(), the user only accept the fact that the application is able to go to sock-ops-accept(). That's the purpose of this hook. Yes. This hook can't perform filtering. After, when packet are coming, we can catch them with libnetfilter_queue, and deal with filtering packets. Is this performed inside sock-ops-accept()? here we agree. *but* in my module, the user don't judge before sock-ops-accept(). He judges when packets are coming, throught the libnetfilter_queue API, in userspace, and reinject packet if it's ok. I didn't understand what is happening. Is there a hook which can perform filtering inside sock-ops-accept()? Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.
Hello. Thank you for feedback. I have some questions. (1) Your module uses struct security_operations and is registered with register_security(). TOMOYO also uses struct security_operations and must be registered with register_security(). Can your module and TOMOYO coexist? (2) Can your module handle incoming UDP's datagram too? (3) How do you handle race window between security_socket_accept() and sock-ops-accept()? Samir Bellabes wrote: My approach is to get the informations regarding the socket from socket(), bind() and and accept() syscalls hooks. Pushing this informations to userspace. Here the user can refuse or accept the sycalls. If a socket is shared by multiple processes and the user said Accept this connection at security_socket_accept(), the process does not always pick up what the user wanted to pick up because there is no mutex between security_socket_accept() and sock-ops-accept(). (4) If (3) is my misunderstanding (i.e. the user doesn't judge at security_socket_accept()), it means that the user judges it before select() says ready to accept. Then ... what happens if a socket is shared by multiple programs? The user has to judge without knowing which program will pick up this connection. TOMOYO lets the user judge whether to accept or not with knowing the program's name who picks up this connection. I think getting information at socket(), bind() and accept() is helpful for detailed judgement, but I think it doesn't solve race problem as long as the user judges *before* sock-ops-accept(). Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.
Hello. James Morris wrote: From memory, one approach under discussion was to add netfilter hooks to the transport layer, which could be invoked correctly by each type of protocol when the target process is selected. If this is done for netfilter, then an LSM hook is probably not needed at all, as security modules can utilize netfilter hooks directly. Patrick McHardy says (at http://marc.info/?l=linux-netdevm=118495005800410w=2 ) Even with socket filters netfilter doesn't know the final receipient process, that is not known until it calls recvmsg and the data is read, which is too late for netfilter. Precautions: This approach has a side effect which unlikely occurs. If a socket is shared by multiple processes with different policy, the process who should be able to accept this connection will not be able to accept this connection because socket_post_accept() aborts this connection. But if socket_post_accept() doesn't abort this connection, the process who must not be able to accept this connection will repeat accept() forever, which is a worse side effect. I think this change is needed regardless of whether to use connection filtering or not. Currently, SELinux doesn't use socket_post_accept(). | * @socket_post_accept: | *This hook allows a security module to copy security | *information into the newly created socket's inode. But if one wants to *copy* security information to accept()ed socket, the location after fd_install() is too late to copy because the userland process can access accept()ed socket's fd whose security information is not copied yet. Also, if one wants to *assign* security information to accept()ed socket, it might attend memory allocation which can fail. So, use of void for socket_post_accept() deprives a security module of a chance to abort this connection if the security module failed to *assign* security information. Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [README] away until Dec 3rd
Hello. I have a question. Yesterday, I posted a patch based on 2.6.24-rc3-mm1 that modifies the following files. include/linux/security.h | 34 +- net/core/datagram.c | 26 -- net/socket.c |7 +-- security/dummy.c | 13 ++--- security/security.c | 10 -- But you say that I should make patches based on the net-2.6.25 tree. Which tree (-mm or net-2.6.25) should I use for making this patch? There is no pending objection at LSM-ml so far, and I'm asking for an approval from one of the core developers at netdev-ml. Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25] Add packet filtering based on process's security context.
Hello. Herbert Xu wrote: On Thu, Nov 22, 2007 at 09:57:14PM +0900, Tetsuo Handa wrote: But you say that I should make patches based on the net-2.6.25 tree. Which tree (-mm or net-2.6.25) should I use for making this patch? The net-2.6.25 tree is the one. Please use the tree at git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6.25.git/ while Dave is away. Thank you. I refreshed the patch based on net-2.6.25 tree. Regards. - Subject: Add packet filtering based on process's security context. This patch allows LSM modules filter incoming connections/datagrams based on the process's security context who is attempting to pick up. There are already hooks to filter incoming connections/datagrams based on the socket's security context, but these hooks are not applicable when one wants to do TCP Wrapper-like filtering (e.g. App1 is permitted to accept TCP connections from 192.168.0.0/16). Precautions: This approach has a side effect which unlikely occurs. If a socket is shared by multiple processes with differnt policy, the process who should be able to accept this connection will not be able to accept this connection because socket_post_accept() aborts this connection. But if socket_post_accept() doesn't abort this connection, the process who must not be able to accept this connection will repeat accept() forever, which is a worse side effect. Similarly, if a socket is shared by multiple processes with differnt policy, the process who should be able to pick up this datagram will not be able to pick up this datagram because socket_post_recv_datagram() discards this datagram. But if socket_post_recv_datagram() doesn't discard this datagram, the process who must not be able to pick up this datagram will repeat recvmsg() forever, which is a worse side effect. So, don't give different permissions between processes who share one socket. Otherwise, some connections/datagrams cannot be delivered to intended process. Signed-off-by: Kentaro Takeda [EMAIL PROTECTED] Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] include/linux/security.h | 34 +- net/core/datagram.c | 26 -- net/socket.c |7 +-- security/dummy.c | 13 ++--- security/security.c | 10 -- 5 files changed, 76 insertions(+), 14 deletions(-) --- net-2.6.25.orig/include/linux/security.h +++ net-2.6.25/include/linux/security.h @@ -781,8 +781,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers based on the process accepting this connection. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -796,6 +800,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recv_datagram: + * Check permission after receiving a datagram. + * This hook allows a security module to filter packets + * from unwanted peers based on the process receiving this datagram. + * The packet will be discarded if this hook returns nonzero. + * @sk contains the socket. + * @skb contains the socket buffer (may be NULL). + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1387,12 +1400,13 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, - struct socket * newsock); + int (*socket_post_accept) (struct socket *sock, struct socket *newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb, + unsigned int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket * sock); int (*socket_getsockopt) (struct socket
[PATCH] Add packet filtering based on process's security context.
Hello. This patch comes from a thread at http://lkml.org/lkml/2007/11/16/155 . I want to use IP/port based access control for incoming connections/datagrams. This idea was discussed for several times, but there is no approach that satisfies both it can decide based on the recipient's process and it can sleep so that the LSM hook can query userspace for an access decision. Thus, I'd like to add one new LSM hook in net/core/datagram.c and change return type of one existing LSM hook in net/socket.c . Regards. - Subject: Add packet filtering based on process's security context. This patch allows LSM modules filter incoming connections/datagrams based on the process's security context who is attempting to pick up. There are already hooks to filter incoming connections/datagrams based on the socket's security context, but these hooks are not applicable when one wants to do TCP Wrapper-like filtering (e.g. App1 is permitted to accept TCP connections from 192.168.0.0/16). Precautions: This approach has a side effect which unlikely occurs. If a socket is shared by multiple processes with differnt policy, the process who should be able to accept this connection will not be able to accept this connection because socket_post_accept() aborts this connection. But if socket_post_accept() doesn't abort this connection, the process who must not be able to accept this connection will repeat accept() forever, which is a worse side effect. Similarly, if a socket is shared by multiple processes with differnt policy, the process who should be able to pick up this datagram will not be able to pick up this datagram because socket_post_recv_datagram() discards this datagram. But if socket_post_recv_datagram() doesn't discard this datagram, the process who must not be able to pick up this datagram will repeat recvmsg() forever, which is a worse side effect. So, don't give different permissions between processes who shares one socket. Otherwise, some connections/datagrams cannot be delivered to intended process. Signed-off-by: Kentaro Takeda [EMAIL PROTECTED] Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] include/linux/security.h | 34 +- net/core/datagram.c | 26 -- net/socket.c |7 +-- security/dummy.c | 13 ++--- security/security.c | 10 -- 5 files changed, 76 insertions(+), 14 deletions(-) --- linux-2.6-mm.orig/include/linux/security.h +++ linux-2.6-mm/include/linux/security.h @@ -778,8 +778,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers based on the process accepting this connection. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -793,6 +797,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recv_datagram: + * Check permission after receiving a datagram. + * This hook allows a security module to filter packets + * from unwanted peers based on the process receiving this datagram. + * The packet will be discarded if this hook returns nonzero. + * @sk contains the socket. + * @skb contains the socket buffer (may be NULL). + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1389,12 +1402,13 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, - struct socket * newsock); + int (*socket_post_accept) (struct socket *sock, struct socket *newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb, + unsigned int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket
[PATCH] Add packet filtering based on process\'s security context.
This patch allows LSM modules filter incoming connections/datagrams based on the process's security context who is attempting to pick up. There are already hooks to filter incoming connections/datagrams based on the socket's security context, but these hooks are not applicable when one wants to do TCP Wrapper-like filtering (e.g. App1 is permitted to accept TCP connections from 192.168.0.0/16). There is a side effect which unlikely occurs. If a socket is shared by multiple processes with different policy, the process who should be able to accept this connection will not be able to accept this connection because socket_post_accept() aborts this connection. But if socket_post_accept() doesn't abort this connection, the process who must not be able to accept this connection will repeat accept() forever, which is a worse side effect. Similarly, if a socket is shared by multiple processes with different policy, the process who should be able to pick up this datagram will not be able to pick up this datagram because socket_post_recv_datagram() discards this datagram. But if socket_post_recv_datagram() doesn't discard this datagram, the process who must not be able to pick up this datagram will repeat recvmsg() forever, which is a worse side effect. Signed-off-by: Kentaro Takeda [EMAIL PROTECTED] Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] include/linux/security.h | 34 +- net/core/datagram.c | 26 -- net/socket.c |7 +-- security/dummy.c | 13 ++--- security/security.c | 10 -- 5 files changed, 76 insertions(+), 14 deletions(-) --- linux-2.6.24-rc2-mm1.orig/include/linux/security.h +++ linux-2.6.24-rc2-mm1/include/linux/security.h @@ -778,8 +778,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers based on the process accepting this connection. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -793,6 +797,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recv_datagram: + * Check permission after receiving a datagram. + * This hook allows a security module to filter packets + * from unwanted peers based on the process receiving this datagram. + * The packet will be discarded if this hook returns nonzero. + * @sk contains the socket. + * @skb contains the socket buffer (may be NULL). + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1384,12 +1397,13 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, - struct socket * newsock); + int (*socket_post_accept) (struct socket *sock, struct socket *newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb, + unsigned int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket * sock); int (*socket_getsockopt) (struct socket * sock, int level, int optname); @@ -2294,10 +2308,12 @@ int security_socket_bind(struct socket * int security_socket_connect(struct socket *sock, struct sockaddr *address, int addrlen); int security_socket_listen(struct socket *sock, int backlog); int security_socket_accept(struct socket *sock, struct socket *newsock); -void security_socket_post_accept(struct socket *sock, struct socket *newsock); +int security_socket_post_accept(struct socket *sock, struct socket *newsock); int security_socket_sendmsg(struct socket *sock, struct msghdr *msg, int size); int security_socket_recvmsg(struct socket *sock, struct msghdr *msg, int size, int flags); +int
Re: [TOMOYO 15/15] LSM expansion for TOMOYO Linux.
Hello. Thank you very much for your time, Paul. Yes, you understood what I wanted to do. TOMOYO Linux's approach: (1) It uses userspace intervention to allow/reject connections and/or packets based on the application's domain. Since existent hooks can't be used for this purpose, I inserted a new hook post_recv_datagram() at skb_recv_datagram() and I modified socket_post_accept() to return error so that I can drop/disconnect based on the application's domain. I think skb_recv_datagram() is the only place that can remove a message picked up with MSG_PEEK flags from the receive queue. To remove a message picked up with MSG_PEEK flags, I noticed that I have to do skb_kill_datagram()-like operation so that the head message that must not be delivered to the caller won't prevent picking up of the non-head message that should be delivered to the caller when the caller repeats only recv(MSG_PEEK) requests. Since skb_recv_datagram() can be called from interrupt context, I have to use spin_lock_irqsave() instead for spin_lock_bh(), am I right? /* from net/core/datagram.c */ @@ -178,6 +179,27 @@ struct sk_buff *skb_recv_datagram(struct } else skb = skb_dequeue(sk-sk_receive_queue); + error = security_post_recv_datagram(sk, skb, flags); + if (error) { + unsigned long cpu_flags; + + if (!(flags MSG_PEEK)) + goto no_peek; + + spin_lock_irqsave(sk-sk_receive_queue.lock, + cpu_flags); + if (skb == skb_peek(sk-sk_receive_queue)) { + __skb_unlink(skb, +sk-sk_receive_queue); + atomic_dec(skb-users); + } + spin_unlock_irqrestore(sk-sk_receive_queue.lock, + cpu_flags); +no_peek: + skb_free_datagram(sk, skb); + goto no_packet; + } + if (skb) return skb; By the way, why can't socket_post_accept() fail? Someone may wish to do memory allocation at socket_post_accept(). socket_accept() is too early for memory allocation because there is no chance to free allocated memory when sock-ops-accept() failed. I think socket_post_accept() should be able to fail. (2) It allows the administrator judge interactively using a userspace agent. Thus, the new hook has to be inserted at blockable location, Since skb_recv_datagram() can be called from interrupt context, I do nothing in post_recv_datagram() if called from interrupt context. +static int tmy_post_recv_datagram(struct sock *sk, + struct sk_buff *skb, + unsigned int flags) +{ + int error = 0; + const unsigned int type = sk-sk_type; + + /* skb_recv_datagram() didn't dequeue. */ + if (!skb) + return 0; + + /* skb_recv_datagram() can be called from interrupt context. */ + if (in_interrupt()) + return 0; + /* I don't check if called by kernel process. */ + if (segment_eq(get_fs(), KERNEL_DS)) + return 0; + + if (type != SOCK_DGRAM type != SOCK_RAW) + return 0; ...(sniped)... +} Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
What does -EIOCBQUEUED do?
Hello. There are several locations that handle -EIOCBQUEUED error code. According to include/linux/errno.h , it seems this code is NFS related and caller will receive completion event later. But I couldn't figure out where is the beginning point and what is happening. What functions are called before aio_complete() is called? (For example, is udp_recvmsg() called if sock_recvmsg() for UDP socket returned -EIOCBQUEUED?) What memory regions are accessed before aio_complete() is called? (For example, is struct msghdr-msg_name accessed if sock_recvmsg() for UDP socket returned -EIOCBQUEUED?) Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] Allow LSM to use IP address/port number.
Hello. Isn't it better to hook into existing netfilter infrastructure somehow? Yes, it has been suggested several times. I want to use a process's context who issued accept()/recv()/recvfrom()/recvmsg()/read() requests. But Stephen Smalley said at http://marc.info/?l=linux-security-modulem=118459899017487w=2 that The socket can be inherited by a child task or passed via local IPC to an unrelated task, and then used by those other tasks. It isn't tied to the original allocating task's lifecycle in any way, nor is it guaranteed to only be used by that original allocating task. It can exist in zero, one or many tasks' file tables and still be receiving data, and can go from completely disassociated (i.e. not present in any tasks' file tables) to being associated. I can't use netfilter infrastructure because it is too early to know who the recipant process of the packet is. The only chance to perform ip/port based filtering using ACLs associated with the recipant process of the packet is post-accept() and post-recvmsg(). Therefore, I re-post my patch again. Regards. Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- include/linux/security.h | 44 +--- net/socket.c | 42 -- security/dummy.c | 13 ++--- 3 files changed, 83 insertions(+), 16 deletions(-) diff -upr a/include/linux/security.h b/include/linux/security.h --- a/include/linux/security.h 2007-07-20 22:55:38.0 +0900 +++ b/include/linux/security.h 2007-07-20 23:15:58.0 +0900 @@ -749,8 +749,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -764,6 +768,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recvmsg: + * Check peer's address after receiving a message from a socket. + * This hook allows a security module to filter messages + * from unwanted peers. + * @sock contains the socket structure. + * @msg contains the message structure. + * @size contains the size of message structure. + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1345,12 +1358,14 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, - struct socket * newsock); + int (*socket_post_accept) (struct socket *sock, + struct socket *newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recvmsg) (struct socket *sock, struct msghdr *msg, + int size, int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket * sock); int (*socket_getsockopt) (struct socket * sock, int level, int optname); @@ -2860,10 +2875,10 @@ static inline int security_socket_accept return security_ops-socket_accept(sock, newsock); } -static inline void security_socket_post_accept(struct socket * sock, - struct socket * newsock) +static inline int security_socket_post_accept(struct socket *sock, + struct socket *newsock) { - security_ops-socket_post_accept(sock, newsock); + return security_ops-socket_post_accept(sock, newsock); } static inline int security_socket_sendmsg(struct socket * sock, @@ -2879,6 +2894,13 @@ static inline int security_socket_recvms return security_ops-socket_recvmsg(sock, msg, size, flags); } +static inline int security_socket_post_recvmsg(struct socket *sock, + struct msghdr *msg
Re: [PATCH 1/1] Allow LSM to use IP address/port number.
Hello. Patrick McHardy wrote: Quoting Tetsuo: So, my approach is not using security context associated with a socket but security context associated with a process. Isn't the socket context derived from the process context? Not so regarding my case. static int smack_sk_alloc_security(struct sock *sk, int family, gfp_t priority) { sk-sk_security = current-security; return 0; } will not help what I want to do. So, I'm not planning to use sk-sk_security. I'm planning to use current-security at accept()/recvmsg() time. What I want to do is to enforce subset of TCP Wrapper inside the kernel space so that IP/port based filtering for TCP and UDP is applied to ALL processes and IP/port based filtering for TCP and UDP is not bypassed by processes that are linked with TCP Wrapper library (even if accept()/recvmsg() syscalls are directly called). To receive source IP/port of a incoming packet, non-NULL msg-msg_name is needed for receiving them, and I'm proposing modifications in net/socket.c . Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Allow LSM to use IP address/port number. (was Re: [PATCH 1/1] Add post accept()/recvmsg() hooks.)
Hello. This thread is from http://marc.info/?t=11834645705r=1w=2 . I want to use tcp_wrapper-like filtering using LSM. But it seems that there are cases (recvmsg() and read()?) where __sock_recvmsg() is called with msg-name == NULL and msg-msg_namelen == 0 that makes what I want to do impossible. To make IP address and port number always available, some changes in socket layer are needed. Since I'm not getting objection from LSM-ml so far, I'm now adding netdev-ml because this patch is related to socket layer. Are there ways to receive messages other than recv()/recvfrom()/recvmsg()/read()? If recv()/recvfrom()/recvmsg()/read() are all ways to receive messages, the following patch seems to allow LSM to use IP address and port number. The following patch allocates buffer for receiving IP address and port number even if userland doesn't request them. Is this change no problem? Regards. Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- include/linux/security.h | 38 ++ net/socket.c | 40 ++-- security/dummy.c | 11 +-- 3 files changed, 77 insertions(+), 12 deletions(-) diff -upr a/include/linux/security.h b/include/linux/security.h --- a/include/linux/security.h 2007-07-03 10:07:14.0 +0900 +++ b/include/linux/security.h 2007-07-09 10:51:04.0 +0900 @@ -748,8 +748,12 @@ struct request_sock; * @socket_post_accept: * This hook allows a security module to copy security * information into the newly created socket's inode. + * This hook also allows a security module to filter connections + * from unwanted peers. + * The connection will be aborted if this hook returns nonzero. * @sock contains the listening socket structure. * @newsock contains the newly created server socket for connection. + * Return 0 if permission is granted. * @socket_sendmsg: * Check permission before transmitting a message to another socket. * @sock contains the socket structure. @@ -763,6 +767,15 @@ struct request_sock; * @size contains the size of message structure. * @flags contains the operational flags. * Return 0 if permission is granted. + * @socket_post_recvmsg: + * Check peer's address after receiving a message from a socket. + * This hook allows a security module to filter messages + * from unwanted peers. + * @sock contains the socket structure. + * @msg contains the message structure. + * @size contains the size of message structure. + * @flags contains the operational flags. + * Return 0 if permission is granted. * @socket_getsockname: * Check permission before the local address (name) of the socket object * @sock is retrieved. @@ -1343,12 +1356,14 @@ struct security_operations { struct sockaddr * address, int addrlen); int (*socket_listen) (struct socket * sock, int backlog); int (*socket_accept) (struct socket * sock, struct socket * newsock); - void (*socket_post_accept) (struct socket * sock, + int (*socket_post_accept) (struct socket *sock, struct socket * newsock); int (*socket_sendmsg) (struct socket * sock, struct msghdr * msg, int size); int (*socket_recvmsg) (struct socket * sock, struct msghdr * msg, int size, int flags); + int (*socket_post_recvmsg) (struct socket *sock, struct msghdr *msg, + int size, int flags); int (*socket_getsockname) (struct socket * sock); int (*socket_getpeername) (struct socket * sock); int (*socket_getsockopt) (struct socket * sock, int level, int optname); @@ -2853,10 +2868,10 @@ static inline int security_socket_accept return security_ops-socket_accept(sock, newsock); } -static inline void security_socket_post_accept(struct socket * sock, +static inline int security_socket_post_accept(struct socket *sock, struct socket * newsock) { - security_ops-socket_post_accept(sock, newsock); + return security_ops-socket_post_accept(sock, newsock); } static inline int security_socket_sendmsg(struct socket * sock, @@ -2872,6 +2887,13 @@ static inline int security_socket_recvms return security_ops-socket_recvmsg(sock, msg, size, flags); } +static inline int security_socket_post_recvmsg(struct socket *sock, + struct msghdr *msg, + int size, int flags) +{ + return security_ops-socket_post_recvmsg(sock, msg, size, flags); +} + static inline int security_socket_getsockname(struct socket * sock) { return security_ops-socket_getsockname(sock); @@ -3016,9 +3038,10 @@ static inline int security_socket_accept return 0
Re: [RFC] Allow LSM to use IP address/port number.
Hello. Thank you for your comment. David Miller wrote: I don't think it's such a hot idea to return errors if the wait_on_sync_kiocb() has returned success. My patch may return errors for non-wait_on_sync_kiocb() case too. Are you saying only wait_on_sync_kiocb() case is bad? If so, could you please explain me why? The location I inserted a hook is after aio event finished. Both struct kiocb and struct sock_iocb are no longer in use. So, I think this location can safely return errors. If you are saying don't return error after receiving data, I'll explain you why I need hooks after receiving data. What I want to do is to implement a kind of anti-virus software's personal firewall feature. It drops messages from unwanted IP address/ports. (To be exact, it doesn't drop, it just tells userland process not to use received messages by returning errors.) security_socket_recvmsg() is called before retrieving a message. I want a hook that is called after retrieving a message so that I can use IP address and port number for judgement. I also want to allow users judge interactively using a popup dialog. To judge interactively, the hook has to be block-able. If non-interactive, iptables is the appropriate location for filtering. This is why I want to insert security_socket_post_recvmsg() hook here. Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Allow LSM to use IP address/port number.
Thank you for your comment. I have a question regarding netfilter infrastructure. I want to filter messages using task_struct-security. Can the netfilter's queuing to userspace feature get a list of struct task_struct who shares a socket that is going to receive incoming messages? My approach is not is this socket allowed to receive from xxx.xxx.xxx.xxx port yy but is this process allowed to receive from xxx.xxx.xxx.xxx port yy. So, my approach is not using security context associated with a socket but security context associated with a process. If I can't use netfilter, there is no chance to filter before enqueuing messages. So, I think propagating errors to the local user after dequeuing messages is the only possible way. Regards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html