from:"Tetsuo Handa"

Re: bpf: Massive skbuff_head_cache memory leak?

2018-09-26 Thread Tetsuo Handa

Hello, Alexei and Daniel.

Can you show us how to run testcases you are testing?

On 2018/09/22 22:25, Tetsuo Handa wrote:
> Hello.
> 
> syzbot is reporting many lockup problems on bpf.git / bpf-next.git / net.git 
> / net-next.git trees.
> 
>   INFO: rcu detected stall in br_multicast_port_group_expired (2)
>   
> https://syzkaller.appspot.com/bug?id=15c7ad8cf35a07059e8a697a22527e11d294bc94
> 
>   INFO: rcu detected stall in tun_chr_close
>   
> https://syzkaller.appspot.com/bug?id=6c50618bde03e5a2eefdd0269cf9739c5ebb8270
> 
>   INFO: rcu detected stall in discover_timer
>   
> https://syzkaller.appspot.com/bug?id=55da031ddb910e58ab9c6853a5784efd94f03b54
> 
>   INFO: rcu detected stall in ret_from_fork (2)
>   
> https://syzkaller.appspot.com/bug?id=c83129a6683b44b39f5b8864a1325893c9218363
> 
>   INFO: rcu detected stall in addrconf_rs_timer
>   
> https://syzkaller.appspot.com/bug?id=21c029af65f81488edbc07a10ed20792444711b6
> 
>   INFO: rcu detected stall in kthread (2)
>   
> https://syzkaller.appspot.com/bug?id=6accd1ed11c31110fed1982f6ad38cc9676477d2
> 
>   INFO: rcu detected stall in ext4_filemap_fault
>   
> https://syzkaller.appspot.com/bug?id=817e38d20e9ee53390ac361bf0fd2007eaf188af
> 
>   INFO: rcu detected stall in run_timer_softirq (2)
>   
> https://syzkaller.appspot.com/bug?id=f5a230a3ff7822f8d39fddf8485931bd06ae47fe
> 
>   INFO: rcu detected stall in bpf_prog_ADDR
>   
> https://syzkaller.appspot.com/bug?id=fb4911fd0e861171cc55124e209f810a0dd68744
> 
>   INFO: rcu detected stall in __run_timers (2)
>   
> https://syzkaller.appspot.com/bug?id=65416569ddc8d2feb8f19066aa761f5a47f7451a
> 
> The cause of lockup seems to be flood of printk() messages from memory 
> allocation
> failures, and one of out_of_memory() messages indicates that skbuff_head_cache
> usage is huge enough to suspect in-kernel memory leaks.
> 
>   [ 1554.547011] skbuff_head_cache1847887KB1847887KB
> 
> Unfortunately, we cannot find from logs what syzbot is trying to do
> because constant printk() messages is flooding away syzkaller messages.
> Can you try running your testcases with kmemleak enabled?
> 

On 2018/09/27 2:35, Dmitry Vyukov wrote:
> I also started suspecting Apparmor. We switched to Apparmor on Aug 30:
> https://groups.google.com/d/msg/syzkaller-bugs/o73lO4KGh0w/j9pcH2tSBAAJ
> Now the instances that use SELinux and Smack explicitly contain that
> in the name, but the rest are Apparmor.
> Aug 30 roughly matches these assorted "task hung" reports. Perhaps
> some Apparmor hook leaks a reference to skbs?

Maybe. They have CONFIG_DEFAULT_SECURITY="apparmor". But I'm wondering why
this problem is not occurring on linux-next.git when this problem is occurring
on bpf.git / bpf-next.git / net.git / net-next.git trees. Is syzbot running
different testcases depending on which git tree is targeted?

Re: WARNING in add_uevent_var

2018-05-07 Thread Tetsuo Handa

On 2018/04/03 21:34, Johannes Berg wrote:
> On Sun, 2018-04-01 at 23:01 -0700, syzbot wrote:
> 
>> So far this crash happened 5 times on net-next, upstream.
>> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6614377067184128
>>
> 
> Huh, fun. Looks like you're basically creating a new HWSIM radio with an
> insanely long name (4k!) and nothing stops you, until we try to generate
> an rfkill instance which sends a uevent and only has a 2k buffer for the
> environment variables, where we put the name ...
> 
> But yeah, we should probably limit the phy name to something sane, I'll
> pick 128 and send a patch.
> 

I think it should be NAME_MAX.


#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
const int fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
static struct {
struct nlmsghdr hdr;
char data[4120 - sizeof(struct nlmsghdr)];
} buf = { };
struct sockaddr_nl addr = { .nl_family = AF_NETLINK };
struct iovec iov = { , sizeof(buf) };
struct msghdr msg = {
.msg_name = ,
.msg_namelen = sizeof(addr),
.msg_iov = ,
.msg_iovlen = 1,
};
buf.hdr.nlmsg_len = 0x1018;
buf.hdr.nlmsg_type = 0x22;
buf.hdr.nlmsg_flags = 0x109;
*(uint8_t*) buf.data = 4;
*(uint8_t*) (buf.data + 1) = 0;
*(uint16_t*) (buf.data + 2) = 0;
*(uint16_t*) (buf.data + 4) = 0x1004;
*(uint16_t*) (buf.data + 6) = 0x11;
memset(buf.data + 8, 'A', 4096); /* strlen() > NAME_MAX */
sendmsg(fd, , 0);
return 0;
}




>From 3eba6541da0c7338e3d71ea83cbc69962923d65e Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Mon, 7 May 2018 15:58:37 +0900
Subject: [PATCH] net: rfkill: Add filename varidity test at rfkill_alloc().

syzbot is hitting WARN() at kobject_uevent_env() [1].
This is because the test case is requesting too long name.
Since the name is used as part of pathname under /sys/devices/virtual/ ,
this name parameter must obey the limitations of a filename.
Fix this by applying name validity test to rfkill_alloc().

[1] 
https://syzkaller.appspot.com/bug?id=c1e1ab1d042b637ee9e25c64b107d51630b9d9ec

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+230d9e642a85d3fec...@syzkaller.appspotmail.com>
Cc: Johannes Berg <johan...@sipsolutions.net>
---
 net/rfkill/core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/rfkill/core.c b/net/rfkill/core.c
index 59d0eb9..d8beebc 100644
--- a/net/rfkill/core.c
+++ b/net/rfkill/core.c
@@ -933,6 +933,10 @@ struct rfkill * __must_check rfkill_alloc(const char *name,
if (WARN_ON(type == RFKILL_TYPE_ALL || type >= NUM_RFKILL_TYPES))
return NULL;
 
+   if (strlen(name) > NAME_MAX || strchr(name, '/') ||
+   !strcmp(name, ".") || !strcmp(name, ".."))
+   return NULL;
+
rfkill = kzalloc(sizeof(*rfkill) + strlen(name) + 1, GFP_KERNEL);
if (!rfkill)
return NULL;
-- 
1.8.3.1

Re: WARNING in kernfs_add_one

2018-05-07 Thread Tetsuo Handa

On 2018/05/06 7:07, Greg KH wrote:
>> More likely wireless territory
> 
> Ugh, that's what I get for writing emails before coffee in the
> morning...
> 
> Yes, you are right, this looks like a wireless issue.
> 
> Now cc: linux-wireless.
> 
Nope, if you look at previous fault injection messages...



>From 7ddcaa3d4327d4f29d11053bd2011bf77ecf72af Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Mon, 7 May 2018 14:19:50 +0900
Subject: [PATCH] driver core: Don't ignore class_dir_create_and_add() failure.

syzbot is hitting WARN() at kernfs_add_one() [1].
This is because kernfs_create_link() is confused by previous device_add()
call which continued without setting dev->kobj.parent field when
get_device_parent() failed by memory allocation fault injection.
Fix this by propagating the error from class_dir_create_and_add() to
the calllers of get_device_parent().

[1] 
https://syzkaller.appspot.com/bug?id=fae0fb607989ea744526d1c082a5b8de6529116f

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Reported-by: syzbot <syzbot+df47f81c226b31d89...@syzkaller.appspotmail.com>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
---
 drivers/base/core.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816..d680fd0 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1467,7 +1467,7 @@ class_dir_create_and_add(struct class *class, struct 
kobject *parent_kobj)
 
dir = kzalloc(sizeof(*dir), GFP_KERNEL);
if (!dir)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
 
dir->class = class;
kobject_init(>kobj, _dir_ktype);
@@ -1477,7 +1477,7 @@ class_dir_create_and_add(struct class *class, struct 
kobject *parent_kobj)
retval = kobject_add(>kobj, parent_kobj, "%s", class->name);
if (retval < 0) {
kobject_put(>kobj);
-   return NULL;
+   return ERR_PTR(retval);
}
return >kobj;
 }
@@ -1784,6 +1784,10 @@ int device_add(struct device *dev)
 
parent = get_device(dev->parent);
kobj = get_device_parent(dev, parent);
+   if (IS_ERR(kobj)) {
+   error = PTR_ERR(kobj);
+   goto parent_error;
+   }
if (kobj)
dev->kobj.parent = kobj;
 
@@ -1882,6 +1886,7 @@ int device_add(struct device *dev)
kobject_del(>kobj);
  Error:
cleanup_glue_dir(dev, glue_dir);
+parent_error:
put_device(parent);
 name_error:
kfree(dev->p);
@@ -2701,6 +2706,11 @@ int device_move(struct device *dev, struct device 
*new_parent,
device_pm_lock();
new_parent = get_device(new_parent);
new_parent_kobj = get_device_parent(dev, new_parent);
+   if (IS_ERR(new_parent_kobj)) {
+   error = PTR_ERR(new_parent_kobj);
+   put_device(new_parent);
+   goto out;
+   }
 
pr_debug("device: '%s': %s: moving to '%s'\n", dev_name(dev),
 __func__, new_parent ? dev_name(new_parent) : "");
-- 
1.8.3.1

Re: INFO: rcu detected stall in __schedule

2018-05-02 Thread Tetsuo Handa

I'm not sure whether this is a PPP bug.

As of uptime = 484, RCU says that it stalled for 125 seconds.

--
[  484.407032] INFO: rcu_sched self-detected stall on CPU
[  484.412488]  0-...!: (125000 ticks this GP) idle=f3e/1/4611686018427387906 
softirq=112858/112858 fqs=0 
[  484.422300]   (t=125000 jiffies g=61626 c=61625 q=1534)
[  484.427663] rcu_sched kthread starved for 125000 jiffies! g61626 c61625 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
--

484 - 125 = 359, which was about to start SND related fuzzing in that log.

--
2033/05/18 03:36:31 executing program 1:
r0 = socket(0x4a, 0x5, 0x7)
setsockopt$inet_int(r0, 0x0, 0x18, &(0x7f00)=0x200, 0x4)
bind$inet6(r0, &(0x7fc0)={0xa, 0x0, 0x0, @loopback={0x0, 0x1}}, 0x1c)
perf_event_open(&(0x7f40)={0x2, 0x70, 0x3e5}, 0x0, 0x, 
0x, 0x0)
timer_create(0x0, &(0x7f0001c0)={0x0, 0x15, 0x0, @thr={&(0x7f000440), 
&(0x7f000540)}}, &(0x7f000200))
timer_getoverrun(0x0)
perf_event_open(&(0x7f25c000)={0x2, 0x78, 0x3e3}, 0x0, 0x0, 
0x, 0x0)
r1 = syz_open_dev$sndctrl(&(0x7f000200)='/dev/snd/controlC#\x00', 0x2, 0x0)
perf_event_open(&(0x7f001000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0, 0x0, 0x8ce, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xfff8, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, @perf_bp={&(0x7f005000), 0x2}, 0x10c}, 0x0, 
0x0, 0x, 0x0)
ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7fc0)=0x1)
r2 = syz_open_dev$sndpcmp(&(0x7f000100)='/dev/snd/pcmC#D#p\x00', 0x1, 
0x4000)
ioctl$SNDRV_SEQ_IOCTL_GET_QUEUE_CLIENT(r2, 0xc04c5349, 
&(0x7f000240)={0x200, 0xfcdc, 0x1})
syz_open_dev$tun(&(0x7f0003c0)='/dev/net/tun\x00', 0x0, 0x20402)
ioctl$SNDRV_CTL_IOCTL_PVERSION(r1, 0xc1105517, &(0x7f001000)=""/250)
ioctl$SNDRV_CTL_IOCTL_SUBSCRIBE_EVENTS(r1, 0xc0045516, &(0x7f00))

2033/05/18 03:36:31 executing program 4:
syz_emit_ethernet(0x3e, &(0x7fc0)={@broadcast=[0xff, 0xff, 0xff, 0xff, 
0xff, 0xff], @empty=[0x0, 0x0, 0xb00], [], {@ipv4={0x800, {{0x5, 
0x4, 0x0, 0x0, 0x30, 0x0, 0x0, 0x0, 0x1, 0x0, @remote={0xac, 0x14, 0x14, 0xbb}, 
@dev={0xac, 0x14, 0x14}}, @icmp=@parameter_prob={0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 
{0x5, 0x4, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @local={0xac, 0x223, 0x14, 
0xaa}, @dev={0xac, 0x14, 0x14}}}, &(0x7f00)={0x0, 0x2, [0x0, 
0x2e6]})

2033/05/18 03:36:31 executing program 1:
r0 = socket$pppoe(0x18, 0x1, 0x0)
connect$pppoe(r0, &(0x7fc0)={0x18, 0x0, {0x1, @broadcast=[0xff, 0xff, 
0xff, 0xff, 0xff, 0xff], 'ip6_vti0\x00'}}, 0x1e)
r1 = socket(0x3, 0xb, 0x8001)
setsockopt$inet_sctp6_SCTP_ADAPTATION_LAYER(r1, 0x84, 0x7, 
&(0x7f000100)={0x2}, 0x4)
ioctl$sock_inet_SIOCGIFADDR(r0, 0x8915, 
&(0x7f40)={'veth1_to_bridge\x00', {0x2, 0x4e21}})
r2 = syz_open_dev$admmidi(&(0x7f00)='/dev/admmidi#\x00', 0x6, 0x8000)
setsockopt$SO_VM_SOCKETS_BUFFER_MAX_SIZE(r2, 0x28, 0x2, 
&(0x7f80)=0xff00, 0x8)

[  359.306427] snd_virmidi snd_virmidi.0: control 112:0:0:�:0 is already 
present
--

Re: WARNING: refcount bug in should_fail

2018-04-21 Thread Tetsuo Handa

Eric W. Biederman wrote:
> Al Viro  writes:
> 
> > On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote:
> >
> >> FWIW, I'm going through the ->kill_sb() instances, fixing that sort
> >> of bugs (most of them preexisting, but I should've checked instead
> >> of assuming that everything's fine).  Will push out later tonight.
> >
> > OK, see vfs.git#for-linus.  Caught: 4 old bugs (allocation failure
> > in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp.
> > and double-dput in late failure exit in rpc_fill_super())
> > and 5 regressions from register_shrinker() failure recovery.
> 
> One issue with your vfs.git#for-linus branch.
> 
> It is missing Fixes tags and  Cc: stable on those patches.
> As the bug came in v4.15 those tags would really help the stable
> maintainers get the recent regression fixes applied.

OK. The patch was sent to linux.git as commit 8e04944f0ea8b838.

#syz fix: mm,vmscan: Allow preallocating memory for register_shrinker().

Re: WARNING: refcount bug in should_fail

2018-04-01 Thread Tetsuo Handa

Dmitry Vyukov wrote:
> On Sun, Apr 1, 2018 at 12:32 PM, Dmitry Vyukov <dvyu...@google.com> wrote:
> > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa
> > <penguin-ker...@i-love.sakura.ne.jp> wrote:
> >> Switching from mm to fsdevel, for this report says that put_net(net) in
> >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to
> >> register_shrinker() failure.
> >>
> >> Relevant commits will be
> >> commit 9ee332d99e4d5a97 ("sget(): handle failures of register_shrinker()") 
> >> and
> >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to 
> >> mount_ns.").
> >>
> >> When sget_userns() in mount_ns() failed, mount_ns() returns an error code 
> >> to
> >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info) 
> >> was
> >> not called by rpc_fill_super() (via fill_super callback passed to 
> >> mount_ns())
> >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb() 
> >> from
> >> deactivate_locked_super()).
> >>
> >> --
> >> static struct dentry *
> >> rpc_mount(struct file_system_type *fs_type,
> >> int flags, const char *dev_name, void *data)
> >> {
> >> struct net *net = current->nsproxy->net_ns;
> >> return mount_ns(fs_type, flags, data, net, net->user_ns, 
> >> rpc_fill_super);
> >> }
> >> --
> >
> > Messed kernel output, this is definitely not in should_fail.
> >
> > #syz dup: WARNING: refcount bug in sk_alloc
> 
> Please don't drop reporter (syzbot) email from CC.
> 
> #syz dup: WARNING: refcount bug in sk_alloc
> 

Excuse me? This "refcount bug in should_fail" is talking about sget_userns() 
versus rpc_fill_super().
I think we need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.

> 
> >> syzbot wrote:
> >>> Hello,
> >>>
> >>> syzbot hit the following crash on bpf-next commit
> >>> 6f1b5a2b58d8470e5a8b25ab29f5fdb46168 (Tue Feb 27 04:11:23 2018 +)
> >>> Merge branch 'bpf-kselftest-improvements'
> >>>
> >>> C reproducer is attached.
> >>> syzkaller reproducer is attached.
> >>> Raw console output is attached.
> >>> compiler: gcc (GCC) 7.1.1 20170620
> >>> .config is attached.
> >>>
> >>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >>> Reported-by: syzbot+84371b6062cb639d7...@syzkaller.appspotmail.com
> >>> It will help syzbot understand when the bug is fixed. See footer for
> >>> details.
> >>> If you forward the report, please keep this part and the footer.
> >>>
> >>> [ cut here ]
> >>> FAULT_INJECTION: forcing a failure.
> >>> name failslab, interval 1, probability 0, space 0, times 0
> >>> refcount_t: underflow; use-after-free.
> >>> CPU: 1 PID: 4239 Comm: syzkaller149381 Not tainted 4.16.0-rc2+ #20
> >>> WARNING: CPU: 0 PID: 4237 at lib/refcount.c:187
> >>> refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187
> >>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> >>> Google 01/01/2011
> >>> Call Trace:
> >>> Kernel panic - not syncing: panic_on_warn set ...
> >>>
> >>>   __dump_stack lib/dump_stack.c:17 [inline]
> >>>   dump_stack+0x194/0x24d lib/dump_stack.c:53
> >>>   fail_dump lib/fault-inject.c:51 [inline]
> >>>   should_fail+0x8c0/0xa40 lib/fault-inject.c:149
> >>>   should_failslab+0xec/0x120 mm/failslab.c:32
> >>>   slab_pre_alloc_hook mm/slab.h:422 [inline]
> >>>   slab_alloc mm/slab.c:3365 [inline]
> >>>   __do_kmalloc mm/slab.c:3703 [inline]
> >>>   __kmalloc+0x63/0x760 mm/slab.c:3714
> >>>   kmalloc include/linux/slab.h:517 [inline]
> >>>   kzalloc include/linux/slab.h:701 [inline]
> >>>   register_shrinker+0x10e/0x2d0 mm/vmscan.c:268
> >>>   sget_userns+0xbbf/0xe40 fs/super.c:520
> >>>   mount_ns+0x6d/0x190 fs/super.c:1029
> >>>   rpc_mount+0x9e/0xd0 net/sunrpc/rpc_pipe.c:1451
> >>>   mount_fs+0x66/0x2d0 fs/super.c:1222
> >>>   vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037
> >>>   vfs_kern_mount fs/namespace.c:2509 [inline]
> >>>   do_new_mount fs/namespace.c:2512 [inline]
> >>>

Re: WARNING: refcount bug in should_fail

2018-04-01 Thread Tetsuo Handa

syzbot wrote:
> > On Sun, Mar 4, 2018 at 6:57 AM, Tetsuo Handa
> > <penguin-ker...@i-love.sakura.ne.jp> wrote:
> >> Switching from mm to fsdevel, for this report says that put_net(net) in
> >> rpc_kill_sb() made net->count < 0 when mount_ns() failed due to
> >> register_shrinker() failure.
> 
> >> Relevant commits will be
> >> commit 9ee332d99e4d5a97 ("sget(): handle failures of  
> >> register_shrinker()") and
> >> commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to  
> >> mount_ns.").
> 
> >> When sget_userns() in mount_ns() failed, mount_ns() returns an error  
> >> code to
> >> the caller without calling fill_super(). That is, get_net(sb->s_fs_info)  
> >> was
> >> not called by rpc_fill_super() (via fill_super callback passed to  
> >> mount_ns())
> >> but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb()  
> >> from
> >> deactivate_locked_super()).
> 
> >> --
> >> static struct dentry *
> >> rpc_mount(struct file_system_type *fs_type,
> >>  int flags, const char *dev_name, void *data)
> >> {
> >>  struct net *net = current->nsproxy->net_ns;
> >>  return mount_ns(fs_type, flags, data, net, net->user_ns,  
> >> rpc_fill_super);
> >> }
> >> --
> 
> > Messed kernel output, this is definitely not in should_fail.
> 
> > #syz dup: WARNING: refcount bug in sk_alloc
> 
> Can't find the corresponding bug.
> 
I don't think this is a dup of existing bug.
We need to fix either 9ee332d99e4d5a97 or d91ee87d8d85a080.

Re: WARNING: refcount bug in should_fail

2018-03-03 Thread Tetsuo Handa

Switching from mm to fsdevel, for this report says that put_net(net) in
rpc_kill_sb() made net->count < 0 when mount_ns() failed due to
register_shrinker() failure.

Relevant commits will be
commit 9ee332d99e4d5a97 ("sget(): handle failures of register_shrinker()") and
commit d91ee87d8d85a080 ("vfs: Pass data, ns, and ns->userns to mount_ns.").

When sget_userns() in mount_ns() failed, mount_ns() returns an error code to
the caller without calling fill_super(). That is, get_net(sb->s_fs_info) was
not called by rpc_fill_super() (via fill_super callback passed to mount_ns())
but put_net(sb->s_fs_info) is called by rpc_kill_sb() (via fs->kill_sb() from
deactivate_locked_super()).

--
static struct dentry *
rpc_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
struct net *net = current->nsproxy->net_ns;
return mount_ns(fs_type, flags, data, net, net->user_ns, 
rpc_fill_super);
}
--

syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on bpf-next commit
> 6f1b5a2b58d8470e5a8b25ab29f5fdb46168 (Tue Feb 27 04:11:23 2018 +)
> Merge branch 'bpf-kselftest-improvements'
> 
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+84371b6062cb639d7...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for  
> details.
> If you forward the report, please keep this part and the footer.
> 
> [ cut here ]
> FAULT_INJECTION: forcing a failure.
> name failslab, interval 1, probability 0, space 0, times 0
> refcount_t: underflow; use-after-free.
> CPU: 1 PID: 4239 Comm: syzkaller149381 Not tainted 4.16.0-rc2+ #20
> WARNING: CPU: 0 PID: 4237 at lib/refcount.c:187  
> refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
> Kernel panic - not syncing: panic_on_warn set ...
> 
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x24d lib/dump_stack.c:53
>   fail_dump lib/fault-inject.c:51 [inline]
>   should_fail+0x8c0/0xa40 lib/fault-inject.c:149
>   should_failslab+0xec/0x120 mm/failslab.c:32
>   slab_pre_alloc_hook mm/slab.h:422 [inline]
>   slab_alloc mm/slab.c:3365 [inline]
>   __do_kmalloc mm/slab.c:3703 [inline]
>   __kmalloc+0x63/0x760 mm/slab.c:3714
>   kmalloc include/linux/slab.h:517 [inline]
>   kzalloc include/linux/slab.h:701 [inline]
>   register_shrinker+0x10e/0x2d0 mm/vmscan.c:268
>   sget_userns+0xbbf/0xe40 fs/super.c:520
>   mount_ns+0x6d/0x190 fs/super.c:1029
>   rpc_mount+0x9e/0xd0 net/sunrpc/rpc_pipe.c:1451
>   mount_fs+0x66/0x2d0 fs/super.c:1222
>   vfs_kern_mount.part.26+0xc6/0x4a0 fs/namespace.c:1037
>   vfs_kern_mount fs/namespace.c:2509 [inline]
>   do_new_mount fs/namespace.c:2512 [inline]
>   do_mount+0xea4/0x2bb0 fs/namespace.c:2842
>   SYSC_mount fs/namespace.c:3058 [inline]
>   SyS_mount+0xab/0x120 fs/namespace.c:3035
>   do_syscall_64+0x280/0x940 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x4460f9
> RSP: 002b:7fbcd769ad78 EFLAGS: 0246 ORIG_RAX: 00a5
> RAX: ffda RBX: 006dcc6c RCX: 004460f9
> RDX: 2080 RSI: 2040 RDI: 2000
> RBP: 7fbcd769ad80 R08: 20c0 R09: 3131
> R10:  R11: 0246 R12: 006dcc68
> R13:  R14: 0037 R15: 0030656c69662f2e
> CPU: 0 PID: 4237 Comm: syzkaller149381 Not tainted 4.16.0-rc2+ #20
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:17 [inline]
>   dump_stack+0x194/0x24d lib/dump_stack.c:53
>   panic+0x1e4/0x41c kernel/panic.c:183
>   __warn+0x1dc/0x200 kernel/panic.c:547
>   report_bug+0x211/0x2d0 lib/bug.c:184
>   fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>   fixup_bug arch/x86/kernel/traps.c:247 [inline]
>   do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>   do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>   invalid_op+0x58/0x80 arch/x86/entry/entry_64.S:957
> RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:187
> RSP: 0018:8801b164f6d8 EFLAGS: 00010286
> RAX: dc08 RBX:  RCX: 815ac30e
> RDX:  RSI: 1100362c9e8b RDI: 1100362c9e60
> RBP: 8801b164f768 R08:  R09: 
> R10: 8801b164f610 R11:  R12: 1100362c9edc
> R13:  R14: 0001 R15: 8801ae924044
>   refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
>   put_net include/net/net_namespace.h:220 [inline]
>   rpc_kill_sb+0x253/0x3c0 net/sunrpc/rpc_pipe.c:1473
>

Re: [PATCH v2] lockdep: Fix fs_reclaim warning.

2018-02-19 Thread Tetsuo Handa

Peter, are you OK with this patch?

Tetsuo Handa wrote:
> From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   
>   sshd/24800 is trying to acquire lock:
>(fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>(fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>  CPU0
>  
> lock(fs_reclaim);
> lock(fs_reclaim);
> 
>*** DEADLOCK ***
> 
>May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>#0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] 
> tcp_sendmsg+0x19/0x40
>#1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>dump_stack+0xbc/0x13f
>__lock_acquire+0xa09/0x2040
>lock_acquire+0x12e/0x350
>fs_reclaim_acquire.part.102+0x29/0x30
>kmem_cache_alloc+0x3d/0x2c0
>alloc_extent_state+0xa7/0x410
>__clear_extent_bit+0x3ea/0x570
>try_release_extent_mapping+0x21a/0x260
>__btrfs_releasepage+0xb0/0x1c0
>btrfs_releasepage+0x161/0x170
>try_to_release_page+0x162/0x1c0
>shrink_page_list+0x1d5a/0x2fb0
>shrink_inactive_list+0x451/0x940
>shrink_node_memcg.constprop.88+0x4c9/0x5e0
>shrink_node+0x12d/0x260
>try_to_free_pages+0x418/0xaf0
>__alloc_pages_slowpath+0x976/0x1790
>__alloc_pages_nodemask+0x52c/0x5c0
>new_slab+0x374/0x3f0
>___slab_alloc.constprop.81+0x47e/0x5a0
>__slab_alloc.constprop.80+0x32/0x60
>__kmalloc_track_caller+0x267/0x310
>__kmalloc_reserve.isra.40+0x29/0x80
>__alloc_skb+0xee/0x390
>sk_stream_alloc_skb+0xb8/0x340
>tcp_sendmsg_locked+0x8e6/0x1d30
>tcp_sendmsg+0x27/0x40
>inet_sendmsg+0xd0/0x310
>sock_write_iter+0x17a/0x240
>__vfs_write+0x2ab/0x380
>vfs_write+0xfb/0x260
>SyS_write+0xb6/0x140
>do_syscall_64+0x1e5/0xc05
>entry_SYSCALL64_slow_path+0x25/0x25
> 
> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>   return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>   goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones <da...@codemonkey.org.uk>
> Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Nick Piggin <npig...@gmail.com>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>   return false;
>  
>   /* this guy won't enter reclaim */
> - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> + if (current->flags & PF_MEMALLOC)
>   return false;
>  
>   /* We're only interested __GFP_FS allocations for now */
> -- 
> 1.8.3.1
>

Re: [PATCH v2] lockdep: Fix fs_reclaim warning.

2018-02-12 Thread Tetsuo Handa

Nikolay Borisov wrote:
> I think I've hit another incarnation of that one. The call stack is:
> http://paste.opensuse.org/3f22d013
> 
> The cleaned up callstack of all the ? entries look like:
> 
> __lock_acquire+0x2d8a/0x4b70
> lock_acquire+0x110/0x330
> kmem_cache_alloc+0x29/0x2c0
> __clear_extent_bit+0x488/0x800
> try_release_extent_mapping+0x288/0x3c0
> __btrfs_releasepage+0x6c/0x140
> shrink_page_list+0x227e/0x3110
> shrink_inactive_list+0x414/0xdb0
> shrink_node_memcg+0x7c8/0x1250
> shrink_node+0x2ae/0xb50
> do_try_to_free_pages+0x2b1/0xe20
> try_to_free_pages+0x205/0x570
>  __alloc_pages_nodemask+0xb91/0x2160
> new_slab+0x27a/0x4e0
> ___slab_alloc+0x355/0x610
>  __slab_alloc+0x4c/0xa0
> kmem_cache_alloc+0x22d/0x2c0
> mempool_alloc+0xe1/0x280

Yes, for mempool_alloc() is adding __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask.

gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
gfp_mask |= __GFP_NORETRY;  /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN;   /* failures are OK */

> bio_alloc_bioset+0x1d7/0x830
> ext4_mpage_readpages+0x99f/0x1000 <-
> __do_page_cache_readahead+0x4be/0x840
> filemap_fault+0x8c8/0xfc0
> ext4_filemap_fault+0x7d/0xb0
> __do_fault+0x7a/0x150
> __handle_mm_fault+0x1542/0x29d0
> __do_page_fault+0x557/0xa30
> async_page_fault+0x4c/0x60

[PATCH v2] lockdep: Fix fs_reclaim warning.

2018-02-08 Thread Tetsuo Handa

>From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Thu, 8 Feb 2018 10:35:35 +0900
Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

 CPU0
 
lock(fs_reclaim);
lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
lockdep_clear_current_reclaim_state() in __perform_reclaim() and
lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
is trying to grab the 'fake' lock again when __perform_reclaim() already
grabbed the 'fake' lock.

The

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
  return false;

test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
("page allocator: calculate the alloc_flags for allocation only once")
added the PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
  goto nopage;

in __alloc_pages_slowpath()).

Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
__need_fs_reclaim() to return false.

Reported-and-tested-by: Dave Jones <da...@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Nick Piggin <npig...@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 81e18ce..19fb76b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
return false;
 
/* this guy won't enter reclaim */
-   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (current->flags & PF_MEMALLOC)
return false;
 
/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-02-01 Thread Tetsuo Handa

Peter Zijlstra wrote:
> On Mon, Jan 29, 2018 at 08:47:20PM +0900, Tetsuo Handa wrote:
> > Peter Zijlstra wrote:
> > > On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > > > This warning seems to be caused by commit d92a8cfcb37ecd13
> > > > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > > > location of
> > > > 
> > > >   /* this guy won't enter reclaim */
> > > >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> > > >   return false;
> > > > 
> > > > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim 
> > > > context
> > > > (__GFP_NOFS)").
> > > 
> > > I'm not entirly sure I get what you mean here. How did I move it? It was
> > > part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> > > mark the lock as held.
> > 
> > d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
> > fs_reclaim_acquire(), and removed current->lockdep_recursion handling.
> > 
> > --
> > # git show d92a8cfcb37ecd13 | grep recursion
> > -# define INIT_LOCKDEP  .lockdep_recursion = 0, 
> > .lockdep_reclaim_gfp = 0,
> > +# define INIT_LOCKDEP  .lockdep_recursion = 0,
> > unsigned intlockdep_recursion;
> > -   if (unlikely(current->lockdep_recursion))
> > -   current->lockdep_recursion = 1;
> > -   current->lockdep_recursion = 0;
> > -* context checking code. This tests GFP_FS recursion (a lock taken
> > --
> 
> That should not matter at all. The only case that would matter for is if
> lockdep itself would ever call into lockdep again. Not something that
> happens here.
> 
> > > The new code has it in fs_reclaim_acquire/release to the same effect, if
> > > __GFP_NOMEMALLOC, we'll not acquire/release the lock.
> > 
> > Excuse me, but I can't catch.
> > We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.
> 
> Right, got the case inverted, same difference though. Before we'd do
> mark_held_lock(), now we do acquire/release under the same conditions.
> 
> > > > Since __kmalloc_reserve() from __alloc_skb() adds
> > > > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > > > failing to return false despite PF_MEMALLOC context (and resulted in
> > > > lockdep warning).
> > > 
> > > But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> > > That's what the name says.
> > 
> > __GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that 
> > allocation
> > request should use.
> 
> Right.
> 
> > But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.
> 
> Ah indeed.
> 
> > Then, how can fs_reclaim contribute to deadlock?
> 
> Not sure it can. But if we're going to allow this, it needs to come with
> a clear description on why. Not a few clues to a puzzle.
> 

Let's decode Dave's report.

--
stack backtrace:
CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
Call Trace:
 dump_stack+0xbc/0x13f
 __lock_acquire+0xa09/0x2040
 lock_acquire+0x12e/0x350
 fs_reclaim_acquire.part.102+0x29/0x30
 kmem_cache_alloc+0x3d/0x2c0
 alloc_extent_state+0xa7/0x410
 __clear_extent_bit+0x3ea/0x570
 try_release_extent_mapping+0x21a/0x260
 __btrfs_releasepage+0xb0/0x1c0
 btrfs_releasepage+0x161/0x170
 try_to_release_page+0x162/0x1c0
 shrink_page_list+0x1d5a/0x2fb0
 shrink_inactive_list+0x451/0x940
 shrink_node_memcg.constprop.88+0x4c9/0x5e0
 shrink_node+0x12d/0x260
 try_to_free_pages+0x418/0xaf0
 __alloc_pages_slowpath+0x976/0x1790
 __alloc_pages_nodemask+0x52c/0x5c0
 new_slab+0x374/0x3f0
 ___slab_alloc.constprop.81+0x47e/0x5a0
 __slab_alloc.constprop.80+0x32/0x60
 __kmalloc_track_caller+0x267/0x310
 __kmalloc_reserve.isra.40+0x29/0x80
 __alloc_skb+0xee/0x390
 sk_stream_alloc_skb+0xb8/0x340
--

struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp, bool 
force_schedule) {
  skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp) = { // gfp == 
GFP_KERNEL
static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t 
priority) { // priority == GFP_KERNEL
  return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE) = {
data = kmalloc_reserve(size, gfp_mask, node, ) = { // 
gfp_mask == GFP_KERNEL
  obj = kmalloc_node_track_caller(size, flags | __GFP_NOMEMALLOC | 
__GFP_NOWARN, node) = { // flags == GFP_KERNEL
__kmalloc_node_track_caller(size, GF

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-29 Thread Tetsuo Handa

Peter Zijlstra wrote:
> On Sun, Jan 28, 2018 at 02:55:28PM +0900, Tetsuo Handa wrote:
> > This warning seems to be caused by commit d92a8cfcb37ecd13
> > ("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
> > location of
> > 
> >   /* this guy won't enter reclaim */
> >   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> >   return false;
> > 
> > check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> > (__GFP_NOFS)").
> 
> I'm not entirly sure I get what you mean here. How did I move it? It was
> part of lockdep_trace_alloc(), if __GFP_NOMEMALLOC was set, it would not
> mark the lock as held.

d92a8cfcb37ecd13 replaced lockdep_set_current_reclaim_state() with
fs_reclaim_acquire(), and removed current->lockdep_recursion handling.

--
# git show d92a8cfcb37ecd13 | grep recursion
-# define INIT_LOCKDEP  .lockdep_recursion = 0, 
.lockdep_reclaim_gfp = 0,
+# define INIT_LOCKDEP  .lockdep_recursion = 0,
unsigned intlockdep_recursion;
-   if (unlikely(current->lockdep_recursion))
-   current->lockdep_recursion = 1;
-   current->lockdep_recursion = 0;
-* context checking code. This tests GFP_FS recursion (a lock taken
--

> 
> The new code has it in fs_reclaim_acquire/release to the same effect, if
> __GFP_NOMEMALLOC, we'll not acquire/release the lock.

Excuse me, but I can't catch.
We currently acquire/release __fs_reclaim_map if __GFP_NOMEMALLOC.

--
+static bool __need_fs_reclaim(gfp_t gfp_mask)
+{
(...snipped...)
+   /* this guy won't enter reclaim */
+   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   return false;
(...snipped...)
+}
--

> 
> 
> > Since __kmalloc_reserve() from __alloc_skb() adds
> > __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
> > failing to return false despite PF_MEMALLOC context (and resulted in
> > lockdep warning).
> 
> But that's correct right, __GFP_NOMEMALLOC should negate PF_MEMALLOC.
> That's what the name says.

__GFP_NOMEMALLOC negates PF_MEMALLOC regarding what watermark that allocation
request should use.

--
static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
{
if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
return 0;
if (gfp_mask & __GFP_MEMALLOC)
return ALLOC_NO_WATERMARKS;
if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
return ALLOC_NO_WATERMARKS;
if (!in_interrupt()) {
if (current->flags & PF_MEMALLOC)
return ALLOC_NO_WATERMARKS;
else if (oom_reserves_allowed(current))
return ALLOC_OOM;
}

return 0;
}
--

But at the same time, PF_MEMALLOC negates __GFP_DIRECT_RECLAIM.

--
/* Attempt with potentially adjusted zonelist and alloc_flags */
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page)
goto got_pg;

/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;

/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
goto nopage;

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
_some_progress);
if (page)
goto got_pg;

/* Try direct compaction and then allocating */
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
compact_priority, _result);
if (page)
goto got_pg;

/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
goto nopage;
--

Then, how can fs_reclaim contribute to deadlock?

> 
> > Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
> > __GFP_NOMEMALLOC might make sense. But since this safeguard was added by
> > commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
> > allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
> > Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
> > return false.
> 
> This does not in fact explain what's going on, it just points to
> 'random' patches.
> 
> Are you talking about this:
> 
> +   /* Avoid recursion of direct reclaim */
> +   if (p->flags & PF_MEMALLOC)
&

Re: kernel panic: Out of memory and no killable processes... (2)

2018-01-28 Thread Tetsuo Handa

syzbot wrote:
> syzbot hit the following crash on net-next commit
> 6bb46bc57c8e9ce947cc605e555b7204b44d2b10 (Fri Jan 26 16:00:23 2018 +)
> Merge branch 'cxgb4-fix-dump-collection-when-firmware-crashed'
> 
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+8630e35fc7287b392...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for  
> details.
> If you forward the report, please keep this part and the footer.
> 
> [ 3685] 0  3685178211   1843200 0 sshd
> [ 3692] 0  3692 43760327680 0  
> syzkaller025682
> [ 3695] 0  3695 43760368640 0  
> syzkaller025682
> Kernel panic - not syncing: Out of memory and no killable processes...
> 

This sounds like too huge vmalloc() request where size is controlled by 
userspace.

--
[   27.738855] syzkaller025682 invoked oom-killer: 
gfp_mask=0x14002c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN), nodemask=(null), 
order=0, oom_score_adj=0
[   27.754960] syzkaller025682 cpuset=/ mems_allowed=0
[   27.760386] CPU: 0 PID: 3689 Comm: syzkaller025682 Not tainted 4.15.0-rc9+ 
#212
[   27.767820] Hardware name: Google Google Compute Engine/Google Compute 
Engine, BIOS Google 01/01/2011
[   27.777166] Call Trace:
[   27.779739]  dump_stack+0x194/0x257
[   27.793194]  dump_header+0x28c/0xe1e
[   27.888997]  oom_kill_process+0x8b5/0x14a0
[   27.981648]  out_of_memory+0x86d/0x1220
[   28.003684]  __alloc_pages_slowpath+0x1d1b/0x2d00
[   28.054140]  __alloc_pages_nodemask+0x9fb/0xd80
[   28.090590]  alloc_pages_current+0xb6/0x1e0
[   28.094927]  __vmalloc_node_range+0x409/0x650
[   28.103837]  __vmalloc_node_flags_caller+0x50/0x60
[   28.113166]  kvmalloc_node+0x82/0xd0
[   28.116869]  xt_alloc_table_info+0x64/0xe0
[   28.121097]  do_ip6t_set_ctl+0x29b/0x5f0
[   28.139158]  nf_setsockopt+0x67/0xc0
[   28.142862]  ipv6_setsockopt+0x115/0x150
[   28.146912]  udpv6_setsockopt+0x45/0x80
[   28.150867]  sock_common_setsockopt+0x95/0xd0
[   28.155359]  SyS_setsockopt+0x189/0x360
[   28.177379]  entry_SYSCALL_64_fastpath+0x29/0xa0
--

struct xt_table_info *xt_alloc_table_info(unsigned int size)
{
(...snipped...)
info = kvmalloc(sz, GFP_KERNEL);
(...snipped...)
}

static int
do_ip6t_set_ctl(struct sock *sk, int cmd, void __user *user, unsigned int len)
{
int ret;

if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
return -EPERM;

switch (cmd) {
case IP6T_SO_SET_REPLACE:
ret = do_replace(sock_net(sk), user, len);
break;

case IP6T_SO_SET_ADD_COUNTERS:
ret = do_add_counters(sock_net(sk), user, len, 0);
break;

default:
ret = -EINVAL;
}

return ret;
}

vmalloc() once became killable by commit 5d17a73a2ebeb8d1 ("vmalloc: back
off when the current task is killed") but then became unkillable by commit
b8c8a338f75e052d ("Revert "vmalloc: back off when the current task is
killed""). Therefore, we can't handle this problem from MM side.
Please consider adding some limit from networking side.

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-27 Thread Tetsuo Handa

Dave, would you try below patch?



>From cae2cbf389ae3cdef1b492622722b4aeb07eb284 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Date: Sun, 28 Jan 2018 14:17:14 +0900
Subject: [PATCH] lockdep: Fix fs_reclaim warning.

Dave Jones reported fs_reclaim lockdep warnings.

  
  WARNING: possible recursive locking detected
  4.15.0-rc9-backup-debug+ #1 Not tainted
  
  sshd/24800 is trying to acquire lock:
   (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  but task is already holding lock:
   (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

 CPU0
 
lock(fs_reclaim);
lock(fs_reclaim);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  2 locks held by sshd/24800:
   #0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40
   #1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
fs_reclaim_acquire.part.102+0x5/0x30

  stack backtrace:
  CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
  Call Trace:
   dump_stack+0xbc/0x13f
   __lock_acquire+0xa09/0x2040
   lock_acquire+0x12e/0x350
   fs_reclaim_acquire.part.102+0x29/0x30
   kmem_cache_alloc+0x3d/0x2c0
   alloc_extent_state+0xa7/0x410
   __clear_extent_bit+0x3ea/0x570
   try_release_extent_mapping+0x21a/0x260
   __btrfs_releasepage+0xb0/0x1c0
   btrfs_releasepage+0x161/0x170
   try_to_release_page+0x162/0x1c0
   shrink_page_list+0x1d5a/0x2fb0
   shrink_inactive_list+0x451/0x940
   shrink_node_memcg.constprop.88+0x4c9/0x5e0
   shrink_node+0x12d/0x260
   try_to_free_pages+0x418/0xaf0
   __alloc_pages_slowpath+0x976/0x1790
   __alloc_pages_nodemask+0x52c/0x5c0
   new_slab+0x374/0x3f0
   ___slab_alloc.constprop.81+0x47e/0x5a0
   __slab_alloc.constprop.80+0x32/0x60
   __kmalloc_track_caller+0x267/0x310
   __kmalloc_reserve.isra.40+0x29/0x80
   __alloc_skb+0xee/0x390
   sk_stream_alloc_skb+0xb8/0x340
   tcp_sendmsg_locked+0x8e6/0x1d30
   tcp_sendmsg+0x27/0x40
   inet_sendmsg+0xd0/0x310
   sock_write_iter+0x17a/0x240
   __vfs_write+0x2ab/0x380
   vfs_write+0xfb/0x260
   SyS_write+0xb6/0x140
   do_syscall_64+0x1e5/0xc05
   entry_SYSCALL64_slow_path+0x25/0x25

Since no fs locks are held, doing GFP_KERNEL allocation should be safe
as long as there is PF_MEMALLOC safeguard (

  /* Avoid recursion of direct reclaim */
  if (p->flags & PF_MEMALLOC)
  goto nopage;

) which prevents infinite recursion.

This warning seems to be caused by commit d92a8cfcb37ecd13
("locking/lockdep: Rework FS_RECLAIM annotation") which moved the
location of

  /* this guy won't enter reclaim */
  if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
  return false;

check added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
(__GFP_NOFS)"). Since __kmalloc_reserve() from __alloc_skb() adds
__GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, __need_fs_reclaim() is
failing to return false despite PF_MEMALLOC context (and resulted in
lockdep warning).

Since there was no PF_MEMALLOC safeguard as of cf40bd16fdad42c0, checking
__GFP_NOMEMALLOC might make sense. But since this safeguard was added by
commit 341ce06f69abfafa ("page allocator: calculate the alloc_flags for
allocation only once"), checking __GFP_NOMEMALLOC no longer makes sense.
Thus, let's remove __GFP_NOMEMALLOC check and allow __need_fs_reclaim() to
return false.

Reported-by: Dave Jones <da...@codemonkey.org.uk>
Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Nick Piggin <npig...@gmail.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688..7804b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3583,7 +3583,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
return false;
 
/* this guy won't enter reclaim */
-   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (current->flags & PF_MEMALLOC)
return false;
 
/* We're only interested __GFP_FS allocations for now */
-- 
1.8.3.1

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-27 Thread Tetsuo Handa

On 2018/01/28 10:16, Tetsuo Handa wrote:
> Linus Torvalds wrote:
>> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones <da...@codemonkey.org.uk> wrote:
>>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>>>  > Just triggered this on a server I was rsync'ing to.
>>>
>>> Actually, I can trigger this really easily, even with an rsync from one
>>> disk to another.  Though that also smells a little like networking in
>>> the traces. Maybe netdev has ideas.
>>
>> Is this new to 4.15? Or is it just that you're testing something new?
>>
>> If it's new and easy to repro, can you just bisect it? And if it isn't
>> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
>> ok)?
>>
>> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
>> but it was rewritten for 4.14.. I'm wondering if that remodeling ended
>> up triggering something.
> 
> --- linux-4.13.16/mm/page_alloc.c
> +++ linux-4.14.15/mm/page_alloc.c

Oops. This output was inverted.

> @@ -3527,53 +3519,12 @@
>   return true;
>   }
>   return false;
>  }
>  #endif /* CONFIG_COMPACTION */
>  
> -#ifdef CONFIG_LOCKDEP
> -struct lockdep_map __fs_reclaim_map =
> - STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
> -
> -static bool __need_fs_reclaim(gfp_t gfp_mask)
> -{
> - gfp_mask = current_gfp_context(gfp_mask);
> -
> - /* no reclaim without waiting on it */
> - if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
> - return false;
> -
> - /* this guy won't enter reclaim */
> - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> - return false;

Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC | 
__GFP_NOWARN
to gfp_mask, __need_fs_reclaim() is failing to return false here.

But why checking __GFP_NOMEMALLOC here? __alloc_pages_slowpath() skips direct
reclaim if !(gfp_mask & __GFP_DIRECT_RECLAIM) or (current->flags & PF_MEMALLOC),
doesn't it?

--
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
{
(...snipped...)
/* Caller is not willing to reclaim, we can't balance anything */
if (!can_direct_reclaim)
goto nopage;

/* Avoid recursion of direct reclaim */
if (current->flags & PF_MEMALLOC)
goto nopage;

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
_some_progress);
if (page)
goto got_pg;
(...snipped...)
}
--

Re: [4.15-rc9] fs_reclaim lockdep trace

2018-01-27 Thread Tetsuo Handa

Linus Torvalds wrote:
> On Sat, Jan 27, 2018 at 2:24 PM, Dave Jones  wrote:
>> On Tue, Jan 23, 2018 at 08:36:51PM -0500, Dave Jones wrote:
>>  > Just triggered this on a server I was rsync'ing to.
>>
>> Actually, I can trigger this really easily, even with an rsync from one
>> disk to another.  Though that also smells a little like networking in
>> the traces. Maybe netdev has ideas.
> 
> Is this new to 4.15? Or is it just that you're testing something new?
> 
> If it's new and easy to repro, can you just bisect it? And if it isn't
> new, can you perhaps check whether it's new to 4.14 (ie 4.13 being
> ok)?
> 
> Because that fs_reclaim_acquire/release() debugging isn't new to 4.15,
> but it was rewritten for 4.14.. I'm wondering if that remodeling ended
> up triggering something.

--- linux-4.13.16/mm/page_alloc.c
+++ linux-4.14.15/mm/page_alloc.c
@@ -3527,53 +3519,12 @@
return true;
}
return false;
 }
 #endif /* CONFIG_COMPACTION */
 
-#ifdef CONFIG_LOCKDEP
-struct lockdep_map __fs_reclaim_map =
-   STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);
-
-static bool __need_fs_reclaim(gfp_t gfp_mask)
-{
-   gfp_mask = current_gfp_context(gfp_mask);
-
-   /* no reclaim without waiting on it */
-   if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
-   return false;
-
-   /* this guy won't enter reclaim */
-   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
-   return false;
-
-   /* We're only interested __GFP_FS allocations for now */
-   if (!(gfp_mask & __GFP_FS))
-   return false;
-
-   if (gfp_mask & __GFP_NOLOCKDEP)
-   return false;
-
-   return true;
-}
-
-void fs_reclaim_acquire(gfp_t gfp_mask)
-{
-   if (__need_fs_reclaim(gfp_mask))
-   lock_map_acquire(&__fs_reclaim_map);
-}
-EXPORT_SYMBOL_GPL(fs_reclaim_acquire);
-
-void fs_reclaim_release(gfp_t gfp_mask)
-{
-   if (__need_fs_reclaim(gfp_mask))
-   lock_map_release(&__fs_reclaim_map);
-}
-EXPORT_SYMBOL_GPL(fs_reclaim_release);
-#endif
-
 /* Perform direct synchronous page reclaim */
 static int
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
const struct alloc_context *ac)
 {
struct reclaim_state reclaim_state;
@@ -3582,21 +3533,21 @@
 
cond_resched();
 
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
noreclaim_flag = memalloc_noreclaim_save();
-   fs_reclaim_acquire(gfp_mask);
+   lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = _state;
 
progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
ac->nodemask);
 
current->reclaim_state = NULL;
-   fs_reclaim_release(gfp_mask);
+   lockdep_clear_current_reclaim_state();
memalloc_noreclaim_restore(noreclaim_flag);
 
cond_resched();
 
return progress;
 }

> 
> Adding PeterZ to the participants list in case he has ideas. I'm not
> seeing what would be the problem in that call chain from hell.
> 
>Linus

Dave Jones wrote:
> 
> WARNING: possible recursive locking detected
> 4.15.0-rc9-backup-debug+ #1 Not tainted
> 
> sshd/24800 is trying to acquire lock:
>  (fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
> but task is already holding lock:
>  (fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
> other info that might help us debug this:
>  Possible unsafe locking scenario:
> 
>CPU0
>
>   lock(fs_reclaim);
>   lock(fs_reclaim);
> 
>  *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 2 locks held by sshd/24800:
>  #0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] tcp_sendmsg+0x19/0x40
>  #1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
> stack backtrace:
> CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
> Call Trace:
>  dump_stack+0xbc/0x13f
>  __lock_acquire+0xa09/0x2040
>  lock_acquire+0x12e/0x350
>  fs_reclaim_acquire.part.102+0x29/0x30
>  kmem_cache_alloc+0x3d/0x2c0
>  alloc_extent_state+0xa7/0x410
>  __clear_extent_bit+0x3ea/0x570
>  try_release_extent_mapping+0x21a/0x260
>  __btrfs_releasepage+0xb0/0x1c0
>  btrfs_releasepage+0x161/0x170
>  try_to_release_page+0x162/0x1c0
>  shrink_page_list+0x1d5a/0x2fb0
>  shrink_inactive_list+0x451/0x940
>  shrink_node_memcg.constprop.88+0x4c9/0x5e0
>  shrink_node+0x12d/0x260
>  try_to_free_pages+0x418/0xaf0
>  __alloc_pages_slowpath+0x976/0x1790
>  __alloc_pages_nodemask+0x52c/0x5c0
>  new_slab+0x374/0x3f0
>  ___slab_alloc.constprop.81+0x47e/0x5a0
>

[linux-next] NULL pointer dereference in rt6_fill_node

2017-01-30 Thread Tetsuo Handa

Hello.

I'm hitting this with linux-next-20170125. Is this known?

[  OK  ] Started Show Plymouth Reboot Screen.

[  OK  ] Stopped PostgreSQL database server.

[  OK  ] Stopped Dynamic System Tuning Daemon.

[  OK  ] Stopped target Network.

 Stopping LSB: Bring up/down networking...

[  828.169782] BUG: unable to handle kernel NULL pointer dereference at 
02a4
[  828.172557] IP: rt6_fill_node.isra.60+0x181/0x5c0
[  828.174211] PGD 0
[  828.174212]
[  828.175467] Oops:  [#1] SMP DEBUG_PAGEALLOC
[  828.177068] Modules linked in: nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT 
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge 
stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter 
ebtables ip6table_filter ip6_tables iptable_filter coretemp vmw_balloon pcspkr 
vmw_vmci i2c_piix4 shpchp ip_tables xfs libcrc32c sr_mod cdrom ata_generic 
sd_mod pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect 
sysimgblt mptspi fb_sys_fops scsi_transport_spi ttm ata_piix mptscsih drm 
i2c_core e1000 mptbase libata
[  828.200197] CPU: 3 PID: 539 Comm: NetworkManager Not tainted 
4.10.0-rc5-next-20170125+ #45
[  828.203094] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[  828.206835] task: 88006f984100 task.stack: c90005afc000
[  828.208934] RIP: 0010:rt6_fill_node.isra.60+0x181/0x5c0
[  828.210701] RSP: 0018:c90005aff938 EFLAGS: 00010246
[  828.212485] RAX:  RBX: 88006c71ae00 RCX: 
[  828.214933] RDX:  RSI: c90005aff95c RDI: 880071773ee0
[  828.217394] RBP: c90005aff9e0 R08:  R09: 0004
[  828.219945] R10: 880071773ee4 R11:  R12: 880071983f00
[  828.222381] R13:  R14:  R15: 880071773ec0
[  828.224920] FS:  7ffaa9733940() GS:88007580() 
knlGS:
[  828.227664] CS:  0010 DS:  ES:  CR0: 80050033
[  828.229670] CR2: 02a4 CR3: 71269000 CR4: 001406e0
[  828.232149] Call Trace:
[  828.232998]  ? slab_kernel_map+0xa0/0x130
[  828.234468]  rt6_dump_route+0x60/0x70
[  828.235717]  fib6_dump_node+0x2e/0x60
[  828.237005]  fib6_walk_continue+0x124/0x140
[  828.238425]  fib6_walk+0x64/0x90
[  828.239567]  ? _raw_read_lock_bh+0x6e/0x80
[  828.240962]  inet6_dump_fib+0x237/0x370
[  828.242309]  ? inet6_dump_fib+0x4b/0x370
[  828.243646]  netlink_dump+0x126/0x280
[  828.244963]  __netlink_dump_start+0x16f/0x210
[  828.246464]  ? fib6_flush_trees+0x40/0x40
[  828.247848]  rtnetlink_rcv_msg+0x13b/0x230
[  828.249275]  ? fib6_flush_trees+0x40/0x40
[  828.250641]  ? rtnl_newlink+0x880/0x880
[  828.252021]  netlink_rcv_skb+0x99/0xb0
[  828.253399]  rtnetlink_rcv+0x25/0x30
[  828.254646]  netlink_unicast+0x16d/0x200
[  828.255982]  netlink_sendmsg+0x36b/0x390
[  828.257348]  ___sys_sendmsg+0x1c9/0x2c0
[  828.259271]  ? ___cache_free+0x14a/0x5b0
[  828.261112]  ? __fget+0x117/0x210
[  828.262777]  ? expand_files+0x2e0/0x2e0
[  828.264657]  __sys_sendmsg+0x4f/0x90
[  828.266399]  SyS_sendmsg+0xd/0x20
[  828.268042]  do_syscall_64+0x67/0x1f0
[  828.269810]  entry_SYSCALL64_slow_path+0x25/0x25
[  828.271885] RIP: 0033:0x7ffaa7f639fd
[  828.273594] RSP: 002b:7ffc0e0c13b0 EFLAGS: 0293 ORIG_RAX: 
002e
[  828.276704] RAX: ffda RBX: 55f219737bd0 RCX: 7ffaa7f639fd
[  828.279712] RDX:  RSI: 7ffc0e0c1460 RDI: 000c
[  828.282623] RBP: 55f219737ae0 R08: 0001 R09: 1000
[  828.286019] R10: 0063 R11: 0293 R12: 55f21977e4f0
[  828.288921] R13: 7ffc0e0c1460 R14:  R15: 55f21977e4f0
[  828.291766] Code: c7 47 18 00 00 00 00 48 8b 43 18 48 8b 50 48 83 e2 01 74 
08 48 8b 40 48 a8 04 74 21 41 c7 47 18 10 00 00 00 48 8b 83 58 01 00 00 <8b> 88 
a4 02 00 00 85 c9 74 08 41 c7 47 18 11 00 00 00 41 c6 47
[  828.299024] RIP: rt6_fill_node.isra.60+0x181/0x5c0 RSP: c90005aff938
[  828.301732] CR2: 02a4


2861:   48 8b 50 48 mov0x48(%rax),%rdx
2865:   83 e2 01and$0x1,%edx
2868:   74 08   je 2872 

286a:   48 8b 40 48 mov0x48(%rax),%rax
286e:   a8 04   test   $0x4,%al
2870:   74 21   je 2893 

rtm->rtm_flags |= RTNH_F_LINKDOWN;
2872:   41 c7 47 18 10 00 00movl   $0x10,0x18(%r15)
2879:   00
if (rt->rt6i_idev->cnf.ignore_routes_with_linkdown)

[PATCH] netfilter: x_tables: fix kmemcheck warning.

2016-07-23 Thread Tetsuo Handa

kmemcheck complains that some of struct nf_hook_ops members allocated at
xt_hook_ops_alloc() are not initialized before nf_register_net_hook() is
called. Add __GFP_ZERO to initialize explicitly.

[  367.411936] nf_conntrack version 0.5.0 (6144 buckets, 24576 max)
[  367.458540] ip_tables: (C) 2000-2006 Netfilter Core Team
[  367.463977] WARNING: kmemcheck: Caught 64-bit read from uninitialized memory 
(88003af7f300)
[  367.465633] 303f5381
[  367.468185]  u u u u u u u u u u u u u u u u i i i i i i i i u u u u u u u u
[  367.470687]  ^
[  367.471079] RIP: 0010:[]  [] 
nf_register_net_hook+0x2f/0x160
[  367.472846] RSP: 0018:88003f5abcb0  EFLAGS: 00010286
[  367.473821] RAX: 88003adb73c0 RBX: 88003af7f300 RCX: 
[  367.475122] RDX:  RSI: 0001 RDI: 88003adb7400
[  367.476485] RBP: 88003f5abcc8 R08: 0067 R09: 
[  367.477764] R10: 88003adb83c0 R11:  R12: 81876760
[  367.479053] R13: 88003adb73c0 R14: 0003 R15: 88003af7f300
[  367.480351] FS:  () GS:8182c000() 
knlGS:
[  367.482018] CS:  0010 DS:  ES:  CR0: 80050033
[  367.483069] CR2: 88003f404280 CR3: 3ad3 CR4: 001406f0
[  367.484357]  [] nf_register_net_hooks+0x3c/0xa0
[  367.485617]  [] ipt_register_table+0xef/0x130
[  367.486765]  [] iptable_filter_table_init.part.1+0x51/0x70
[  367.488125]  [] iptable_filter_net_init+0x2a/0x30
[  367.489375]  [] ops_init+0x3c/0x130
[  367.490410]  [] register_pernet_operations+0x108/0x1c0
[  367.491746]  [] register_pernet_subsys+0x23/0x40
[  367.492933]  [] iptable_filter_init+0x33/0x4b
[  367.494053]  [] do_one_initcall+0x4a/0x180
[  367.495168]  [] kernel_init_freeable+0x15b/0x201
[  367.496332]  [] kernel_init+0x9/0x100
[  367.497369]  [] ret_from_fork+0x1f/0x40
[  367.498449]  [] 0x
[  367.499523] Initializing XFRM netlink socket
[  367.500404] NET: Registered protocol family 10
[  367.501792] ip6_tables: (C) 2000-2006 Netfilter Core Team
[  367.502971] NET: Registered protocol family 17

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 net/netfilter/x_tables.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index e0aa7c1..fc261fe 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1513,7 +1513,7 @@ xt_hook_ops_alloc(const struct xt_table *table, nf_hookfn 
*fn)
if (!num_hooks)
return ERR_PTR(-EINVAL);
 
-   ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
+   ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL | __GFP_ZERO);
if (ops == NULL)
return ERR_PTR(-ENOMEM);
 
-- 
1.8.3.1

Re: [PATCH] vmxnet3: avoid calling pskb_may_pull with interrupts disabled

2016-03-11 Thread Tetsuo Handa

Neil Horman wrote:
> On Mon, Mar 07, 2016 at 03:16:14PM -0500, David Miller wrote:
> > From: Neil Horman 
> > Date: Fri,  4 Mar 2016 13:40:48 -0500

This patch is calling spin_unlock_irqrestore() without spin_lock_irqsave().

In file included from include/linux/seqlock.h:35:0,
 from include/linux/time.h:5,
 from include/linux/stat.h:18,
 from include/linux/module.h:10,
 from drivers/net/vmxnet3/vmxnet3_drv.c:27:
drivers/net/vmxnet3/vmxnet3_drv.c: In function 'vmxnet3_xmit_frame':
include/linux/spinlock.h:246:30: warning: 'flags' may be used uninitialized in 
this function [-Wmaybe-uninitialized]
   _raw_spin_unlock_irqrestore(lock, flags); \
  ^
drivers/net/vmxnet3/vmxnet3_drv.c:977:16: note: 'flags' was declared here
  unsigned long flags;
^

vmxnet3_tq_xmit(struct sk_buff *skb, struct vmxnet3_tx_queue *tq,
struct vmxnet3_adapter *adapter, struct net_device *netdev)
{
(...snipped...)
goto hdr_too_big;
(...snipped...)
spin_lock_irqsave(>tx_lock, flags);
(...snipped...)
spin_unlock_irqrestore(>tx_lock, flags);
(...snipped...)
return NETDEV_TX_OK;
(...snipped...)
hdr_too_big:
tq->stats.drop_oversized_hdr++;
unlock_drop_pkt:
spin_unlock_irqrestore(>tx_lock, flags);
(...snipped...)
return NETDEV_TX_OK;
}

Re: 4.4.3: OOPS when running "stress-ng --sock 5"

2016-03-07 Thread Tetsuo Handa

Holger Schurig wrote:
> So I did an "arm-linux-gnueabihf-objdump -Sgd linux/vmlinux", not sure
> if that helps:
> 
> c00972ec <__rmqueue>:
>  * Do the hard work of removing an element from the buddy allocator.
>  * Call me with the zone->lock already held.
>  */
> static struct page *__rmqueue(struct zone *zone, unsigned int order,
> int migratetype, gfp_t gfp_flags)
> {
> c00972ec:   e1a0c00dmov ip, sp
> c00972f0:   e92ddff0push{r4, r5, r6, r7, r8, r9, sl, fp, ip, 
> lr, pc}
> c00972f4:   e24cb004sub fp, ip, #4
> c00972f8:   e24dd024sub sp, sp, #36 ; 0x24
> unsigned int current_order;
> struct free_area *area;
> struct page *page;
> 
> /* Find a page of the appropriate size in the preferred list */
> for (current_order = order; current_order < MAX_ORDER; 
> ++current_order) {
> c00972fc:   e351000acmp r1, #10
>  * Do the hard work of removing an element from the buddy allocator.
>  * Call me with the zone->lock already held.
>  */
> 
I tried on x86_64 but I could not reproduce it.
Thus, we need to examine this problem using your environment.

I didn't notice that c00972ec is __rmqueue+0x0.
Actual line number to examine is c0097360 ("pc" register) which is 
__rmqueue+0x74.
Please show us line number and assembly code around c0097360.

Re: [PATCH] unix: properly account for FDs passed over unix sockets

2015-12-30 Thread Tetsuo Handa

Willy Tarreau wrote:
> On Wed, Dec 30, 2015 at 09:58:42AM +0100, Hannes Frederic Sowa wrote:
> > The MSG_PEEK code should not be harmful and the patch is good as is. I 
> > first understood from the published private thread, that it is possible 
> > for a program to exceed the rlimit of fds. But the DoS is only by 
> > keeping the fds in flight and not attaching them to any program.
> 
> Exactly. The real issue is when these FDs become very expensive such as
> pipes full of data.
> 

As you wrote how to abuse this vulnerability which exists in Linux 2.0
and later kernel, I quote a short description from private thread.

  "an unprivileged user consumes all file descriptors so that other
  unprivileged user cannot work" and "an unprivileged user consumes all
  kernel memory so that the OOM killer kills almost all processes before
  the culprit process is killed (CVE-2013-4312)".

Reported-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ath6kl: Use vmalloc for loading firmware using api1 method

2015-11-28 Thread Tetsuo Handa

Andy Shevchenko wrote:
> On Sat, Nov 28, 2015 at 8:58 PM, Brent Taylor  wrote:
> > Whats the status on this patch?  I don't see it on patchwork anymore
> > nor is it in any of the git trees I checked.
> >
> 
> You forget to use kvfree() instead of kfree() in core.c.
> 

In addition to that, I think you can do like below.

if (>fw == fw)
*fw = vmalloc(fw_entry->size);
else
*fw = kmalloc(fw_entry->size, GFP_KERNEL);
if (*fw == NULL)
ret = -ENOMEM;
else
memcpy(*fw, fw_entry->data, fw_entry->size);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] tree wide: Use kvfree() than conditional kfree()/vfree()

2015-11-09 Thread Tetsuo Handa

There are many locations that do

  if (memory_was_allocated_by_vmalloc)
vfree(ptr);
  else
kfree(ptr);

but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory
using is_vmalloc_addr(). Unless callers have special reasons, we can
replace this branch with kvfree(). Please check and reply if you found
problems.

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mho...@suse.com>
Cc: Russell King <li...@arm.linux.org.uk> # arm
Cc: <linux-a...@vger.kernel.org> # apei
Cc: <drbd-u...@lists.linbit.com> # drbd
Cc: <linux-ker...@vger.kernel.org> # mspec
Cc: <dri-de...@lists.freedesktop.org> # drm
Cc: Oleg Drokin <oleg.dro...@intel.com> # lustre
Cc: Andreas Dilger <andreas.dil...@intel.com> # lustre
Cc: <codal...@coda.cs.cmu.edu> # coda
Cc: <linux-...@lists.infradead.org> # jffs2
Cc: Jan Kara <j...@suse.com> # udf
Cc: <linux-fsde...@vger.kernel.org> # xattr
Cc: <linux...@kvack.org> # ipc + mm
Cc: <netdev@vger.kernel.org> # ipv4
---
 arch/arm/mm/dma-mapping.c  | 11 ++--
 drivers/acpi/apei/erst.c   |  6 ++--
 drivers/block/drbd/drbd_bitmap.c   | 26 +
 drivers/block/drbd/drbd_int.h  |  3 --
 drivers/char/mspec.c   | 15 ++
 drivers/gpu/drm/drm_hashtab.c  |  5 +---
 .../lustre/include/linux/libcfs/libcfs_private.h   |  8 ++
 fs/coda/coda_linux.h   |  3 +-
 fs/jffs2/build.c   |  8 ++
 fs/jffs2/fs.c  |  5 +---
 fs/jffs2/super.c   |  5 +---
 fs/udf/super.c |  7 +
 fs/xattr.c | 33 ++
 ipc/sem.c  |  2 +-
 ipc/util.c | 11 ++--
 ipc/util.h |  2 +-
 mm/percpu.c| 18 +---
 mm/vmalloc.c   |  5 +---
 net/ipv4/fib_trie.c|  4 +--
 19 files changed, 46 insertions(+), 131 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index e62400e..492bf3e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1200,10 +1200,7 @@ error:
while (i--)
if (pages[i])
__free_pages(pages[i], 0);
-   if (array_size <= PAGE_SIZE)
-   kfree(pages);
-   else
-   vfree(pages);
+   kvfree(pages);
return NULL;
 }
 
@@ -1211,7 +1208,6 @@ static int __iommu_free_buffer(struct device *dev, struct 
page **pages,
   size_t size, struct dma_attrs *attrs)
 {
int count = size >> PAGE_SHIFT;
-   int array_size = count * sizeof(struct page *);
int i;
 
if (dma_get_attr(DMA_ATTR_FORCE_CONTIGUOUS, attrs)) {
@@ -1222,10 +1218,7 @@ static int __iommu_free_buffer(struct device *dev, 
struct page **pages,
__free_pages(pages[i], 0);
}
 
-   if (array_size <= PAGE_SIZE)
-   kfree(pages);
-   else
-   vfree(pages);
+   kvfree(pages);
return 0;
 }
 
diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index 6682c5d..6e6bc10 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include  /* kvfree() */
 #include 
 
 #include "apei-internal.h"
@@ -532,10 +533,7 @@ retry:
return -ENOMEM;
memcpy(new_entries, entries,
   erst_record_id_cache.len * sizeof(entries[0]));
-   if (erst_record_id_cache.size < PAGE_SIZE)
-   kfree(entries);
-   else
-   vfree(entries);
+   kvfree(entries);
erst_record_id_cache.entries = entries = new_entries;
erst_record_id_cache.size = new_size;
}
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 9462d27..2daaafb 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -364,12 +364,9 @@ static void bm_free_pages(struct page **pages, unsigned 
long number)
}
 }
 
-static void bm_vk_free(void *ptr, int v)
+static inline void bm_vk_free(void *ptr)
 {
-   if (v)
-   vfree(ptr);
-   else
-   kfree(ptr);
+   kvfree(ptr);
 }
 
 /*
@@ -379,7 +376,7 @@ static struct page **bm_realloc_pages(struct drbd_bitmap 
*b, unsigned long want)
 {
struct page **old_pages = b->bm_pages;
struct page **new_pages, *page;
-   u

[PATCH] tcp: use kmalloc() than kmalloc_array().

2015-11-07 Thread Tetsuo Handa

Commit 095dc8e0c3686d58 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
silently changed from kmalloc() to kmalloc_array(). The latter has
overflow check whereas the former doesn't have.

If nblocks * locksz might overflow, we need to do like

  -  if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz)
  +  if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz)
   hashinfo->ehash_locks = vmalloc(nblocks * locksz);

because kmalloc_array() detects overflow and returns NULL.
But if nblocks * locksz is guaranteed not to overflow, there is
no need to use kmalloc_array().

Since I assume it won't overflow, use kmalloc() than kmalloc_array().

Signed-off-by: Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp>
---
 net/ipv4/inet_hashtables.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ccc5980..8f4ab27 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -648,8 +648,8 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
/* no more locks than number of hash buckets */
nblocks = min(nblocks, hashinfo->ehash_mask + 1);
 
-   hashinfo->ehash_locks = kmalloc_array(nblocks, locksz,
- GFP_KERNEL | 
__GFP_NOWARN);
+   hashinfo->ehash_locks = kmalloc(nblocks * locksz,
+   GFP_KERNEL | __GFP_NOWARN);
if (!hashinfo->ehash_locks)
hashinfo->ehash_locks = vmalloc(nblocks * locksz);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp: use kmalloc() than kmalloc_array().

2015-11-07 Thread Tetsuo Handa

Tetsuo Handa wrote:
> Commit 095dc8e0c3686d58 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
> silently changed from kmalloc() to kmalloc_array(). The latter has
> overflow check whereas the former doesn't have.
> 
> If nblocks * locksz might overflow, we need to do like
> 
>   -  if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz)
>   +  if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz)

Oops, I meant

   -  if (!hashinfo->ehash_locks)
   +  if (!hashinfo->ehash_locks && nblocks > SIZE_MAX / locksz)

here.

>hashinfo->ehash_locks = vmalloc(nblocks * locksz);
> 
> because kmalloc_array() detects overflow and returns NULL.
> But if nblocks * locksz is guaranteed not to overflow, there is
> no need to use kmalloc_array().
> 
> Since I assume it won't overflow, use kmalloc() than kmalloc_array().

I don't know about possible value range.
Please confirm that it can't overflow.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp: use kmalloc() than kmalloc_array().

2015-11-07 Thread Tetsuo Handa

David Miller wrote:
> From: Eric Dumazet 
> Date: Sat, 07 Nov 2015 10:50:07 -0800
> 
> > I do not feel we should go back to kmalloc() just because
> > vmalloc_array() does not exist yet.
> 
> Agreed.
> 

Please change as you like.

I was thinking to introduce a helper that does vmalloc() when
kmalloc() failed because locations that do

  ptr = kmalloc(size, GFP_NOFS);
  if (!ptr)
  ptr = vmalloc(size); /* Wrong because GFP_KERNEL is used implicitly */

are found. I just noticed that inet_ehash_locks_alloc() is doing

  ptr = kmalloc_array(count, size, GFP_KERNEL);
  if (!ptr)
  ptr = vmalloc(count * size); /* Wrong because overflow is not checked */

and wanted to know your intent.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] cxgb4: memory corruption in debugfs

2015-08-18 Thread Tetsuo Handa

Dan Carpenter wrote:
 You can't use kstrtoul() with an int or it causes memory corruption.
 Also j should be unsigned or we have underflow bugs.
 
 I considered changing j to unsigned long but everything fits in a u32.

Excuse me, but kstrtouint()'s last argument is not u32 * but unsigned int *.
Aren't there architectures where sizeof(unsigned int)  sizeof(u32) ?

 
 Fixes: 8e3d04fd7d70 ('cxgb4: Add MPS tracing support')
 Signed-off-by: Dan Carpenter dan.carpen...@oracle.com
 
 diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c 
 b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
 index 1732e29..0a87a32 100644
 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
 +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
 @@ -1289,13 +1289,14 @@ static unsigned int xdigit2int(unsigned char c)
  static ssize_t mps_trc_write(struct file *file, const char __user *buf,
size_t count, loff_t *pos)
  {
 - int i, j, enable, ret;
 + int i, enable, ret;
   u32 *data, *mask;
   struct trace_params tp;
   const struct inode *ino;
   unsigned int trcidx;
   char *s, *p, *word, *end;
   struct adapter *adap;
 + u32 j;
  
   ino = file_inode(file);
   trcidx = (uintptr_t)ino-i_private  3;
 @@ -1340,7 +1341,7 @@ static ssize_t mps_trc_write(struct file *file, const 
 char __user *buf,
  
   if (!strncmp(word, qid=, 4)) {
   end = (char *)word + 4;
 - ret = kstrtoul(end, 10, (unsigned long *)j);
 + ret = kstrtouint(end, 10, j);
   if (ret)
   goto out;
   if (!adap-trace_rss) {
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET] printk, netconsole: implement reliable netconsole

2015-04-18 Thread Tetsuo Handa

Tejun Heo wrote:
  If we can assume that scheduler is working, adding a kernel thread that
  does
  
while (1) {
read messages with metadata from /dev/kmsg
send them using UDP network
}
  
  might be easier than modifying netconsole module.
 
 But, I mean, if we are gonna do that in kernel, we better do it
 properly where it belongs.  What's up with easier than modifying
 netconsole module?  Why is netconsole special?  And how would the
 above be any less complex than a single timer function?  What am I
 missing?

User space daemon is sometimes disturbed unexpectedly due to

  (a) SIGKILL by OOM-killer
  (b) spurious ptrace() by somebody
  (c) spurious signals such as SIGSTOP / SIGINT
  (d) stalls triggered by page faults under OOM condition
  (e) other problems such as scheduler being not working

We have built-in protection for (a) named /proc/$pid/oom_score_adj , but
we need to configure access control modules for protecting (b) and (c),
and we don't have protection for (d). Thinking from OOM stall discussion,
(d) is fatal when trying to obtain kernel messages under problematic
condition. I thought that a kernel thread that does

  while (1) {
  read messages with metadata from /dev/kmsg
  send them using UDP network
  }

is automatically protected from (a), (b), (c) and (d), and it could be
implemented outside of netconsole module.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET] printk, netconsole: implement reliable netconsole

2015-04-17 Thread Tetsuo Handa

Tejun Heo wrote:
 On Sat, Apr 18, 2015 at 03:03:46AM +0900, Tetsuo Handa wrote:
  packet will be sufficient for finding out whether the packets were lost 
  and/or
  reordered in flight.
  
printk(Hello);
 = netconsole sends  Hello using UDP
printk(netconsole);
 = netconsole sends 0001 netconsole using UDP
printk(world\n);
 = netconsole sends 0002 world\n using UDP
  
  It might be nice to allow administrator to prefix a sequence number
  to netconsole messages for those who are using special receiver
  program (e.g. ncrx) which checks that sequence number.
 
 That said, this is pretty much what the first 12 patches do (except
 for the last printk patch, which can be taken out).  We already have
 sequencing and established format to expose them to userland - try cat
 /dev/kmsg, which btw is what local loggers on modern systems use
 anyway.  Why introduce netconsole's own version of metadata?

I didn't mean to introduce netconsole's own version of metadata.
I meant we don't need to implement in-kernel retry logic.

If we can assume that scheduler is working, adding a kernel thread that
does

  while (1) {
  read messages with metadata from /dev/kmsg
  send them using UDP network
  }

might be easier than modifying netconsole module.

 
 Thanks.
 
 -- 
 tejun
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET] printk, netconsole: implement reliable netconsole

2015-04-17 Thread Tetsuo Handa

Tejun Heo wrote:
 Hello, David.
 
 On Fri, Apr 17, 2015 at 01:17:12PM -0400, David Miller wrote:
  If userland cannot run properly, it is almost certain that neither will
  your complex reliability layer logic.
 
 * The bulk of patches are to pipe extended log messages to console
   drivers and let netconsole relay them to the receiver (and quite a
   bit of refactoring in the process), which, regardless of the
   reliability logic, is beneficial as we're currently losing
   structured logging (dictionary) and other metadata over consoles and
   regardless of where the reliability logic is implemented, it's a lot
   easier to have messages IDs.
 
 * The only thing necessary for reliable transmission are timer and
   netpoll.  There sure are cases where they go down too but there's a
   pretty big gap between those two going down and userland getting
   hosed, but where to put the retransmission and reliability logic
   definitely is debatable.
 
 * That said, the reliability part of the patch series are just two
   patches - 13 and 14, both of which are actually pretty simple.
 
  I tend to agree with Tetsuo, that in-kernel netconsole should remain
  as simple as possible and once it starts to have any smarts and less
  trivial logic the job belongs in userspace.
 
 Upto patch 12, it's just the same mechanism transferring extended
 messages.  It doesn't add any smartness to netconsole per-se except
 that it can now emit messages with metadata headers.  What do you
 think about them?

So, this patchset aims for obtaining kernel messages under problematic
condition. You have to hold messages until ack is delivered. This means
that printk buffer can become full before burst messages (e.g. SysRq-t)
are acked due to packet loss in the network.

printk() cannot wait for ack. Trying to wait for ack would break something.
How can you transmit subsequent kernel messages which failed to enqueue
due to waiting for ack for previous kernel messages?

 
 Thanks.
 
 -- 
 tejun
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET] printk, netconsole: implement reliable netconsole

2015-04-17 Thread Tetsuo Handa

Tejun Heo wrote:
  printk() cannot wait for ack. Trying to wait for ack would break something.
  How can you transmit subsequent kernel messages which failed to enqueue
  due to waiting for ack for previous kernel messages?
 
 Well, if log buffer overflows and the messages aren't at the logging
 target yet, they're lost.  It's the same as doing dmesg on localhost,
 isn't it?  This doesn't have much to do with where the reliability
 logic is implemented and is exactly the same with local logging too.

If you tolerate loss of kernel messages, adding sequence number to each UDP
packet will be sufficient for finding out whether the packets were lost and/or
reordered in flight.

  printk(Hello);
   = netconsole sends  Hello using UDP
  printk(netconsole);
   = netconsole sends 0001 netconsole using UDP
  printk(world\n);
   = netconsole sends 0002 world\n using UDP

It might be nice to allow administrator to prefix a sequence number
to netconsole messages for those who are using special receiver
program (e.g. ncrx) which checks that sequence number.

 
 Thanks.
 
 -- 
 tejun
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET] printk, netconsole: implement reliable netconsole

2015-04-17 Thread Tetsuo Handa

Tejun Heo wrote:
 * Implement netconsole retransmission support.  Matching rx socket on
   the source port is automatically created for extended targets and
   the log receiver can request retransmission by sending reponse
   packets.  This is completely decoupled from the main write path and
   doesn't make netconsole less robust when things start go south.

If the sender side can wait for retransmission, why can't we use
userspace programs (e.g. rsyslogd)?

For me, netconsole is mainly for saving kernel messages which cannot be
waited for a few deciseconds (e.g. system reset by softdog's timeout) and
for saving kernel messages (e.g. SysRq-t) during disk I/O hang up.
I have a logger for receiving netconsole messages at
http://sourceforge.jp/projects/akari/scm/svn/tree/head/branches/udplogger/
and things I expect for netconsole are shown below.

  (a) spool the message to send up to 1 line or adimn configured size
  so that total number of small UDP packets can be reduced

  (b) don't hesitate to send the spooled message immediately if either
  kernel panic or system reset is in progress

  (c) allow different console log level for different console drivers
  so that I can send kernel messages via netconsole without making
  local console noisy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2.6.25-rc1] Locks in udp_recvmsg()?

2008-02-10 Thread Tetsuo Handa

Hello.

I found that udp_recvmsg() in net/ipv4/udp.c for 2.6.25-rc1 calls
lock_sock() only when it releases the datagram (i.e. out_free: and 
csum_copy_err:).
Is it correct to call __skb_recv_datagram() without calling lock_sock()
when it acquires the datagram (i.e. try_again:)?

Regards.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[2.6.24-mm1] TCP/IPv6 connect() oopses at twothirdsMD4Transform()

2008-02-04 Thread Tetsuo Handa

Hello.

Kernel config is at http://I-love.SAKURA.ne.jp/tmp/config-2.6.24-mm1

2.6.24 works fine.

Regards.
--
BUG: unable to handle kernel paging request at 25476bec
IP: [c0211c28] twothirdsMD4Transform+0x78/0x37c
*pde =  
Oops:  [#1] SMP DEBUG_PAGEALLOC
last sysfs file: 
/sys/devices/pci:00/:00:10.0/host0/target0:0:1/0:0:1:0/type
Modules linked in: nfsd lockd sunrpc exportfs pcnet32

Pid: 2148, comm: a.out Not tainted (2.6.24-mm1 #1)
EIP: 0060:[c0211c28] EFLAGS: 00010286 CPU: 0
EIP is at twothirdsMD4Transform+0x78/0x37c
EAX: 00084000 EBX: 0800 ECX: 8000 EDX: db45ddec
ESI:  EDI: 52806380 EBP: db45dddc ESP: db45ddc8
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process a.out (pid: 2148, ti=db45d000 task=deaf9250 task.ti=db45d000)
Stack: 8000 def6ef9c 6380 c0759d60 db45de1c db45de28 c0211fd2 0040 
   c0759d40    0100 52806380 1f2e00ba fffa249f 
   5a696b37 8dbe1970 cf7579d0 3b0cc350 a54b10a8 def6e9a0  def6ef8c 
Call Trace:
 [c0211fd2] ? secure_tcpv6_sequence_number+0x58/0x7a
 [c0322247] ? tcp_v6_connect+0x46d/0x4e3
 [c02ade07] ? lock_sock_nested+0x56/0x5e
 [c02ecbae] ? inet_stream_connect+0x1c/0x163
 [c02ecc24] ? inet_stream_connect+0x92/0x163
 [c02ab14b] ? sys_connect+0x72/0x98
 [c013b945] ? lock_release_holdtime+0x4e/0x54
 [c0115250] ? do_page_fault+0x1c5/0x3fc
 [c013e858] ? __lock_release+0x4b/0x51
 [c0115250] ? do_page_fault+0x1c5/0x3fc
 [c02ab9f5] ? sys_socketcall+0x6f/0x15e
 [c0103ae7] ? restore_nocheck+0x12/0x15
 [c0103a86] ? syscall_call+0x7/0xb
 ===
Code: 31 c1 03 0c ba 8b 7a 0c 01 ce 8b 4d ec c1 c6 0b 31 d9 21 f1 31 d9 03 0c 
ba 8b 7a 10 01 c8 8b 4d ec c1 c0 13 31 f1 21 c1 33 4d ec 03 0c ba 8b 7a 14 01 
cb 89 c1 c1 c3 03 31 f1 21 d9 31 f1 03 0c 
EIP: [c0211c28] twothirdsMD4Transform+0x78/0x37c SS:ESP 0068:db45ddc8
---[ end trace 160518059a282c77 ]---
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.24-mm1] TCP/IPv6 connect() oopses at twothirdsMD4Transform()

2008-02-04 Thread Tetsuo Handa

Hello.

 random: revert braindamage that snuck into checkpatch cleanup
 
 Signed-off-by: Matt Mackall [EMAIL PROTECTED]

Yes. It solved the oops.

Thank you.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process's security context.

2008-01-24 Thread Tetsuo Handa

Hello.

Are there any remaining questions/problems about this patch?
If none, I want this patch applied to net-2.6.25 tree.

Regards.

---
This patch modifies security_socket_post_accept() and introduces
security_socket_post_recv_datagram() LSM hooks.

Currently, security_socket_post_accept() is called *after* fd_install()
at sys_accept(). This means that userland process might access accept()ed
but not yet properly labeled socket.
I believe this security_socket_post_accept() should be called *before*
fd_install() for the three reasons.

 * The userland process will always access proper context labeled by
   security_socket_post_accept() rather than default context labeled by
   security_inode_alloc() (called from alloc_inode() from new_inode() from
   sock_alloc() from sys_accept()). This gives a security module
   a chance to perform access control based on proper context.

 * If security_socket_post_accept() failed to assign proper context
   (e.g. -ENOMEM), the accept()ed socket can't have proper context.
   Use of void for security_socket_post_accept() deprives a security module
   of a chance to abandon the accept()ed socket.

 * A security module can perform connection filtering based on
   process's security context.
   If security_socket_post_accept() is called after fd_install(),
   it is too late to do so because the accept()ed socket is already visible
   to userland processes.


Currently, there is no way to directly map security context from incoming
packet to user process. This is because the creator or owner of a socket is
not always the receiver of an incoming packet. The userland process who
receives the incoming packet is not known until a process calls sys_recvmsg().
So, I want to add a LSM hook to give a security module a chance to control
after the recipient of the incoming packet is known.

Signed-off-by: Kentaro Takeda [EMAIL PROTECTED]
Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]
---
 include/linux/security.h |   34 +-
 net/core/datagram.c  |   29 -
 net/socket.c |7 +--
 security/dummy.c |   13 ++---
 security/security.c  |   10 --
 5 files changed, 80 insertions(+), 13 deletions(-)

--- net-2.6.25.orig/include/linux/security.h
+++ net-2.6.25/include/linux/security.h
@@ -781,8 +781,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers based on the process accepting this connection.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -796,6 +800,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recv_datagram:
+ * Check permission after receiving a datagram.
+ * This hook allows a security module to filter packets
+ * from unwanted peers based on the process receiving this datagram.
+ * The packet will be discarded if this hook returns nonzero.
+ * @sk contains the socket.
+ * @skb contains the socket buffer.
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1387,12 +1400,13 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
-   struct socket * newsock);
+   int (*socket_post_accept) (struct socket *sock, struct socket *newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb,
+ unsigned int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket * sock);
int (*socket_getsockopt) (struct socket * sock, int level, int optname);
@@ -2297,10 +2311,12 @@ int security_socket_bind(struct socket *
 int security_socket_connect(struct

[PATCH net-2.6.25] Add packet filtering based on process's security context.

2008-01-22 Thread Tetsuo Handa

This patch modifies security_socket_post_accept() and introduces
security_socket_post_recv_datagram() LSM hooks.

Currently, security_socket_post_accept() is called *after* fd_install()
at sys_accept(). This means that userland process might access accept()ed
but not yet properly labeled socket.
I believe this security_socket_post_accept() should be called *before*
fd_install() for the three reasons.

 * The userland process will always access proper context labeled by
   security_socket_post_accept() rather than default context labeled by
   security_inode_alloc() (called from alloc_inode() from new_inode() from
   sock_alloc() from sys_accept()). This gives a security module
   a chance to perform access control based on proper context.

 * If security_socket_post_accept() failed to assign proper context
   (e.g. -ENOMEM), the accept()ed socket can't have proper context.
   Use of void for security_socket_post_accept() deprives a security module
   of a chance to abandon the accept()ed socket.

 * A security module can perform connection filtering based on
   process's security context.
   If security_socket_post_accept() is called after fd_install(),
   it is too late to do so because the accept()ed socket is already visible
   to userland processes.


Currently, there is no way to directly map security context from incoming
packet to user process. This is because the creator or owner of a socket is
not always the receiver of an incoming packet. The userland process who
receives the incoming packet is not known until a process calls sys_recvmsg().
So, I want to add a LSM hook to give a security module a chance to control
after the recipient of the incoming packet is known.

Signed-off-by: Kentaro Takeda [EMAIL PROTECTED]
Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]
---
 include/linux/security.h |   34 +-
 net/core/datagram.c  |   29 -
 net/socket.c |7 +--
 security/dummy.c |   13 ++---
 security/security.c  |   10 --
 5 files changed, 80 insertions(+), 13 deletions(-)

--- net-2.6.25.orig/include/linux/security.h
+++ net-2.6.25/include/linux/security.h
@@ -781,8 +781,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers based on the process accepting this connection.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -796,6 +800,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recv_datagram:
+ * Check permission after receiving a datagram.
+ * This hook allows a security module to filter packets
+ * from unwanted peers based on the process receiving this datagram.
+ * The packet will be discarded if this hook returns nonzero.
+ * @sk contains the socket.
+ * @skb contains the socket buffer.
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1387,12 +1400,13 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
-   struct socket * newsock);
+   int (*socket_post_accept) (struct socket *sock, struct socket *newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb,
+ unsigned int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket * sock);
int (*socket_getsockopt) (struct socket * sock, int level, int optname);
@@ -2297,10 +2311,12 @@ int security_socket_bind(struct socket *
 int security_socket_connect(struct socket *sock, struct sockaddr *address, int 
addrlen);
 int security_socket_listen(struct socket *sock, int backlog);
 int security_socket_accept(struct

Re: [PATCH net-2.6.25] Add packet filtering based on process\'s security context.

2008-01-22 Thread Tetsuo Handa

Hello.

Casey Schaufler wrote:
 Do you have a real situation where two user processes with different
 security contexts share a socket? How do you get into that situation,
 and is it appropriate to have that situation in your security scheme?
 Can this occur without using privilege?

I hope such situation won't occur, as I have mentioned in the previous
posting.

| Precautions: This approach has a side effect which unlikely occurs.
| 
| If a socket is shared by multiple processes with differnt policy,
| the process who should be able to accept this connection
| will not be able to accept this connection
| because socket_post_accept() aborts this connection.
| But if socket_post_accept() doesn't abort this connection,
| the process who must not be able to accept this connection
| will repeat accept() forever, which is a worse side effect.
| 
| Similarly, if a socket is shared by multiple processes with differnt policy,
| the process who should be able to pick up this datagram
| will not be able to pick up this datagram
| because socket_post_recv_datagram() discards this datagram.
| But if socket_post_recv_datagram() doesn't discard this datagram,
| the process who must not be able to pick up this datagram
| will repeat recvmsg() forever, which is a worse side effect.
| 
| So, don't give different permissions between processes who shares one socket.
| Otherwise, some connections/datagrams cannot be delivered to intended process.

But it is possible to write a code like

-- app3.c start --
/* gcc -Wall -O2 -o /tmp/app3 app3.c */
#include fcntl.h
#include netinet/in.h
#include stdio.h
#include string.h
#include sys/select.h
#include sys/socket.h
#include sys/types.h
#include unistd.h

int main(int argc, char *argv[]) {
const int fd1 = socket(PF_INET, SOCK_DGRAM, 0), fd2 = socket(PF_INET, 
SOCK_DGRAM, 0);
struct sockaddr_in addr;
char buf[16];
memset(addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = htonl(INADDR_ANY);
addr.sin_port = htons(1);
fprintf(stderr, %s started.\n, argv[0]);
if (bind(fd1, (struct sockaddr *) addr, sizeof(addr))) {
fprintf(stderr, Can't bind()\n);
return 1;
}
if (sendto(fd2, hello\n, 6, 0, (struct sockaddr *) addr, 
sizeof(addr)) != 6 ||
sendto(fd2, world\n, 6, 0, (struct sockaddr *) addr, 
sizeof(addr)) != 6) {
fprintf(stderr, Can't sendto()\n);
return 1;
}
while (1) {
fd_set rfds;
FD_ZERO(rfds);
FD_SET(fd1, rfds);
select(fd1 + 1, rfds, NULL, NULL, NULL);
if (FD_ISSET(fd1, rfds)) break;
fprintf(stderr, Can't select()\n);
return 1;
}
if (fcntl(fd1, FD_CLOEXEC, 0)) {
fprintf(stderr, Can't fcntl()\n);
return 1;
}
snprintf(buf, sizeof(buf), %d, fd1);
execlp(/tmp/app4, app4, buf, NULL);
fprintf(stderr, Can't execve()\n);
return 1;
}
-- app3.c end --

-- app4.c start --
/* gcc -Wall -O2 -o /tmp/app4 app4.c */
#include netinet/in.h
#include stdio.h
#include stdlib.h
#include sys/socket.h
#include sys/types.h
#include unistd.h

int main(int argc, char *argv[]) {
int fd;
if (argc != 2) {
fprintf(stderr, Bad parameter.\n);
return 1;
}
fprintf(stderr, %s started.\n, argv[0]);
fd = atoi(argv[1]);
while (1) {
struct sockaddr_in addr;
socklen_t size = sizeof(addr);
char buffer[1024];
int len;
len = recvfrom(fd, buffer, sizeof(buffer), 0, (struct sockaddr 
*) addr, size);
if (len == EOF) {
fprintf(stderr, Can't recvfrom()\n);
return 1;
}
write(1, buffer, len);
}
return 0;
}
-- app4.c end --

and assign different policy to /tmp/app1 and /tmp/app2 .
Therefore, I want to check at sys_recvmsg() time.

(Usage: Compile app3 and app4 and run /tmp/app3 .)
For TCP's case, see http://www.mail-archive.com/[EMAIL PROTECTED]/msg02531.html

What I want to do is perform connection/packet filtering
with the recipient of the incoming connection/packet known.
My security scheme controls based on the recipient of the incoming 
connection/packet.
In this case, not /tmp/app1 or /tmp/app3 , but /tmp/app2 or /tmp/app4.
This case occurs without using privilege.

Regards.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: AF_UNIX MSG_PEEK bug?

2008-01-09 Thread Tetsuo Handa

Hello.

Brent Casavant wrote:
 However, the program would occasionally get into a situation where
 a call to recv(sockfd, buf, len, MSG_PEEK) returns some number
 of bytes less than the requested length, and persists in this state
 (i.e. retrying the call continues to return the same amount of data)
 even when more than sufficient data is known to have been successfully
 written to the socket.
Did you try MSG_WAITALL flag? See man 2 recv.
A TCP socket handles data in bytes.
You cannot complain if the amount received by recv() is smaller than expected
unless you use MSG_WAITALL flag.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.

2007-12-09 Thread Tetsuo Handa


Hello, Samir.

Did you receive the following messages?
Since these messages were dropped at vger.kernel.org ,
I'm worrying that you couldn't receive the following messages.

Tetsuo Handa wrote:
 Hello.
 
 Samir Bellabes wrote:
   what differences between you approach and netfilter in this case ? if
   it's about packet filtering, you already have all you wishes in
   netfilter project.
   Except a hook for making decision with the name of process who picks that 
   packet up known.
  
  I think that we really don't need it, because we can catch the
  informations as I explained.
 
 Well, I haven't understood yet why we don't need it.
 
 How can you know the the name of process who copies that datagram to its 
 userspace memory?
 A socket may be shared by multiple different executable files,
 so the name of the executable file is not known until
 one of processes who share the socket issues accept()/recvmsg() syscall.
 
 Are you saying that I should not use the name of the executable file?
 
 Regards.
 

Tetsuo Handa wrote:
 Hello.
 
 I made an example.
 
 Usage:
 
   Compile app1 and app2 and run /tmp/app1 .
 
   Run something like
curl http://localhost:1/
   to connect to /tmp/app1.
 
 I want to know that */tmp/app2* accepts TCP connection
 so that the user can control
 whether this TCP connection from 127.0.0.1 port N
 should be accepted by */tmp/app2* or not.
 
 How can we do this without socket_post_accept() change?
 
 Regards.
 
 -- app1.c start --
 /* gcc -Wall -O2 -o /tmp/app1 app1.c */
 #include fcntl.h
 #include netinet/in.h
 #include stdio.h
 #include string.h
 #include sys/select.h
 #include sys/socket.h
 #include sys/types.h
 #include unistd.h
 
 int main(int argc, char *argv[]) {
   const int fd = socket(PF_INET, SOCK_STREAM, 0);
   struct sockaddr_in addr;
   char buf[16];
   memset(addr, 0, sizeof(addr));
   addr.sin_family = AF_INET;
   addr.sin_addr.s_addr = htonl(INADDR_ANY);
   addr.sin_port = htons(1);
   fprintf(stderr, %s started.\n, argv[0]);
   if (bind(fd, (struct sockaddr *) addr, sizeof(addr))) {
   fprintf(stderr, Can't bind()\n);
   return 1;
   } else if (listen(fd, 5)) {
   fprintf(stderr, Can't listen()\n);
   return 1;
   }
   while (1) {
   fd_set rfds;
   FD_ZERO(rfds);
   FD_SET(fd, rfds);
   select(fd + 1, rfds, NULL, NULL, NULL);
   if (FD_ISSET(fd, rfds)) break;
   fprintf(stderr, Can't select()\n);
   return 1;
   }
   if (fcntl(fd, FD_CLOEXEC, 0)) {
   fprintf(stderr, Can't fcntl()\n);
   return 1;
   }
   snprintf(buf, sizeof(buf), %d, fd);
   execlp(/tmp/app2, app2, buf, NULL);
   fprintf(stderr, Can't execve()\n);
   return 1;
 }
 -- app1.c end --
 
 -- app2.c start --
 /* gcc -Wall -O2 -o /tmp/app2 app2.c */
 #include netinet/in.h
 #include stdio.h
 #include stdlib.h
 #include sys/socket.h
 #include sys/types.h
 #include unistd.h
 
 int main(int argc, char *argv[]) {
   int lfd;
   if (argc != 2) {
   fprintf(stderr, Bad parameter.\n);
   return 1;
   }
   fprintf(stderr, %s started.\n, argv[0]);
   lfd = atoi(argv[1]);
   while (1) {
   struct sockaddr_in addr;
   socklen_t size = sizeof(addr);
   int fd = accept(lfd, (struct sockaddr *) addr, size);
   char c;
   if (fd == EOF) {
   fprintf(stderr, Can't accept()\n);
   return 1;
   }
   while (read(fd, c, 1) == 1  write(fd, c, 1) == 1);
   close(fd);
   }
   return 0;
 }
 -- app2.c end --
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process's securitycontext.

2007-12-03 Thread Tetsuo Handa

Hello.

Patrick McHardy wrote:
 No news on that. I'm also a bit sceptical if adding all this complexity
 and overhead would really be worth it (considering only netfilter) just
 to use the owner match and UID/GID matching. It wouldn't even be
 accurate because there is not 1:1 mapping of sockets and processes.
Considering only LSM, socket_post_accept()/socket_post_recv_datagram() hooks are
not complicated at all.
A socket may be mapped to multiple processes, but at the moment of picking up
(i.e. accept()/recvmsg()), I think it is accurate 1:1 mapping.
I'm more interested in Who picks this connection/datagram up? than
Which socket enqueues this connection/datagram?
It may be indifferent for netfilter, but it is region of interest for me.

 I actually like Samir Bellabes' approach, which doesn't suffer from
 these problems IIRC.
Oh, I found him at http://nfws.inl.fr/en/?p=50 . (Sorry, I didn't know.)
He is the person who was discussing with me a few days ago.

 From memory, one approach under discussion was to add netfilter hooks to 
  the transport layer, which could be invoked correctly by each type of 
  protocol when the target process is selected.
 
 We can only invoke the hooks after the socket lookup, but we don't
 know which process is going to call recvmsg() for that socket.

Right. Thus, I'm proposing LSM hooks at accept()/recvmsg() time.

Regards.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.

2007-11-30 Thread Tetsuo Handa

Hello.

Thank you for detailed explanation.

Samir Bellabes wrote:
 By filtering, you should mean packets filtring, shouldn't you ?
 because this hook is able to deny the accept() syscall for a process, so
 it's a kind of filtring too.
Yes, you are right.

 No, it's performed from the userspace. the goal is to don't touch the
 network stack at all.
OK. One thing I'm worrying.
Use of userspace process assumes that it shall not be killed by some reason
(SIGKILL or OOM-killer or something).
What happens if userspace process is dead?
TOMOYO wants to continue packet filtering functionality even if userspace 
process is dead.
So, TOMOYO loads all filtering rules inside kernel and performs filtering 
inside kernel.

TOMOYO's goals are
(1) perform simple stateless packet filtering without any assistance outside 
the kernel
(2) provide simple approach for people (including me) who disable iptables
or worry the correctness of configurations because of very complicated 
syntax.
(Many users disable both iptables/SELinux inside local network because
their very-very-flexible-configurations give them a headache.)
(3) allow people describe per-an-application firewall rules in understandable 
syntax.

 Dropping the TCP SYN for a TCP connection, for example, then the
 process don't see the connection arrived and was dropped.
That's an advantage.
Making select() not to say ready to accept is what TOMOYO wants
(although it is very difficult to do so because of (3)).

 With your tool, you will have to learn the ftp client to let incoming
 connection, reach this dynamic port, so does this mean, as it's dynamic,
 to allow TCP port 1024-65535 ?
Yes. Users add lines like

 kernel /usr/sbin/vsftpd
 allow_network TCP accept xxx.xxx.xxx.xxx-yyy.yyy.yyy.yyy 1024-65535

which means /usr/sbin/vsftpd is able to accept incoming TCP connections
if source IP address is between xxx.xxx.xxx.xxx-yyy.yyy.yyy.yyy and
source port is between 1024-65535 and

 kernel ... /usr/bin/ftp
 allow_network TCP accept zzz.zzz.zzz.zzz 1024-65535

which means /usr/bin/ftp is able to accept incoming TCP connections if
source IP address is zzz.zzz.zzz.zzz and source port is between 1024-65535.
Users have to add explicitly, but it is very easy and understandable.

 I'm not speaking anymore about socket filtering inside
 sock-aps-accept(). I do filtering for sock-aps-accept() with the
 netfilter tool, by handling packets directly.
For TOMOYO, (3) is more important than making select() not to say ready to 
accept.
Packet filtering without knowing the program's name who picks up this 
connection is not a TOMOYO's goal.
So, TOMOYO still wants to do filtering inside sock-aps-accept() or 
security_socket_post_accept().

Thank you.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.

2007-11-30 Thread Tetsuo Handa

Hello.

Samir Bellabes wrote:
 at security_socket_accept(), the user only accept the fact that the
 application is able to go to sock-ops-accept(). That's the purpose of
 this hook.
Yes. This hook can't perform filtering.

 After, when packet are coming, we can catch them with
 libnetfilter_queue, and deal with filtering packets.
Is this performed inside sock-ops-accept()?

 here we agree. *but* in my module, the user don't judge before
 sock-ops-accept(). He judges when packets are coming, throught the
 libnetfilter_queue API, in userspace, and reinject packet if it's ok.
I didn't understand what is happening.
Is there a hook which can perform filtering inside sock-ops-accept()?

Regards.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.

2007-11-30 Thread Tetsuo Handa

Hello.

Thank you for feedback.

I have some questions.

(1) Your module uses struct security_operations and
is registered with register_security().

TOMOYO also uses struct security_operations and
must be registered with register_security().

Can your module and TOMOYO coexist?

(2) Can your module handle incoming UDP's datagram too?

(3) How do you handle race window between security_socket_accept()
and sock-ops-accept()?

Samir Bellabes wrote:
 My approach is to get the informations regarding the socket from
 socket(), bind() and and accept() syscalls hooks.
 Pushing this informations to userspace. Here the user can refuse or
 accept the sycalls.

If a socket is shared by multiple processes and
the user said Accept this connection at security_socket_accept(),
the process does not always pick up what the user wanted to pick up
because there is no mutex between security_socket_accept() and
sock-ops-accept().

(4) If (3) is my misunderstanding (i.e. the user doesn't judge at 
security_socket_accept()),
it means that the user judges it before select() says ready to accept.
Then ... what happens if a socket is shared by multiple programs?
The user has to judge without knowing which program will pick up this 
connection.
TOMOYO lets the user judge whether to accept or not
with knowing the program's name who picks up this connection.

I think getting information at socket(), bind() and accept() is helpful for 
detailed judgement,
but I think it doesn't solve race problem as long as the user judges *before* 
sock-ops-accept().

Regards.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] Add packet filtering based on process'ssecurity context.

2007-11-23 Thread Tetsuo Handa

Hello.

James Morris wrote:
 From memory, one approach under discussion was to add netfilter hooks to 
 the transport layer, which could be invoked correctly by each type of 
 protocol when the target process is selected.
 
 If this is done for netfilter, then an LSM hook is probably not needed at 
 all, as security modules can utilize netfilter hooks directly.

Patrick McHardy says (at http://marc.info/?l=linux-netdevm=118495005800410w=2 
)
Even with socket filters netfilter doesn't know the final receipient
 process, that is not known until it calls recvmsg and the data is read,
 which is too late for netfilter.



  Precautions: This approach has a side effect which unlikely occurs.
  
  If a socket is shared by multiple processes with different policy,
  the process who should be able to accept this connection
  will not be able to accept this connection
  because socket_post_accept() aborts this connection.
  But if socket_post_accept() doesn't abort this connection,
  the process who must not be able to accept this connection
  will repeat accept() forever, which is a worse side effect.
I think this change is needed regardless of whether to use connection filtering 
or not.
Currently, SELinux doesn't use socket_post_accept().

|  * @socket_post_accept:
|  *This hook allows a security module to copy security
|  *information into the newly created socket's inode.

But if one wants to *copy* security information to accept()ed socket,
the location after fd_install() is too late to copy
because the userland process can access accept()ed socket's fd
whose security information is not copied yet.

Also, if one wants to *assign* security information to accept()ed socket,
it might attend memory allocation which can fail.
So, use of void for socket_post_accept() deprives a security module of a chance 
to
abort this connection if the security module failed to *assign* security 
information.

Regards.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [README] away until Dec 3rd

2007-11-22 Thread Tetsuo Handa

Hello.

I have a question.

Yesterday, I posted a patch based on 2.6.24-rc3-mm1 that modifies
the following files.

 include/linux/security.h |   34 +-
 net/core/datagram.c  |   26 --
 net/socket.c |7 +--
 security/dummy.c |   13 ++---
 security/security.c  |   10 --

But you say that I should make patches based on the net-2.6.25 tree.
Which tree (-mm or net-2.6.25) should I use for making this patch?

There is no pending objection at LSM-ml so far, and
I'm asking for an approval from one of the core developers at netdev-ml.

Regards.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25] Add packet filtering based on process's security context.

2007-11-22 Thread Tetsuo Handa

Hello.

Herbert Xu wrote:
 On Thu, Nov 22, 2007 at 09:57:14PM +0900, Tetsuo Handa wrote:
 
  But you say that I should make patches based on the net-2.6.25 tree.
  Which tree (-mm or net-2.6.25) should I use for making this patch?
 
 The net-2.6.25 tree is the one.  Please use the tree at
 
   git://git.kernel.org/pub/scm/linux/kernel/git/herbert/net-2.6.25.git/
 
 while Dave is away.

Thank you.

I refreshed the patch based on net-2.6.25 tree.

Regards.

-
Subject: Add packet filtering based on process's security context.

This patch allows LSM modules filter incoming connections/datagrams
based on the process's security context who is attempting to pick up.

There are already hooks to filter incoming connections/datagrams
based on the socket's security context, but these hooks are not
applicable when one wants to do TCP Wrapper-like filtering
(e.g. App1 is permitted to accept TCP connections from 192.168.0.0/16).

Precautions: This approach has a side effect which unlikely occurs.

If a socket is shared by multiple processes with differnt policy,
the process who should be able to accept this connection
will not be able to accept this connection
because socket_post_accept() aborts this connection.
But if socket_post_accept() doesn't abort this connection,
the process who must not be able to accept this connection
will repeat accept() forever, which is a worse side effect.

Similarly, if a socket is shared by multiple processes with differnt policy,
the process who should be able to pick up this datagram
will not be able to pick up this datagram
because socket_post_recv_datagram() discards this datagram.
But if socket_post_recv_datagram() doesn't discard this datagram,
the process who must not be able to pick up this datagram
will repeat recvmsg() forever, which is a worse side effect.

So, don't give different permissions between processes who share one socket.
Otherwise, some connections/datagrams cannot be delivered to intended process.

Signed-off-by: Kentaro Takeda [EMAIL PROTECTED]
Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]

 include/linux/security.h |   34 +-
 net/core/datagram.c  |   26 --
 net/socket.c |7 +--
 security/dummy.c |   13 ++---
 security/security.c  |   10 --
 5 files changed, 76 insertions(+), 14 deletions(-)

--- net-2.6.25.orig/include/linux/security.h
+++ net-2.6.25/include/linux/security.h
@@ -781,8 +781,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers based on the process accepting this connection.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -796,6 +800,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recv_datagram:
+ * Check permission after receiving a datagram.
+ * This hook allows a security module to filter packets
+ * from unwanted peers based on the process receiving this datagram.
+ * The packet will be discarded if this hook returns nonzero.
+ * @sk contains the socket.
+ * @skb contains the socket buffer (may be NULL).
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1387,12 +1400,13 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
-   struct socket * newsock);
+   int (*socket_post_accept) (struct socket *sock, struct socket *newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb,
+ unsigned int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket * sock);
int (*socket_getsockopt) (struct socket

[PATCH] Add packet filtering based on process's security context.

2007-11-21 Thread Tetsuo Handa

Hello.

This patch comes from a thread at http://lkml.org/lkml/2007/11/16/155 .

I want to use IP/port based access control for incoming connections/datagrams.

This idea was discussed for several times, but there is no approach that 
satisfies
both it can decide based on the recipient's process and
it can sleep so that the LSM hook can query userspace for an access decision.
Thus, I'd like to add one new LSM hook in net/core/datagram.c and
change return type of one existing LSM hook in net/socket.c .

Regards.

-
Subject: Add packet filtering based on process's security context.

This patch allows LSM modules filter incoming connections/datagrams
based on the process's security context who is attempting to pick up.

There are already hooks to filter incoming connections/datagrams
based on the socket's security context, but these hooks are not
applicable when one wants to do TCP Wrapper-like filtering
(e.g. App1 is permitted to accept TCP connections from 192.168.0.0/16).

Precautions: This approach has a side effect which unlikely occurs.

If a socket is shared by multiple processes with differnt policy,
the process who should be able to accept this connection
will not be able to accept this connection
because socket_post_accept() aborts this connection.
But if socket_post_accept() doesn't abort this connection,
the process who must not be able to accept this connection
will repeat accept() forever, which is a worse side effect.

Similarly, if a socket is shared by multiple processes with differnt policy,
the process who should be able to pick up this datagram
will not be able to pick up this datagram
because socket_post_recv_datagram() discards this datagram.
But if socket_post_recv_datagram() doesn't discard this datagram,
the process who must not be able to pick up this datagram
will repeat recvmsg() forever, which is a worse side effect.

So, don't give different permissions between processes who shares one socket.
Otherwise, some connections/datagrams cannot be delivered to intended process.

Signed-off-by: Kentaro Takeda [EMAIL PROTECTED]
Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]

 include/linux/security.h |   34 +-
 net/core/datagram.c  |   26 --
 net/socket.c |7 +--
 security/dummy.c |   13 ++---
 security/security.c  |   10 --
 5 files changed, 76 insertions(+), 14 deletions(-)

--- linux-2.6-mm.orig/include/linux/security.h
+++ linux-2.6-mm/include/linux/security.h
@@ -778,8 +778,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers based on the process accepting this connection.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -793,6 +797,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recv_datagram:
+ * Check permission after receiving a datagram.
+ * This hook allows a security module to filter packets
+ * from unwanted peers based on the process receiving this datagram.
+ * The packet will be discarded if this hook returns nonzero.
+ * @sk contains the socket.
+ * @skb contains the socket buffer (may be NULL).
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1389,12 +1402,13 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
-   struct socket * newsock);
+   int (*socket_post_accept) (struct socket *sock, struct socket *newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb,
+ unsigned int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket

[PATCH] Add packet filtering based on process\'s security context.

2007-11-19 Thread Tetsuo Handa

This patch allows LSM modules filter incoming connections/datagrams
based on the process's security context who is attempting to pick up.

There are already hooks to filter incoming connections/datagrams
based on the socket's security context, but these hooks are not
applicable when one wants to do TCP Wrapper-like filtering
(e.g. App1 is permitted to accept TCP connections from 192.168.0.0/16).



There is a side effect which unlikely occurs.

If a socket is shared by multiple processes with different policy,
the process who should be able to accept this connection
will not be able to accept this connection
because socket_post_accept() aborts this connection.
But if socket_post_accept() doesn't abort this connection,
the process who must not be able to accept this connection
will repeat accept() forever, which is a worse side effect.

Similarly, if a socket is shared by multiple processes with different policy,
the process who should be able to pick up this datagram
will not be able to pick up this datagram
because socket_post_recv_datagram() discards this datagram.
But if socket_post_recv_datagram() doesn't discard this datagram,
the process who must not be able to pick up this datagram
will repeat recvmsg() forever, which is a worse side effect.

Signed-off-by: Kentaro Takeda [EMAIL PROTECTED]
Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]

 include/linux/security.h |   34 +-
 net/core/datagram.c  |   26 --
 net/socket.c |7 +--
 security/dummy.c |   13 ++---
 security/security.c  |   10 --
 5 files changed, 76 insertions(+), 14 deletions(-)

--- linux-2.6.24-rc2-mm1.orig/include/linux/security.h
+++ linux-2.6.24-rc2-mm1/include/linux/security.h
@@ -778,8 +778,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers based on the process accepting this connection.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -793,6 +797,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recv_datagram:
+ * Check permission after receiving a datagram.
+ * This hook allows a security module to filter packets
+ * from unwanted peers based on the process receiving this datagram.
+ * The packet will be discarded if this hook returns nonzero.
+ * @sk contains the socket.
+ * @skb contains the socket buffer (may be NULL).
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1384,12 +1397,13 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
-   struct socket * newsock);
+   int (*socket_post_accept) (struct socket *sock, struct socket *newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recv_datagram) (struct sock *sk, struct sk_buff *skb,
+ unsigned int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket * sock);
int (*socket_getsockopt) (struct socket * sock, int level, int optname);
@@ -2294,10 +2308,12 @@ int security_socket_bind(struct socket *
 int security_socket_connect(struct socket *sock, struct sockaddr *address, int 
addrlen);
 int security_socket_listen(struct socket *sock, int backlog);
 int security_socket_accept(struct socket *sock, struct socket *newsock);
-void security_socket_post_accept(struct socket *sock, struct socket *newsock);
+int security_socket_post_accept(struct socket *sock, struct socket *newsock);
 int security_socket_sendmsg(struct socket *sock, struct msghdr *msg, int size);
 int security_socket_recvmsg(struct socket *sock, struct msghdr *msg,
int size, int flags);
+int

Re: [TOMOYO 15/15] LSM expansion for TOMOYO Linux.

2007-09-06 Thread Tetsuo Handa

Hello.

Thank you very much for your time, Paul.
Yes, you understood what I wanted to do.

TOMOYO Linux's approach:

(1) It uses userspace intervention to allow/reject
connections and/or packets based on the application's domain.
Since existent hooks can't be used for this purpose,
I inserted a new hook post_recv_datagram() at skb_recv_datagram()
and I modified socket_post_accept() to return error so that
I can drop/disconnect based on the application's domain.

I think skb_recv_datagram() is the only place that can remove
a message picked up with MSG_PEEK flags from the receive queue.
To remove a message picked up with MSG_PEEK flags, I noticed that
I have to do skb_kill_datagram()-like operation so that
the head message that must not be delivered to the caller won't prevent
picking up of the non-head message that should be delivered to the caller
when the caller repeats only recv(MSG_PEEK) requests.
Since skb_recv_datagram() can be called from interrupt context,
I have to use spin_lock_irqsave() instead for spin_lock_bh(), am I right?

/* from net/core/datagram.c */
@@ -178,6 +179,27 @@ struct sk_buff *skb_recv_datagram(struct
} else
skb = skb_dequeue(sk-sk_receive_queue);
 
+   error = security_post_recv_datagram(sk, skb, flags);
+   if (error) {
+   unsigned long cpu_flags;
+
+   if (!(flags  MSG_PEEK))
+   goto no_peek;
+
+   spin_lock_irqsave(sk-sk_receive_queue.lock,
+ cpu_flags);
+   if (skb == skb_peek(sk-sk_receive_queue)) {
+   __skb_unlink(skb,
+sk-sk_receive_queue);
+   atomic_dec(skb-users);
+   }
+   spin_unlock_irqrestore(sk-sk_receive_queue.lock,
+  cpu_flags);
+no_peek:
+   skb_free_datagram(sk, skb);
+   goto no_packet;
+   }
+
if (skb)
return skb;



By the way, why can't socket_post_accept() fail?
Someone may wish to do memory allocation at socket_post_accept().
socket_accept() is too early for memory allocation because
there is no chance to free allocated memory
when sock-ops-accept() failed.
I think socket_post_accept() should be able to fail.

(2) It allows the administrator judge interactively
using a userspace agent.
Thus, the new hook has to be inserted at blockable location,
Since skb_recv_datagram() can be called from interrupt context,
I do nothing in post_recv_datagram() if called from interrupt context.

+static int tmy_post_recv_datagram(struct sock *sk,
+ struct sk_buff *skb,
+ unsigned int flags)
+{
+   int error = 0;
+   const unsigned int type = sk-sk_type;
+
+   /* skb_recv_datagram() didn't dequeue. */
+   if (!skb)
+   return 0;
+
+   /* skb_recv_datagram() can be called from interrupt context. */
+   if (in_interrupt())
+   return 0;
+   /* I don't check if called by kernel process. */
+   if (segment_eq(get_fs(), KERNEL_DS))
+   return 0;
+
+   if (type != SOCK_DGRAM  type != SOCK_RAW)
+   return 0;
...(sniped)...
+}



Regards.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

What does -EIOCBQUEUED do?

2007-08-13 Thread Tetsuo Handa

Hello.

There are several locations that handle
-EIOCBQUEUED error code.
According to include/linux/errno.h ,
it seems this code is NFS related and
caller will receive completion event later.
But I couldn't figure out where is the beginning point
and what is happening.

What functions are called before aio_complete() is called?
(For example, is udp_recvmsg() called if sock_recvmsg() for
UDP socket returned -EIOCBQUEUED?)

What memory regions are accessed before aio_complete() is called?
(For example, is struct msghdr-msg_name accessed
if sock_recvmsg() for UDP socket returned -EIOCBQUEUED?)

Regards.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] Allow LSM to use IP address/port number.

2007-07-20 Thread Tetsuo Handa

Hello.

  Isn't it better to hook into existing netfilter infrastructure somehow?
 Yes, it has been suggested several times.
I want to use a process's context who issued
accept()/recv()/recvfrom()/recvmsg()/read() requests.

But Stephen Smalley said at 
http://marc.info/?l=linux-security-modulem=118459899017487w=2 that
 The socket can be inherited by a child task or passed via local IPC to
 an unrelated task, and then used by those other tasks.  It isn't tied to
 the original allocating task's lifecycle in any way, nor is it
 guaranteed to only be used by that original allocating task.  It can
 exist in zero, one or many tasks' file tables and still be receiving
 data, and can go from completely disassociated (i.e. not present in any
 tasks' file tables) to being associated.
I can't use netfilter infrastructure because
it is too early to know who the recipant process of the packet is.
The only chance to perform ip/port based filtering
using ACLs associated with the recipant process of the packet
is post-accept() and post-recvmsg().

Therefore, I re-post my patch again.

Regards.

Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]
---
 include/linux/security.h |   44 +---
 net/socket.c |   42 --
 security/dummy.c |   13 ++---
 3 files changed, 83 insertions(+), 16 deletions(-)

diff -upr a/include/linux/security.h b/include/linux/security.h
--- a/include/linux/security.h  2007-07-20 22:55:38.0 +0900
+++ b/include/linux/security.h  2007-07-20 23:15:58.0 +0900
@@ -749,8 +749,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -764,6 +768,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recvmsg:
+ * Check peer's address after receiving a message from a socket.
+ * This hook allows a security module to filter messages
+ * from unwanted peers.
+ * @sock contains the socket structure.
+ * @msg contains the message structure.
+ * @size contains the size of message structure.
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1345,12 +1358,14 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
-   struct socket * newsock);
+   int (*socket_post_accept) (struct socket *sock,
+  struct socket *newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recvmsg) (struct socket *sock, struct msghdr *msg,
+   int size, int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket * sock);
int (*socket_getsockopt) (struct socket * sock, int level, int optname);
@@ -2860,10 +2875,10 @@ static inline int security_socket_accept
return security_ops-socket_accept(sock, newsock);
 }
 
-static inline void security_socket_post_accept(struct socket * sock, 
-  struct socket * newsock)
+static inline int security_socket_post_accept(struct socket *sock,
+ struct socket *newsock)
 {
-   security_ops-socket_post_accept(sock, newsock);
+   return security_ops-socket_post_accept(sock, newsock);
 }
 
 static inline int security_socket_sendmsg(struct socket * sock, 
@@ -2879,6 +2894,13 @@ static inline int security_socket_recvms
return security_ops-socket_recvmsg(sock, msg, size, flags);
 }
 
+static inline int security_socket_post_recvmsg(struct socket *sock,
+  struct msghdr *msg

Re: [PATCH 1/1] Allow LSM to use IP address/port number.

2007-07-20 Thread Tetsuo Handa


Hello.

Patrick McHardy wrote:
 Quoting Tetsuo:
   So, my approach is not using security context associated with a socket
   but security context associated with a process.
 Isn't the socket context derived from the process context?
Not so regarding my case.

static int smack_sk_alloc_security(struct sock *sk, int family, gfp_t priority)
{
sk-sk_security = current-security;
return 0;
}

will not help what I want to do.
So, I'm not planning to use sk-sk_security.
I'm planning to use current-security at accept()/recvmsg() time.

What I want to do is to enforce subset of TCP Wrapper inside the kernel space
so that IP/port based filtering for TCP and UDP is applied to ALL processes
and IP/port based filtering for TCP and UDP is not bypassed by processes that 
are
linked with TCP Wrapper library (even if accept()/recvmsg() syscalls are 
directly called).

To receive source IP/port of a incoming packet, non-NULL msg-msg_name is 
needed for receiving them,
and I'm proposing modifications in net/socket.c .

Regards.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] Allow LSM to use IP address/port number. (was Re: [PATCH 1/1] Add post accept()/recvmsg() hooks.)

2007-07-09 Thread Tetsuo Handa

Hello.

This thread is from http://marc.info/?t=11834645705r=1w=2 .

I want to use tcp_wrapper-like filtering using LSM.
But it seems that there are cases (recvmsg() and read()?) where
__sock_recvmsg() is called with msg-name == NULL and msg-msg_namelen == 0
that makes what I want to do impossible.
To make IP address and port number always available,
some changes in socket layer are needed.

Since I'm not getting objection from LSM-ml so far,
I'm now adding netdev-ml because this patch is related to socket layer.

Are there ways to receive messages other than 
recv()/recvfrom()/recvmsg()/read()?
If recv()/recvfrom()/recvmsg()/read() are all ways to receive messages,
the following patch seems to allow LSM to use IP address and port number.

The following patch allocates buffer for receiving IP address and port number
even if userland doesn't request them.
Is this change no problem?

Regards.

Signed-off-by: Tetsuo Handa [EMAIL PROTECTED]
---
 include/linux/security.h |   38 ++
 net/socket.c |   40 ++--
 security/dummy.c |   11 +--
 3 files changed, 77 insertions(+), 12 deletions(-)

diff -upr a/include/linux/security.h b/include/linux/security.h
--- a/include/linux/security.h  2007-07-03 10:07:14.0 +0900
+++ b/include/linux/security.h  2007-07-09 10:51:04.0 +0900
@@ -748,8 +748,12 @@ struct request_sock;
  * @socket_post_accept:
  * This hook allows a security module to copy security
  * information into the newly created socket's inode.
+ * This hook also allows a security module to filter connections
+ * from unwanted peers.
+ * The connection will be aborted if this hook returns nonzero.
  * @sock contains the listening socket structure.
  * @newsock contains the newly created server socket for connection.
+ * Return 0 if permission is granted.
  * @socket_sendmsg:
  * Check permission before transmitting a message to another socket.
  * @sock contains the socket structure.
@@ -763,6 +767,15 @@ struct request_sock;
  * @size contains the size of message structure.
  * @flags contains the operational flags.
  * Return 0 if permission is granted.  
+ * @socket_post_recvmsg:
+ * Check peer's address after receiving a message from a socket.
+ * This hook allows a security module to filter messages
+ * from unwanted peers.
+ * @sock contains the socket structure.
+ * @msg contains the message structure.
+ * @size contains the size of message structure.
+ * @flags contains the operational flags.
+ * Return 0 if permission is granted.
  * @socket_getsockname:
  * Check permission before the local address (name) of the socket object
  * @sock is retrieved.
@@ -1343,12 +1356,14 @@ struct security_operations {
   struct sockaddr * address, int addrlen);
int (*socket_listen) (struct socket * sock, int backlog);
int (*socket_accept) (struct socket * sock, struct socket * newsock);
-   void (*socket_post_accept) (struct socket * sock,
+   int (*socket_post_accept) (struct socket *sock,
struct socket * newsock);
int (*socket_sendmsg) (struct socket * sock,
   struct msghdr * msg, int size);
int (*socket_recvmsg) (struct socket * sock,
   struct msghdr * msg, int size, int flags);
+   int (*socket_post_recvmsg) (struct socket *sock, struct msghdr *msg,
+   int size, int flags);
int (*socket_getsockname) (struct socket * sock);
int (*socket_getpeername) (struct socket * sock);
int (*socket_getsockopt) (struct socket * sock, int level, int optname);
@@ -2853,10 +2868,10 @@ static inline int security_socket_accept
return security_ops-socket_accept(sock, newsock);
 }
 
-static inline void security_socket_post_accept(struct socket * sock, 
+static inline int security_socket_post_accept(struct socket *sock,
   struct socket * newsock)
 {
-   security_ops-socket_post_accept(sock, newsock);
+   return security_ops-socket_post_accept(sock, newsock);
 }
 
 static inline int security_socket_sendmsg(struct socket * sock, 
@@ -2872,6 +2887,13 @@ static inline int security_socket_recvms
return security_ops-socket_recvmsg(sock, msg, size, flags);
 }
 
+static inline int security_socket_post_recvmsg(struct socket *sock,
+  struct msghdr *msg,
+  int size, int flags)
+{
+   return security_ops-socket_post_recvmsg(sock, msg, size, flags);
+}
+
 static inline int security_socket_getsockname(struct socket * sock)
 {
return security_ops-socket_getsockname(sock);
@@ -3016,9 +3038,10 @@ static inline int security_socket_accept
return 0

Re: [RFC] Allow LSM to use IP address/port number.

2007-07-09 Thread Tetsuo Handa

Hello.

Thank you for your comment.


David Miller wrote:
 I don't think it's such a hot idea to return errors if the
 wait_on_sync_kiocb() has returned success.
My patch may return errors for non-wait_on_sync_kiocb() case too.
Are you saying only wait_on_sync_kiocb() case is bad?
If so, could you please explain me why?
The location I inserted a hook is after aio event finished.
Both struct kiocb and struct sock_iocb are no longer in use.
So, I think this location can safely return errors.


If you are saying don't return error after receiving data,
I'll explain you why I need hooks after receiving data.

What I want to do is to implement a kind of anti-virus software's
personal firewall feature.
It drops messages from unwanted IP address/ports.
(To be exact, it doesn't drop, it just tells userland process
not to use received messages by returning errors.)

security_socket_recvmsg() is called before retrieving a message.
I want a hook that is called after retrieving a message
so that I can use IP address and port number for judgement.

I also want to allow users judge interactively using a popup dialog.
To judge interactively, the hook has to be block-able.
If non-interactive, iptables is the appropriate location for filtering.
This is why I want to insert security_socket_post_recvmsg() hook here.


Regards.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Allow LSM to use IP address/port number.

2007-07-09 Thread Tetsuo Handa

Thank you for your comment.

I have a question regarding netfilter infrastructure.

I want to filter messages using task_struct-security.
Can the netfilter's queuing to userspace feature
get a list of struct task_struct who shares a socket
that is going to receive incoming messages?

My approach is not is this socket allowed to receive from xxx.xxx.xxx.xxx port 
yy
but is this process allowed to receive from xxx.xxx.xxx.xxx port yy.
So, my approach is not using security context associated with a socket
but security context associated with a process.

If I can't use netfilter, there is no chance to filter before enqueuing
messages. So, I think propagating errors to the local user
after dequeuing messages is the only possible way.

Regards.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

57 matches

Mail list logo