Re: fcntl(F_GETPATH) support

2019-09-17 Thread Jason Thorpe


> On Sep 17, 2019, at 9:08 PM, Takashi YAMAMOTO  wrote:
> 
> but it would be a super expensive operation for the majority of filesystems, 
> right?

But it will most likely be extremely rare, so *shrug*?

-- thorpej



Re: fcntl(F_GETPATH) support

2019-09-17 Thread Takashi YAMAMOTO
2019年9月16日(月) 3:43 Jason Thorpe :

>
>
> > On Sep 15, 2019, at 11:03 AM, Christos Zoulas 
> wrote:
> >
> > I think it is quite reliable because all the file descriptors would be
> recently
> > opened and therefore be in the cache.  One would need to DoS the cache
> > cause eviction. If that turns out to be false, we can make the namecache
> > reliable, or withdraw it.
>
> If we had a way to ask the file system for the / a name+parent in the
> event we don't find an entry in the name cache, then we don't have to make
> the name cache "reliable".
>

but it would be a super expensive operation for the majority of
filesystems, right?


>
> FWIW, I don't think making the name cache "reliable" is necessarily a good
> idea -- it could be abused to consume kernel memory by creating lots of
> long-named files / directories.
>
> In the event that the name really can't be determined (perhaps it's an
> open-unlinked file?), then I think it's perfectly reasonable to return
> ENOENT for F_GETPATH; callers need to be capable of handling that.
>
> -- thorpej
>
>


Re: fcntl(F_GETPATH) support

2019-09-17 Thread Takashi YAMAMOTO
2019年9月15日(日) 9:06 Christos Zoulas :

> In article <2f29ca9a-0ae1-48d3-b3f4-1556912d4...@me.com>,
> Jason Thorpe   wrote:
> >
> >
> >> On Sep 14, 2019, at 2:52 PM, Kamil Rytarowski  wrote:
> >>
> >> On 14.09.2019 23:34, Christos Zoulas wrote:
> >>> Comments?
> >>>
> >>
> >> Question. How does it handle symbolic/hardlinks?
> >
> >Symbolic links, of course, are simply continuations of the lookup, so
> >there's "nothing to see here" with a symbolic link.
>
> If you can get a file desriptor to a symlink, it will work; I don't think
> that we have a way to do this now.
>
> >I looked at cache_revlookup() and it looks to me like it simply picks
> >the first hit it finds in the hash table for the given vnode.
>
> That is correct.
>
> >At least one platform that support this API uses "the last name the file
> >was looked up with", and will fall back on "whatever the underlying file
> >system considers to be the canonical name for the file", which is file
> >system-defined, if for some reason there is no entry in the name cache
> >(which could, for example, happen if an application is doing an
> >"open-by-file-id" type operation and the file has not yet been opened by
> >path name).
>
> It will also fail if the entry has been evicted from the cache, but that
> rarely happens with recently opened files.
>

if we want to provide this API, it should be done in a reliable way.
"rarely fails" is not reliable enough, IMO.


>
> christos
>
>


Re: fcntl(F_GETPATH) support

2019-09-17 Thread Christos Zoulas
In article ,
Jason Thorpe   wrote:
>
>
>> On Sep 17, 2019, at 1:39 PM, Christos Zoulas  wrote:
>> 
>> I am not sure though if we should change the current behavior just to make
>> F_GETPATH better? Opinions?
>
>It seems completely logical that we SHOULD fix this.

I concur. Specially because some filesystems already do it. These
are the only ones that need to be changed:

christos

Index: fs/msdosfs/msdosfs_vnops.c
===
RCS file: /cvsroot/src/sys/fs/msdosfs/msdosfs_vnops.c,v
retrieving revision 1.98
diff -u -u -r1.98 msdosfs_vnops.c
--- fs/msdosfs/msdosfs_vnops.c  26 Apr 2017 03:02:48 -  1.98
+++ fs/msdosfs/msdosfs_vnops.c  17 Sep 2019 22:40:01 -
@@ -153,6 +153,8 @@
goto bad;
VN_KNOTE(ap->a_dvp, NOTE_WRITE);
*ap->a_vpp = DETOV(dep);
+   cache_enter(ap->a_dvp, *ap->a_vpp, cnp->cn_nameptr, cnp->cn_namelen,
+   cnp->cn_flags);
return (0);
 
 bad:
Index: fs/tmpfs/tmpfs_subr.c
===
RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_subr.c,v
retrieving revision 1.104
diff -u -u -r1.104 tmpfs_subr.c
--- fs/tmpfs/tmpfs_subr.c   1 Jan 2019 10:06:54 -   1.104
+++ fs/tmpfs/tmpfs_subr.c   17 Sep 2019 22:40:01 -
@@ -434,6 +434,7 @@
 
VOP_UNLOCK(*vpp);
 
+   cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags);
return 0;
 }
 
Index: fs/udf/udf_subr.c
===
RCS file: /cvsroot/src/sys/fs/udf/udf_subr.c,v
retrieving revision 1.146
diff -u -u -r1.146 udf_subr.c
--- fs/udf/udf_subr.c   3 Jun 2019 06:04:20 -   1.146
+++ fs/udf/udf_subr.c   17 Sep 2019 22:40:01 -
@@ -5963,6 +5963,7 @@
/* adjust file count */
udf_adjust_filecount(udf_node, 1);
 
+   cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags);
return 0;
 }
 
Index: ufs/chfs/chfs_vnode.c
===
RCS file: /cvsroot/src/sys/ufs/chfs/chfs_vnode.c,v
retrieving revision 1.15
diff -u -u -r1.15 chfs_vnode.c
--- ufs/chfs/chfs_vnode.c   1 Apr 2017 19:35:57 -   1.15
+++ ufs/chfs/chfs_vnode.c   17 Sep 2019 22:40:01 -
@@ -310,6 +310,8 @@
 
VOP_UNLOCK(vp);
*vpp = vp;
+   cache_enter(pdir, *vpp, cnp->cn_nameptr, cnp->cn_namelen,
+   cnp->cn_flags);
return (0);
 }
 
Index: ufs/ext2fs/ext2fs_vnops.c
===
RCS file: /cvsroot/src/sys/ufs/ext2fs/ext2fs_vnops.c,v
retrieving revision 1.129
diff -u -u -r1.129 ext2fs_vnops.c
--- ufs/ext2fs/ext2fs_vnops.c   1 Jan 2019 10:06:55 -   1.129
+++ ufs/ext2fs/ext2fs_vnops.c   17 Sep 2019 22:40:01 -
@@ -1045,6 +1045,7 @@
}
 
*vpp = tvp;
+   cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags);
return 0;
 
 bad:
Index: ufs/lfs/lfs_vnops.c
===
RCS file: /cvsroot/src/sys/ufs/lfs/lfs_vnops.c,v
retrieving revision 1.324
diff -u -u -r1.324 lfs_vnops.c
--- ufs/lfs/lfs_vnops.c 20 Jun 2019 00:49:11 -  1.324
+++ ufs/lfs/lfs_vnops.c 17 Sep 2019 22:40:01 -
@@ -405,6 +405,7 @@
if (error)
goto bad;
*vpp = tvp;
+   cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags);
KASSERT(VOP_ISLOCKED(*vpp) == LK_EXCLUSIVE);
return (0);
 
Index: ufs/ufs/ufs_vnops.c
===
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vnops.c,v
retrieving revision 1.247
diff -u -u -r1.247 ufs_vnops.c
--- ufs/ufs/ufs_vnops.c 1 Jul 2019 00:57:06 -   1.247
+++ ufs/ufs/ufs_vnops.c 17 Sep 2019 22:40:01 -
@@ -1909,6 +1909,7 @@
if (error)
goto bad;
*vpp = tvp;
+   cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags);
return (0);
 
  bad:



Re: fcntl(F_GETPATH) support

2019-09-17 Thread Jason Thorpe



> On Sep 17, 2019, at 1:39 PM, Christos Zoulas  wrote:
> 
> I am not sure though if we should change the current behavior just to make
> F_GETPATH better? Opinions?

It seems completely logical that we SHOULD fix this.

-- thorpej



Re: fcntl(F_GETPATH) support

2019-09-17 Thread Christos Zoulas
In article <4e7e49e0-9c71-1f21-22fc-8ed54590a...@gmx.com>,
Kamil Rytarowski   wrote:
>-=-=-=-=-=-
>-=-=-=-=-=-
>
>On 15.09.2019 20:03, Christos Zoulas wrote:
>> I think it is quite reliable because all the file descriptors would be
>recently
>> opened and therefore be in the cache.  One would need to DoS the cache
>> cause eviction. If that turns out to be false, we can make the namecache
>> reliable, or withdraw it.
>
>As discussed with Mateusz Guzik offlist, the dentry approach looks
>reliable and as a way to go.
>
>It changes fd -> vnode -> inode to fd -> dentry -> vnode.
>
>We could switch from catching program name on exec to proper pathname
>resolution with KERN_PROC_PATHNAME.
>
>Certain programs require always correct path to be resolved from
>hardlinks, not the last one from the cache. This used to affect LLVM.
>
>There is also a hole in the current namecache implementation as it
>misses entry for newly created files (as informed by Mateusz). Example:
>
>#include 
>#include 
>#include 
>#include 
>#include 
>
>int
>main(void)
>{
>char buf[1024];
>int fd;
>
>fd = open("/tmp/test", O_RDWR|O_CREAT|O_EXCL,0600);
>if (fd == -1)
>err(1, "open");
>if (fcntl(fd, F_GETPATH, buf) < 0)
>err(1, "fcntl");
>printf("[%s]\n", buf);
>}
>
>For the time being, the current code is still an improvement over the
>past state.

Well, I did some experiments by adding a cache_enter call in ufs inode
creation. This fixes the above case, and makes the cache more logically
consistent i.e. we usually create files we are going to access, so we
might as well cache them (consider a build scenario and object files,
or even binaries that will get installed). This is a one line change
per filesystem:

Index: ufs_vnops.c
===
RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vnops.c,v
retrieving revision 1.247
diff -u -p -u -r1.247 ufs_vnops.c
--- ufs_vnops.c 1 Jul 2019 00:57:06 -   1.247
+++ ufs_vnops.c 17 Sep 2019 20:32:48 -
@@ -1909,6 +1909,7 @@ ufs_makeinode(struct vattr *vap, struct 
if (error)
goto bad;
*vpp = tvp;
+   cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags);
return (0);
 
  bad:

The above change does not seem to affect build times...

no cache
build.sh started:Mon Sep 16 13:11:04 EDT 2019
build.sh ended:  Mon Sep 16 15:27:41 EDT 2019
2:16:37

cache
build.sh started:Mon Sep 16 16:02:18 EDT 2019
build.sh ended:  Mon Sep 16 18:18:23 EDT 2019
2:16:15

I don't know what made the builds slower the second day... We should
investigate why builds on a freshly booted machines are faster!

cache
build.sh started:Tue Sep 17 11:05:35 EDT 2019
build.sh ended:  Tue Sep 17 13:37:33 EDT 2019
2:31:58

no cache
build.sh started:Tue Sep 17 14:01:14 EDT 2019
build.sh ended:  Tue Sep 17 16:29:17 EDT 2019
2:28:03

I am not sure though if we should change the current behavior just to make
F_GETPATH better? Opinions?

christos



Re: pool guard

2019-09-17 Thread Maxime Villard

Le 12/09/2019 à 08:21, Maxime Villard a écrit :

Le 06/09/2019 à 15:09, Maxime Villard a écrit :

An idea for a feature similar to KMEM_GUARD - which I recently removed because
it was too weak and useless -, but this time at the pool layer, covering certain
specific pools, without memory consumption or performance cost, and enabled by
default at least on amd64. Note that this is hardening and exploit mitigation,
but not bug detection, so it will be of little interest in the context of
fuzzing. Note also that it targets 64bit arches, because they have nearly
unlimited VA.

The idea is that we can use special guard allocators on certain pools to prevent
important kernel data from being close to untrusted data in the VA space.

Suppose the kernel is parsing a received packet from the network, and there
is a buffer overflow which causes it to write beyond the mbuf. The data is in
an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page
allocated using the default pool allocator. After the 4K page in memory, there
could be critical kernel data sitting, which an attacker could overwrite.

    overflow
    --->
 +++--+
 | 2K Cluster | 2K Cluster | Critical Kernel Data |
 +++--+
  <- usual 4K pool page --> <- another 4K page -->

This is a scenario that I already encountered when working on NetBSD's network
stack.

Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers
to allocate buffers with unmapped pages at the beginning and the end). The pool
layer sees pages of size 128K, and packs 64 2K clusters in them.

    overflow
    ~>
 ++---+---+---+---+---++
 |  Unmapped  | 2K C. | 2K C. | [...] | 2K C. | 2K C. |  Unmapped  |
 ++---+---+---+---+---++
  <-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K --->

The pool page header is off-page, and bitmapped. Therefore, there is strictly
no kernel data in the 128K pool page.

The overflow still occurs, but this time the critical kernel data is far from
here, after the unmapped pages at the end. At worst only other clusters get
overwritten; at best we are close to the end and hit a page fault which stops
the overflow. 64K is chosen as the maximum of uint16_t.

No performance cost, because these guarded buffers are allocated only when the
pools grow, which is a rare operation that occurs almost only at boot time. No
actual memory consumption either, because unmapped areas don't consume physical
memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on
amd64, far beyond what we will ever need -, so no problem with consuming VA.

The code is here [1] for mcl, it is simple and works fine. It is not perfect
but can already prevent a lot of trouble. The principle could be applied to
other pools.

[1] https://m00nbsd.net/garbage/pool/guard.diff


If there are no further comments, I will commit it within a week.


Actually I realized there is a problem. uvm_km_guard must use uvm_map and not
the kmem arena, because the kmem arena is proportionate to the PA. The thing
is, using uvm_map is illegal, because we're sometimes in interrupt context.
Needs to be revisited.