Re: fcntl(F_GETPATH) support
> On Sep 17, 2019, at 9:08 PM, Takashi YAMAMOTO wrote: > > but it would be a super expensive operation for the majority of filesystems, > right? But it will most likely be extremely rare, so *shrug*? -- thorpej
Re: fcntl(F_GETPATH) support
2019年9月16日(月) 3:43 Jason Thorpe : > > > > On Sep 15, 2019, at 11:03 AM, Christos Zoulas > wrote: > > > > I think it is quite reliable because all the file descriptors would be > recently > > opened and therefore be in the cache. One would need to DoS the cache > > cause eviction. If that turns out to be false, we can make the namecache > > reliable, or withdraw it. > > If we had a way to ask the file system for the / a name+parent in the > event we don't find an entry in the name cache, then we don't have to make > the name cache "reliable". > but it would be a super expensive operation for the majority of filesystems, right? > > FWIW, I don't think making the name cache "reliable" is necessarily a good > idea -- it could be abused to consume kernel memory by creating lots of > long-named files / directories. > > In the event that the name really can't be determined (perhaps it's an > open-unlinked file?), then I think it's perfectly reasonable to return > ENOENT for F_GETPATH; callers need to be capable of handling that. > > -- thorpej > >
Re: fcntl(F_GETPATH) support
2019年9月15日(日) 9:06 Christos Zoulas : > In article <2f29ca9a-0ae1-48d3-b3f4-1556912d4...@me.com>, > Jason Thorpe wrote: > > > > > >> On Sep 14, 2019, at 2:52 PM, Kamil Rytarowski wrote: > >> > >> On 14.09.2019 23:34, Christos Zoulas wrote: > >>> Comments? > >>> > >> > >> Question. How does it handle symbolic/hardlinks? > > > >Symbolic links, of course, are simply continuations of the lookup, so > >there's "nothing to see here" with a symbolic link. > > If you can get a file desriptor to a symlink, it will work; I don't think > that we have a way to do this now. > > >I looked at cache_revlookup() and it looks to me like it simply picks > >the first hit it finds in the hash table for the given vnode. > > That is correct. > > >At least one platform that support this API uses "the last name the file > >was looked up with", and will fall back on "whatever the underlying file > >system considers to be the canonical name for the file", which is file > >system-defined, if for some reason there is no entry in the name cache > >(which could, for example, happen if an application is doing an > >"open-by-file-id" type operation and the file has not yet been opened by > >path name). > > It will also fail if the entry has been evicted from the cache, but that > rarely happens with recently opened files. > if we want to provide this API, it should be done in a reliable way. "rarely fails" is not reliable enough, IMO. > > christos > >
Re: fcntl(F_GETPATH) support
In article , Jason Thorpe wrote: > > >> On Sep 17, 2019, at 1:39 PM, Christos Zoulas wrote: >> >> I am not sure though if we should change the current behavior just to make >> F_GETPATH better? Opinions? > >It seems completely logical that we SHOULD fix this. I concur. Specially because some filesystems already do it. These are the only ones that need to be changed: christos Index: fs/msdosfs/msdosfs_vnops.c === RCS file: /cvsroot/src/sys/fs/msdosfs/msdosfs_vnops.c,v retrieving revision 1.98 diff -u -u -r1.98 msdosfs_vnops.c --- fs/msdosfs/msdosfs_vnops.c 26 Apr 2017 03:02:48 - 1.98 +++ fs/msdosfs/msdosfs_vnops.c 17 Sep 2019 22:40:01 - @@ -153,6 +153,8 @@ goto bad; VN_KNOTE(ap->a_dvp, NOTE_WRITE); *ap->a_vpp = DETOV(dep); + cache_enter(ap->a_dvp, *ap->a_vpp, cnp->cn_nameptr, cnp->cn_namelen, + cnp->cn_flags); return (0); bad: Index: fs/tmpfs/tmpfs_subr.c === RCS file: /cvsroot/src/sys/fs/tmpfs/tmpfs_subr.c,v retrieving revision 1.104 diff -u -u -r1.104 tmpfs_subr.c --- fs/tmpfs/tmpfs_subr.c 1 Jan 2019 10:06:54 - 1.104 +++ fs/tmpfs/tmpfs_subr.c 17 Sep 2019 22:40:01 - @@ -434,6 +434,7 @@ VOP_UNLOCK(*vpp); + cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags); return 0; } Index: fs/udf/udf_subr.c === RCS file: /cvsroot/src/sys/fs/udf/udf_subr.c,v retrieving revision 1.146 diff -u -u -r1.146 udf_subr.c --- fs/udf/udf_subr.c 3 Jun 2019 06:04:20 - 1.146 +++ fs/udf/udf_subr.c 17 Sep 2019 22:40:01 - @@ -5963,6 +5963,7 @@ /* adjust file count */ udf_adjust_filecount(udf_node, 1); + cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags); return 0; } Index: ufs/chfs/chfs_vnode.c === RCS file: /cvsroot/src/sys/ufs/chfs/chfs_vnode.c,v retrieving revision 1.15 diff -u -u -r1.15 chfs_vnode.c --- ufs/chfs/chfs_vnode.c 1 Apr 2017 19:35:57 - 1.15 +++ ufs/chfs/chfs_vnode.c 17 Sep 2019 22:40:01 - @@ -310,6 +310,8 @@ VOP_UNLOCK(vp); *vpp = vp; + cache_enter(pdir, *vpp, cnp->cn_nameptr, cnp->cn_namelen, + cnp->cn_flags); return (0); } Index: ufs/ext2fs/ext2fs_vnops.c === RCS file: /cvsroot/src/sys/ufs/ext2fs/ext2fs_vnops.c,v retrieving revision 1.129 diff -u -u -r1.129 ext2fs_vnops.c --- ufs/ext2fs/ext2fs_vnops.c 1 Jan 2019 10:06:55 - 1.129 +++ ufs/ext2fs/ext2fs_vnops.c 17 Sep 2019 22:40:01 - @@ -1045,6 +1045,7 @@ } *vpp = tvp; + cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags); return 0; bad: Index: ufs/lfs/lfs_vnops.c === RCS file: /cvsroot/src/sys/ufs/lfs/lfs_vnops.c,v retrieving revision 1.324 diff -u -u -r1.324 lfs_vnops.c --- ufs/lfs/lfs_vnops.c 20 Jun 2019 00:49:11 - 1.324 +++ ufs/lfs/lfs_vnops.c 17 Sep 2019 22:40:01 - @@ -405,6 +405,7 @@ if (error) goto bad; *vpp = tvp; + cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags); KASSERT(VOP_ISLOCKED(*vpp) == LK_EXCLUSIVE); return (0); Index: ufs/ufs/ufs_vnops.c === RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vnops.c,v retrieving revision 1.247 diff -u -u -r1.247 ufs_vnops.c --- ufs/ufs/ufs_vnops.c 1 Jul 2019 00:57:06 - 1.247 +++ ufs/ufs/ufs_vnops.c 17 Sep 2019 22:40:01 - @@ -1909,6 +1909,7 @@ if (error) goto bad; *vpp = tvp; + cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags); return (0); bad:
Re: fcntl(F_GETPATH) support
> On Sep 17, 2019, at 1:39 PM, Christos Zoulas wrote: > > I am not sure though if we should change the current behavior just to make > F_GETPATH better? Opinions? It seems completely logical that we SHOULD fix this. -- thorpej
Re: fcntl(F_GETPATH) support
In article <4e7e49e0-9c71-1f21-22fc-8ed54590a...@gmx.com>, Kamil Rytarowski wrote: >-=-=-=-=-=- >-=-=-=-=-=- > >On 15.09.2019 20:03, Christos Zoulas wrote: >> I think it is quite reliable because all the file descriptors would be >recently >> opened and therefore be in the cache. One would need to DoS the cache >> cause eviction. If that turns out to be false, we can make the namecache >> reliable, or withdraw it. > >As discussed with Mateusz Guzik offlist, the dentry approach looks >reliable and as a way to go. > >It changes fd -> vnode -> inode to fd -> dentry -> vnode. > >We could switch from catching program name on exec to proper pathname >resolution with KERN_PROC_PATHNAME. > >Certain programs require always correct path to be resolved from >hardlinks, not the last one from the cache. This used to affect LLVM. > >There is also a hole in the current namecache implementation as it >misses entry for newly created files (as informed by Mateusz). Example: > >#include >#include >#include >#include >#include > >int >main(void) >{ >char buf[1024]; >int fd; > >fd = open("/tmp/test", O_RDWR|O_CREAT|O_EXCL,0600); >if (fd == -1) >err(1, "open"); >if (fcntl(fd, F_GETPATH, buf) < 0) >err(1, "fcntl"); >printf("[%s]\n", buf); >} > >For the time being, the current code is still an improvement over the >past state. Well, I did some experiments by adding a cache_enter call in ufs inode creation. This fixes the above case, and makes the cache more logically consistent i.e. we usually create files we are going to access, so we might as well cache them (consider a build scenario and object files, or even binaries that will get installed). This is a one line change per filesystem: Index: ufs_vnops.c === RCS file: /cvsroot/src/sys/ufs/ufs/ufs_vnops.c,v retrieving revision 1.247 diff -u -p -u -r1.247 ufs_vnops.c --- ufs_vnops.c 1 Jul 2019 00:57:06 - 1.247 +++ ufs_vnops.c 17 Sep 2019 20:32:48 - @@ -1909,6 +1909,7 @@ ufs_makeinode(struct vattr *vap, struct if (error) goto bad; *vpp = tvp; + cache_enter(dvp, *vpp, cnp->cn_nameptr, cnp->cn_namelen, cnp->cn_flags); return (0); bad: The above change does not seem to affect build times... no cache build.sh started:Mon Sep 16 13:11:04 EDT 2019 build.sh ended: Mon Sep 16 15:27:41 EDT 2019 2:16:37 cache build.sh started:Mon Sep 16 16:02:18 EDT 2019 build.sh ended: Mon Sep 16 18:18:23 EDT 2019 2:16:15 I don't know what made the builds slower the second day... We should investigate why builds on a freshly booted machines are faster! cache build.sh started:Tue Sep 17 11:05:35 EDT 2019 build.sh ended: Tue Sep 17 13:37:33 EDT 2019 2:31:58 no cache build.sh started:Tue Sep 17 14:01:14 EDT 2019 build.sh ended: Tue Sep 17 16:29:17 EDT 2019 2:28:03 I am not sure though if we should change the current behavior just to make F_GETPATH better? Opinions? christos
Re: pool guard
Le 12/09/2019 à 08:21, Maxime Villard a écrit : Le 06/09/2019 à 15:09, Maxime Villard a écrit : An idea for a feature similar to KMEM_GUARD - which I recently removed because it was too weak and useless -, but this time at the pool layer, covering certain specific pools, without memory consumption or performance cost, and enabled by default at least on amd64. Note that this is hardening and exploit mitigation, but not bug detection, so it will be of little interest in the context of fuzzing. Note also that it targets 64bit arches, because they have nearly unlimited VA. The idea is that we can use special guard allocators on certain pools to prevent important kernel data from being close to untrusted data in the VA space. Suppose the kernel is parsing a received packet from the network, and there is a buffer overflow which causes it to write beyond the mbuf. The data is in an mbuf cluster of size 2K (on amd64). This mbuf cluster sits on a 4K page allocated using the default pool allocator. After the 4K page in memory, there could be critical kernel data sitting, which an attacker could overwrite. overflow ---> +++--+ | 2K Cluster | 2K Cluster | Critical Kernel Data | +++--+ <- usual 4K pool page --> <- another 4K page --> This is a scenario that I already encountered when working on NetBSD's network stack. Now, we switch the mcl pool to use the new uvm_km_guard API (simple wrappers to allocate buffers with unmapped pages at the beginning and the end). The pool layer sees pages of size 128K, and packs 64 2K clusters in them. overflow ~> ++---+---+---+---+---++ | Unmapped | 2K C. | 2K C. | [...] | 2K C. | 2K C. | Unmapped | ++---+---+---+---+---++ <-- 64K ---> <-- 128K pool page with 64 clusters --> <-- 64K ---> The pool page header is off-page, and bitmapped. Therefore, there is strictly no kernel data in the 128K pool page. The overflow still occurs, but this time the critical kernel data is far from here, after the unmapped pages at the end. At worst only other clusters get overwritten; at best we are close to the end and hit a page fault which stops the overflow. 64K is chosen as the maximum of uint16_t. No performance cost, because these guarded buffers are allocated only when the pools grow, which is a rare operation that occurs almost only at boot time. No actual memory consumption either, because unmapped areas don't consume physical memory, only virtual, and on 64bit arches we have plenty of that - eg 32TB on amd64, far beyond what we will ever need -, so no problem with consuming VA. The code is here [1] for mcl, it is simple and works fine. It is not perfect but can already prevent a lot of trouble. The principle could be applied to other pools. [1] https://m00nbsd.net/garbage/pool/guard.diff If there are no further comments, I will commit it within a week. Actually I realized there is a problem. uvm_km_guard must use uvm_map and not the kmem arena, because the kmem arena is proportionate to the PA. The thing is, using uvm_map is illegal, because we're sometimes in interrupt context. Needs to be revisited.