Re: adding linux syscall fallocate

2019-11-04 Thread Maciej
Hi r0ller,

A couple of weeks ago I also run to the issue when I found lack of fallocate or 
POSIX_FALLOCATE(2) (to be precise) a little bit sad.
From the very first view typical usage of POSIX_FALLOCATE(2) seems straight 
forward, comparing to the Linux fallocate(2) where different modes have to be 
supported. However, things can go a little bit more complicated if you are 
dealing with an existing file with a more complex structure.
Because of that, I found difficult to provide good quality implementation 
without a proper way to test it.
Before EuroBSD 2019 as a part of work about fuzzing the FFS, I decided to bring 
some well known FS test (namely speaking "XFS tests') suit to make sure that 
bugs that we fix did not introduce a regression.
The same thing applies to the new features of FS, is relatively easy to port 
implementation from other OS/FS, but without a proper way to test them, I would 
be very careful to introduce such things too quickly to the end-users.

One thing that I was missing for XFS tests, and going to publish part of it 
soon, is a way to view the internal structure of inodes and other metadata of 
Filesystem. My primary use case was to debug mount issues, in the example the 
issue that I showed during my presentation about the fuzzing. But also same 
apply to the code that manipulates inode blocks.

Hopefully, we can elaborate on that, and as I pointed earlier I would be 
careful with merging new FS features especially such wide used as 
POSIX_FALLOCATE(2) without a proper FS test suit or extensive enough testing 
that would require proper too like i.e. FSDB.

Thanks
Maciej

‐‐‐ Original Message ‐‐‐
On Sunday, November 3, 2019 6:06 PM, r0ller  wrote:

> Hi Jaromir,
>
> Indeed. That's bad news but thanks for your answer! I've even found this: 
> https://wiki.netbsd.org/projects/project/ffs-fallocate/
> Are there any details for this project besides that page? I don't know 
> anything about NetBSD internals though if it's not meant for gurus, I'd have 
> a look at it and give it a try.
>
> Best regards,
> r0ller
>
>  Eredeti levél 
> Feladó: Jaromír Doleček < jaromir.dole...@gmail.com (Link -> 
> mailto:jaromir.dole...@gmail.com) >
> Dátum: 2019 november 3 15:16:34
> Tárgy: Re: adding linux syscall fallocate
> Címzett: r0ller < r0l...@freemail.hu (Link -> mailto:r0l...@freemail.hu) >
> Le dim. 3 nov. 2019 à 08:57, r0ller  a écrit :
>
>> As you can see on the attached screenshot, "line 4741" gets printed out. So 
>> I went on to check what happens in VOP_FALLOCATE but it gets really internal 
>> there.
>>
>> Does anyone have any hint?
>
> fallocate VOP is not implemented for FFS:
>
>> grep fallocate *
> ffs_vnops.c: { &vop_fallocate_desc, genfs_eopnotsupp }, /* fallocate */
> ffs_vnops.c: { &vop_fallocate_desc, spec_fallocate }, /* fallocate */
> ffs_vnops.c: { &vop_fallocate_desc, vn_fifo_bypass }, /* fallocate */
>
> Jaromir

Re: adding linux syscall fallocate

2019-11-09 Thread Maciej
Hi r0ller,

"I do know that writing a file by calling fallocate can be tricked and 
redirected into a named pipe of which the data can simply be pulled into a 
normal file."

> My understanding of both `fallocate` (which can be emulated on not Linux 
> FSes/OSes) and `posix_fallocate` is that this call is made to ensures that 
> the space in the range offset to offset+len is allocated on storage medium.
Based on my experience (and I may not know about all possible use cases as my 
main background is more Archive/Databases/device drivers), two main use-cases 
that people use `fallocate`:
- Stability to reserve a place on the disk because bad things can happen in the 
pipeline of any process if some file is not completed because of the lack of 
disk space or other issues.
- Various performance tricks reserve contiguous region, mmap it, do large 
sequential writes or asynchronous operations. Also the simplest performance use 
case relay on the fact that if you do not have allocated blocks in order to 
flush write to the disk you need to do read to find blocks and fill metadata.

btw: Also people from device drivers used to like fallocation as it makes many 
driver error scenarios much easier to handle.

Now if your cross-compilation don't perform complicated performance tricks or 
other fallocation operation that I am not aware of (but will be more than happy 
to learn), we can try to fix that properly.
The `posix_fallocate` implementation that I started a couple of weeks ago 
should work for the use case where FileSystem mainly focuses to reserve the 
blocks upfront before placing large assets like binary files, logs or just 
regular non-sparse files.
If you are interested we can try that, but first of all, could you run the 
process under the "ktruss -i" and send me the output (you can just grep for 
`posix_fallocate`, `fdiscard`, `open`, `close` and eventually other FS 
operations)?

Coming back to your question: if I understand you correctly you just redirect 
operations to the FIFO in buffering fashion and then just flush them. I cannot 
see how this solves fallocation issue, except the fact that fallocate probably 
does not make too much sense for the cross-compilation process because when 
someone will try to open the file or do other operation it will be blocked 
waiting for other processes to fill it.
If we consider a single process which is going to create a file do `falloc` and 
later try to write to it. At the very beginning, it will either try to remove 
the previous file if exist or it may do open on the path and hangs waiting for 
other processes to fill it. I worry that you may face a couple of such 
scenarios in a more complicated version.
I have some doubts if in that way it will be easy to work around on such a 
large project as a compilation.
Maybe just mocking fallocate to simply return success without doing anything in 
FS would be an easier hack, but you will need to make sure that you won't run 
out of space on your working storage medium...

Also before any experiments always do the backup and be sure that you can lose 
the data that you are working on! ;)
Let me know what do you think
Cheers
Maciej

‐‐‐ Original Message ‐‐‐
On Wednesday, November 6, 2019 9:12 AM, r0ller  wrote:

> Hi Maciej,
>
> Thanks for the detailed answer! Unfortunately, I don't think that I could 
> accomplish this task in my spare time:(
>
> Please, don't take this as an offence but as a fix for my case I thought of 
> an utter hack:
>
> I do know that writing a file by calling fallocate can be tricked and 
> redirected into a named pipe of which the data can simply be pulled into a 
> normal file. This is what I'm already doing in my project as a workaround 
> when building it as 32bit arm lib:
>
> mkfifo mypipe
> cat mypipe > myfile
> 
>
> The problem with this is that it cannot be used when crosscompiling autoconf 
> projects where a configure script starts creating many files as I'd need to 
> edit the script at too many places to implement this trick.
>
> However, if I could carry out this trick with the pipe when intercepting the 
> linux fallocate call, it could work. Do you think it feasible?
>
> Best regards,
> r0ller
>
>  Eredeti levél 
> Feladó: Maciej < maciej.grochow...@protonmail.com (Link -> 
> mailto:maciej.grochow...@protonmail.com) >
> Dátum: 2019 november 4 23:32:56
> Tárgy: Re: adding linux syscall fallocate
> Címzett: r0ller < r0l...@freemail.hu (Link -> mailto:r0l...@freemail.hu) >
>
> Hi r0ller,
>
> A couple of weeks ago I also run to the issue when I found lack of fallocate 
> or POSIX_FALLOCATE(2) (to be precise) a little bit sad.
>>From the very first view typical usage of POSIX_FALLOCATE(2) seems straight 
>>forward, comparing to the Linux fallocate(2) where

Proposal: validate FFS root inode during the mount.

2019-11-18 Thread Maciej
Hi all,

During the fuzzing of FFS filesystem, we had a couple of issues caused by 
corrupted inode fields.
Especially there is a lot of things that can go wrong when `di_mode`, `di_size` 
and `di_nlink` are empty.
Example of corrupted root inode, ufs1 but ufs2 will be similar.

```
fsdb (inum: 2)> pf inode number=2 format=ufs1
command `pf inode number=2 format=ufs1
'
Disk format ufs1inode 2 block: 512

di_mode: 0x0  <<di_nlink: 0x0  <<
di_size: 0x0  <<di_atime: 0x0
di_atimensec: 0x0   di_mtime: 0x0
di_mtimensec: 0x0   di_ctime: 0x0
di_ctimensec: 0x0   di_flags: 0x0
di_blocks: 0x0  di_gen: 0x6c3122e2
di_uid: 0x0 di_gid: 0x0
di_modrev: 0x0
--- inode.di_oldids ---
...
```

Below dump of inode from freshly created FS:
```
fsdb (inum: 2)> pf inode number=2
command `pf inode number=2
'
Disk format ufs1inode 2 block: 512

di_mode: 0x41ed di_nlink: 0x2
di_size: 0x200  di_atime: 0x0
di_atimensec: 0x0   di_mtime: 0x0
di_mtimensec: 0x0   di_ctime: 0x0
di_ctimensec: 0x0   di_flags: 0x0
di_blocks: 0x1  di_gen: 0x68881d2c
di_uid: 0x0 di_gid: 0x0
di_modrev: 0x0
--- inode.di_oldids ---
...
```

Main issues with such corrupted root inode are:
- failing all operation where `namei` is involved, like `lstat` or `umount` (so 
after mounting the corrupted image user cannot umount it).
- system crashes when the block size is equal to zero on lstat or other lookup 
ops (i.e. `ls -alh /mnt` will cause panic).

There is couple more but I didn't study them in details.

To make sure that corrupted mount won't cause harm to the user,
I want to add function to validate root inode on mount step (after superblock 
validation)
Initial version that I prepared read inode from the disk and check these fields,
however, it does not add root vnode to the vcache.
Downsize is obviously additional read from the disk however I think is worth it.

I did some walk through vnode life-cycle and adding it to the vcache on mount 
step does not seem to impact lifecycle.
Although for now, I decided to not make things complicated,
but I will be more than happy to add this functionality if people consider that 
is worth it.

Any thoughts about such validation?
I attached the patch below.
Is not the final version as I need to check a couple of remount scenarios.

Thanks
Maciej

ps. More details about Fuzzing FFS:
https://blog.netbsd.org/tnf/entry/write_your_own_fuzzer_for
https://blog.netbsd.org/tnf/entry/fuzzing_netbsd_filesystems_via_afldiff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index 138b8781b60f..71e71f1a830e 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -92,6 +92,7 @@ __KERNEL_RCSID(0, "$NetBSD: ffs_vfsops.c,v 1.361 2019/01/01 10:06:55 hannken Exp
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -115,6 +116,7 @@ MODULE(MODULE_CLASS_VFS, ffs, NULL);
 
 static int ffs_vfs_fsync(vnode_t *, int);
 static int ffs_superblock_validate(struct fs *);
+static int ffs_rootinode_validate(struct vnode *, struct fs *, struct ufsmount *);
 static int ffs_is_appleufs(struct vnode *, struct fs *);
 
 static int ffs_init_vnode(struct ufsmount *, struct vnode *, ino_t);
@@ -798,6 +800,13 @@ ffs_reload(struct mount *mp, kauth_cred_t cred, struct lwp *l)
 		return (EINVAL);
 	}
 
+	if (!ffs_rootinode_validate(devvp, newfs, ump)) {
+		DPRINTF("Corrupted root inode cannot mount Filesystem.");
+		DPRINTF("Please run fsck first!");
+		kmem_free(newfs, fs_sbsize);
+		return (EINVAL);
+	}
+
 	/*
 	 * The current implementation doesn't handle the possibility that
 	 * these values may have changed.
@@ -943,6 +952,58 @@ ffs_reload(struct mount *mp, kauth_cred_t cred, struct lwp *l)
  */
 static const int sblock_try[] = SBLOCKSEARCH;
 
+int ffs_rootinode_validate(struct vnode *devvp, struct fs *fs, struct ufsmount *ump)
+{
+	struct ufs1_dinode *dp1;
+	struct ufs2_dinode *dp2;
+	struct inode ip;
+	struct buf *bp;
+	int error;
+	int result;
+	ino_t root_ino;
+
+	result = 1;
+	root_ino = UFS_ROOTINO;
+
+	/* Read the disk contents for the inode. */
+	error = bread(devvp, FFS_FSBTODB(fs, ino_to_fsba(fs, root_ino)),
+  (int)fs->fs_bsize, 0, &bp);
+	if (error) {
+		printf("CANNOT bread!!");
+		return error;
+	}
+
+	/* Initialize temporary root inode. */
+	memset(&ip, 0, sizeof(struct inode));
+	ip.i_ump = ump;
+	ip.i_fs = fs;
+	ip.i_dev = ump->um_dev;
+	ip.i_number = root_ino;
+
+	ffs_load_inode(bp, &ip, fs, root_ino);
+
+	if (fs->fs_magic == FS_UFS1_MAGIC) {
+		dp1 = ip.i_din.ffs1_din;
+
+		if (!S_ISDIR(dp1->di_mode) || !dp1->di_size || !dp1->di_blocks
+|| !dp1->di_nlink) {
+			result = 0;
+		}
+	} else { /* fs->fs_m

Re: adding linux syscall fallocate

2019-11-19 Thread Maciej
>You don't have to know where to write. Strictly speaking you only need to know 
>that the space is reserved. We do
>have enough magic block numbers that could serve as a marker. E.g. anything up 
>to the end of the first super block
>isn't allocatable. And there are enough spare fields in the fs, cg and csum 
>structs.

One thing that I am missing here is: if you would keep info about reservation 
as number of magic block i.e. M inside blocks pointers, and you would reserve 
lets say 1000 blocks for file from 1200 available on storage medium, you will 
have enough block for this reservation but later another one will came call for 
another file which will require another 1000 blocks and this one should fail 
because we do not have enough space, so knowledge about history of allocations 
is require.

If you request block-allocator from fallocate to reserve these block, to make 
sure that they will be available as a reservation, you decrements number of 
available blk on FS, but I think info about this allocation should be stored 
somewhere? If we need to keep that, it will need some additional list or pool 
inside Cylinder Group to store these allocated but not assigned blocks? This 
info needs to be permanently written somewhere I think it is variable length 
plus we can have multiple requests/free that would be not contiguous in blocks 
so I have some doubts if we can just put them as a range with start-end block 
and call fallocated-blocks.

I remember when ext4 come with Linux fallocate, and they support only extend 
based files because it is obvious how to do that on extends, however for some 
reason they didn't support block-based old files, and I was thinking that was 
because of that reason that would require changes on-disk layout, and 
historically posix_fallocate was slow because it zeroed blocks.

Thanks
Maciej

‐‐‐ Original Message ‐‐‐
On Tuesday, November 19, 2019 4:00 PM, Christoph Badura  wrote:

> On Mon, Nov 18, 2019 at 02:26:17PM -0800, Jason Thorpe wrote:
>
> > > On Nov 18, 2019, at 1:13 PM, Mouse mo...@rodents-montreal.org wrote:
> > > All you need is a second magic block number. Block number zero is
> > > already reserved for holes. Making, say, block number 1, or -1, or
> > > some such, reserved to represent "block-of-zeros semantics for which
> > > all backing data has been accounted as allocated", so that writing such
> > > a block is guaranteed to have space available?
> >
> > If you’re using a single magic block number for “allocated,
> > but uninitialized”, how are you supposed to know where to write
>
> You don't have to know where to write. Strictly speaking you only need
> to know that the space is reserved. We do have enough magic block
> numbers that could serve as a marker. E.g. anything up to the end of
> the first super block isn't allocatable. And there are enough spare
> fields in the fs, cg and csum structs.
>
> However, we should look at how FreeBSD has implemented posix_fallocate
> and see if we can incorporate that code. Or at least come up with a
> compatable implementation so that there is a possibility of moving disks
> between these systems.
>
> --chris




Re: Proposal: validate FFS root inode during the mount.

2019-11-20 Thread Maciej
Apologizes for bit vague description, I will try to describe it better
We touch a couple of different topics so I will address then in order.

1. My biggest concern, and reason for putting this validation in place,
is the case where we allow mount the Filesystem that later cannot be umounted 
from Userland.

Both umount(2) and umount(8) (with or without -R) will fail as they relay on 
kernel vfs_lookup in the path resolution step.
When root inode does not have i_mode set we will end up with ENOENT due to the 
failure inside ffs_loadvnode
```
ffs_loadvnode()

if (ip->i_mode == 0) {
ffs_deinit_vnode(ump, vp);

return ENOENT;
}
```
All lookup operations end up in vcache -> VFS_LOADVNODE, so is clear that we 
won't be able to resolve the given path without loading inode from disk.

>From a user perspective, this can lead to the situations where after 
>corruption user cannot umount FS and the only way is to power off system in 
>order to umount it and repair,
which IMO is bad/unexpected behaviour.
Although i_mode should not be zero, the strange value inside i_mode can lead to 
different execution paths, because of that I want to check if the i_mode is a 
directory.

Hope people will share a similar opinion about this part.


2. other validation for root inode,

` !S_ISDIR(dp1->di_mode) || !dp1->di_size || !dp1->di_blocks || !dp1->di_nlink) 
`

Couple other cases that I found probably worth to consider for FFS root inode 
and were placed inside the patch,
Note: I DID NOT prove that incorrect values inside them case the panic during 
the mount(2) system call step, but only when another system call is issued.

di_blocks - if is equal to zero lstat operations will cause the panic, that may 
happen due to the tools on the startup. But is only MIGHT so someone may argue 
that this can happen to any other file and we clearly do not want to do a scan 
on every mount.

di_nlink - if equal to the zero and we have di_blocks or di_size non-equal to 
the zero, vput operation will cause panic inside ufs_inactive. That can happen 
due to i.e. stat systemcall, so most likely during the boot proces.

That is my justification about rest of validation which can be considered as 
pragmatic/unnecessary but as I already checked DIR mode on the root inode, 
checking these fields does not cost anything.
Also I did not check all cases and is possible that some strange combination 
can cause panic on the mount.
Another argument about this validation is the fact that is done inside 
ffs_mount which is also called by the reload which can cause panic due to the 
usage of vput.
As I said unfortunately at this moment I don't have particular crash example, 
so I understand people concerns.

3. "Kirk has added a bunch of checksum and other integrity checks to FreeBSD$"

 Thank you for the info I will spend some time and study what is validated 
there.

4. Validation on the mount time: I don't have a strong opinion about the 
preferred behaviour and also do not want to check entire FS on the mount 
because that will make mount process very unuseful.

To summarize:
I consider point 1 (umount issue) as a logical bug fix, and 2 as nice to have 
(or ew possible future fix).

‐‐‐ Original Message ‐‐‐
On Wednesday, November 20, 2019 8:27 PM,  wrote:

> mo...@rodents-montreal.org (Mouse) writes:
>
> > But for the machine's own filesystems? Corruption should panic:
> > mount -o onerror=panic /dev/wd2a /builds
>
> There are 3 popular behaviours and all can be useful.
>
> -   return errors
> -   return errors + reject further writes
> -   panic
>
> The second is Linux default behaviour and some people hate it,
> maybe because the read-only mode is rather silent. If you do
> react to it, it can be more graceful than a panic.
>
> --
> --
> Michael van Elst
> Internet: mlel...@serpens.de
> "A potential Snark may lurk in every tree."
>




Re: Trying to write a kernel module for (un)mounting tmpfs

2020-09-13 Thread Maciej
Hi Bruno,

The reason why it is hard to map VFS operations inside the kernel is due to the 
way how POSIX filesystems are designed.
Most of the operations are made to run in the context of some process where we 
have information about things such mount point,
some file structures or path. This information is later translated to concrete 
kernel structures.

From the other side kernel module code when executed would need to get somehow 
such information like mount point and mount options.
If you take a look on the man-page of mount(2) you will notice that there are 
couple arguments like flags
(Filesystem independent - or at least meant to be independent), and mount data 
which is filesystem-specific.

The first design question that should ask yourself is how this information 
would be provided to the kernel module?
Probably they can be hardcoded in the module; however, if you decide to trigger 
this operation from userspace
(by using any US to kernel interface), you can provide some structure that 
contains more information about current filesystem to make things easier.

A good starting point to understand how things are working would be 
implementation in VFS layer of mount system call:

`do_sys_mount()` inside: kern/vfs_syscalls.c

You can notice that in order to get vnode (structure needed for mount: the 
mount point)
you need to translate file path to find a vnode using `namei` operation

Next thing is mount operation itself, as you realized each filesystem has its 
own `struct vfsops` that contains filesystem operations.
Kernel contains the global list `struct vfs_list_head vfs_list ` build inside 
`kern/vfs_init.c`
this list contains all filesystems availavble on NetBSD with their operations.
To get filesystem specific structure you can use `mount_get_vfsops()`.

When finished with the preparation (probably you will need to prepare arguments 
according to man page mentioned before),
you can call your favourite FS mount operation by `mount_domount()`.

Hope that quick walkthrough will help you understand how mount operation works 
inside the kernel.
Maciej

‐‐‐ Original Message ‐‐‐
On Thursday, September 10, 2020 12:51 PM, Bruno Melo  
wrote:

> Hi,
>
> I'm trying to write a kernel module as example for mounting and unmounting 
> tmpfs. For that, I read the mount_tmpfs code and in line 247 I found the call 
> for mount() syscall here:
> https://nxr.netbsd.org/xref/src/sbin/mount_tmpfs/mount_tmpfs.c#247
>
> Then I looked for how the kernel mounts file systems and found there are 
> _mount() function for every file system:
> https://nxr.netbsd.org/xref/src/sys/sys/mount.h#254
>
> Now, I just needed to know how to use tmpfs_mount() and find it here:
> https://nxr.netbsd.org/xref/src/sys/fs/tmpfs/tmpfs_vfsops.c#86
> but I still don't know how to use it. I created the struct 
> [tmpfs_args](https://nxr.netbsd.org/source/s?defs=tmpfs_args&project=src) 
> [args](https://nxr.netbsd.org/source/s?refs=args&project=src) and size_t 
> args_len = sizeof args variables to pass to data and data_len arguments in 
> tmpfs_mount(). And the directory I'm mounting is "/tmp". So, I have 3 of 4 
> arguments needed to tmpfs_mount(), the last one is the struct mount *mp.
>
> I know the mount() syscall takes 
> [MOUNT_TMPFS](https://nxr.netbsd.org/source/s?defs=MOUNT_TMPFS&project=src) 
> as first argument, but I'm not understanding the magic behind struct mount mp 
> and what I need to assign to the struct mount variable to pass it to that 
> argument. Any hint?
>
> Thanks for the help,
>
> Bruno Melo.