Re: [PATCH] fcntl: Add 32bit filesystem mode

2020-04-20 Thread Andreas Dilger

> From 73471e01733dd1d998ff3cd41edebb4c78793193 Mon Sep 17 00:00:00 2001
> From: Peter Maydell 
> Date: Mon, 20 Apr 2020 11:54:22 +0100
> Subject: [RFC] linux-user: Use new F_SET_FILE_32BIT_FS fcntl for 32-bit guests
> 
> If the guest is 32 bit then there is a potential problem if the
> host gives us back a 64-bit sized value that we can't fit into
> the ABI the guest requires. This is a theoretical issue for many
> syscalls, but a real issue for directory reads where the host
> is using ext3 or ext4. There the 'offset' values retured via
> the getdents syscall are hashes, and on a 64-bit system they
> will always fill the full 64 bits.
> 
> Use the F_SET_FILE_32BIT_FS fcntl to tell the kernel to stick
> to 32-bit sized hashes for fds used by the guest.
> 
> Signed-off-by: Peter Maydell 

Another question I had here is whether the filesystem needs to provide
32-bit values for other syscalls, such as stat() and statfs()?  For
ext4, stat() is not going to return a 64-bit inode number, but other
filesystems might (e.g. Lustre has a mode to do this).  Similarly,
should statfs() scale up f_bsize until it can return a 32-bit f_blocks
value?  We also had to do this ages ago for Lustre when 32-bit clients
couldn't handle > 16TB filesystems, but that is a single disk today.

Should that be added into F_SET_FILE_32BIT_FS also?

Cheers, Andreas

> ---
> RFC patch because it depends on the kernel patch to provide
> F_SET_FILE_32BIT_FS, which is still under discussion. All this
> patch does is call the fcntl for every fd the guest opens.
> 
> linux-user/syscall.c | 27 +++
> 1 file changed, 27 insertions(+)
> 
> diff --git a/linux-user/syscall.c b/linux-user/syscall.c
> index 674f70e70a5..8966d4881bd 100644
> --- a/linux-user/syscall.c
> +++ b/linux-user/syscall.c
> @@ -884,6 +884,28 @@ static inline int host_to_target_sock_type(int host_type)
> return target_type;
> }
> 
> +/*
> + * If the guest is using a 32 bit ABI then we should try to ask the kernel
> + * to provide 32-bit offsets in getdents syscalls, as otherwise some
> + * filesystems will return 64-bit hash values which we can't fit into
> + * the field sizes the guest ABI mandates.
> + */
> +#ifndef F_SET_FILE_32BIT_FS
> +#define F_SET_FILE_32BIT_FS (1024 + 15)
> +#endif
> +
> +static inline void request_32bit_fs(int fd)
> +{
> +#if HOST_LONG_BITS > TARGET_ABI_BITS
> +/*
> + * Ignore errors, which are likely due to the host kernel being too
> + * old to support this fcntl. We'll try anyway, which might or might
> + * not work, depending on the guest code and on the host filesystem.
> + */
> +fcntl(fd, F_SET_FILE_32BIT_FS);
> +#endif
> +}
> +
> static abi_ulong target_brk;
> static abi_ulong target_original_brk;
> static abi_ulong brk_page;
> @@ -7704,6 +7726,7 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
>   target_to_host_bitmask(arg2,
> fcntl_flags_tbl),
>   arg3));
> fd_trans_unregister(ret);
> +request_32bit_fs(ret);
> unlock_user(p, arg1, 0);
> return ret;
> #endif
> @@ -7714,6 +7737,7 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
>   target_to_host_bitmask(arg3,
> fcntl_flags_tbl),
>   arg4));
> fd_trans_unregister(ret);
> +request_32bit_fs(ret);
> unlock_user(p, arg2, 0);
> return ret;
> #if defined(TARGET_NR_name_to_handle_at) && defined(CONFIG_OPEN_BY_HANDLE)
> @@ -7725,6 +7749,7 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
> case TARGET_NR_open_by_handle_at:
> ret = do_open_by_handle_at(arg1, arg2, arg3);
> fd_trans_unregister(ret);
> +request_32bit_fs(ret);
> return ret;
> #endif
> case TARGET_NR_close:
> @@ -7769,6 +7794,7 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
> return -TARGET_EFAULT;
> ret = get_errno(creat(p, arg2));
> fd_trans_unregister(ret);
> +request_32bit_fs(ret);
> unlock_user(p, arg1, 0);
> return ret;
> #endif
> @@ -12393,6 +12419,7 @@ static abi_long do_syscall1(void *cpu_env, int
> num, abi_long arg1,
> }
> ret = get_errno(memfd_create(p, arg2));
> fd_trans_unregister(ret);
> +request_32bit_fs(ret);
> unlock_user(p, arg1, 0);
> return ret;
> #endif
> --
> 2.20.1


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH] ext4: Give 32bit personalities 32bit hashes

2020-03-17 Thread Andreas Dilger
On Mar 17, 2020, at 5:31 AM, Linus Walleij  wrote:
> 
> It was brought to my attention that this bug from 2018 was
> still unresolved: 32 bit emulators like QEMU were given
> 64 bit hashes when running 32 bit emulation on 64 bit systems.
> 
> The personality(2) system call supports to let processes
> indicate that they are 32 bit Linux to the kernel. This
> was suggested by Teo in the original thread, so I just wired
> it up and it solves the problem.
> 
> Programs that need the 32 bit hash only need to issue the
> personality(PER_LINUX32) call and things start working.

I'm generally with with this from the ext4 point of view.

That said, I'd think it would be preferable for ease of use and
compatibility that applications didn't have to be modified
(e.g. have QEMU or glibc internally set PER_LINUX32 for this
process before the 32-bit syscall is called, given that it knows
whether it is emulating a 32-bit runtime or not).

The other way to handle this would be for ARM32 to check the
PER_LINUX32 flag via is_compat_task() so that there wouldn't
need to be any changes to the ext4 code at all?

Cheers, Andreas


> I made a test program like this:
> 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> 
>  int main(int argc, char** argv) {
>DIR* dir;
>personality(PER_LINUX32);
>dir = opendir("/boot");
>printf("dir=%p\n", dir);
>printf("readdir(dir)=%p\n", readdir(dir));
>printf("errno=%d: %s\n", errno, strerror(errno));
>return 0;
>  }
> 
> This was compiled with an ARM32 toolchain from Bootlin using
> glibc 2.28 and thus suffering from the bug.
> 
> Before the patch:
> 
>  $ ./readdir-bug
>  dir=0x86000
>  readdir(dir)=(nil)
>  errno=75: Value too large for defined data type
> 
> After the patch:
> 
>  $ ./readdir-bug
>  dir=0x86000
>  readdir(dir)=0x86020
>  errno=0: Success
> 
> Problem solved.
> 
> Cc: Florian Weimer 
> Cc: Peter Maydell 
> Cc: Andy Lutomirski 
> Cc: sta...@vger.kernel.org
> Suggested-by: Theodore Ts'o 
> Link: https://bugs.launchpad.net/qemu/+bug/1805913
> Link: https://lore.kernel.org/lkml/87bm56vqg4@mid.deneb.enyo.de/
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205957
> Signed-off-by: Linus Walleij 
> ---
> fs/ext4/dir.c | 9 +
> 1 file changed, 9 insertions(+)
> 
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index 9aa1f75409b0..3faf9edf3e92 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -27,6 +27,7 @@
> #include 
> #include 
> #include 
> +#include 
> #include "ext4.h"
> #include "xattr.h"
> 
> @@ -618,6 +619,14 @@ static int ext4_dx_readdir(struct file *file, struct 
> dir_context *ctx)
> 
> static int ext4_dir_open(struct inode * inode, struct file * filp)
> {
> + /*
> +  * If we are currently running e.g. a 32 bit emulator on
> +  * a 64 bit machine, the emulator will indicate that it needs
> +  * a 32 bit personality and thus 32 bit hashes from the file
> +  * system.
> +  */
> + if (personality(current->personality) == PER_LINUX32)
> + filp->f_mode |= FMODE_32BITHASH;
>   if (IS_ENCRYPTED(inode))
>   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
>   return 0;
> --
> 2.24.1
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-28 Thread Andreas Dilger
On Dec 28, 2018, at 4:18 AM, Peter Maydell  wrote:
> 
> On Fri, 28 Dec 2018 at 00:23, Andreas Dilger  wrote:
>> On Dec 27, 2018, at 10:41 AM, Peter Maydell  wrote:
>>> As you note, this causes breakage for userspace programs which
>>> need to implement an API/ABI with 32-bit offset but which only
>>> have access to the kernel's 64-bit offset API/ABI.
>> 
>> This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
>> the 64-bit API, but storing the value in a 32-bit field?
> 
> I didn't say "which choose to store the value in a 32-bit field",
> I said "which have to implement an API/ABI which has 32-bit fields".
> In QEMU's case, we use the host kernel's ABI, which has 64-bit
> offset fields. We implement a syscall ABI for the guest binary
> we are running under emulation, which may have 32-bit offset fields
> (for instance if we are running a 32-bit Arm binary.) Both of
> these ABIs are fixed -- QEMU doesn't have a choice here, it
> just has to make the best effort it can with what the host kernel
> provides it, to provide the semantics the guest binary needs.
> My suggestion in this thread is that the host kernel provides
> a wider range of facilities so that QEMU can do the job it's
> trying to do.
> 
>> The same
>> problem would exist for filesystems with 64-bit inodes or 64-bit
>> file offsets trying to store these values in 32-bit variables.
>> It might work most of the time, but it can also break randomly.
> 
> In general inodes and offsets start from 0 and work up --
> so almost all of the time they don't actually overflow.
> The problem with ext4 directory hash "offsets" is that they
> overflow all the time and immediately, so instead of "works
> unless you have a weird edge case" like all the other filesystems,
> it's "never works".
> 
>>> I think the best fix for this would be for the kernel to either
>>> (a) consistently use a 32-bit hash or (b) to provide an API
>>> so that userspace can use the FMODE_32BITHASH flag the way
>>> that kernel-internal users already can.
>> 
>> It would be relatively straight forward to add a "32bitapi" mount
>> option to return a 32-bit directory hash to userspace for operations
>> on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
>> However, I can't think of an easy way to do this on a per-process
>> basis without just having it call the 32-bit API directly.
> 
> The problem is that there is no 32-bit API in some cases
> (unless I have misunderstood the kernel code) -- not all
> host architectures implement compat syscalls or allow them
> to be called from 64-bit processes or implement all the older
> syscall variants that had smaller offets. If there was a guaranteed
> "this syscall always exists and always gives me 32-bit offsets"
> we could use it.

The "32bitapi" mount option would use 32-bit hash for seekdir
and telldir, regardless of what kernel API was used.  That would
just set the FMODE_32BITHASH flag in the file->f_mode for all files.

Using 32-bit directory hash values is not necessarily harmful, but
it returns the possibility to hit the problem with hash collisions
that previously existed before the move to 64-bit hash values.
This becomes more of a problem as directory sizes increase.

>> For ext4 at least, you could just shift the high 32-bit part of
>> the 64-bit hash down into a 32-bit value in telldir(), and
>> shift it back up when seekdir() is called.
> 
> Yes, that has been suggested, but it seemed a bit dubious
> to bake in knowledge of ext4's internal implementation details.
> Can we rely on this as an ABI promise that will always work
> for all versions of all file systems going forwards?

Well, the directory cookies need to be relatively stable over
time because they are exported to applications and possibly
remote nodes via NFS, so it can't be changed very much.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [Qemu-devel] d_off field in struct dirent and 32-on-64 emulation

2018-12-27 Thread Andreas Dilger
On Dec 27, 2018, at 10:41 AM, Peter Maydell  wrote:
> 
> On Thu, 27 Dec 2018 at 17:19, Florian Weimer  wrote:
>> We have a bit of an interesting problem with respect to the d_off
>> field in struct dirent.
>> 
>> When running a 64-bit kernel on certain file systems, notably ext4,
>> this field uses the full 63 bits even for small directories (strace -v
>> output, wrapped here for readability):
>> 
>> getdents(3, [
>>  {d_ino=1494304, d_off=3901177228673045825, d_reclen=40, 
>> d_name="authorized_keys", d_type=DT_REG},
>>  {d_ino=1494277, d_off=7491915799041650922, d_reclen=24, d_name=".", 
>> d_type=DT_DIR},
>>  {d_ino=1314655, d_off=9223372036854775807, d_reclen=24, d_name="..", 
>> d_type=DT_DIR}
>> ], 32768) = 88
>> 
>> When running in 32-bit compat mode, this value is somehow truncated to
>> 31 bits, for both the getdents and the getdents64 (!) system call (at
>> least on i386).
> 
> Yes -- look for hash2pos() and friends in fs/ext4/dir.c.
> The ext4 code in the kernel uses a 32 bit hash if (a) the kernel
> is 32 bit (b) this is a compat syscall (b) some other bit of
> the kernel asked it to via the FMODE_32BITHASH flag (currently only
> NFS does that I think).
> 
> As you note, this causes breakage for userspace programs which
> need to implement an API/ABI with 32-bit offset but which only
> have access to the kernel's 64-bit offset API/ABI.

This is (IMHO) a bit of an oxymoron, isn't it?  Applications using
the 64-bit API, but storing the value in a 32-bit field?  The same
problem would exist for filesystems with 64-bit inodes or 64-bit
file offsets trying to store these values in 32-bit variables.
It might work most of the time, but it can also break randomly.

> I think the best fix for this would be for the kernel to either
> (a) consistently use a 32-bit hash or (b) to provide an API
> so that userspace can use the FMODE_32BITHASH flag the way
> that kernel-internal users already can.

It would be relatively straight forward to add a "32bitapi" mount
option to return a 32-bit directory hash to userspace for operations
on that mountpoint (ext4 doesn't have 64-bit inode numbers yet).
However, I can't think of an easy way to do this on a per-process
basis without just having it call the 32-bit API directly.

> I couldn't think of or find any existing way for userspace
> to get the right results here, which is why
> 32-bit-guest-on-64-bit-host QEMU doesn't work on these filesystems
> (depending on what exactly the guest's libc etc do).
> 
>> the 32-bit getdents system call emulation in a 64-bit qemu-user
>> process would just silently truncate the d_off field as part of
>> the translation, not reporting an error.
>> [...]
>> This truncation has always been a bug; it breaks telldir/seekdir
>> at least in some cases.
> 
> Yes; you can't fit a quart into a pint pot, so if the guest
> only handles 32-bit offsets then truncation is about all we
> can do. This works fine if offsets are offsets, assuming the
> directory isn't so enormous it would have broken the guest
> anyway. I'm not aware of any issues with this other than the
> oddball ext4 offsets-are-hashes situation -- could you expand
> on the telldir/seekdir issue? (I suppose we should probably
> make QEMU's syscall emulation layer return "no more entries"
> rather than entries with truncated hashes.)

For ext4 at least, you could just shift the high 32-bit part of
the 64-bit hash down into a 32-bit value in telldir(), and
shift it back up when seekdir() is called.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP