from:"Theodore Y. Ts'o"

Re: [DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance

2020-10-02 Thread Theodore Y. Ts'o

On Fri, Oct 02, 2020 at 03:39:35PM +, Van Leeuwen, Pascal wrote:
> > Then your company can not contribute in Linux kernel development, as
> > this is obviously not allowed by such a footer.
> >
> Interesting, this has never been raised as a problem until today ...
> Going back through my mail archive, it looks like they started automatically 
> adding that some
> 3 months ago. Not that they informed anyone about that, it just silently 
> happened.

So use a private e-mail address (e.g., at fastmail.fm if you don't
want to run your mail server) and then tunnel out SMTP requests using
ssh.  It's not hard.  :-)

I've worked a multiple $BIG_COMPANY's, and I've been doing this for
decades.  It's also helpful when I need to send e-mails from
conference networks from my laptop

- Ted

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-10-02 Thread Theodore Y. Ts'o

On Mon, Aug 03, 2020 at 05:02:11PM -0600, Jens Axboe wrote:
> ext4 uses generic_file_read_iter(), which already supports this.
> 
> Cc: Theodore Ts'o 
> Signed-off-by: Jens Axboe 

Applied, thanks.   (And apologies for the delay.)

- Ted

Re: [PATCHv2 1/3] ext4: Refactor ext4_overwrite_io() to take ext4_map_blocks as argument

2020-10-02 Thread Theodore Y. Ts'o

On Sat, Aug 22, 2020 at 05:04:35PM +0530, Ritesh Harjani wrote:
> Refactor ext4_overwrite_io() to take struct ext4_map_blocks
> as it's function argument with m_lblk and m_len filled
> from caller
> 
> There should be no functionality change in this patch.
> 
> Signed-off-by: Ritesh Harjani 
> ---
>  fs/ext4/file.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c..84f73ed91af2 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -188,26 +188,22 @@ ext4_extending_io(struct inode *inode, loff_t offset, 
> size_t len)
>  }
>  
>  /* Is IO overwriting allocated and initialized blocks? */
> -static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
> +static bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks 
> *map)
>  {
> - struct ext4_map_blocks map;
>   unsigned int blkbits = inode->i_blkbits;
> - int err, blklen;ts
> + loff_t end = (map->m_lblk + map->m_len) << blkbits;

As Dan Carpenter has pointed out, we need to cast map->m_lblk to
loff_t, since m_lblk is 32 bits, and when this get shifted left by
blkbits, we could end up losing bits.

> - if (pos + len > i_size_read(inode))
> + if (end > i_size_read(inode))
>   return false;

This transformation is not functionally identical.

The problem is that pos is not necessarily a multiple of the file
system blocksize.From below, 

> + map.m_lblk = offset >> inode->i_blkbits;
> + map.m_len = EXT4_MAX_BLOCKS(count, offset, inode->i_blkbits);

So what previously was the starting offset of the overwrite, is now
offset shifted right by blkbits, and then shifted left back by blkbits.

So unless I'm missing something, this looks not quite right?

- Ted

Re: [PATCH] [v2] ext4: Fix error handling code in add_new_gdb

2020-10-02 Thread Theodore Y. Ts'o

On Sat, Aug 29, 2020 at 10:54:02AM +0800, Dinghao Liu wrote:
> When ext4_journal_get_write_access() fails, we should
> terminate the execution flow and release n_group_desc,
> iloc.bh, dind and gdb_bh.
> 
> Signed-off-by: Dinghao Liu 

Thanks, applied.

- Ted

Re: [PATCHv3 1/1] ext4: Optimize file overwrites

2020-10-02 Thread Theodore Y. Ts'o

On Fri, Sep 18, 2020 at 10:36:35AM +0530, Ritesh Harjani wrote:
> In case if the file already has underlying blocks/extents allocated
> then we don't need to start a journal txn and can directly return
> the underlying mapping. Currently ext4_iomap_begin() is used by
> both DAX & DIO path. We can check if the write request is an
> overwrite & then directly return the mapping information.
> 
> This could give a significant perf boost for multi-threaded writes
> specially random overwrites.
> On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
> could be seen in random writes (overwrite). Also bcoz this optimizes
> away the spinlock contention during jbd2 slab cache allocation
> (jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
> 
> Reported-by: Dan Williams 
> Suggested-by: Jan Kara 
> Signed-off-by: Ritesh Harjani 

Thanks, applied.

- Ted

Re: [PATCH] ext4: fix leaking sysfs kobject after failed mount

2020-10-02 Thread Theodore Y. Ts'o

On Thu, Sep 24, 2020 at 11:08:59AM +0200, Jan Kara wrote:
> On Tue 22-09-20 09:24:56, Eric Biggers wrote:
> > From: Eric Biggers 
> > 
> > ext4_unregister_sysfs() only deletes the kobject.  The reference to it
> > needs to be put separately, like ext4_put_super() does.
> > 
> > This addresses the syzbot report
> > "memory leak in kobject_set_name_vargs (3)"
> > (https://syzkaller.appspot.com/bug?extid=9f864abad79fae7c17e1).
> > 
> > Reported-by: syzbot+9f864abad79fae7c1...@syzkaller.appspotmail.com
> > Fixes: 72ba74508b28 ("ext4: release sysfs kobject when failing to enable 
> > quotas on mount")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Eric Biggers 
> 
> Looks good. You can add:
> 
> Reviewed-by: Jan Kara 

Thanks, applied.

- Ted

Re: [PATCH] FIX the comment of struct jbd2_journal_handle

2020-10-02 Thread Theodore Y. Ts'o

On Wed, Sep 23, 2020 at 01:12:31AM +0800, Hui Su wrote:
> the struct name was modified long ago, but the comment still
> use struct handle_s.
> 
> Signed-off-by: Hui Su 

Tnanks, applied.  I updated the commit summary to be:

jbd2: fix the comment of struct jbd2_journal_handle

- Ted

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-07 Thread Theodore Y. Ts'o

On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote:
> 
> That sounds like a conversation that would have been a lot more
> interesting and enjoyable if it hadn't started with "can we shoot it in
> the head", and continued with the notion that anything other than
> e2fsprogs making something to be mounted by mount(2) and handled by
> fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> "how do we make it go away so that only e2fsprogs and the kernel ever
> touch ext4". I started this thread because I'd written some userspace
> code, a new version of the kernel made that userspace code stop working,
> so I wanted to report that the moment I'd discovered that, along with a
> potential way to address it with as little disrupton to ext4 as
> possible.

What is really getting my dander up is your attempt to claim that the
on-disk file system format is like the userspace/kernel interface,
where if we break any file system that file system that was
"previously accepted by an older kernel", this is a bug that must be
reverted or otherwise fixed to allow file systems that had previously
worked, to continue to work.  And this is true even if the file system
is ***invalid***.

And the problem with this is that there have been any number of
commits where file systems which were previously invalid, but which
could be caused to trigger a syzbot whine, which was fixed by
tightening up the validity tests in the kernel.  In some cases, I had
to also had to fix up e2fsck to detect the invalid file system which
was generated by the file system fuzzer.  Yes, it's unfortunate that
we didn't have these checks earlier, but a file system has a huge
amount of state.

The principle you've articulated would make it impossible for me to
fix these bugs, unless I can prove that the failure to check a
particular invalid file system corruption could lead to a security
vulnerability.  (Would it be OK for me to make the kernel more strict
and reject an invalid file system if it triggers a WARN_ON, so I get
the syzbot complaint, but it doesn't actually cause a security issue?)

So this conversation would have been a lot more pleasant for *me* if
you hadn't tried to elevate your request to a general principle, where
if someone is deliberately generating an invalid file system, I'm not
allowed to make the kernel more strict to detect said invalidity and
to reject the invalid / corrupted / fuzzed file system.

And note that sometimes the security problem happens when there are
multiple file system corruptions that are chained together.  So
enabling block validity *can* sometimes prevent the fuzzed file system
from proceeding further.  Granted, this is less likely in the case of
a read-only file system, but it really worries me when there are
proprietary programs (maybe your library isn't proprietary, but I note
you haven't send me a link to your git repo, but instead have offered
sending sample file systems) which insist on generating their own file
systems, which might or might not be valid, and then expecting them to
receive first class support as part of an iron-bound contract where
I'm not even allowed to add stronger sanity checks which might reject
said invalid file system in the future.

> The short version is that I needed a library to rapidly turn
> dynamically-obtained data into a set of disk blocks to be served
> on-the-fly as a software-defined disk, and then mounted on the other
> side of that interface by the Linux kernel. Turns out that's *many
> orders of magnitude* faster than any kind of network filesystem like
> NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
> generate and account for and cache, the faster it can run, and
> microseconds matter.

So are you actually trying to dedup data blocks, or are you just
trying to avoid needing to track the block allocation bitmaps?  And
are you just writing a single file, or multiple files?  Do you know
what the maximum size of the file or files will be?  Do you need a
complex directory structure, or just a single root directory?  Can the
file system be sparse?

So for example, you can do something like this, which puts all of the
metadata at beginning of the file system, and then you could write to
contiguous data blocks.  Add the following in mke2fs.conf:

[fs_types]
hugefile = {
features = 
extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
cluster_size = 32768
hash_alg = half_md4
reserved_ratio = 0.0
num_backup_sb = 0
packed_meta_blocks = 1
make_hugefiles = 1
inode_ratio = 4194304
hugefiles_dir = /storage
hugefiles_name = huge-file
hugefiles_digits = 0
hugefiles_size = 0
hugefiles_align = 256M
hugefiles_align_disk = true
num_hugefiles = 1
zero_hugefiles = false
inode_size = 128
}

   hugefiles = {
features =

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-09 Thread Theodore Y. Ts'o

On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote:
> 
> I wasn't trying to make a *new* general principle or policy. I was under
> the impression that this *was* the policy, because it never occurred to
> me that it could be otherwise. It seemed like a natural aspect of the
> kernel/userspace boundary, to the point that the idea of it *not* being
> part of the kernel's stability guarantees didn't cross my mind. 

>From our perspective (and Darrick and I discussed this on this week's
ext4 video conference, so it represents the ext4 and xfs maintainer's
position) is that the file system format is different.  First, the
on-disk format is not an ABI, and it is several orders more complex
than a system call interface.  Second, we make no guarantees about
what the file system created by malicious tools will do.  For example,
XFS developers reject bug reports from file system fuzzers, because
the v5 format has CRC checks, so randomly corrupted file systems won't
crash the kernel.  Yes, this doesn't protect against maliciously
created file systems where the attacker makes sure the checksums are
valid, but only crazy people who think containers are just as secure
as VM's and that unprivileged users should be allowed to make the
kernel mount potentially maliciously created file systems would be
exposing the kernel to such maliciously created images.

> Finally, I think there's also some clarification needed in the role of
> what some of the incompat and ro_compat flags mean. For instance,
> "RO_COMPAT_READONLY" is documented as:
> > - Read-only filesystem image; the kernel will not mount this image
> >   read-write and most tools will refuse to write to the image.
> Is it reasonable to interpret this as "this can never, ever become
> writable", such that no kernel should ever "understand" that flag in
> ro_compat?

Yes.  However...

> I'd assumed so, but this discussion is definitely leading me
> to want to confirm any such assumptions. Is this a flag that e2fsck
> could potentially use to determine that it shouldn't check
> read-write-specific data structures, or should that be a different flag?

Just because it won't be modifiable, shouldn't mean that e2fsck won't
check to make sure that such structures are valid.  "Won't be changed"
and "valid" are two different concepts.  And certainly, today we *do*
check to make sure the bitmaps are valid and don't overlap, and we
can't go back in time to change that.

That being said, on the ext4 weekly video chat, we did discuss other
uses of an incompat feature flag that would allow the allocation
bitmap blocks and inode table block fields in the superblock to be
zero, which would mean that they are unallocated.  This would allow us
to dynamically grow the inode table by adding an extra block group
descriptor.  In fact, I'd probably use this as an opportunity to make
some other changes, such using inodes to store locations of the block
group descriptors, inode tables, and allocation bitmaps at the same
time.  Those details can be discussed later, but the point is that
this is why it's good to discuss format changes from a requirements
perspective, so that if we do need to make an incompat change, we can
kill multiple birds with a single stone.

> It's an arbitrary filesystem hierarchy, including directories, files of
> various sizes (hence using inline_data), and permissions. The problem
> isn't to get data from point A to point B; the problem is (in part) to
> turn a representation of a filesystem into an actual mounted filesystem
> as efficiently as possible, live-serving individual blocks on demand
> rather than generating the whole image in advance.

Ah, so you want to be able to let the other side "look at" the file
system in parallel with it being generated on demand?  The cache
coherency problems would seem to be... huge.  For example, how can you
add a file to directory after the reader has looked at the directory
inode and directory blocks?  Or for that matter, looked at a portion
of the inode table block?  Or are you using 4k inodes so there is only
one inode per block?  What about the fact that we sometimes do
readahead of inode table blocks?

I can think of all sorts of implementation level changes in terms of
caching, readahead behavior, etc., that we might make in the future
that might break you if you are doing some quite as outré are that.
Again, the fact that you're being cagey about what you are doing, and
potentially complaining about changes we might make that would break
you, is ***really*** scaring me now.

Can you go into more details here?  I'm sorry if you're working for
some startup who might want to patent these ideas, but if you want to
guarantee future compatibility, I'm really going to have to insist.

   - Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-16 Thread Theodore Y. Ts'o

On Wed, Sep 16, 2020 at 07:09:41AM +0800, Ming Lei wrote:
> > The problem is it's a bit tricky to revert 568f27006577, since there
> > is a merge conflict in blk_kick_flush().  I attempted to do the bisect
> > manually here, but it's clearly not right since the kernel is not
> > booting after the revert:
> > 
> > https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870
> > 
> > branch:
> > 
> > https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577
> > 
> > Can you send me a patch which correctly reverts 568f27006577?  I can
> > try it against -rc1 .. -rc4, whichever is most convenient.
> 
> Please test the following revert patch against -rc4.

Unfortunately the results of the revert is... wierd.

With -rc4, *all* of the VM's are failing --- reliably.  With rc4 with
the revert, *some* of the VM's are able to complete the tests, but
over half are still locking up or failing with some kind of oops.  So
that seems to imply that there is some kind of timing issue going on,
or maybe there are multiple bugs in play?

So let's review the bidding.   We're going to review four commits:

7bf137298cb7: (Parent of 568f27006577)  Completely clean, all VM's complete the 
tests

568f27006577: Fails reliably.  In 9 of the 11 VM's there is nothing on
the console; the I/O is just stopped.  If I've been able to
get to the VM before it gets killed from the timeout, ssh
works, but any attempt do any I/O will hang, which presumably
explains why the tests are hanging.  In the other two VM's
there are a hung task timeouts, with stack traces that look
like this...

v5.9-rc4: More than half of the VM's are failing --- but at least some are 
succeeding,
which is more than can be said for 568f27006577.  There is a
*variety* of different sort of failures.  So the fact that
we're not seeing the silent hangs in -rc4 is... interesting

v5.9-rc4 with the revert of 568f27006577: we're seeing a similar
number of VM failures, but the failure signature is different.
The most common failure is...

(More details below, with the stack traces.)

I really don't know what to make of this.  It looks like there's
something going on in the block layer, based the fact that
568f27006577 fails reliably, but its predecssor is completely clean.
But then things have changed significantly by the time we get to -rc4.
I'll do a more in-depth analysis of -rc1 to see if the failure
patterns are more similar to 568f27006577 than -rc4.  But hopefully
you can see something that I'm missing?

Thanks,

- Ted

---

7bf137298cb7: (Parent of 568f27006577)  Completely clean, all VM's complete the 
tests

---

568f27006577: Fails reliably.  In 9 of the 11 VM's there is nothing on
the console; the I/O is just stopped.  If I've been able to
get to the VM before it gets killed from the timeout, ssh
works, but any attempt do any I/O will hang, which presumably
explains why the tests are hanging.  In the other two VM's
there are a hung task timeouts, with stack traces that look
like this:

[14375.634282] INFO: task jbd2/sda1-8:116 blocked for more than 122 
seconds.
[14375.641679]  Not tainted 5.8.0-rc2-xfstests-30545-g568f27006577 
#6
[14375.648517] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[14375.656523] jbd2/sda1-8 D0   116  2 0x4000
[14375.656530] Call Trace:
[14375.656548]  __schedule+0x2cc/0x6e0
[14375.656695]  ? sched_clock_cpu+0xc/0xb0
[14375.656699]  schedule+0x55/0xd0
[14375.656702]  io_schedule+0x12/0x40
[14375.656708]  blk_mq_get_tag+0x11e/0x280
[14375.656715]  ? __wake_up_common_lock+0xc0/0xc0
[14375.656719]  __blk_mq_alloc_request+0xb6/0x100
[14375.656722]  blk_mq_submit_bio+0x13f/0x7d0
[14375.656727]  ? blk_queue_enter+0x15c/0x510
[14375.656731]  submit_bio_noacct+0x48d/0x500
[14375.656737]  ? kvm_sched_clock_read+0x14/0x30
[14375.656740]  ? submit_bio+0x42/0x150
[14375.656744]  submit_bio+0x42/0x150
[14375.656748]  ? guard_bio_eod+0x90/0x140
[14375.656754]  submit_bh_wbc+0x16d/0x190
[14375.656761]  jbd2_journal_commit_transaction+0x70d/0x1f23
[14375.656767]  ? kjournald2+0x128/0x3b0
[14375.656771]  kjournald2+0x128/0x3b0
[14375.656777]  ? trace_hardirqs_on+0x1c/0xf0
[14375.656781]  ? __wake_up_common_lock+0xc0/0xc0
[14375.656785]  ? __jbd2_debug+0x50/0x50
[14375.656788]  kthread+0x136/0x150
[14375.656792]  ? __kthread_queue_delayed_work+0x90/0x90
[14375.656796]  ret_from_fork+0x22/0x30

---

v5.9-rc4: More than half of the VM's are failing --- but at least some are 
succeeding,
which

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-17 Thread Theodore Y. Ts'o

On Thu, Sep 17, 2020 at 10:20:51AM +0800, Ming Lei wrote:
> 
> Obviously there is other more serious issue, since 568f27006577 is
> completely reverted in your test, and you still see list corruption
> issue.
> 
> So I'd suggest to find the big issue first. Once it is fixed, maybe
> everything becomes fine.
> ...
> Looks it is more like a memory corruption issue, is there any helpful log
> dumped when running kernel with kasan?

Last night, I ran six VM's using -rc4 with and without KASAN; without
Kasan, half of them hung.  With KASAN enabled, all of the test VM's
ran to completion.

This strongly suggests whatever the problem is, it's timing related.
I'll run a larger set of test runs to see if this pattern is confirmed
today.

> BTW, I have kvm/qumu auto test which runs blktest/xfstest over 
> virtio-blk/virito-scsi/loop/nvme
> with xfs/ext4 every two days, and not see such failure recently, but the 
> kernel config is based
> rhel8's config.

Here is the configs I'm using, with and without KASAN.  (With KASAN is
enabled is sent as a diff to avoid running into LKML's e-mail size
restrictrions.)

 - Ted
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 5.9.0-rc4 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (Debian 10.2.0-7) 10.2.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=100200
CONFIG_LD_VERSION=23500
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-xfstests"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_HAVE_SCHED_AVG_IRQ=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-08-22 Thread Theodore Y. Ts'o

On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
> >>> Resending this one, as I've been carrying it privately since May. The
> >>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
> >>> please consider this one for 5.9. Thanks!
> >>
> >> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
> >> missing something?  It's on my queue to send to Linus once I get my
> >> (late) ext4 primary pull request for 5.9.
> > 
> > Right, it went in at the start of the merge window for 5.9. Thanks Ted!
> 
> Didn't see it in the queue that just sent in, is it still queued up?

It wasn't in the queue which I queued up because that was based on
5.8-rc4.  Linus was a bit grumpy (fairly so) because it was late, and
that's totally on me.

He has said that he's going to start ignoring pull requests that
aren't fixes only if this becomes a pattern, so while I can send him
another pull request which will just have that one change, there are
no guarantees he's going to take it at this late date.

Sorry, when you sent me the commit saying that the changes that were
needed were already upstream on August 3rd, I thought that meant that
they were aready in Linus's tree.  I should have checked and noticed
that that in fact "ext4: flag as supporting buffered async reads"
wasn't compiling against Linus's upstream tree, so I didn't realize
this needed to be handled as a special case during the merge window.

Cheers,

- Ted

REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-14 Thread Theodore Y. Ts'o

On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> Worse, right now, -rc1 and -rc2 is causing random crashes in my
> gce-xfstests framework.  Sometimes it happens before we've run even a
> single xfstests; sometimes it happens after we have successfully
> completed all of the tests, and we're doing a shutdown of the VM under
> test.  Other times it happens in the middle of a test run.  Given that
> I'm seeing this at -rc1, which is before my late ext4 pull request to
> Linus, it's probably not an ext4 related bug.  But it also means that
> I'm partially blind in terms of my kernel testing at the moment.  So I
> can't even tell Linus that I've run lots of tests and I'm 100%
> confident your one-line change is 100% safe.

I was finally able to bisect it down to the commit:

37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

(See below for [1] Bisect log.)

The previous commit allows the tests to run to completion.  With
commit 37f4a24c2469 and later all 11 test scenarios (4k blocks, 1k
blocks, ext3 compat, ext4 w/ fscrypt, nojournal mode, data=journal,
bigalloc, etc.) the VM will get stuck.

The symptom is that while running xfstests in a Google Compute Engine
(GCE) VM, the tests just hang.  There are a number of tests where this
is more likely, but it's not unique to a single test.

In most cases, there is nothing; just the test stops running until the
test framework times out after an hour (tests usually complete in
seconds or at most a few tens of minutes or so in the worst case) and
kills the VM.  In one case, I did get a report like this.  (See below
for [2] stack trace from 37f4a24c2469.)

I attempted to revert the commit in question against -rc1 and -rc4;
that result can be found at the branches manual-revert-of-blk-mq-patch
and manual-revert-of-blk-mq-patch-rc4 at https://github.com/tytso/ext4.

I don't think I got the revert quite right; with the revert, most of
the test VM's successfully complete, but 2 out of the 11 fail, with a
different stack trace.  (See below for [3] stack trace from my
attempted manual revert of 37f4a24c2469).  But it does seem to confirm
that the primary cause of the test VM hangs is caused by commit
37f4a24c2469.

Does this make any sense as to what might be going on?  I hope it does
for you, since I'm pretty confused what might be going on.

Thanks,

   - Ted

[1] Bisect log

git bisect start
# bad: [9123e3a74ec7b934a4a099e98af6a61c2f80bbf5] Linux 5.9-rc1
git bisect bad 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5
# good: [bcf876870b95592b52519ed4aafcf9d95999bc9c] Linux 5.8
git bisect good bcf876870b95592b52519ed4aafcf9d95999bc9c
# bad: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 
'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm
git bisect bad 8186749621ed6b8fc42644c399e8c755a2b6f630
# bad: [2324d50d051ec0f14a548e78554fb02513d6dcef] Merge tag 'docs-5.9' of 
git://git.lwn.net/linux
git bisect bad 2324d50d051ec0f14a548e78554fb02513d6dcef
# bad: [92c59e126b21fd212195358a0d296e787e444087] Merge tag 'arm-defconfig-5.9' 
of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad 92c59e126b21fd212195358a0d296e787e444087
# bad: [cdc8fcb49905c0b67e355e027cb462ee168ffaa3] Merge tag 
'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
git bisect bad cdc8fcb49905c0b67e355e027cb462ee168ffaa3
# good: [ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84] Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84
# bad: [d958e343bdc3de2643ce25225bed082dc222858d] block: blk-timeout: delete 
duplicated word
git bisect bad d958e343bdc3de2643ce25225bed082dc222858d
# bad: [53042f3cc411adc79811ba3cfbca5d7a42a7b806] ps3vram: stop using 
->queuedata
git bisect bad 53042f3cc411adc79811ba3cfbca5d7a42a7b806
# good: [621c1f42945e76015c3a585e7a9fe6e71665eba0] block: move struct 
block_device to blk_types.h
git bisect good 621c1f42945e76015c3a585e7a9fe6e71665eba0
# good: [36a3df5a4574d5ddf59804fcd0c4e9654c514d9a] blk-mq: put driver tag when 
this request is completed
git bisect good 36a3df5a4574d5ddf59804fcd0c4e9654c514d9a
# good: [570e9b73b0af2e5381ca5343759779b8c1ed20e3] blk-mq: move 
blk_mq_get_driver_tag into blk-mq.c
git bisect good 570e9b73b0af2e5381ca5343759779b8c1ed20e3
# bad: [b5fc1e8bedf8ad2c6381e0df6331ad5686aca425] blk-mq: remove pointless call 
of list_entry_rq() in hctx_show_busy_rq()
git bisect bad b5fc1e8bedf8ad2c6381e0df6331ad5686aca425
# bad: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk-mq: centralise related 
handling into blk_mq_get_driver_tag
git bisect bad 37f4a24c2469a10a4c16c641671bd766e276cf9f
# good: [723bf178f158abd1ce6069cb049581b3cb003aab] blk-mq: move 
blk_mq_put_driver_tag() into blk-mq.c
git bisect good 723bf178f158abd1ce6069cb049581b3cb003aab
# first bad commit: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-15 Thread Theodore Y. Ts'o

On Tue, Sep 15, 2020 at 03:33:03PM +0800, Ming Lei wrote:
> Hi Theodore,
> 
> On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote:
> > On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> > > Worse, right now, -rc1 and -rc2 is causing random crashes in my
> > > gce-xfstests framework.  Sometimes it happens before we've run even a
> > > single xfstests; sometimes it happens after we have successfully
> > > completed all of the tests, and we're doing a shutdown of the VM under
> > > test.  Other times it happens in the middle of a test run.  Given that
> > > I'm seeing this at -rc1, which is before my late ext4 pull request to
> > > Linus, it's probably not an ext4 related bug.  But it also means that
> > > I'm partially blind in terms of my kernel testing at the moment.  So I
> > > can't even tell Linus that I've run lots of tests and I'm 100%
> > > confident your one-line change is 100% safe.
> > 
> > I was finally able to bisect it down to the commit:
> > 
> > 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
> 
> 37f4a24c2469 has been reverted in:
> 
>   4e2f62e566b5 Revert "blk-mq: put driver tag when this request is 
> completed"
> 
> And later the patch is committed as the following after being fixed:
> 
>   568f27006577 blk-mq: centralise related handling into 
> blk_mq_get_driver_tag
> 
> So can you reproduce the issue by running kernel of commit 568f27006577?

Yes.  And things work fine if I try 4e2f62e566b5.

> If yes, can the issue be fixed by reverting 568f27006577?

The problem is it's a bit tricky to revert 568f27006577, since there
is a merge conflict in blk_kick_flush().  I attempted to do the bisect
manually here, but it's clearly not right since the kernel is not
booting after the revert:

https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870

branch:

https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577

Can you send me a patch which correctly reverts 568f27006577?  I can
try it against -rc1 .. -rc4, whichever is most convenient.

> Can you share the exact mount command line for setup the environment?
> and the exact xfstest item?

It's a variety of mount command lines, since I'm using gce-xfstests[1][2]
using a variety of file system scenarios --- but the basic one, which
is ext4 using the default 4k block size is failing (they all are failing).

[1] https://thunk.org/gce-xfstests
[2] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

It's also not one consistent xfstests which is failing, but it does
tend to be tests which are loading up the storage stack with a lot of
small random read/writes, especially involving metadata blocks/writes.
(For example, tests which run fsstress.)

Since this reliably triggers for me, and other people running
kvm-xfstests or are running xfstests on their own test environments
aren't seeing it, I'm assuming it must be some kind of interesting
interaction between virtio-scsi, perhaps with how Google Persistent
Disk is behaving (maybe timing related?  who knows?).  Darrick Wong
did say he saw something like it once using Oracle's Cloud
infrastructure, but as far as I know it hasn't reproduced since.  On
Google Compute Engine VM's, it reproduces *extremely* reliably.

I expect that if you were to set up gce-xfstests, get a free GCE
account with the initial $300 free credits, you could run
"gce-xfstests -c ext4/4k -g auto" and it would reproduce within an
hour or so.  (So under a dollar's worth of VM credits, so long as you
notice that it's hung and shut down the VM after gathering debugging
data.)

The instructions are at [2], and the image xfstests-202008311554 in
the xfstests-cloud project is a public copy of the VM test appliance I
was using.

% gcloud compute images describe --project xfstests-cloud xfstests-202008311554
archiveSizeBytes: '1720022528'
creationTimestamp: '2020-09-15T15:09:30.544-07:00'
description: Linux Kernel File System Test Appliance
diskSizeGb: '10'
family: xfstests
guestOsFeatures:
- type: VIRTIO_SCSI_MULTIQUEUE
- type: UEFI_COMPATIBLE
id: '1558420969906537845'
kind: compute#image
labelFingerprint: V-2Qgcxt2uw=
labels:
  blktests: g8a75bed
  e2fsprogs: v1_45_6
  fio: fio-3_22
  fsverity: v1_2
  ima-evm-utils: v1_3_1
  nvme-cli: v1_12
  quota: g13bb8c2
  util-linux: v2_36
  xfsprogs: v5_8_0-rc1
  xfstests: linux-v3_8-2838-geb439bf2
  xfstests-bld: gb5085ab
licenseCodes:
- '5543610867827062957'
licenses:
- 
https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-10-buster
name: xfstests-202008311554
selfLink: 
https://www.googleapis.com/compute/v1/projects/xfstests-cloud/global/images/xfstests-202008311554
sourceDisk: 
https://www.googleapis.com/compute/v1/projects/xfstests-cloud/zones

Re: PROBLEM: Reiser4 hard lockup

2020-10-27 Thread Theodore Y. Ts'o

On Tue, Oct 27, 2020 at 01:53:31AM +0100, Edward Shishkin wrote:
> > > reiser4progs 1.1.x Software Framework Release Number (SFRN) 4.0.1 file
> > > system utilities should not be used to check/fix media formatted 'a
> > > priori' in SFRN 4.0.2 and vice-versa.
> > 
> > Honestly, this is the first time I've heard about a Linux FS having
> > versioning other than a major one
> 
> This is because, unlike other Linux file systems, reiser4 is a
> framework.
> 
> In vanilla kernel having a filesystem-as-framework is discouraged for
> ideological reasons. As they explained: "nobody's interested in
> plugins". A huge monolithic mess without any internal structure -
> welcome :)

I wouldn't call it an ideological problem, but more about wanting to
assure interoperability issues and wanting to reduce confusion on the
part of users, especially if images get moved between systems.  There
is also plenty of way of introducing internal structure and code
cleanliness without going completely undisciplined with respect to
on-disk format extensions.  :-)

Finally, I'll note that ext 2/3/4 does have a rather fine-grained set
of feature flags, with specific rules about what the kernel --- and
e2fsck --- should do if it finds a feature bit it doesn't understand
in the incompat, ro_compat, and compat feature flags set.  This is
especially helpful since we have multiple implementations of ext 2/3/4
out there (in FreeBSD, the GRUB bootloader, GNU HURD, Fuchsia, etc.)
and so using feature bits allow for safe and reliable interoperability
with the user being warned if they can safely only mount the file
system read-only, or not at all, if the file system has some new
feature that their current OS version does not support.  We can also
give appropriate warnings if they are using an insufficiently recent
version of the userspace tools.

Cheers,

- Ted

Re: [PATCH v3 23/32] jbd2: fix a kernel-doc markup

2020-10-27 Thread Theodore Y. Ts'o

On Tue, Oct 27, 2020 at 10:51:27AM +0100, Mauro Carvalho Chehab wrote:
> The kernel-doc markup that documents _fc_replay_callback is
> missing an asterisk, causing this warning:
> 
>   ../include/linux/jbd2.h:1271: warning: Function parameter or member 
> 'j_fc_replay_callback' not described in 'journal_s'
> 
> When building the docs.
> 
> Fixes: 609f928af48f ("jbd2: fast commit recovery path")
> Signed-off-by: Mauro Carvalho Chehab 

Thanks, I'm accomulating some bug fix patches to push to Linus, so
I'll grab this for the ext4 git tree.

- Ted

Re: [PATCH] ext4: properly check for dirty state in ext4_inode_datasync_dirty()

2020-10-28 Thread Theodore Y. Ts'o

On Wed, Oct 28, 2020 at 08:57:03AM +0530, Ritesh Harjani wrote:
> 
> Well, I too noticed this yesterday while I was testing xfstests -g swap.
> Those tests were returning _notrun, hence that could be the reason why
> it didn't get notice in XFSTESTing from Ted.

Yeah, one of the things I discussed with Harshad is we really need a
test that looks like generic/472, but which is in shared/NNN, and
which unconditionally tries to use swapon for those file systems where
swapfiles are expected to work.  This is actually the second
regression caused by our breaking swapfile support (the other being
the iomap bmap change), which escaped our testing because we didn't
notice that generic/472 was skipped.

(Mental note; perhaps we should have a way of flagging tests that are
skipped when previously they would run in the {kvm,gce}-xfstests
framework.)

- Ted

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-10 Thread Theodore Y. Ts'o

On Thu, Dec 10, 2020 at 09:09:51AM +0100, Dmitry Vyukov wrote:
> >  * [new tag]   ext4-for-linus-5.8-rc1-2 -> 
> > ext4-for-linus-5.8-rc1-2
> >  ! [rejected]  ext4_for_linus   -> ext4_for_linus  
> > (would clobber existing tag)
> 
> Interesting. First time I see this. Should syzkaller use 'git fetch
> --tags --force"?...
> StackOverflow suggests it should help:
> https://stackoverflow.com/questions/58031165/how-to-get-rid-of-would-clobber-existing-tag

Yeah, sorry, ext4_for_linus is a signed tag which is only used to
authenticate my pull request to Linus.  After Linus accepts the pull,
the digital signature is going to be upstream, and so I end up
deleting and the reusing that tag for the next merge window.

I guess I could just start always using ext4_for_linus- and
just delete the tags once they have been accepted, to keep my list of
tags clean. 

It's going to make everyone else's tags who pull from ext4.git messy,
though, with gobs of tags that probably won't be of use to them.  It
does avoid the need to use git fetch --tags --force, and I guess
people are used to the need to GC tags with the linux-repo.  So maybe
that's the right thing to do going forward.

- Ted

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-09 Thread Theodore Y. Ts'o

On Tue, Dec 08, 2020 at 11:33:11PM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:15ac8fdb Add linux-next specific files for 20201207
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1125c92350
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3696b8138207d24d
> dashboard link: https://syzkaller.appspot.com/bug?extid=345b75652b1d24227443
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=151bf86b50
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=139212cb50

#syz test git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
e360ba58d067a30a4e3e7d55ebdd919885a058d6

>From 3d3bc303a8a8f7123cf486f49fa9060116fa1465 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o 
Date: Wed, 9 Dec 2020 15:59:11 -0500
Subject: [PATCH] ext4: check for invalid block size early when mounting a file
 system

Check for valid block size directly by validating s_log_block_size; we
were doing this in two places.  First, by calculating blocksize via
BLOCK_SIZE << s_log_block_size, and then checking that the blocksize
was valid.  And then secondly, by checking s_log_block_size directly.

The first check is not reliable, and can trigger an UBSAN warning if
s_log_block_size on a maliciously corrupted superblock is greater than
22.  This is harmless, since the second test will correctly reject the
maliciously fuzzed file system, but to make syzbot shut up, and
because the two checks are duplicative in any case, delete the
blocksize check, and move the s_log_block_size earlier in
ext4_fill_super().

Signed-off-by: Theodore Ts'o 
Reported-by: syzbot+345b75652b1d24227...@syzkaller.appspotmail.com
---
 fs/ext4/super.c | 40 
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f86220a8df50..4a16bbf0432c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4202,18 +4202,25 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
 */
sbi->s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT;
 
-   blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
-
-   if (blocksize == PAGE_SIZE)
-   set_opt(sb, DIOREAD_NOLOCK);
-
-   if (blocksize < EXT4_MIN_BLOCK_SIZE ||
-   blocksize > EXT4_MAX_BLOCK_SIZE) {
+   if (le32_to_cpu(es->s_log_block_size) >
+   (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
ext4_msg(sb, KERN_ERR,
-  "Unsupported filesystem blocksize %d (%d 
log_block_size)",
-blocksize, le32_to_cpu(es->s_log_block_size));
+"Invalid log block size: %u",
+le32_to_cpu(es->s_log_block_size));
goto failed_mount;
}
+   if (le32_to_cpu(es->s_log_cluster_size) >
+   (EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
+   ext4_msg(sb, KERN_ERR,
+"Invalid log cluster size: %u",
+le32_to_cpu(es->s_log_cluster_size));
+   goto failed_mount;
+   }
+
+   blocksize = EXT4_MIN_BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
+
+   if (blocksize == PAGE_SIZE)
+   set_opt(sb, DIOREAD_NOLOCK);
 
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
sbi->s_inode_size = EXT4_GOOD_OLD_INODE_SIZE;
@@ -4432,21 +4439,6 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
if (!ext4_feature_set_ok(sb, (sb_rdonly(sb
goto failed_mount;
 
-   if (le32_to_cpu(es->s_log_block_size) >
-   (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
-   ext4_msg(sb, KERN_ERR,
-"Invalid log block size: %u",
-le32_to_cpu(es->s_log_block_size));
-   goto failed_mount;
-   }
-   if (le32_to_cpu(es->s_log_cluster_size) >
-   (EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
-   ext4_msg(sb, KERN_ERR,
-"Invalid log cluster size: %u",
-le32_to_cpu(es->s_log_cluster_size));
-   goto failed_mount;
-   }
-
if (le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) > (blocksize / 4)) {
ext4_msg(sb, KERN_ERR,
 "Number of reserved GDT blocks insanely large: %d",
-- 
2.28.0

Re: general protection fault in ext4_commit_super

2020-12-22 Thread Theodore Y. Ts'o

On Tue, Dec 22, 2020 at 12:28:53PM +0100, Jan Kara wrote:
> > Fix e810c942a325 ("ext4: save error info to sb through journal if 
> > available")
> > by flushing work as part of rollback.
> 
> Thanks for having a look. I don't think the fix is quite correct though. The
> flush_work() should be at failed_mount3: label. So something like attached
> fixup. Ted, can you please fold it into the buggy commit?

Done.  I folded it into "ext4: defer saving error info from atomic
context" since this is the commit where we introduced the s_error_work
workqueue.

Thanks!!

- Ted

[GIT PULL] ext4 updates for v5.11-rc1

2020-12-22 Thread Theodore Y. Ts'o

The following changes since commit 418baf2c28f3473039f2f7377760bd8f6897ae18:

  Linux 5.10-rc5 (2020-11-22 15:36:08 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to be993933d2e997fdb72b8b1418d2a84df79b8962:

  ext4: remove unnecessary wbc parameter from ext4_bio_write_page (2020-12-22 
13:08:45 -0500)

NOTE: The reason why the branch had recently changed was to add a
one-line fix which added flush_work() call to an error/cleanup patgh,
to address a syzbot reported failure.  See the thread at:

http://lore.kernel.org/r/1faff305b709b...@google.com

There were also some commit description updates to add some Cc:
sta...@kernel.org tags.

This branch was tested and passes xfstests regression tests, and in
any case, it's all bug fixes and cleanups:

TESTRUNID: tytso-20201222152130
KERNEL:5.10.0-rc5-xfstests-00029-gbe993933d2e9 #2064 SMP Tue Dec 22 
15:19:12 EST 2020 x86_64
CMDLINE:   -c ext4/4k -g auto
CPUS:  2
MEM:   7680

ext4/4k: 520 tests, 43 skipped, 6608 seconds
Totals: 477 tests, 43 skipped, 0 failures, 0 errors, 6554s


Various bug fixes and cleanups for ext4; no new features this cycle.



Alexander Lochmann (1):
  Updated locking documentation for transaction_t

Chunguang Xu (7):
  ext4: use ASSERT() to replace J_ASSERT()
  ext4: remove redundant mb_regenerate_buddy()
  ext4: simplify the code of mb_find_order_for_block
  ext4: update ext4_data_block_valid related comments
  ext4: delete nonsensical (commented-out) code inside 
ext4_xattr_block_set()
  ext4: fix a memory leak of ext4_free_data
  ext4: avoid s_mb_prefetch to be zero in individual scenarios

Colin Ian King (1):
  ext4: remove redundant assignment of variable ex

Dan Carpenter (1):
  ext4: fix an IS_ERR() vs NULL check

Gustavo A. R. Silva (1):
  ext4: fix fall-through warnings for Clang

Harshad Shirwadkar (3):
  ext4: add docs about fast commit idempotence
  ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h
  jbd2: add a helper to find out number of fast commit blocks

Jan Kara (8):
  ext4: fix deadlock with fs freezing and EA inodes
  ext4: don't remount read-only with errors=continue on reboot
  ext4: remove redundant sb checksum recomputation
  ext4: standardize error message in ext4_protect_reserved_inode()
  ext4: make ext4_abort() use __ext4_error()
  ext4: move functions in super.c
  ext4: simplify ext4 error translation
  ext4: defer saving error info from atomic context

Kaixu Xia (2):
  ext4: remove redundant operation that set bh to NULL
  ext4: remove the unused EXT4_CURRENT_REV macro

Lei Chen (1):
  ext4: remove unnecessary wbc parameter from ext4_bio_write_page

Roman Anufriev (2):
  ext4: add helpers for checking whether quota can be enabled/is journalled
  ext4: print quota journalling mode on (re-)mount

Theodore Ts'o (1):
  ext4: check for invalid block size early when mounting a file system

Xianting Tian (1):
  ext4: remove the null check of bio_vec page

 Documentation/filesystems/ext4/journal.rst |  50 ++
 fs/ext4/balloc.c   |   2 +-
 fs/ext4/block_validity.c   |  16 +-
 fs/ext4/ext4.h |  77 ++---
 fs/ext4/ext4_jbd2.c|   4 +-
 fs/ext4/ext4_jbd2.h|   9 +-
 fs/ext4/extents.c  |   5 +-
 fs/ext4/fast_commit.c  |  99 +++-
 fs/ext4/fast_commit.h  |  78 +++--
 fs/ext4/fsync.c|   2 +-
 fs/ext4/indirect.c |   4 +-
 fs/ext4/inode.c|  35 ++--
 fs/ext4/mballoc.c  |  39 ++---
 fs/ext4/namei.c|  12 +-
 fs/ext4/page-io.c  |   5 +-
 fs/ext4/super.c| 422 
-
 fs/ext4/xattr.c|   1 -
 fs/jbd2/journal.c  |   8 +-
 include/linux/jbd2.h   |  14 +-
 19 files changed, 504 insertions(+), 378 deletions(-)

Re: [PATCH -next] ext4: use DEFINE_MUTEX (and mutex_init() had been too late)

2020-12-23 Thread Theodore Y. Ts'o

On Wed, Dec 23, 2020 at 10:12:54PM +0800, Zheng Yongjun wrote:
> Signed-off-by: Zheng Yongjun 

Why is mutex_init() too late?  We only take the mutex after we
mounting an ext4 file system, and that can't happen until ext4_init_fs
is called.

- Ted

>  fs/ext4/super.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 94472044f4c1..8776f06a639d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -59,7 +59,7 @@
>  #include 
>  
>  static struct ext4_lazy_init *ext4_li_info;
> -static struct mutex ext4_li_mtx;
> +static DEFINE_MUTEX(ext4_li_mtx);
>  static struct ratelimit_state ext4_mount_msg_ratelimit;
>  
>  static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
> @@ -6640,7 +6640,6 @@ static int __init ext4_init_fs(void)
>  
>   ratelimit_state_init(_mount_msg_ratelimit, 30 * HZ, 64);
>   ext4_li_info = NULL;
> - mutex_init(_li_mtx);
>  
>   /* Build-time check for flags consistency */
>   ext4_check_flag_values();
> -- 
> 2.22.0
>

Re: [PATCH v2 0/3] add support for metadata encryption to F2FS

2020-12-17 Thread Theodore Y. Ts'o

On Thu, Dec 17, 2020 at 03:04:32PM +, Satya Tangirala wrote:
> This patch series adds support for metadata encryption to F2FS using
> blk-crypto.

Is there a companion patch series needed so that f2fstools can
check/repair a file system with metadata encryption enabled?

- Ted

Re: [PATCH] ext4: Don't leak old mountpoint samples

2020-12-17 Thread Theodore Y. Ts'o

On Tue, Dec 01, 2020 at 04:13:01PM +0100, Richard Weinberger wrote:
> As soon the first file is opened, ext4 samples the mountpoint
> of the filesystem in 64 bytes of the super block.
> It does so using strlcpy(), this means that the remaining bytes
> in the super block string buffer are untouched.
> If the mount point before had a longer path than the current one,
> it can be reconstructed.
> 
> Consider the case where the fs was mounted to "/media/johnjdeveloper"
> and later to "/".
> The the super block buffer then contains "/\x00edia/johnjdeveloper".
> 
> This case was seen in the wild and caused confusion how the name
> of a developer ands up on the super block of a filesystem used
> in production...
> 
> Fix this by clearing the string buffer before writing to it,
> 
> Signed-off-by: Richard Weinberger 

Thank for reporting this issue.  In fact, the better fix is to use
strncpy().  See my revised patch for an explanation of why

commit cdc9ad7d3f201a77749432878fb4caa490862de6
Author: Theodore Ts'o 
Date:   Thu Dec 17 13:24:15 2020 -0500

ext4: don't leak old mountpoint samples

When the first file is opened, ext4 samples the mountpoint of the
filesystem in 64 bytes of the super block.  It does so using
strlcpy(), this means that the remaining bytes in the super block
string buffer are untouched.  If the mount point before had a longer
path than the current one, it can be reconstructed.

Consider the case where the fs was mounted to "/media/johnjdeveloper"
and later to "/".  The super block buffer then contains
"/\x00edia/johnjdeveloper".

This case was seen in the wild and caused confusion how the name
of a developer ands up on the super block of a filesystem used
in production...

Fix this by using strncpy() instead of strlcpy().  The superblock
field is defined to be a fixed-size char array, and it is already
marked using __nonstring in fs/ext4/ext4.h.  The consumer of the field
in e2fsprogs already assumes that in the case of a 64+ byte mount
path, that s_last_mounted will not be NUL terminated.

Reported-by: Richard Weinberger 
Signed-off-by: Theodore Ts'o 

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1cd3d26e3217..349b27f0dda0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -810,7 +810,7 @@ static int ext4_sample_last_mounted(struct super_block *sb,
if (err)
goto out_journal;
lock_buffer(sbi->s_sbh);
-   strlcpy(sbi->s_es->s_last_mounted, cp,
+   strncpy(sbi->s_es->s_last_mounted, cp,
sizeof(sbi->s_es->s_last_mounted));
ext4_superblock_csum_set(sb);
unlock_buffer(sbi->s_sbh);

Re: [PATCH v2 0/3] add support for metadata encryption to F2FS

2020-12-17 Thread Theodore Y. Ts'o

On Thu, Dec 17, 2020 at 08:51:14PM +, Satya Tangirala wrote:
> On Thu, Dec 17, 2020 at 01:08:49PM -0500, Theodore Y. Ts'o wrote:
> > On Thu, Dec 17, 2020 at 03:04:32PM +, Satya Tangirala wrote:
> > > This patch series adds support for metadata encryption to F2FS using
> > > blk-crypto.
> > 
> > Is there a companion patch series needed so that f2fstools can
> > check/repair a file system with metadata encryption enabled?
> > 
> > - Ted
> Yes! It's at
> https://lore.kernel.org/linux-f2fs-devel/20201217151013.1513045-1-sat...@google.com/

Cool, I've been meaning to update f2fs-tools in Debian, and including
these patches will allow us to generate {kvm,gce,android}-xfstests
images with this support.  I'm hoping to get to it sometime betweeen
Christmas and New Year's.

I guess there will need to be some additional work needed to create
the f2fs image with a fixed keys for a particular file system in
xfstests-bld, and then mounting and checking said image with the
appropriatre keys as well.   Is that something you've put together?

Cheers,

- Ted

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-14 Thread Theodore Y. Ts'o

(Dropping off-topic lists)

On Mon, Dec 14, 2020 at 03:37:37PM +0100, Dmitry Vyukov wrote:
> > It's going to make everyone else's tags who pull from ext4.git messy,
> > though, with gobs of tags that probably won't be of use to them.  It
> > does avoid the need to use git fetch --tags --force, and I guess
> > people are used to the need to GC tags with the linux-repo.

(I had meant to say linux-next repo above.)

> syzbot is now prepared and won't fail next time, nor on other similar
> trees. Which is good.
> So it's really up to you.

I'm curious --- are you having to do anything special in terms of
deleting old tags to keep the size of the repo under control?  Git
will keep a tag around indefinitely, so if you have huge numbers of
next-MMDD tags in your repo, the size will grow without bound.
Are you doing anything to automatically garbage collect tags to preven
this from being a problem?

(I am not pulling linux-next every day; only when I need to debug a
bug reported against the -next tree, so I just manually delete the
tags as necessary.  So I'm curious what folks who are following
linux-next are doing, and whether they have something specific for
linux-next tags, or whether they have a more general solution.)

Cheers,

- Ted

Re: [PATCH] ext4: fix -Wstringop-truncation warnings

2020-12-15 Thread Theodore Y. Ts'o

On Thu, Nov 12, 2020 at 05:33:24PM +0800, Kang Wenlin wrote:
> From: Wenlin Kang 
> 
> The strncpy() function may create a unterminated string,
> use strscpy_pad() instead.
> 
> This fixes the following warning:
> 
> fs/ext4/super.c: In function '__save_error_info':
> fs/ext4/super.c:349:2: warning: 'strncpy' specified bound 32 equals 
> destination size [-Wstringop-truncation]
>   strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func));
>   ^~~
> fs/ext4/super.c:353:3: warning: 'strncpy' specified bound 32 equals 
> destination size [-Wstringop-truncation]
>strncpy(es->s_first_error_func, func,
>^
> sizeof(es->s_first_error_func));
> ~~~

What compiler are you using?  s_last_error_func is defined to not
necessarily be NUL terminated.  So strscpy_pad() is not a proper
replacement for strncpy() in this use case.

>From Documentation/process/deprecated:

   If a caller is using non-NUL-terminated strings, strncpy() can
   still be used, but destinations should be marked with the `__nonstring
   `_
   attribute to avoid future compiler warnings.

s_{first,last}_error_func is properly annotated with __nonstring in
fs/ext4/ext4.h.

- Ted

Re: [PATCH 031/141] ext4: Fix fall-through warnings for Clang

2020-12-16 Thread Theodore Y. Ts'o

On Fri, Nov 20, 2020 at 12:28:32PM -0600, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of just letting the code
> fall through to the next case.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 

Thanks, applied.

- Ted

Re: [PATCH] fs: ext4: remove unnecessary wbc parameter from ext4_bio_write_page

2020-12-16 Thread Theodore Y. Ts'o

On Fri, Dec 11, 2020 at 02:54:24PM +0800, chenle...@gmail.com wrote:
> From: Lei Chen 
> 
> ext4_bio_write_page does not need wbc parameter, since its parameter
> io contains the io_wbc field. The io::io_wbc is initialized by
> ext4_io_submit_init which is called in ext4_writepages and
> ext4_writepage functions prior to ext4_bio_write_page.
> Therefor, when ext4_bio_write_page is called, wbc info
> has already been included in io parameter.
> 
> Signed-off-by: Lei Chen 

Thanks, applied.

- Ted

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-23 Thread Theodore Y. Ts'o

On Thu, Oct 22, 2020 at 04:52:52PM -0700, Brendan Higgins wrote:
> So you, me, Luis, David, and a whole bunch of other people have been
> thinking about this problem for a while. What if we just put
> kunitconfig fragments in directories along side the test files they
> enable?
> 
> For example, we could add a file to fs/ext4/kunitconfig which contains:
> 
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_KUNIT_TESTS=y
> 
> We could do something similar in fs/jdb2, etc.
> 
> Obviously some logically separate KUnit tests (different maintainers,
> different Kconfig symbols, etc) reside in the same directory, for
> these we could name the kunitconfig file something like
> lib/list-test.kunitconfig (not a great example because lists are
> always built into Linux), but you get the idea.
> 
> Then like Ted suggested, if you call kunit.py run foo/bar, then
> 
> if bar is a directory, then kunit.py will look for foo/bar/kunitconfig
> 
> if bar is a file ending with .kunitconfig like foo/bar.kunitconfig,
> then it will use that kunitconfig
> 
> if bar is '...' (foo/...) then kunit.py will look for all kunitconfigs
> underneath foo.
> 
> Once all the kunitconfigs have been resolved, they will be merged into
> the .kunitconfig. If they can be successfully merged together, the new
> .kunitconfig will then continue to function as it currently does.

I was thinking along a similar set of lines this morning.  One thing
I'd add in addition to your suggestion to that is to change how
.kunitconfig is interpreted such that

CONFIG_KUNIT=y

is always implied, so it doesn't have to be specified explicitly, and
that if a line like:

fs/ext4

or

mm

etc. occurs, that will cause a include of the Kunitconfig (I'd using a
capitalized version of the filename like Kconfig, so that it's easier
to see in a directory listing) in the named directory.

That way, .kunitconfig is backwards compatible, but it also allows
people to put a one-liner into .kunitconfig to enable the unit tests
for that particular directory.

What do folks think?

Cheers,

- Ted

Re: [RFC] Removing b_end_io

2020-10-25 Thread Theodore Y. Ts'o

On Sun, Oct 25, 2020 at 04:44:38AM +, Matthew Wilcox wrote:
> @@ -3068,6 +3069,12 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>   }
>  
>   submit_bio(bio);
> +}
> +
> +static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> +  enum rw_hint write_hint, struct writeback_control *wbc)
> +{
> + __bh_submit(bh, op | op_flags, write_hint, wbc, end_bio_bh_io_sync);
>   return 0;
>  }
>

I believe this will break use cases where the file system sets
bh->b_end_io and then calls submit_bh(), which then calls
submit_bh_wbc().  That's because with this change, calls to
submit_bh_wbc() --- include submit_bh() --- ignores bh->b_end_io and
results in end_bio_bh_io_sync getting used.

Filesystems that do this includes fs/ntfs, fs/resiserfs.

In this case, that can probably be fixed by changing submit_bh() to
pass in bh->b_end_io, or switching those users to use the new
bh_submit() function to prevent these breakages.

- Ted

Re: [PATCH v4 12/27] jbd2: fix kernel-doc markups

2020-11-19 Thread Theodore Y. Ts'o

On Mon, Nov 16, 2020 at 11:18:08AM +0100, Mauro Carvalho Chehab wrote:
> Kernel-doc markup should use this format:
> identifier - description
> 
> They should not have any type before that, as otherwise
> the parser won't do the right thing.
> 
> Also, some identifiers have different names between their
> prototypes and the kernel-doc markup.
> 
> Reviewed-by: Jan Kara 
> Signed-off-by: Mauro Carvalho Chehab 

Applied to the ext4 tree, thanks!

- Ted

[GIT PULL] more ext4 fixes for v5.10-rc4

2020-11-12 Thread Theodore Y. Ts'o

The following changes since commit 52d1998d09af92d44ffce7454637dd3fd1afdc7d:

  Merge tag 'fscrypt-for-linus' of 
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt (2020-11-10 10:05:37 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_bugfixes

for you to fetch changes up to d196e229a80c39254f4adbc312f55f5198e98941:

  Revert "ext4: fix superblock checksum calculation race" (2020-11-11 14:24:18 
-0500)


Two ext4 bug fixes, one via a revert of a commit sent during the merge window.


Harshad Shirwadkar (1):
  ext4: handle dax mount option collision

Theodore Ts'o (1):
  Revert "ext4: fix superblock checksum calculation race"

 fs/ext4/ext4.h  |  6 +++---
 fs/ext4/super.c | 11 ---
 2 files changed, 3 insertions(+), 14 deletions(-)

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-21 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 02:16:56PM -0700, Randy Dunlap wrote:
> On 10/21/20 2:15 PM, Brendan Higgins wrote:
> > On Tue, Oct 20, 2020 at 12:37 AM Geert Uytterhoeven
> >  wrote:
> >>
> >> EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
> >> user may not want to enable.  Fix this by making the test depend on
> >> EXT4_FS instead.
> >>
> >> Fixes: 1cbeab1b242d16fd ("ext4: add kunit test for decoding extended 
> >> timestamps")
> >> Signed-off-by: Geert Uytterhoeven 
> > 
> > If I remember correctly, having EXT4_KUNIT_TESTS select EXT4_FS was
> > something that Ted specifically requested, but I don't have any strong
> > feelings on it either way.
> 
> omg, please No. depends on is the right fix here.

So my requirement which led to that particular request is to keep what
needs to be placed in .kunitconfig to a small and reasonable set.

Per Documentation/dev-tools/kunit, we start by:

cd $PATH_TO_LINUX_REPO
cp arch/um/configs/kunit_defconfig .kunitconfig

we're then supposed to add whatever Kunit tests we want to enable, to wit:

CONFIG_EXT4_KUNIT_TESTS=y

so that .kunitconfig would look like this:

CONFIG_KUNIT=y
CONFIG_KUNIT_TEST=y
CONFIG_KUNIT_EXAMPLE_TEST=y
CONFIG_EXT4_KUNIT_TESTS=y

... and then you should be able to run:

./tools/testing/kunit/kunit.py run

... and have the kunit tests run.  I would *not* like to have to put a
huge long list of CONFIG_* dependencies into the .kunitconfig file.

I'm don't particularly care how this gets achieved, but please think
about how to make it easy for a kernel developer to run a specific set
of subsystem unit tests.  (In fact, being able to do something like
"kunit.py run fs/ext4 fs/jbd2" or maybe "kunit.py run fs/..." would be
*great*.  No need to fuss with hand editing the .kunitconfig file at
all would be **wonderful**.

Cheers,

- Ted

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-21 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 04:07:15PM -0700, Randy Dunlap wrote:
> > I'm don't particularly care how this gets achieved, but please think
> > about how to make it easy for a kernel developer to run a specific set
> > of subsystem unit tests.  (In fact, being able to do something like
> > "kunit.py run fs/ext4 fs/jbd2" or maybe "kunit.py run fs/..." would be
> > *great*.  No need to fuss with hand editing the .kunitconfig file at
> > all would be **wonderful**.
> 
> I understand the wish for ease of use, but this is still the tail
> wagging the dog.
> 
> The primary documentation for 'select' is
> Documentation/kbuild/kconfig-language.rst, which says:
> 
>   Note:
>   select should be used with care. select will force
>   a symbol to a value without visiting the dependencies.
>   By abusing select you are able to select a symbol FOO even
>   if FOO depends on BAR that is not set.
>   In general use select only for non-visible symbols
>   (no prompts anywhere) and for symbols with no dependencies.
>   That will limit the usefulness but on the other hand avoid
>   the illegal configurations all over.
> 

Well, the KUNIT configs are kinda of a special case, since normally
they don't have a lot of huge number of dependencies, since unit tests
in general are not integration tests.  So ideally, dependencies will
mostly be replaced with mocking functions.  And if there are *real*
dependencies that the Kunit Unit tests need, they can be explicitly
pulled in with selects.

That being said, as I said, I'm not picky about *how* this gets
achieved.  But ease of use is a key part of making people more likely
to run the unit tests.  So another way of solving the problem might be
to put some kind of automated dependency solver into kunit.py, or some
way of manually adding the necessary dependencies in some kind of
Kunitconfig file that are in directories where their are Unit tests,
or maybe some kind of extenstion to the Kconfig file.  My main
requirement is that the only thing that should be necessary for
enabling the ext4 Kunit tests should be adding a single line to the
.kunitconfig file.  It's not fair to make the human developer manually
have to figure out the dependency chains.

As far as I'm concerned, ease of use is important enough to justfy
special casing and/or bending the rules as far as "select" is concered
for Kunit-related CONFIG items.  But if someone else want to suggest a
better approach, I'm all ears.

Cheers,

- Ted

[GIT PULL] ext4 changes for 5.10

2020-10-22 Thread Theodore Y. Ts'o

The following changes since commit a1b8638ba1320e6684aa98233c15255eb803fac7:

  Linux 5.9-rc7 (2020-09-27 14:38:10 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 1322181170bb01bce3c228b82ae3d5c6b793164f:

  ext4: fix invalid inode checksum (2020-10-21 23:22:38 -0400)


The siginificant new ext4 feature this time around is Harshad's new
fast_commit mode.  In addition, thanks to Mauricio for fixing a race
where mmap'ed pages that are being changed in parallel with a
data=journal transaction commit could result in bad checksums in the
failure that could cause journal replays to fail.  Also notable is
Ritesh's buffered write optimization which can result in significant
improvements on parallel write workloads.  (The kernel test robot
reported a 330.6% improvement on fio.write_iops on a 96 core system
using DAX[1].)

Besides that, we have the usual miscellaneous cleanups and bug fixes.

[1] https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian


Chunguang Xu (4):
  ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
  ext4: rename system_blks to s_system_blks inside ext4_sb_info
  ext4: delete invalid comments near mb_buddy_adjust_border
  ext4: make mb_check_counter per group

Constantine Sapuntzakis (1):
  ext4: fix superblock checksum calculation race

Darrick J. Wong (1):
  ext4: limit entries returned when counting fsmap records

Dinghao Liu (1):
  ext4: fix error handling code in add_new_gdb

Eric Biggers (1):
  ext4: fix leaking sysfs kobject after failed mount

Harshad Shirwadkar (9):
  doc: update ext4 and journalling docs to include fast commit feature
  ext4: add fast_commit feature and handling for extended mount options
  ext4 / jbd2: add fast commit initialization
  jbd2: add fast commit machinery
  ext4: main fast-commit commit path
  jbd2: fast commit recovery path
  ext4: fast commit recovery path
  ext4: add a mount opt to forcefully turn fast commits on
  ext4: add fast commit stats in procfs

Hui Su (1):
  jbd2: fix the comment of struct jbd2_journal_handle

Jan Kara (2):
  ext4: discard preallocations before releasing group lock
  ext4: Detect already used quota file early

Jens Axboe (1):
  ext4: flag as supporting buffered async reads

Kaixu Xia (1):
  ext4: use the normal helper to get the actual inode

Luo Meng (1):
  ext4: fix invalid inode checksum

Mauricio Faria de Oliveira (4):
  jbd2: introduce/export functions 
jbd2_journal_submit|finish_inode_data_buffers()
  jbd2, ext4, ocfs2: introduce/use journal callbacks 
j_submit|finish_inode_data_buffers()
  ext4: data=journal: fixes for ext4_page_mkwrite()
  ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()

Nikolay Borisov (1):
  ext4: remove unused argument from ext4_(inc|dec)_count

Petr Malat (1):
  ext4: do not interpret high bytes if 64bit feature is disabled

Randy Dunlap (1):
  ext4: delete duplicated words + other fixes

Ritesh Harjani (3):
  ext4: implement swap_activate aops using iomap
  ext4: optimize file overwrites
  ext4: fix bs < ps issue reported with dioread_nolock mount opt

Tian Tao (1):
  ext4: remove unused including 

Xiao Yang (1):
  ext4: disallow modifying DAX inode flag if inline_data has been set

Ye Bin (1):
  ext4: fix dead loop in ext4_mb_new_blocks

Zhang Qilong (1):
  ext4: add trace exit in exception path.

Zhang Xiaoxu (1):
  ext4: fix bdev write error check failed when mount fs with ro

changfengnan (1):
  jbd2: avoid transaction reuse after reformatting

zhangyi (F) (7):
  ext4: clear buffer verified flag if read meta block from disk
  ext4: introduce new metadata buffer read helpers
  ext4: use common helpers in all places reading metadata buffers
  ext4: use ext4_buffer_uptodate() in __ext4_get_inode_loc()
  ext4: introduce ext4_sb_breadahead_unmovable() to replace 
sb_breadahead_unmovable()
  ext4: use ext4_sb_bread() instead of sb_bread()
  ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()

 Documentation/filesystems/ext4/journal.rst |   66 ++
 Documentation/filesystems/journalling.rst  |   33 +
 fs/ext4/Makefile   |2 +-
 fs/ext4/acl.c  |2 +
 fs/ext4/balloc.c   |   14 +-
 fs/ext4/block_validity.c   |   10 +-
 fs/ext4/dir.c  |4 +-
 fs/ext4/ext4.h |  136 +++-
 fs/ext4/ext4_jbd2.c|2 +-
 fs/ext4/extents.c  |  315 +++-
 fs/ext4/extents_status.c   |   24 +
 fs/ext4/fast_commit.c

[GIT PULL] ext4 bug fixes for 5.10-rc

2020-11-22 Thread Theodore Y. Ts'o

The following changes since commit 09162bc32c880a791c6c0668ce0745cf7958f576:

  Linux 5.10-rc4 (2020-11-15 16:44:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_fixes2

for you to fetch changes up to f902b216501094495ff75834035656e8119c537f:

  ext4: fix bogus warning in ext4_update_dx_flag() (2020-11-19 22:41:10 -0500)


A final set of miscellaneous bug fixes for ext4


Jan Kara (1):
  ext4: fix bogus warning in ext4_update_dx_flag()

Mauro Carvalho Chehab (1):
  jbd2: fix kernel-doc markups

Theodore Ts'o (1):
  ext4: drop fast_commit from /proc/mounts

 fs/ext4/ext4.h|  3 ++-
 fs/ext4/super.c   |  4 
 fs/jbd2/journal.c | 34 ++
 fs/jbd2/transaction.c | 31 ---
 include/linux/jbd2.h  |  2 +-
 5 files changed, 37 insertions(+), 37 deletions(-)

Re: [PATCH v2 2/3] fscrypt: Have filesystems handle their d_ops

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 04:03:14AM +, Daniel Rosenberg wrote:
> This shifts the responsibility of setting up dentry operations from
> fscrypt to the individual filesystems, allowing them to have their own
> operations while still setting fscrypt's d_revalidate as appropriate.
> 
> Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
> they have their own specific dentry operations as well. That operation
> will set the minimal d_ops required under the circumstances.
> 
> Since the fscrypt d_ops are set later on, we must set all d_ops there,
> since we cannot adjust those later on. This should not result in any
> change in behavior.
> 
> Signed-off-by: Daniel Rosenberg 

Acked-by: Theodore Ts'o

Re: [PATCH v7 0/8] add support for direct I/O with fscrypt using blk-crypto

2020-11-17 Thread Theodore Y. Ts'o

What is the expected use case for Direct I/O using fscrypt?  This
isn't a problem which is unique to fscrypt, but one of the really
unfortunate aspects of the DIO interface is the silent fallback to
buffered I/O.  We've lived with this because DIO goes back decades,
and the original use case was to keep enterprise databases happy, and
the rules around what is necessary for DIO to work was relatively well
understood.

But with fscrypt, there's going to be some additional requirements
(e.g., using inline crypto) required or else DIO silently fall back to
buffered I/O for encrypted files.  Depending on the intended use case
of DIO with fscrypt, this caveat might or might not be unfortunately
surprising for applications.

I wonder if we should have some kind of interface so we can more
explicitly allow applications to query exactly what the requirements
might be for a particular file vis-a-vis Direct I/O.  What are the
memory alignment requirements, what are the file offset alignment
requirements, what are the write size requirements, for a particular
file.

- Ted

Re: [PATCH v2 1/3] libfs: Add generic function for setting dentry_ops

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 04:03:13AM +, Daniel Rosenberg wrote:
> This adds a function to set dentry operations at lookup time that will
> work for both encrypted filenames and casefolded filenames.
> 
> A filesystem that supports both features simultaneously can use this
> function during lookup preparations to set up its dentry operations once
> fscrypt no longer does that itself.
> 
> Currently the casefolding dentry operation are always set if the
> filesystem defines an encoding because the features is toggleable on
> empty directories. Since we don't know what set of functions we'll
> eventually need, and cannot change them later, we add just add them.
> 
> Signed-off-by: Daniel Rosenberg 

Reviewed-by: Theodore Ts'o 

- Ted

Re: [PATCH v2 2/3] fscrypt: Have filesystems handle their d_ops

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 09:04:11AM -0800, Jaegeuk Kim wrote:
> 
> I'd like to pick this patch series in f2fs/dev for -next, so please let me 
> know
> if you have any concern.

No concern for me as far as ext4 is concerned, thanks!

 - Ted

Re: [RESEND][PATCH] ima: Set and clear FMODE_CAN_READ in ima_calc_file_hash()

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 10:23:58AM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 10:35 AM Mimi Zohar  wrote:
> >
> > We need to differentiate between signed files, which by definition are
> > immutable, and those that are mutable.  Appending to a mutable file,
> > for example, would result in the file hash not being updated.
> > Subsequent reads would fail.
> 
> Why would that require any reading of the file at all AT WRITE TIME?
> 
> Don't do it. Really.
> 
> When opening the file write-only, you just invalidate the hash. It
> doesn't matter anyway - you're only writing.
> 
> Later on, when reading, only at that point does the hash matter, and
> then you can do the verification.
> 
> Although honestly, I don't even see the point. You know the hash won't
> match, if you wrote to the file.

I think the use case the IMA folks might be thinking about is where
they want to validate the file at open time, *before* the userspace
application starts writing to the file, since there might be some
subtle attacks where Boris changes the first part of the file before
Alice appends "I agree" to said file.

Of course, Boris will be able to modify the file after Alice has
modified it, so it's a bit of a moot point, but one could imagine a
scenario where the file is modified while the system is not running
(via an evil hotel maid) and then after Alice modifies the file, of
*course* the hash will be invalid, so no one would notice.  A sane
application would have read the file to make sure it contained the
proper contents before appending "I agree" to said file, so it's a bit
of an esoteric point.

The other case I could imagine is if the file is marked execute-only,
without read access, and IMA wanted to be able to read the file to
check the hash.  But we already make an execption for allowing the
file to be read during page faults, so that's probably less
controversial.

- Ted

Re: [PATCH 0/2] Tristate moount option comatibility fixup

2020-11-10 Thread Theodore Y. Ts'o

On Mon, Nov 09, 2020 at 08:10:07PM +0100, Michal Suchanek wrote:
> Hello,
> 
> after the tristate dax option change some applications fail to detect
> pmem devices because the dax option no longer shows in mtab when device
> is mounted with -o dax.

Which applications?  Name them.

We *really* don't want to encourage applications to make decisions
only based on the mount options.  For example, it could be that the
application's files will have the S_DAX flag set.

It would be a real shame if we are actively encourage applications to
use a broken configuration mechanism which was only used as a hack
while DAX was in experimental status.

- Ted

Re: How to enable auto-suspend by default

2020-11-10 Thread Theodore Y. Ts'o

One note...  I'll double check, but on my XPS 13 9380, as I recall, I
have to manually disable autosuspend on all of the XHCI controllers
and internal hubs after running "powertop --auto-tune", or else any
external mouse attached to said USB device will be dead to the world
for 2-3 seconds if the autosuspend timeout has kicked in, which was
***super*** annoying.

- Ted

Re: drivers/char/random.c needs a (new) maintainer

2020-11-30 Thread Theodore Y. Ts'o

On Mon, Nov 30, 2020 at 04:15:23PM +0100, Jason A. Donenfeld wrote:
> I am willing to maintain random.c and have intentions to have a
> formally verified RNG. I've mentioned this to Ted before.
> 
> But I think Ted's reluctance to not accept the recent patches sent to
> this list is mostly justified, and I have no desire to see us rush
> into replacing random.c with something suboptimal or FIPSy.

Being a maintainer is not about *accepting* patches, it's about
*reviewing* them.  I do plan to make time to catch up on reviewing
patches this cycle.  One thing that would help me is if folks
(especially Jason, if you would) could start with a detailed review of
Nicolai's patches.  His incremental approach is I believe the best one
from a review perspective, and certainly his cleanup patches are ones
which I would expect are no-brainers.

- Ted

Re: [Ksummit-discuss] crediting bug reports and fixes folded into original patch

2020-12-03 Thread Theodore Y. Ts'o

On Thu, Dec 03, 2020 at 12:43:52AM +0100, Vlastimil Babka wrote:
> 
> there was a bit of debate on Twitter about this, so I thought I would bring it
> here. Imagine a scenario where patch sits as a commit in -next and there's a 
> bug
> report or fix, possibly by a bot or with some static analysis. The maintainer
> decides to fold it into the original patch, which makes sense for e.g.
> bisectability. But there seem to be no clear rules about attribution in this
> case, which looks like there should be, probably in
> Documentation/maintainer/modifying-patches.rst

I don't think there should be any kind of fixed, inflexible rules
about this.  

1) Sometimes there will be a *huge* number of comments and
suggestions.  Do we really want to require links to dozens of mail
message id's, and/or dozens or more e-mail addresses?

2) Sometimes a fixup is pretty trivial; even if it is expressed in the
form of a one-line patch, versus someone who does a detailed review of
a patch, but doesn't actually end up appending an explicit
Reviewed-by, perhaps because he or she didn't completely agree with
the final version of the patch.

3) I think this very much should be up to the maintainer's discretion,
as opposed to making rules that may result in some rediculous amount
of bloat in the git log.

4) It's really unhealthy, in my opinion for people to be fixed on
counting attributions.  If we create fixed rules, this can turn into
people try to game the system.  It's the same reason why I'm not
terribly enthusiastic about people trying to game Signed-off-by counts
by sending gazillions of white space or spelling fixes.

If the fix is large enough that for copyright reasons we need to
acknowledge the work, then folding in the SoB as for DCO reason makes
perfect sense.  But if it's a trivial patch (the kind where projects
that require copyright assignment wouldn't require executed legal
agreements), then perhaps attribution is not always a requirement.
Again, there are times when people who spend a lot of work discussing
patch may not get attributiionm even if they didn't actually create
the one-line whitespace fix and sent it in as a patch with a
signed-off-by with a demand that the attribution be preserved.

Common sense really needs to prevale here, and I'm concerned that
people who like to create rules don't realize what a mess this can
create when contributors approach their participation with a sense of
entitlement.

Cheers,

- Ted

Re: [PATCH V2] uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

2020-12-04 Thread Theodore Y. Ts'o

On Thu, Dec 03, 2020 at 08:18:23AM +0200, Amir Goldstein wrote:
> Here is a recent example, where during patch review, I requested NOT to 
> include
> any stable backport triggers [1]:
> "...We should consider sending this to stable, but maybe let's merge
> first and let it
>  run in master for a while before because it is not a clear and
> immediate danger..."
>
> As a developer and as a reviewer, I wish (as Dave implied) that I had a way to
> communicate to AUTOSEL that auto backport of this patch has more risk than
> the risk of not backporting.

My suggestion is that we could put something in the MAINTAINERS file
which indicates what the preferred delay time should be for (a)
patches explicitly cc'ed to stable, and (b) preferred time should be
for patches which are AUTOSEL'ed for stable for that subsystem.  That
time might be either in days/weeks, or "after N -rc releases", "after
the next full release", or, "never" (which would be a way for a
subsystem to opt out of the AUTOSEL process).

It should also be possible specify the delay in the trailer, e.g.:

Stable-Defer: 
Auto-Stable-Defer: 

The advantage of specifying the delay relative to when they show up in
Linus's tree helps deal with the case where the submaintainer might
not be sure when their patches will get pushed to Linus by the
maintainer.

Cheers,

- Ted

Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-04 Thread Theodore Y. Ts'o

On Fri, Dec 04, 2020 at 02:59:35PM +, David Howells wrote:
> Hi Chuck, Bruce,
> 
> Why is gss_krb5_crypto.c using an auxiliary cipher?  For reference, the
> gss_krb5_aes_encrypt() code looks like the attached.
> 
> From what I can tell, in AES mode, the difference between the main cipher and
> the auxiliary cipher is that the latter is "cbc(aes)" whereas the former is
> "cts(cbc(aes))" - but they have the same key.
> 
> Reading up on CTS, I'm guessing the reason it's like this is that CTS is the
> same as the non-CTS, except for the last two blocks, but the non-CTS one is
> more efficient.

The reason to use CTS is if you don't want to expand the size of the
cipher text to the cipher block size.  e.g., if you have a 53 byte
plaintext, and you can't afford to let the ciphertext be 56 bytes, the
cryptographic engineer will reach for CTS instead of CBC.

So that probably explains the explanation to use CTS (and it's
required by the spec in any case).  As far as why CBC is being used
instead of CTS, the only reason I can think of is the one you posted.
Perhaps there was some hardware or software configureation where
cbc(aes) was hardware accelerated, and cts(cbc(aes)) would not be?

In any case, using cbc(aes) for all but the last two blocks, and using
cts(cbc(aes)) for the last two blocks, is identical to using
cts(cbc(aes)) for the whole encryption.  So the only reason to do this
in the more complex way would be because for performance reasons.

 - Ted

Re: [PATCH v9 1/2] kunit: Support for Parameterized Testing

2020-12-02 Thread Theodore Y. Ts'o

On Mon, Nov 30, 2020 at 02:22:22PM -0800, 'Brendan Higgins' via KUnit 
Development wrote:
> 
> Looks good to me. I would definitely like to pick this up. But yeah,
> in order to pick up 2/2 we will need an ack from either Ted or Iurii.
> 
> Ted seems to be busy right now, so I think I will just ask Shuah to go
> ahead and pick this patch up by itself and we or Ted can pick up patch
> 2/2 later.

I have been paying attention to this patch series, but I had presumed
that this was much more of a kunit change than an ext4 change, and the
critical bits was a review of the kunit infrastructure.  I certainly
have no objection to changing the ext4 test to use the new
parameterized testing, and if you'd like me to give a quick review,
I'll take a quick look.  I assume, Brendan, that you've already tried
doing a compile and run test of the patch series, so I'm not going to
do that?

- Ted

Re: [PATCH v9 2/2] fs: ext4: Modify inode-test.c to use KUnit parameterized testing feature

2020-12-02 Thread Theodore Y. Ts'o

On Mon, Nov 16, 2020 at 11:11:50AM +0530, Arpitha Raghunandan wrote:
> Modify fs/ext4/inode-test.c to use the parameterized testing
> feature of KUnit.
> 
> Signed-off-by: Arpitha Raghunandan <98.a...@gmail.com>
> Signed-off-by: Marco Elver 

Acked-by: Theodore Ts'o

Re: [PATCH v7 6/8] ext4: support direct I/O with fscrypt using blk-crypto

2020-12-03 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 02:07:06PM +, Satya Tangirala wrote:
> From: Eric Biggers 
> 
> Wire up ext4 with fscrypt direct I/O support. Direct I/O with fscrypt is
> only supported through blk-crypto (i.e. CONFIG_BLK_INLINE_ENCRYPTION must
> have been enabled, the 'inlinecrypt' mount option must have been specified,
> and either hardware inline encryption support must be present or
> CONFIG_BLK_INLINE_ENCYRPTION_FALLBACK must have been enabled). Further,
> direct I/O on encrypted files is only supported when the *length* of the
> I/O is aligned to the filesystem block size (which is *not* necessarily the
> same as the block device's block size).
> 
> fscrypt_limit_io_blocks() is called before setting up the iomap to ensure
> that the blocks of each bio that iomap will submit will have contiguous
> DUNs. Note that fscrypt_limit_io_blocks() is normally a no-op, as normally
> the DUNs simply increment along with the logical blocks. But it's needed
> to handle an edge case in one of the fscrypt IV generation methods.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 
> Reviewed-by: Jaegeuk Kim 

Acked-by: Theodore Ts'o

Re: [PATCH v3] Updated locking documentation for transaction_t

2020-12-03 Thread Theodore Y. Ts'o

On Thu, Oct 15, 2020 at 03:26:28PM +0200, Alexander Lochmann wrote:
> Hi folks,
> 
> I've updated the lock documentation according to our finding for
> transaction_t.
> Does this patch look good to you?

I updated the annotations to match with the local usage, e.g:

 * When commit was requested [journal_t.j_state_lock]

became:

 * When commit was requested [j_state_lock]

Otherwise, looks good.  Thanks for the patch!

   - Ted

Re: [PATCH] ext4: remove the null check of bio_vec page

2020-12-03 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 12:25:03PM +0200, Jan Kara wrote:
> On Tue 20-10-20 16:22:01, Xianting Tian wrote:
> > bv_page can't be NULL in a valid bio_vec, so we can remove the NULL check,
> > as we did in other places when calling bio_for_each_segment_all() to go
> > through all bio_vec of a bio.
> > 
> > Signed-off-by: Xianting Tian 
> 
> Thanks for the patch. It looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted

Re: [PATCH][next] ext4: remove redundant assignment of variable ex

2020-12-03 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 02:23:26PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Variable ex is assigned a variable that is not being read, the assignment
> is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Thanks, applied.

- Ted

Re: [PATCH v3] Updated locking documentation for transaction_t

2020-12-03 Thread Theodore Y. Ts'o

On Thu, Dec 03, 2020 at 03:38:40PM +0100, Alexander Lochmann wrote:
> 
> 
> On 03.12.20 15:04, Theodore Y. Ts'o wrote:
> > On Thu, Oct 15, 2020 at 03:26:28PM +0200, Alexander Lochmann wrote:
> > > Hi folks,
> > > 
> > > I've updated the lock documentation according to our finding for
> > > transaction_t.
> > > Does this patch look good to you?
> > 
> > I updated the annotations to match with the local usage, e.g:
> > 
> >  * When commit was requested [journal_t.j_state_lock]
> > 
> > became:
> > 
> >  * When commit was requested [j_state_lock]What do you mean by local 
> > usage?
> The annotations of other members of transaction_t?

Yes, I'd like the annotations of the other objects to be consistent,
and just use j_state_lock, j_list_lock, etc., for the other annotations.

> Shouldn't the annotation look like this?
> [t_journal->j_state_lock]
> It would be more precise.

It's more precise, but it's also unnecessary in this case, since all
of the elements of the journal have a j_ prefix, elements of a
transaction_t have a t_ prefix, etc.  There is also no other structure
element which has a j_state_lock name *other* than in journal_t.

Cheers,

- Ted

Re: [PATCH] MAINTAINERS: add missing file in ext4 entry

2020-11-06 Thread Theodore Y. Ts'o

On Fri, Oct 30, 2020 at 10:24:35AM +0800, Chao Yu wrote:
> include/trace/events/ext4.h belongs to ext4 module, add the file path into
> ext4 entry in MAINTAINERS.
> 
> Signed-off-by: Chao Yu 

Thanks, applied.

- Ted

[GIT PULL] ext4 cleanups for 5.10-rc4

2020-11-09 Thread Theodore Y. Ts'o

(Resent with missing cc's, sorry.)

The following changes since commit 3cea11cd5e3b00d91caf0b4730194039b45c5891:

  Linux 5.10-rc2 (2020-11-01 14:43:51 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_cleanups

for you to fetch changes up to 05d5233df85e9621597c5838e95235107eb624a2:

  jbd2: fix up sparse warnings in checkpoint code (2020-11-07 00:09:08 -0500)


More fixes and cleanups for the new fast_commit features, but also a
few other miscellaneous bug fixes and a cleanup for the MAINTAINERS
file.


Chao Yu (1):
  MAINTAINERS: add missing file in ext4 entry

Dan Carpenter (1):
  ext4: silence an uninitialized variable warning

Harshad Shirwadkar (22):
  ext4: describe fast_commit feature flags
  ext4: mark fc ineligible if inode gets evictied due to mem pressure
  ext4: drop redundant calls ext4_fc_track_range
  ext4: fixup ext4_fc_track_* functions' signature
  jbd2: rename j_maxlen to j_total_len and add jbd2_journal_max_txn_bufs
  ext4: clean up the JBD2 API that initializes fast commits
  jbd2: drop jbd2_fc_init documentation
  jbd2: don't use state lock during commit path
  jbd2: don't pass tid to jbd2_fc_end_commit_fallback()
  jbd2: add todo for a fast commit performance optimization
  jbd2: don't touch buffer state until it is filled
  jbd2: don't read journal->j_commit_sequence without taking a lock
  ext4: dedpulicate the code to wait on inode that's being committed
  ext4: fix code documentatioon
  ext4: mark buf dirty before submitting fast commit buffer
  ext4: remove unnecessary fast commit calls from ext4_file_mmap
  ext4: fix inode dirty check in case of fast commits
  ext4: disable fast commit with data journalling
  ext4: issue fsdev cache flush before starting fast commit
  ext4: make s_mount_flags modifications atomic
  jbd2: don't start fast commit on aborted journal
  ext4: cleanup fast commit mount options

Joseph Qi (1):
  ext4: unlock xattr_sem properly in ext4_inline_data_truncate()

Kaixu Xia (1):
  ext4: correctly report "not supported" for {usr,grp}jquota when 
!CONFIG_QUOTA

Theodore Ts'o (2):
  ext4: fix sparse warnings in fast_commit code
  jbd2: fix up sparse warnings in checkpoint code

 Documentation/filesystems/ext4/journal.rst |   6 ++
 Documentation/filesystems/ext4/super.rst   |   7 +++
 Documentation/filesystems/journalling.rst  |   6 +-
 MAINTAINERS|   1 +
 fs/ext4/ext4.h |  66 ++--
 fs/ext4/extents.c  |   7 +--
 fs/ext4/fast_commit.c  | 174 
+++--
 fs/ext4/fast_commit.h  |   6 +-
 fs/ext4/file.c |   6 +-
 fs/ext4/fsmap.c|   2 +-
 fs/ext4/fsync.c|   2 +-
 fs/ext4/inline.c   |   1 +
 fs/ext4/inode.c|  19 +++---
 fs/ext4/mballoc.c  |   6 +-
 fs/ext4/namei.c|  61 +--
 fs/ext4/super.c|  47 ---
 fs/jbd2/checkpoint.c   |   2 +
 fs/jbd2/commit.c   |  11 +++-
 fs/jbd2/journal.c  | 138 
+++---
 fs/jbd2/recovery.c |   6 +-
 fs/jbd2/transaction.c  |   4 +-
 fs/ocfs2/journal.c |   2 +-
 include/linux/jbd2.h   |  23 ---
 include/trace/events/ext4.h|  10 +--
 24 files changed, 342 insertions(+), 271 deletions(-)

Re: "beyond 2038" warnings from loopback mount is noisy

2019-09-03 Thread Theodore Y. Ts'o

On Tue, Sep 03, 2019 at 09:18:44AM -0700, Deepa Dinamani wrote:
> 
> This prints a warning for each inode that doesn't extend limits beyond
> 2038. It is rate limited by the ext4_warning_inode().
> Looks like your filesystem has inodes that cannot be extended.
> We could use a different rate limit or ignore this corner case. Do the
> maintainers have a preference?

We need to drop this commit (ext4: Initialize timestamps limits), or
at least the portion which adds the call to the EXT4_INODE_SET_XTIME
macro in ext4.h.  

I know of a truly vast number of servers in production all over the
world which are using 128 byte inodes, and spamming the inodes at the
maximum rate limit is a really bad idea.  This includes at some major
cloud data centers where the life of individual servers in their data
centers is well understood (they're not going to last until 2038) and
nothing stored on the local Linux file systems are long-lived ---
that's all stored in the cluster file systems.  The choice of 128 byte
inode was deliberately chosen to maximize storage TCO, and so spamming
a warning at high rates is going to be extremely unfriendly.

In cases where the inode size is such that there is no chance at all
to support timestamps beyond 2038, a single warning at mount time, or
maybe a warning at mkfs time might be acceptable.  But there's no
point printing a warning time each time we set a timestamp on such a
file system.  It's not going to change, and past a certain point, we
need to trust that people who are using 128 byte inodes did so knowing
what the tradeoffs might be.  After all, it is *not* the default.

 - Ted

Re: "fs/namei.c: keep track of nd->root refcount status" causes boot panic

2019-09-03 Thread Theodore Y. Ts'o

On Tue, Sep 03, 2019 at 06:50:24AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 03, 2019 at 02:48:32PM +0100, Al Viro wrote:
> > Not sure what would be the best way to do it...  I don't mind breaking
> > the out-of-tree modules, whatever their license is; what I would rather
> > avoid is _quiet_ breaking of such.
> 
> Any out of tree module running against an upstream kernel will need
> a recompile for a new version anyway.  So I would not worry about it
> at all.

I'm really confused.  What out-of-tree module are people needing to
use when doing linux-next testing?   That seems like a recipe for disaster...

- Ted

Re: "beyond 2038" warnings from loopback mount is noisy

2019-09-03 Thread Theodore Y. Ts'o

On Tue, Sep 03, 2019 at 02:31:06PM -0700, Deepa Dinamani wrote:
> > We need to drop this commit (ext4: Initialize timestamps limits), or
> > at least the portion which adds the call to the EXT4_INODE_SET_XTIME
> > macro in ext4.h.
> 
> As Arnd said, I think this can be fixed by warning only when the inode
> size is not uniformly 128 bytes in ext4.h. Is this an acceptable
> solution or we want to drop this warning altogether?

If we have a mount-time warning, I really don't think a warning in the
kernel is going to be helpful.  It's only going to catch the most
extreme cases --- specifically, a file system originally created and
written using ext3 (real ext3; even before we dropped ext3 from the
upstream kernel, most distributions were using ext4 to provide ext3
support) and which included enough extended attributes that there is
no space in the inode and the external xattr block for there to make
space for the extra timestamp.  That's extremely rare edge cases, and
I don't think it's worth trying to catch it in the kernel.

The right place to catch this is rather in e2fsck, I think.

> We have a single mount time warning already in place here. I did not
> realize some people actually chose to use 128 byte inodes on purpose.

Yes, there are definitely some people who are still doing this.  The
other case, as noted on this thread, is that file systems smaller than
512 MiB are treated as type "small" (and file systems smaller than
4MiB are treated as type "floppy"), and today, we are still using 128
byte inodes to minimize the overhead of the inode table.  It's
probably time to reconsider these defaults, but that's an e2fsprogs
level change.  And that's not going to change the fact that there are
people who are deliberately choosing to use 128 byte inode.

Changes that we could consider:

1)  Change the default for types "small" and "floppy" to be 256 byte inodes.

2)  Add a warning to mke2fs to give a warning when creating a file
system with 128 byte inodes.

3)  Add code to e2fsck to automatically make room for the timestamp if
possible.

4)  Add code to e2fsck so that at some pre-determined point in the
future (maybe 5 years before 2038?) have it print warnings for file
systems using 128 byte inodes, and for file systems with 256+ byte
inodes and where there isn't enough space in the inode for expanded
timestamps.

Cheers,

- Ted

Re: "beyond 2038" warnings from loopback mount is noisy

2019-09-03 Thread Theodore Y. Ts'o

On Tue, Sep 03, 2019 at 11:48:14PM +0200, Arnd Bergmann wrote:
> I think the warning as it was intended makes sense, the idea
> was never to warn on every inode update for file systems that
> cannot handle future dates, only to warn when we
> 
> a) try to set a future date
> b) fail to do that because the space cannot be made available.

What do you mean by "try to set a future date"?  Do you mean a trying
to set a date after 2038 (when it can no longer fit in a signed 32-bit
value)?  Because that's not what the commit is currently doing.

> I would prefer to fix it on top of the patches I already merged.
> 
> Maybe something like:
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9e3ae3be3de9..5a971d1b6d5e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -835,7 +835,9 @@ do {
>  \
> }
>  \
> else{\
> (raw_inode)->xtime = cpu_to_le32(clamp_t(int32_t,
> (inode)->xtime.tv_sec, S32_MIN, S32_MAX));\
> -   ext4_warning_inode(inode, "inode does not support
> timestamps beyond 2038"); \
> +   if (((inode)->xtime.tv_sec != (raw_inode)->xtime) && \
> +   ((inode)->i_sb->s_time_max > S32_MAX))
>  \
> +   ext4_warning_inode(inode, "inode does not
> support timestamps beyond 2038"); \
> } \
>  } while (0)

Sure, that's much less objectionable.

> However, I did expect that people might have legacy ext3 file system
> images that they mount, and printing a warning for each write would
> also be wrong for those.

I guess I'm much less convinced that 10-15 years from now, there will
be many legacy ext3 file systems left.  Storage media doesn't last
that long, and if file systems get moved around, e2fsck will be run at
least once, and so adding some e2fsck-time warnings seems to be a
better approach IMHO.

- Ted

Re: "beyond 2038" warnings from loopback mount is noisy

2019-09-03 Thread Theodore Y. Ts'o

On Tue, Sep 03, 2019 at 03:47:54PM -0700, Deepa Dinamani wrote:
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 9e3ae3be3de9..5a971d1b6d5e 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -835,7 +835,9 @@ do {
> > >  \
> > > }
> > >  \
> > > else{\
> > > (raw_inode)->xtime = cpu_to_le32(clamp_t(int32_t,
> > > (inode)->xtime.tv_sec, S32_MIN, S32_MAX));\
> > > -   ext4_warning_inode(inode, "inode does not support
> > > timestamps beyond 2038"); \
> > > +   if (((inode)->xtime.tv_sec != (raw_inode)->xtime) && \
> > > +   ((inode)->i_sb->s_time_max > S32_MAX))
> > >  \
> > > +   ext4_warning_inode(inode, "inode does not
> > > support timestamps beyond 2038"); \
> > > } \
> > >  } while (0)
> >
> > Sure, that's much less objectionable.
> 
> The reason it was warning for every update was because of the
> ratelimiting. I think ratelimiting is not working well here. I will
> check that part.

If you are calling ext4_warning_inode() on every single update, you
really can't depend on rate limiting to prevent log spam.  The problem
is sometimes we *do* need more than say, one ext4 warning every hour.
Rate limiting is a last-ditch prevention against an unintentional
denial of service attack against the system, but we can't depend on it
as license to call ext4_warning() every time we set a timestamp.  That
happens essentially constantly on a running system.  So if you set the
limits aggressively enough that it's not seriously annoying, it will
suppress all other potential uses of ext4_warning() --- essentially,
it will make ext4_warning useless.

The other concern I would have if that warning message is being
constantly called, post 2038, is that even *with* rate limiting, it
will turn into a massive scalability bottleneck --- remember, the
ratelimit structure has a spinlock, so even if you are suppressing
things so that we're only logging one message an hour, if it's being
called hundreds of times a second from multiple CPU's, the cache line
thrashing will make this to be a performance *nightmare*.

   - Ted

Re: "beyond 2038" warnings from loopback mount is noisy

2019-09-04 Thread Theodore Y. Ts'o

On Tue, Sep 03, 2019 at 09:50:09PM -0700, Deepa Dinamani wrote:
> If we don't care to warn about the timestamps that are clamped in
> memory, maybe we could just warn when they are being written out.
> Would something like this be more acceptable? I would also remove the
> warning in ext4.h. I think we don't have to check if the inode is 128
> bytes here (Please correct me if I am wrong). If this looks ok, I can
> post this.

That's better, but it's going to be misleading in many cases.  The
inode's extra size field is 16 or larger, there will be enough space
for the timestamps, so talking about "timestamps on this inode beyond
2038" when ext4 is unable to expand it from say, 24 to 32, won't be
true.  Certain certain features won't be available, yes --- such as
project-id-based quotas, since there won't be room to store the
project ID.  However, it's not going to impact the ability to store
timestamps beyond 2038.  The i_extra_isize field is not just about
timestamps!

Again, the likelihood that there will be file systems that have this
problem in 2038 is... extremely low in my judgement.  Storage media
just doesn't last that long; and distributions such as Red Hat and
SuSE very strongly encourage people to reformat file systems and do
*not* support upgrades from ext3 to ext4 by using tune2fs.  If you do
this, their help desk will laugh at you and refuse to help you.

Companies like Google will do this kind of upgrades[1], sure.  But
that's because backing up and reformatting vast numbers of file
systems are not practical at scale.  (And even Google doesn't maintain
the file system image when the servers are old enough to be TCO
negative and it's time to replace them.)

In contrast, most companies / users don't do this sort of thing at
all.  It's not an issue for Cell Phones, for example, or most consumer
devices, which are lucky if the last more than 3 years before they get
desupported and stop getting security updates, and then the lithium
ion batttery dies and the device end up in a landfill.  Those that
might live 20 years (although good luck with that for something like,
say, a smart thermostat) aren't going to have a console and no one
will be paying attention to the kernel messages anyway.  So is it
really worth it?  For whom are these messages meant?

[1] https://www.youtube.com/watch?v=Wp5Ehw7ByuU

Cheers,

- Ted

Re: [PATCH] fs:ext4:remove unused including

2019-09-04 Thread Theodore Y. Ts'o

On Wed, Sep 04, 2019 at 03:36:28PM +0800, zhao.ha...@zte.com.cn wrote:
> fix compiler error in ext4.hfs/ext4/ext4.h:30:27: fatal error: 
> linux/version.h: No such file or directorySigned-off-by: Zhao Hang 
>  --- fs/ext4/ext4.h | 1 - 1 file changed, 1 
> deletion(-)diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.hindex 1cb6785..9baa4cf 
> 100644--- a/fs/ext4/ext4.h+++ b/fs/ext4/ext4.h@@ -27,7 +27,6 @@ #include 
>   #include   #include  
> -#include   #include   #include 
>   #include  --  2.15.2

First of all, this patch is completely white space namaged.

Secondly, this is a problem in how you are building your kernel (or
ext4 as a module, if you are trying to build ext4 as some kind of out
of tree module or some such).  When you do a kernel build, the file
include/linux/version.h is automatically generated.  It will look
something like this.

% cat /build/ext4-64/usr/include/linux/version.h
#define LINUX_VERSION_CODE 327936
#define KERNEL_VERSION(a,b,c) (((a) << 16) + ((b) << 8) + (c))

 - Ted

Re: [PATCH v2] Ext4 documentation fixes.

2019-08-23 Thread Theodore Y. Ts'o

On Fri, Aug 23, 2019 at 04:56:42AM +, Ayush Ranjan wrote:
> Hey Ted!
> Thanks for reviewing! The comment in 
> fs/ext4/ext4.h:ext4_group_desc:bg_checksum
> says that the crc16 checksum formula should be crc16(sb_uuid+group+desc). I
> think group over here denotes group number.
> 
> Briefly looking through fs/ext4/super.c:ext4_group_desc_csum() suggests that:
> - For the new metadata_csum algorithm, only the group number and the block
> descriptor are included in the checksum. So the formula should be
> crc32c(group+desc) & 0xFFF (this looks like a bug as this should also include 
> sb
> UUID?)
> - For the old crc16 algorithm, the sb UUID, group number and the block
> descriptor are included in the checksum. So the formula should be
> crc16(sb\_uuid+group+desc). (should remain unchanged)

Thanks for the research and explanation.  I think I'm going to change
that to be:

crc{16,32c}(sb_uuid + group_num + bg_desc)

That should make it clearer what is meant.

 - Ted








> 
> Ayush Ranjan
> University of Illinois - Urbana Champaign | May 2020
> Bachelors of Science in Computer Science and Mathematics
> Business Minor | Gies College of Business
> 
> 
> On Fri, Aug 23, 2019 at 8:48 AM Theodore Y. Ts'o  wrote:
> >
> > On Thu, Aug 15, 2019 at 09:11:51AM -0700, Ayush Ranjan wrote:
> > > This commit aims to fix the following issues in ext4 documentation:
> > > - Flexible block group docs said that the aim was to group block
> > >   metadata together instead of block group metadata.
> > > - The documentation consistly uses "location" instead of "block number".
> > >   It is easy to confuse location to be an absolute offset on disk. Added
> > >   a line to clarify all location values are in terms of block numbers.
> > > - Dirent2 docs said that the rec_len field is shortened instead of the
> > >   name_len field.
> > > - Typo in bg_checksum description.
> > > - Inode size is 160 bytes now, and hence i_extra_isize is now 32.
> > > - Cluster size formula was incorrect, it did not include the +10 to
> > >   s_log_cluster_size value.
> > > - Typo: there were two s_wtime_hi in the superblock struct.
> > > - Superblock struct was outdated, added the new fields which were part
> > >   of s_reserved earlier.
> > > - Multiple mount protection seems to be implemented in fs/ext4/mmp.c.
> > >
> > > Signed-off-by: Ayush Ranjan 
> >
> > Fixed with one minor typo fix:
> >
> > > diff --git a/Documentation/filesystems/ext4/group_descr.rst
> > > b/Documentation/filesystems/ext4/group_descr.rst
> > > index 0f783ed88..feb5c613d 100644
> > > --- a/Documentation/filesystems/ext4/group_descr.rst
> > > +++ b/Documentation/filesystems/ext4/group_descr.rst
> > > @@ -100,7 +100,7 @@ The block group descriptor is laid out in ``struct
> > > ext4_group_desc``.
> > >       - \_\_le16
> > >       - bg\_checksum
> > >       - Group descriptor checksum; crc16(sb\_uuid+group+desc) if the
> > > -       RO\_COMPAT\_GDT\_CSUM feature is set, or
> crc32c(sb\_uuid+group\_desc) &
> > > +       RO\_COMPAT\_GDT\_CSUM feature is set, or 
> > > crc32c(sb\_uuid+group+desc)
> &
> > >         0x if the RO\_COMPAT\_METADATA\_CSUM feature is set.
> >
> > The correct checksum should be "crc16(sb\_uuid+group\_desc)" or
> > "crc32c(sb\_uuid+group\_desc)".  That is, it's previous line which
> > needed modification.
> >
> >                                         - Ted

Re: [PATCH v10 2/3] fdt: add support for rng-seed

2019-08-23 Thread Theodore Y. Ts'o

On Fri, Aug 23, 2019 at 04:41:59PM +0100, Will Deacon wrote:
> 
> Given that these aren't functional changes, I've kept Ted's ack from v9
> and I'll queue these via arm64 assuming they pass testing.
> 
> Ted -- please shout if you're not happy about that, and I'll drop the
> series.

That's fine, thanks.  I'm thinking about making some changes to
add_hwgenerator_randomness(), but it's not going to be in the next
merge window, and it's more important that we get the interfaces (the
Kconfig options and add_bootloader_randomness() function prototype)
right for ARM.

Now to shanghai some volunteers to get this functionality working for
x86 (at least for the UEFI and NERF bootloaders).  :-)

Thanks!!

- Ted

Re: [PATCH] ext4: change the type of ext4 cache stats to percpu_counter to improve performance

2019-08-24 Thread Theodore Y. Ts'o

On Fri, Aug 23, 2019 at 10:47:34AM +0800, Shaokun Zhang wrote:
> From: Yang Guo 
> 
> @es_stats_cache_hits and @es_stats_cache_misses are accessed frequently in
> ext4_es_lookup_extent function, it would influence the ext4 read/write
> performance in NUMA system.
> Let's optimize it using percpu_counter, it is profitable for the
> performance.
> 
> The test command is as below:
> fio -name=randwrite -numjobs=8 -filename=/mnt/test1 -rw=randwrite
> -ioengine=libaio -direct=1 -iodepth=64 -sync=0 -norandommap -group_reporting
> -runtime=120 -time_based -bs=4k -size=5G
> 
> And the result is better 10% than the initial implement:
> without the patch，IOPS=197k, BW=770MiB/s (808MB/s)(90.3GiB/120002msec)
> with the patch,  IOPS=218k, BW=852MiB/s (894MB/s)(99.9GiB/120002msec)
> 
> Cc: "Theodore Ts'o" 
> Cc: Andreas Dilger 
> Signed-off-by: Yang Guo 
> Signed-off-by: Shaokun Zhang 

Applied with some adjustments so it would apply.  I also changed the patch 
summary to:

ext4: use percpu_counters for extent_status cache hits/misses

- Ted

Re: [PATCH] ext4: change the type of ext4 cache stats to percpu_counter to improve performance

2019-08-25 Thread Theodore Y. Ts'o

On Sun, Aug 25, 2019 at 10:28:03AM -0700, Eric Biggers wrote:
> This patch is causing the following.  Probably because there's no calls to
> percpu_counter_destroy() for the new counters?

Yeah, I noticed this from my test runs last night as well.  It looks
like original patch was never tested with CONFIG_HOTPLUG_CPU.

The other problem with this patch is that it initializes
es_stats_cache_hits and es_stats_cache_miesses too late.  They will
get used when the journal inode is loaded.  This is mostly harmless,
but it's also wrong.

I've dropped this patch from the ext4 git tree.

  - Ted

Re: [PATCH] ext4: change the type of ext4 cache stats to percpu_counter to improve performance

2019-08-26 Thread Theodore Y. Ts'o

On Mon, Aug 26, 2019 at 04:24:20PM +0800, Shaokun Zhang wrote:
> > The other problem with this patch is that it initializes
> > es_stats_cache_hits and es_stats_cache_miesses too late.  They will
> > get used when the journal inode is loaded.  This is mostly harmless,
> 
> I have checked it again, @es_stats_cache_hits and @es_stats_cache_miesses
> have been initialized before the journal inode is loaded, Maybe I miss
> something else?

No, sorry, that was my mistake.  I misread things when I was looking
over your patch last night.

Please resubmit your patch once you've fixed things up and tested it.

I would recommend that you at least try running your patch using the
kvm-xfstests's smoke test[1] before submitting them.  It will save you
and me time.

[1] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

Thanks,

- Ted

Re: [PATCH v8 2/3] fdt: add support for rng-seed

2019-08-19 Thread Theodore Y. Ts'o

On Mon, Aug 19, 2019 at 03:16:04PM +0800, Hsin-Yi Wang wrote:
> Introducing a chosen node, rng-seed, which is an entropy that can be
> passed to kernel called very early to increase initial device
> randomness. Bootloader should provide this entropy and the value is
> read from /chosen/rng-seed in DT.

So it's really cool that you've sent out this patch set.  I've been
wanting this for all platforms / architectures for quite a while.
Question --- are you willing to guarantee that the booloader can be
trusted enough that you *know* the entropy being provided by the
bootloader to be secure?

If so, we could let fdt.c use a different interface, perhaps
add_hwgenerator_randomness(), which allows the bootloader to transfer
trusted entropy for the purposes of initializing the crng and entropy
accounting for /dev/random.

One of the questions is how do we make sure the boot loader is
actually secure, but given that we have to trust the boot loader for
various trusted boot use cases, it seems reasonable to do that.

What do you think?

- Ted

Re: Status of Subsystems

2019-08-20 Thread Theodore Y. Ts'o

On Tue, Aug 20, 2019 at 03:56:24PM +0200, Sebastian Duda wrote:
> 
> so the status of the files is inherited from the subsystem `INPUT MULTITOUCH
> (MT) PROTOCOL`?
> 
> Is it the same with the subsystem `NOKIA N900 POWER SUPPLY DRIVERS`
> (respectively `POWER SUPPLY CLASS/SUBSYSTEM and DRIVERS`)?

Note that the definitions of "subsystems" is not necessarily precise.
So assuming there is a strict subclassing and inheritance might not be
a perfect assumption.  There are some files which have no official
owner, and there are also some files which may be modified by more
than one subsystem.

We certainly don't talk about "inheritance" when we talk about
maintainers and sub-maintainers.  Furthermore, the relationships,
processes, and workflows between a particular maintainer and their
submaintainers can be unique to a particular maintainer.

We define these terms to be convenient for Linux development, and like
many human institutions, they can be flexible and messy.  The goal was
*not* define things so it would be convenient for academics writing
papers --- like insects under glass.

Cheers,

- Ted

Re: New kernel interface for sys_tz and timewarp?

2019-08-15 Thread Theodore Y. Ts'o

On Thu, Aug 15, 2019 at 03:22:45PM +0200, Arnd Bergmann wrote:
> If 64-bit Windows relies on a working EFI RTC implementation, we could
> decide to leave the driver enabled on 64-bit and only disable it for
> 32-bit EFI. That way, future distros would no longer have to worry about
> the localtime hack, at least the ones that have dropped support for
> 32-bit x86 kernels.

... and who have also dropped support for legacy (non-UEFI) 64-bit
boot.  Keep in mind that even for distributions which may install with
UEFI by default, if people have been upgrading from (for example)
Debian Jessie to Stretch to Buster, they may still be using non-UEFI
boot.  This might be especially true on small Linode servers (such as,
for example, the one which is currently running my mail smarthost)

   - Ted

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-09-03 Thread Theodore Y. Ts'o

99dedf RBX: 0001 RCX: 
[ 3326.428656] RDX:  RSI:  RDI: ad5e433b
[ 3326.435910] RBP: 949dd7a29800 R08: 0001 R09: 0001
[ 3326.443169] R10: 949dd91ea724 R11: 08ee R12: 949dd7a29864
[ 3326.450420] R13: 0001 R14: 0001 R15: 
[ 3326.457690]  ? acpi_safe_halt+0x1b/0x30
[ 3326.461647]  acpi_idle_enter+0x1d0/0x260
[ 3326.465704]  cpuidle_enter_state+0x6e/0x3a0
[ 3326.470019]  cpuidle_enter+0x29/0x40
[ 3326.473726]  cpuidle_idle_call+0xf8/0x160
[ 3326.477854]  do_idle+0x72/0xc0
[ 3326.481036]  cpu_startup_entry+0x19/0x20
[ 3326.485079]  start_kernel+0x433/0x452
[ 3326.488899]  secondary_startup_64+0xb6/0xc0
[ 3326.493204] CR2: 949e7c958e3f
[ 3326.496757] ---[ end trace 464b5b002bebb81c ]---



On Thu, Sep 03, 2020 at 06:10:19PM -0600, Jens Axboe wrote:
> On 8/26/20 7:54 PM, Jens Axboe wrote:
> > On 8/25/20 8:18 AM, Jens Axboe wrote:
> >> On 8/24/20 4:56 AM, Jens Axboe wrote:
> >>> On 8/22/20 9:48 AM, Jens Axboe wrote:
> >>>> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
> >>>>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
> >>>>>>>>> Resending this one, as I've been carrying it privately since May. 
> >>>>>>>>> The
> >>>>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as 
> >>>>>>>>> well),
> >>>>>>>>> please consider this one for 5.9. Thanks!
> >>>>>>>>
> >>>>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
> >>>>>>>> missing something?  It's on my queue to send to Linus once I get my
> >>>>>>>> (late) ext4 primary pull request for 5.9.
> >>>>>>>
> >>>>>>> Right, it went in at the start of the merge window for 5.9. Thanks 
> >>>>>>> Ted!
> >>>>>>
> >>>>>> Didn't see it in the queue that just sent in, is it still queued up?
> >>>>>
> >>>>> It wasn't in the queue which I queued up because that was based on
> >>>>> 5.8-rc4.  Linus was a bit grumpy (fairly so) because it was late, and
> >>>>> that's totally on me.
> >>>>>
> >>>>> He has said that he's going to start ignoring pull requests that
> >>>>> aren't fixes only if this becomes a pattern, so while I can send him
> >>>>> another pull request which will just have that one change, there are
> >>>>> no guarantees he's going to take it at this late date.
> >>>>>
> >>>>> Sorry, when you sent me the commit saying that the changes that were
> >>>>> needed were already upstream on August 3rd, I thought that meant that
> >>>>> they were aready in Linus's tree.  I should have checked and noticed
> >>>>> that that in fact "ext4: flag as supporting buffered async reads"
> >>>>> wasn't compiling against Linus's upstream tree, so I didn't realize
> >>>>> this needed to be handled as a special case during the merge window.
> >>>>
> >>>> Well to be honest, this kind of sucks. I've been posting it since May,
> >>>> and the ideal approach would have been to just ack it and I could have
> >>>> carried it in my tree. That's what we did for btrfs and XFS, both of
> >>>> which have it.
> >>>>
> >>>> The required patches *were* upstreamed on August 3rd, which is why I
> >>>> mentioned that. But yes, not in 5.8 or earlier, of course.
> >>>>
> >>>> So I suggest that you either include it for the next pull request for
> >>>> Linus, or that I put it in with your ack. Either is fine with me. I'd
> >>>> consider this a "dropping the ball" kind of thing, it's not like the
> >>>> patch hasn't been in linux-next or hasn't been ready for months. This
> >>>> isn't some "oh I wrote this feature after the merge window" event. It'd
> >>>> be a real shame to ship 5.9 and ext4 not have support for the more
> >>>> efficient async buffered reads, imho, especially since the two other
> >>>> major local file systems already have it.
> >>>>
> >>>> Let me know what you think.
> >>>
> >>> Ted, can you make a call on this, please? It's now post -rc2. Let's
> >>> get this settled and included, one way or another.
> >>
> >> Daily ping on this one...
> > 
> > And again. Ted, not sure how to make any progress with this, to be
> > honest, it's like pounding sand.
> 
> And 8 days later...
> 
> -- 
> Jens Axboe
>

Re: [PATCH] ext4: Fix comment typo "the the".

2020-08-18 Thread Theodore Y. Ts'o

On Sat, Apr 25, 2020 at 02:16:24AM +0900, kyoungho koo wrote:
> I have found double typed comments "the the". So i modified it to
> one "the"
> 
> Signed-off-by: kyoungho koo 

Thanks, applied; apologies for this falling through the cracks!

- Ted

Re: [LKP] Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-08-19 Thread Theodore Y. Ts'o

Looking at what the stress-ng fiemap workload is doing, and
it's interesting.

It is running 4 processes which are calling FIEMAP on a particular
file in a loop, with a 25ms sleep every 64 times.  And then there is a
fifth process which is randomly writing to the file and calling
punch_hole to random offsets in the file.

So this is quite different from what Ritesh has been benchmarking
which is fiemap in isolation, as opposed to fiemap racing against a 3
other fiemap processes plus a process which is actively modifying the
file.

In the original code, if I remember correctly, we were using a shared
reader/writer lock to look at the extent tree blocks directly, but we
hold the i_data_sem rw_sempahore for the duration of the fiemap call.

In the new code, we're going through the extent_status cache, which is
grabbing the rw_spinlock each time we do a lookup in the extents
status tree.  So this is a much finer-grained locking and that is
probably the explanation for the increased time for running fiemap in
the contended case.

If this theory is correct, we would probably get back the performance
by wrapping the calls to iomap_fiemap() with {up,down}_read(>i_data_sem)
in ext4_fiemap().

That being said, however  it's clear what real-life workload cares
about FIEMAP performance, especially with multiple threads all calling
FIEMAP racing against a file which is being actively modified.  Having
stress-ng do this to find potential kernel bugs is a great thing, so I
understand why stress-ng might be doing this as a QA tool.  Why we
should care about stress-ng as a performance benchmark, at least in
this case, is much less clear to me.

Cheers,

- Ted

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-08-18 Thread Theodore Y. Ts'o

On Mon, Aug 03, 2020 at 05:02:11PM -0600, Jens Axboe wrote:
> ext4 uses generic_file_read_iter(), which already supports this.
> 
> Cc: Theodore Ts'o 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> Resending this one, as I've been carrying it privately since May. The
> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
> please consider this one for 5.9. Thanks!

The necessary commit only hit upstream as of 5.9-rc1, unless I'm
missing something?  It's on my queue to send to Linus once I get my
(late) ext4 primary pull request for 5.9.

- Ted

Re: [PATCH] mballoc: Replace seq_printf with seq_puts

2020-08-18 Thread Theodore Y. Ts'o

On Mon, Aug 10, 2020 at 02:21:58AM +, Xu Wang wrote:
> seq_puts is a lot cheaper than seq_printf, so use that to print
> literal strings.
> 
> Signed-off-by: Xu Wang 

Applied, thanks.

- Ted

Re: [PATCH AUTOSEL 4.19 059/206] ext4: make dioread_nolock the default

2020-09-18 Thread Theodore Y. Ts'o

On Thu, Sep 17, 2020 at 07:58:59PM -0700, Eric Biggers wrote:
> On Thu, Sep 17, 2020 at 10:05:35PM -0400, Sasha Levin wrote:
> > From: Theodore Ts'o 
> > 
> > [ Upstream commit 244adf6426ee31a83f397b700d964cff12a247d3 ]
> > 
> > This fixes the direct I/O versus writeback race which can reveal stale
> > data, and it improves the tail latency of commits on slow devices.
> > 
> > Link: https://lore.kernel.org/r/20200125022254.1101588-1-ty...@mit.edu
> > Signed-off-by: Theodore Ts'o 
> > Signed-off-by: Sasha Levin I
> 
> Any particular reason to be backporting this?  I thought I saw some fixes for
> dioread_nolock go by, after it was made the default.  Are you getting all of
> those fixes too?

Agreed, making dioread_nolock the default has enough issues that it's
not something that I'd suggest backporting at this point.  It's a
fundamental behavioral change that it's not something we should change
in a stable kernel.

- Ted

Re: [PATCH] random: initialize ChaCha20 constants with correct endianness

2020-09-18 Thread Theodore Y. Ts'o

On Tue, Sep 15, 2020 at 09:50:13PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> On big endian CPUs, the ChaCha20-based CRNG is using the wrong
> endianness for the ChaCha20 constants.
> 
> This doesn't matter cryptographically, but technically it means it's not
> ChaCha20 anymore.  Fix it to always use the standard constants.

I'll note that we're not technically ChaCha20 in terms of how we
handle the IV.  ChaCha20 is defined as having a 96 bit IV and a 32-bit
counter.  The counter is "usually initialized to be zero or one" (per
RFC 7539) and the counter is defined to be Little Endian.

We're currently not bothering to deal with Endian conversions with the
counter, and we're using a 64-bit counter, instead of a 32-bit
counter.  We also twiddle 32-bits of the state (crng->state[14]) by
XOR'ing it with RDRAND if available at each round, which is also a
spec violation.

WE also initialize the counter to be a random value, using the
input_pool or the primary crng state (if we are initializing the
secondary state), but given that the specification says _usually_ zero
or one, that's not an out-and-out spec violation.

As far as the other deviations / "spec violations" from ChaCha-20 are
concerned...  I'm "sorry not sorry".  :-)

I have no objections to changing things so that the first 4 words of
the crng state are more ChaCha-20-like, on the theory that most of the
cryptoanlysis work (both positive and negative) have been done with
the little-endian version of "expand 32-byte k".  I don't think it
really makes a difference, either positively or negatively.  But
technically we'd *still* not be using ChaCha20.  We could say that
we're using the ChaCha20 block function, regardless.

Cheers,

- Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-24 Thread Theodore Y. Ts'o

On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote:
> 
> The list corruption issue can be reproduced on kvm/qumu guest too when
> running xfstests(ext4) generic/038.
> 
> However, the issue may become not reproduced when adding or removing memory
> debug options, such as adding KASAN.

Can you reliably reprodue this crash?  And if so, with what config and
what kernel version.

One of the reasons why I had gone silent on this bug is that I've been
trying many, many configurations and configurations which reliably
reproduced on say, -rc4 would not reproduce on -rc5, and then I would
get a completely different set of results on -rc6.  So I've been
trying to run a lot of different experiments to try to understand what
might be going on, since it seems pretty clear this must be a very
timing-sensitive failure.

I also found that the re-occrance went down significantly if I enabled
KASAN, and while it didn't go away, I wasn't able to get a KASAN
failure to trigger, either.  Turning off CONFIG_PROVE_LOCKING and a
*lot* of other debugging configs made the problem vanish in -rc4, but
that trick didn't work with -rc5 or -rc6.

Each time I discovered one of these things, I was about to post to the
e-mail thread, only to have a confirmation test run on a different
kernel version make the problem go away.  In particular, your revert
helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5.

HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
that one of the stack traces had virtio balloon in the trace, I
realized that when I switched the GCE VM type from e1-standard-2 to
n1-standard-2 (where e1 VM's are cheaper because they use
virtio-balloon to better manage host OS memory utilization), problem
has become, much, *much* rarer (and possibly has gone away, although
I'm going to want to run a lot more tests before I say that
conclusively) on my test setup.  At the very least, using an n1 VM
(which doesn't have virtio-balloon enabled in the hypervisor) is
enough to unblock ext4 development.

Any chance your kvm/qemu configuration might have been using
virtio-ballon?  Because other ext4 developers who have been using
kvm-xftests have not had any problems

> When I enable PAGE_POISONING, double free on kmalloc(192) is captured:
> 
> [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp 
> 89ada7584300^M
> [ 1198.326651] [ cut here ]^M
> [ 1198.327969] kernel BUG at mm/slab.c:2535!^M
> [ 1198.329129] invalid opcode:  [#1] SMP PTI^M
> [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 5.9.0-rc4_quiesce_srcu-xfstests #102^M
> [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M
> [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M
> [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 
> 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 
> 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M
> [ 1198.347331] RSP: 0018:982e40710be8 EFLAGS: 00010046^M
> [ 1198.349091] RAX: 0048 RBX: 89adb6441400 RCX: 
> ^M
> [ 1198.351839] RDX:  RSI: 89adbaa97800 RDI: 
> 89adbaa97800^M
> [ 1198.354572] RBP: 89ada7584300 R08: 0417 R09: 
> 0057^M
> [ 1198.357150] R10: 0001 R11: 982e40710aa5 R12: 
> 89adbaaae598^M
> [ 1198.359067] R13: e7bc819d6108 R14: e7bc819d6100 R15: 
> 89adb6442280^M
> [ 1198.360975] FS:  () GS:89adbaa8() 
> knlGS:^M
> [ 1198.363202] CS:  0010 DS:  ES:  CR0: 80050033^M
> [ 1198.365986] CR2: 55f6a3811318 CR3: 00017adca005 CR4: 
> 00770ee0^M
> [ 1198.368679] DR0:  DR1:  DR2: 
> ^M
> [ 1198.371386] DR3:  DR6: fffe0ff0 DR7: 
> 0400^M
> [ 1198.374203] PKRU: 5554^M
> [ 1198.375174] Call Trace:^M
> [ 1198.376165]  ^M
> [ 1198.376908]  ___cache_free+0x56d/0x770^M
> [ 1198.378355]  ? kmem_freepages+0xa0/0xf0^M
> [ 1198.379814]  kfree+0x91/0x120^M
> [ 1198.382121]  kmem_freepages+0xa0/0xf0^M
> [ 1198.383474]  slab_destroy+0x9f/0x120^M
> [ 1198.384779]  slabs_destroy+0x6d/0x90^M
> [ 1198.386110]  ___cache_free+0x632/0x770^M
> [ 1198.387547]  ? kmem_freepages+0xa0/0xf0^M
> [ 1198.389016]  kfree+0x91/0x120^M
> [ 1198.390160]  kmem_freepages+0xa0/0xf0^M
> [ 1198.391551]  slab_destroy+0x9f/0x120^M
> [ 1198.392964]  slabs_destroy+0x6d/0x90^M
> [ 1198.394439]  ___cache_free+0x632/0x770^M
> [ 1198.395896]  kmem_cache_free.part.75+0x19/0x70^M
> [ 1198.397791]  rcu_core+0x1eb/0x6b0^M
> [ 1198.399829]  ? ktime_get+0x37/0xa0^M
> [ 1198.401343]  __do_softirq+0xdf/0x2c5^M
> [ 1198.403010]  asm_call_on_stack+0x12/0x20^M
> [ 1198.404847]  ^M
> [ 1198.405799]  do_softirq_own_stack+0x39/0x50^M
> [ 1198.407621]  irq_exit_rcu+0x97/0xa0^M
> [

Re: [PATCH] ext4: Implement swap_activate aops using iomap

2020-09-24 Thread Theodore Y. Ts'o

On Fri, Sep 04, 2020 at 02:46:53PM +0530, Ritesh Harjani wrote:
> After moving ext4's bmap to iomap interface, swapon functionality
> on files created using fallocate (which creates unwritten extents) are
> failing. This is since iomap_bmap interface returns 0 for unwritten
> extents and thus generic_swapfile_activate considers this as holes
> and hence bail out with below kernel msg :-
> 
> [340.915835] swapon: swapfile has holes
> 
> To fix this we need to implement ->swap_activate aops in ext4
> which will use ext4_iomap_report_ops. Since we only need to return
> the list of extents so ext4_iomap_report_ops should be enough.
> 
> Reported-by: Yuxuan Shui 
> Fixes: ac58e4fb03f ("ext4: move ext4 bmap to use iomap infrastructure")
> Signed-off-by: Ritesh Harjani 

Thanks, applied.

- Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-24 Thread Theodore Y. Ts'o

On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> that one of the stack traces had virtio balloon in the trace, I
> realized that when I switched the GCE VM type from e1-standard-2 to
> n1-standard-2 (where e1 VM's are cheaper because they use
> virtio-balloon to better manage host OS memory utilization), problem
> has become, much, *much* rarer (and possibly has gone away, although
> I'm going to want to run a lot more tests before I say that
> conclusively) on my test setup.  At the very least, using an n1 VM
> (which doesn't have virtio-balloon enabled in the hypervisor) is
> enough to unblock ext4 development.

 and I spoke too soon.  A number of runs using -rc6 are now
failing even with the n1-standard-2 VM, so virtio-ballon may not be an
indicator.

This is why debugging this is frustrating; it is very much a heisenbug
--- although 5.8 seems to work completely reliably, as does commits
before 37f4a24c2469.  Anything after that point will show random
failures.  :-(

- Ted

Re: [PATCH] ext4: Use generic casefolding support

2020-10-29 Thread Theodore Y. Ts'o

On Wed, Oct 28, 2020 at 05:08:20AM +, Daniel Rosenberg wrote:
> This switches ext4 over to the generic support provided in libfs.
> 
> Since casefolded dentries behave the same in ext4 and f2fs, we decrease
> the maintenance burden by unifying them, and any optimizations will
> immediately apply to both.
> 
> Signed-off-by: Daniel Rosenberg 
> Reviewed-by: Eric Biggers 

Applied, thanks.

- Ted

GIT PULL] ext4 fixes for 5.10-rc2

2020-10-29 Thread Theodore Y. Ts'o

The following changes since commit 96485e4462604744d66bf4301557d996d80b85eb:

  Merge tag 'ext4_for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2020-10-22 10:31:08 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_fixes

for you to fetch changes up to 6694875ef8045cdb1e6712ee9b68fe08763507d8:

  ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/... 
(2020-10-28 13:43:22 -0400)


Bug fixes for the new ext4 fast commit feature, plus a fix for the
data=journal bug fix.  Also use the generic casefolding support which
has now landed in fs/libfs.c for 5.10.


Andrea Righi (1):
  ext4: properly check for dirty state in ext4_inode_datasync_dirty()

Daniel Rosenberg (1):
  ext4: use generic casefolding support

Harshad Shirwadkar (4):
  ext4: fix double locking in ext4_fc_commit_dentry_updates()
  ext4: make num of fast commit blocks configurable
  ext4: use s_mount_flags instead of s_mount_state for fast commit state
  ext4: use IS_ERR() for error checking of path

Jan Kara (1):
  ext4: fix mmap write protection for data=journal mode

Mauro Carvalho Chehab (1):
  jbd2: fix a kernel-doc markup

Theodore Ts'o (1):
  ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...

yangerkun (1):
  ext4: do not use extent after put_bh

 fs/ext4/dir.c | 64 
++--
 fs/ext4/ext4.h| 20 
 fs/ext4/extents.c | 30 +++---
 fs/ext4/fast_commit.c | 37 -
 fs/ext4/hash.c|  2 +-
 fs/ext4/inode.c   | 15 +--
 fs/ext4/namei.c   | 20 
 fs/ext4/super.c   | 16 
 fs/ext4/sysfs.c   |  2 ++
 include/linux/jbd2.h  |  7 +--
 10 files changed, 78 insertions(+), 135 deletions(-)

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-27 Thread Theodore Y. Ts'o

On Fri, Sep 25, 2020 at 02:18:48PM -0700, Shakeel Butt wrote:
> 
> Yes, you are right. Let's first get this patch tested and after
> confirmation we can update the commit message.

Thanks Shakeel!  I've tested your patch, as well as reverting the
three commits that Linus had suggested, and both seem to address the
problem for me as well.  I did see a small number of failures
immediately as soon as the VM has booted, when testing with the
"revert the three commits" but this appears to be a different failure,
which I had been seeing periodically during the bisect as well which
was no doubt introducing noise in my testing:

[   28.545018] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
[   28.545018] Modules linked in:
[   28.545018] irq event stamp: 4517759
[   28.545018] hardirqs last  enabled at (4517758): [] 
asm_common_interrupt+0x1e/0x40
[   28.545018] hardirqs last disabled at (4517759): [] 
sysvec_apic_timer_interrupt+0xb/0x90
[   28.545018] softirqs last  enabled at (10634): [] 
irq_enter_rcu+0x6d/0x70
[   28.545018] softirqs last disabled at (10635): [] 
asm_call_on_stack+0x12/0x20
[   28.545018] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
5.9.0-rc6-xfstests-7-g3f3cb48a7d90 #1916
[   28.545018] Hardware name: Google Google Compute Engine/Google Compute 
Engine, BIOS Google 01/01/2011
[   28.545018] RIP: 0010:__do_softirq+0xa3/0x435
[   28.545018] Code: 00 83 80 ac 07 00 00 01 48 89 44 24 08 c7 44 24 1c 0a 00 
00 00 65 66 c7 05 a8 ae 9e 55 00 00 e8 d3 92 3b ff fb b8 ff ff ff ff <48> c7 c3 
40 51 00 ab 41 0f bc c7 89 c6 83 c6 01 89 74 24 04 75 6a
[   28.545018] RSP: :b89f000e0f98 EFLAGS: 0202
[   28.545018] RAX:  RBX:  RCX: 298a
[   28.545018] RDX:  RSI:  RDI: aa80009d
[   28.545018] RBP: b89f000abda0 R08: 0001 R09: 
[   28.545018] R10: 0001 R11: 0046 R12: 0001
[   28.545018] R13:  R14:  R15: 0080
[   28.545018] FS:  () GS:998e5920() 
knlGS:
[   28.545018] CS:  0010 DS:  ES:  CR0: 80050033
[   28.545018] CR2:  CR3: 00023e012001 CR4: 001706e0
[   28.545018] DR0:  DR1:  DR2: 
[   28.545018] DR3:  DR6: fffe0ff0 DR7: 0400
[   28.545018] Call Trace:
[   28.545018]  
[   28.545018]  asm_call_on_stack+0x12/0x20
[   28.545018]  
[   28.545018]  do_softirq_own_stack+0x4e/0x60
[   28.545018]  irq_exit_rcu+0x9f/0xe0
[   28.545018]  sysvec_call_function_single+0x43/0x90
[   28.545018]  asm_sysvec_call_function_single+0x12/0x20
[   28.545018] RIP: 0010:acpi_idle_do_entry+0x54/0x70
[   28.545018] Code: ed c3 e9 cf fe ff ff 65 48 8b 04 25 00 6e 01 00 48 8b 00 
a8 08 75 ea e8 ba c0 5b ff e9 07 00 00 00 0f 00 2d f8 3d 4e 00 fb f4 <9c> 58 fa 
f6 c4 02 74 cf e9 5f c2 5b ff cc cc cc cc cc cc cc cc cc
[   28.545018] RSP: :b89f000abe88 EFLAGS: 0202
[   28.545018] RAX: 293b RBX: 998e5564 RCX: 1a12
[   28.545018] RDX:  RSI:  RDI: aa5fd2b6
[   28.545018] RBP: ab163760 R08: 0001 R09: 000e003c
[   28.545018] R10: 998e582e2340 R11: 0046 R12: 0001
[   28.545018] R13: 0001 R14: ab1637e0 R15: 
[   28.545018]  ? acpi_idle_do_entry+0x46/0x70
[   28.545018]  ? acpi_idle_do_entry+0x46/0x70
[   28.545018]  acpi_idle_enter+0x7d/0xb0
[   28.545018]  cpuidle_enter_state+0x84/0x2c0
[   28.545018]  cpuidle_enter+0x29/0x40
[   28.545018]  cpuidle_idle_call+0x111/0x180
[   28.545018]  do_idle+0x7b/0xd0
[   28.545018]  cpu_startup_entry+0x19/0x20
[   28.545018]  secondary_startup_64+0xb6/0xc0

I think this was an issue relating to acpi_idle that others have
reported, but I thought this was fixed before -rc6 was released?  In
any case, this is post -rc6, so apparently there is something else
going on here, and this is probably unrelated to the regression which
Shakeel's patch is addressing.

- Ted

Re: [PATCH 4/4] debian: add generic rule file

2019-07-15 Thread Theodore Y. Ts'o

On Mon, Jul 15, 2019 at 08:56:25PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> On 15.07.19 14:28, Masahiro Yamada wrote:
> 
> >> The rule file contains a rule for creating debian/control and
> >> other metadata - this is done similar to the 'deb-pkg' make rule,
> >> scripts/packaging/mkdebian.
> > 
> > I saw a similar patch submission before, and negative feedback about it.
> 
> Do you recall what negative feedback exactly ?

It's possible I'm not remembering some of the feedback, but the only
thing I recall was the comment I made that I'd really like this use
case:

make O=/build/linux-build bindeb-pkg

to not break.  And as far as I can tell from the proposed patch series
(I haven't had a chance to experimentally verify it yet), I don't
think it should break anything --- I'm assuming that we will still
have a way of creating the debian/rules file in
/build/linux-build/debian/rules when doing a O= build, and that the
intdeb-pkg rule remains the same.  At least, it appears to be the case
from my doing a quick look at the patches.

> > Debian maintains its own debian/rules, and it is fine.
> 
> Not for me, I don't use it - given up trying to make anything useful
> out of it. It's extremly complex, practically undebuggable and doesn't
> even work w/o lots of external preparations.

Yeah, the official Debian debian/rules is optimized for doing a
distribution release, and in addition to the issues Enrico has raised,
last time I tried it, it was S-L-O-W since it was building a fully
generic kernel.  It's not at all useable for general developer use.

It sounds like what Enrico is trying to do is to enable running
"dpkg-buildpackage -us -uc -b" from the the top-level kernel package
as being easier than running "make bindeb-pkg".  I suspect this might
be because his goal is to integrate individual kernel builds from
using Debian's hermetic build / chroot systems (e.g., sbuild, pbuilder)?

  - Ted

Re: [PATCH 4/4] debian: add generic rule file

2019-07-16 Thread Theodore Y. Ts'o

On Tue, Jul 16, 2019 at 05:58:49PM +0900, Masahiro Yamada wrote:
> I want debian/ to be kept as a drop-in directory
> for packagers, without replacing the upstream debian/rules.
> 
> If a check-in source file is modified in anyway,
> scripts/setlocalversion would set -dirty flag,
> which I want to avoid.

In practice, that's not going to be a problem for most distributions.
The traditional way Debian-derived systems have done builds is
completely outside of git.  So there will be a linux_5.2.orig.tar.gz
and a linux_5.2-1.debian.tar.xz.  dpkg_source -x will first unpackage
the orig.tar.gz, and then the debian.tar.xz, and if the second
overwrites the first, it's no big deal.

More modern Debian package maintainer workflows may be using git, but
in that case, all of the "Debianizations" are reflected in a separate
branch.  So it's not going to set the -dirty flag.

There will be potential merge conflicts between Enrico's proposed
"upstream default debian/rules" file and the Debian/Ubuntu
debian/rules file on their distro branch.  However, I don't think
that's a big issue, for two reasons.

First, once it's checked in, I expect changes to the default
debian/rules file will be relatively rare.  Secondly, it's easy enough
to use gitattributes and defining a custom merge driver so that a
distribution can configure things so that they always use the version
of debian/rules from their branch, so the merge conflict resolution
can be set up to always do the right thing.

There are certainly other upstreams which ship their own debian/
directories.  E2fsprogs is one such example, but in that case I'm
cheating because I'm both the Debian package maintainer as well as the
upstream maintainer.  :-)   However, it's never been an issue for Ubuntu
when they choose to ship their own customized debian/rules file.

> debian/rules is a hook for packagers to do their jobs in downstream.
> "We kindly committed a generic one for you" sounds weird to me.

It is weird, and it's not common for upstream packages (which are not
native Debian packages) to ship their own debian directory.  But it
certainly does happen, and it won't cause any problems in actual
practice.

Regards,

- Ted

Re: kbuild: Fail if gold linker is detected

2019-07-16 Thread Theodore Y. Ts'o

On Tue, Jul 16, 2019 at 08:13:24PM +0200, Ingo Molnar wrote:
> 
> * Thomas Gleixner  wrote:
> 
> > On Tue, 16 Jul 2019, Ingo Molnar wrote:
> > 
> > > 
> > > * Thomas Gleixner  wrote:
> > > 
> > > > The gold linker has known issues of failing the build in random and
> > > > predictible ways. H.J. stated:
> > > 
> > > s/predictable/unpredictable?
> > 
> > No. It fails randomly, but also predictable. Enable X32 support on 64bit
> > and it fails the VDSO build. That's been the case for years.
> 
> Then please make this a bit more apparent, such as:
> 
>  "The gold linker has known issues of failing the build in random
>   but also in more predictible ways."


How about:

The gold linker has known issues of failing for certain configurations.



- Ted

Re: [PATCH v2] kbuild: Fail if gold linker is detected

2019-07-16 Thread Theodore Y. Ts'o

On Wed, Jul 17, 2019 at 12:25:14AM +0200, Thomas Gleixner wrote:
> > It's been my default system linker for years and I've had very few issues
> > with it and it's a big improvement when linking with LTO
> 
> I understand, but the fact that you need to turn off config options in
> order to build a kernel and the clear statement that it's not recommended
> makes it truly unsuitable and unmaintainable for us.

Or if you work for a cloud company who is willing to make the gold
linker work for your specific use case and configuration (and ideally,
have gold toolchain experts on staff who will work with you), then it
might be OK, but just for that particular use case.  (Just as Android
kernels worked with Clang when Clang was still miscompiling kernel on
different architectures and configurations.)  In those cases, you can
just carry a patch to force the gold linker to work.

The point though is the teams that were using alternative,
not-always-reliable toolchains, were big boys and girls, and they
weren't asking the upstream kernel devs for support.  And they only
cared about a few specific configurations, and not something that
would work for all or even most configurations and hardware platforms.

- Ted

Re: [PATCH 4/4] debian: add generic rule file

2019-07-17 Thread Theodore Y. Ts'o

On Wed, Jul 17, 2019 at 04:16:39PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> 
> > In practice, that's not going to be a problem for most distributions.
> > The traditional way Debian-derived systems have done builds is
> > completely outside of git.  So there will be a linux_5.2.orig.tar.gz
> > and a linux_5.2-1.debian.tar.xz.  dpkg_source -x will first unpackage
> > the orig.tar.gz, and then the debian.tar.xz, and if the second
> > overwrites the first, it's no big deal.
> 
> ACK. IIRC they already filter out debian/ directories when generating
> upstream tarballs - other upstreams already provide their debian/
> stuff, too.

Well, no, actually they don't.  That's because as much as possible
they want the upstream tarball to be bit-for-bit identical to the one
published on the official upstream distribution site.  That allows
them to include the detached PGP signature from the upstream
maintainer, if one is provided.

If there are files in the upstream debian/ directory that they don't
need, they can delete in the distro's debian/rules file.  Ideally, so
we shouldn't include files in the Linux kernel's debian/ directory
willy-nilly.  But the debian/rules file will *always* be present, and
so it will be overwritten by the _.debian.tar.xz file,
and so it's no big deal.

- Ted

Re: Linux 5.3-rc8

2019-09-17 Thread Theodore Y. Ts'o

On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> Willy Tarreau - 17.09.19, 07:24:38 CEST:
> > On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > > >Well, the patch actually made getrandom() return en error too, but
> > > >you seem more interested in the hypotheticals than in arguing
> > > >actualities.> 
> > > If you want to be safe, terminate the process.
> > 
> > This is an interesting approach. At least it will cause bug reports in
> > application using getrandom() in an unreliable way and they will
> > check for other options. Because one of the issues with systems that
> > do not finish to boot is that usually the user doesn't know what
> > process is hanging.
> 

I would be happy with a change which changes getrandom(0) to send a
kill -9 to the process if it is called too early, with a new flag,
getrandom(GRND_BLOCK) which blocks until entropy is available.  That
leaves it up to the application developer to decide what behavior they
want.

Userspace applications which want to do something more sophisticated
could set a timer which will cause getrandom(GRND_BLOCK) to return
with EINTR (or the signal handler could use longjmp; whatever) to
abort and do something else, like calling random_r if it's for some
pathetic use of random numbers like MIT-MAGIC-COOKIE.

> A userspace process could just poll on the kernel by forking a process 
> to use getrandom() and waiting until it does not get terminated anymore. 
> And then it would still hang.

So I'm not too worried about that, because if a process is
determined to do something stupid, they can always do something
stupid.

This could potentially be a problem, as would GRND_BLOCK, in that if
an application author decides to use to do something to wait for real
randomness, because in the good judgement of the application author,
it d*mned needs real security because otherwise an attacker could,
say, force a launch of nuclear weapons and cause world war III, and
then some small 3rd-tier distro decides to repurpose that application
for some other use, and puts it in early boot, it's possible that a
user will report it as a "regression", and we'll be back to the
question of whether we revert a performance optimization patch.

There are only two ways out of this mess.  The first option is we take
functionality away from a userspace author who Really Wants A Secure
Random Number Generator.  And there are an awful lot of programs who
really want secure crypto, becuase this is not a hypothetical.  The
result in "Mining your P's and Q's" did happen before.  If we forget
the history, we are doomed to repeat it.

The only other way is that we need to try to get the CRNG initialized
securely in early boot, before we let userspace start.  If we do it
early enough, we can also make the kernel facilities like KASLR and
Stack Canaries more secure.  And this is *doable*, at least for most
common platforms.  We can leverage UEFI; we cn try to use the TPM's
random number generator, etc.  It won't help so much for certain
brain-dead architectures, like MIPS and ARM, but if they are used for
embedded use cases, it will be caught before the product is released
for consumer use.  And this is where blocking is *way* better than a
big fat warning, or sleeping for 15 seconds, both of which can easily
get missed in the embedded case.  If we can fix this for traditional
servers/desktops/laptops, then users won't be complaining to Linus,
and I think we can all be happy.

Regards,

- Ted

Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()

2019-09-19 Thread Theodore Y. Ts'o

(Adding linux-api since this patch proposes an API change; both by
changing the existing behavior, and adding new flags and possibly a
new system call.)

On Wed, Sep 18, 2019 at 04:57:58PM -0700, Linus Torvalds wrote:
> On Wed, Sep 18, 2019 at 2:17 PM Ahmed S. Darwish  wrote:
> >
> > Since Linux v3.17, getrandom(2) has been created as a new and more
> > secure interface for pseudorandom data requests.  It attempted to
> > solve three problems, as compared to /dev/urandom:
> 
> I don't think your patch is really _wrong_, but I think it's silly to
> introduce a new system call, when we have 30 bits left in the flags of
> the old one, and the old system call checked them.

The only reason to introduce a new system call is if we were going to
keep the existing behavior of getrandom.  Given that the patch changes
what getrandom(0), I agree there's no point to adding a new system
call.

> There is *one* other small semantic change: The old code did
> urandom_read() which added warnings, but each warning also _reset_ the
> crng_init_cnt. Until it decided not to warn any more, at which point
> it also stops that resetting of crng_init_cnt.
> 
> And that reset of crng_init_cnt, btw, is some cray cray.
> 
> It's basically a "we used up entropy" thing, which is very
> questionable to begin with as the whole discussion has shown, but
> since it stops doing it after 10 cases, it's not even good security
> assuming the "use up entropy" case makes sense in the first place.

It was a bug that it stopped doing it after 10 tries, and there's a
really good reason for it.  Yes, the "using up entropy" thing doesn't
make much sense in the general case.  But we still need some threshold
for deciding whether or not it's been sufficiently initialized such
that we consider the CRNG initialized.

The reason for zeroing it after we expose state is because otherwise
if the pool starts in a known state (the attacker knows the starting
configuration, knows the DMI table that we're mixing into the pool
since that's a constant, etc.), then after we've injected a small
amount of uncertainty in the pool --- say, we started with a single
known state of the pool, and after injecting some randomness, there
are 64 possible states of the pool.  If the attacker can read from
/dev/urandom, the attacker can know which of the 64 possible states of
the pool it's in.  Now suppose we inject more uncertainty, so that
there's another 64 unknown states, and the attacker is able to
constantly read from /dev/urandom in a tight loop; it'll be able to
keep up with the injection of entropy insertion, and so even though
we've injected 256 "bits" of uncertainty, the attacker will still know
the state of the pool.  That's why when we read from the pool, we need
to clear the entropy bits.

This is sometimes called a "state extension attack", and there have
been attacks that have been carried out against RNG's that's don't
protect against it.  What happened is when I added the rate-limiting
to the uninitialized /dev/urandom warning, I accidentally wiped out
the protection.  But it was there for a reason.

> And the new cases are defined to *not* warn. In particular,
> GRND_INSECURE very much does *not* warn about early urandom access
> when crng isn't ready. Because the whole point of that new mode is
> that the user knows it isn't secure.
> 
> So that should make getrandom(GRND_INSECURE) palatable to the systemd
> kind of use that wanted to avoid the pointless kernel warning.

Yes, that's clearly the right thing to do.  I do think we need to
restore the state extension attack protections, though.

> + /*
> +  * People are really confused about whether
> +  * this is secure or insecure. Traditional
> +  * behavior is secure, but there are users
> +  * who clearly didn't want that, and just
> +  * never thought about it.
> +  */
> + case 0:
>   ret = wait_for_random_bytes();
> - if (unlikely(ret))
> + if (ret)
>   return ret;
> + break;

I'm happy this proposed is not changing the behavior of getrandom(0).
Why not just remap 0 to GRND_EXPLICIT | GRND_WAIT_ENTROPY, though?  It
will have the same effect, and it's make it clear what we're doing.

Later on, when we rip out /dev/random pool code (and make reading from
/dev/random the equivalent of getrandom(GRND_SECURE)), we'll need to
similarly map the legacy combination of flags for GRND_RANDOM and
GRND_RANDOM | GRND_NONBLOCK.

- Ted

Re: [PATCH] Fixed most indent issues in tty_io.c

2019-09-08 Thread Theodore Y. Ts'o

Hi Sandro,

It's not mentioned in the process documentation (but maybe we should
add this), is that it's up to individual maintainers about whether or
not whitespace cleanups are accepted outside of the staging tree.

That's because whitespace cleanups are a great "training wheel" for
newbies who are learning the ropes, but they do have some costs.  For
example, for actively developed portions of the kernel whitespace
cleans can often break other pending changes.  Also, trivial cleanups
(e.g., spelling and whitespace cleanups) makes it more likely that
future bug fixes in that portion of the kernel will fail to be
automatically backported to the stable kernel, thus requiring a manual
backport effort.  

As a result, some maintainers will reject trivial cleanups unless they
are part of a patch series that is making some kind of substantive
improvement to the kernel (beyond trivial cleanups).

There are some good aspects of fixing whitespace issues, of course,
which is why they are encouraged in the staging tree, but there is not
consensus amongst maintainers about whether it is a net benefit to do
clean up patches just for the sake of doing cleanup patches.

(And of course, sometimes the checkpatch rules change over time --- at
one point, checkpatch would warn if *any* line was longer than 80
characters, and so there were tons and tons of trivial cleanups to
"fix" this, including breaking up strings.  When enough people
complained that this actually made it harder to find kernel messages
that got split, checkpatch changed to complain when strings were split
across lines, and more trivial patches got sent out undoing previous
trivial patches.  And this caused all of the same downsides of
breaking automated stable backports, *twice*.  As such, newbies are
strongly encouraged to restrict their checkpatch cleanups to the
staging tree, since when such cleanup patches are considered welcome
very much depends on the kernel subsystem and the maintainers
involved.)

Cheers,

- Ted

Re: Linux 5.3-rc8

2019-09-10 Thread Theodore Y. Ts'o

On Tue, Sep 10, 2019 at 06:21:07AM +0200, Ahmed S. Darwish wrote:
> 
> The commit b03755ad6f33 (ext4: make __ext4_get_inode_loc plug), [1]
> which was merged in v5.3-rc1, *always* leads to a blocked boot on my
> system due to low entropy.
> 
> The hardware is not a VM: it's a Thinkpad E480 (i5-8250U CPU), with
> a standard Arch user-space.

Hmm, I'm not seeing this on a Dell XPS 13 (model 9380) using a Debian
Bullseye (Testing) running a rc4+ kernel.

This could be because Debian is simply doing more I/O; or it could be
because I don't have some package installed which is trying to reading
from /dev/random or calling getrandom(2).  Previously, Fedora ran into
blocking issues because of some FIPS compliance patches to some
userspace daemons.  So it's going to be very user space dependent and
package dependent.

> It seems that batching the directory lookup I/O requests (which are
> possibly a lot during boot) is minimizing sources of disk-activity-
> induced entropy? [2] [3]
> 
> Can this even be considered a user-space breakage? I'm honestly not
> sure. On my modern RDRAND-capable x86, just running rng-tools rngd(8)
> early-on fixes the problem. I'm not sure about the status of older
> CPUs though.

You can probably also fix this problem by adding random.trust_cpu=true
to the boot command line, or by enabling CONFIG_RANDOM_TRUST_CPU.
This obviously assumes that you trust Intel's implementation of
RDRAND, but that's true regardless of whether of whether you use rngd
or the kernel config option.

As far as whether it's considered user-space breakage; that's though.
File system performance improvements can cause a reduced amount of
I/O, and that can cause less entropy to be collected, and depending on
a complex combination of kernel config options, distribution-specific
patches, and what packages are loaded, that could potentially cause
boot hangs waiting for entropy.  Does that we we're can't make any
file system performace improvements?  Surely that doesn't seem like
the right answer.

It would be useful to figure out what process is blocking waiting on
entropy, since in general, trying to rely on cryptographic entropy in
early boot, especially if it is to generate cryptographic keys, is
going to be more dangerous compared to a "just in time" approach to
generating crypto keys.  So this could also be considered a userspace
bug, depending on your point of view...

- Ted

Re: Linux 5.3-rc8

2019-09-11 Thread Theodore Y. Ts'o

On Tue, Sep 10, 2019 at 07:21:54PM +0100, Linus Torvalds wrote:
> On Tue, Sep 10, 2019 at 6:33 PM Ahmed S. Darwish  wrote:
> >
> > While gnome-session is obviously at fault here by requiring
> > *blocking* randomness at the boot path, it's still not requesting
> > much, just (5 * 16) bytes to be exact.

It doesn't matter how much randomness it's requesting.  With the new
cryptographic random number generator, the CRNG is either
initialized or it's not.

> Just out of curiosity, what happens if you apply a patch like this
> (intentionally whitespace-damaged, I don't want anybody to pick it up
> without thinking about it) thing...

> Which I think is what the code really wants - it's only using jiffies
> because that is the only thing _guaranteed_ to change at all. But with
> the sum, you get the best of both worlds, and should basically make
> the entropy estimation use the "better of two counters".
> 
> Ted, comments? I'd hate to revert the ext4 thing just because it
> happens to expose a bad thing in user space.

Unfortuantely, I very much doubt this is going to work.  That's
because the add_disk_randomness() path is only used for legacy
/dev/random (which actually only still exists because of some insane
PCI compliance issues which a number of end users really care about
--- or they care about because it makes the insane PCI complaince labs
go away).

Also, because by default, the vast majority of disks have
/sys/block/XXX/queue/add_random set to zero by default.

So the the way we get entropy these days for initializing the CRNG is
via the add_interrupt_randomness() path, where do something really
fast, and we assume that we get enough uncertainity from 8 interrupts
to give us one bit of entropy (64 interrupts to give us a byte of
entropy), and that we need 512 bits of entropy to consider the CRNG
fully initialized.  (Yeah, there's a lot of conservatism in those
estimates, and so what we could do is decide to say, cut down the
number of bits needed to initialize the CRNG to be 256 bits, since
that's the size of the CHACHA20 cipher.)

Ultimately, though, we need to find *some* way to fix userspace's
assumptions that they can always get high quality entropy in early
boot, or we need to get over people's distrust of Intel and RDRAND.
Otherwise, future performance improvements in any part of the system
which reduces the number of interrupts is always going to potentially
result in somebody's misconfigured system or badly written
applications to fail to boot.  :-(

- Ted

Re: Linux 5.3-rc8

2019-09-11 Thread Theodore Y. Ts'o

On Wed, Sep 11, 2019 at 06:00:19PM +0100, Linus Torvalds wrote:
> [0.231255] random: get_random_bytes called from
> start_kernel+0x323/0x4f5 with crng_init=0
> 
> and that's this code:
> 
> add_latent_entropy();
> add_device_randomness(command_line, strlen(command_line));
> boot_init_stack_canary();
> 
> in particular, it's the boot_init_stack_canary() thing that asks for a
> random number for the canary.
> 
> I don't actually see the 'crng init done' until much much later:
> 
> [   21.741125] random: crng init done

Yes, that's super early in the boot sequence.  IIRC the stack canary
gets reinitialized later (or maybe it was only for the other CPU's in
SMP mode; I don't recall the details of the top of my head).

I think this one always fails, and perhaps we should have a way of
suppressing it --- but that's correct the in-kernel interface doesn't
block.

The /dev/urandom device doesn't block either, despite security
eggheads continually asking me to change it to block ala getrandom(2),
but I have always pushed because because I *know* changing
/dev/urandom to block would be asking for userspace regressions.

The compromise we came up with was that since getrandom(2) is a new
interface, we could make this have the behavior that the security
heads wanted, which is to make blocking unconditional, since the
theory was that *this* interface would be sane, and that userspace
applications which used it too early was buggy, and we could make it
*their* problem.

People have suggested adding a new getrandom flag, GRND_I_KNOW_THIS_IS_INSECURE,
or some such, which wouldn't block and would return "best efforts"
randomness.  I haven't been super enthusiastic about such a flag
because I *know* it would be insecure.   However, the next time a massive
security bug shows up on the front pages of the Wall Street Journal,
or on some web site such as https://factorable.net, it won't be the kernel's 
fault
since the flag will be GRND_INSECURE_BROKEN_APPLICATION, or some such.
It doesn't really solve the problem, though.

> But this does show that
> 
>  (a) we have the same issue in the kernel, and we don't block there

Ultimately, I think the only right answer is to make it the
bootloader's responsibility to get us some decent entropy at boot
time.  There are patches to allow ARM systems to pass in entropy via
the device tree.  And in theory (assuming you trust the UEFI BIOS ---
stop laughing in the back!) we can use that get entropy which will
solve the problem for UEFI boot systems.  I've been talking to Ron
Minnich about trying to get this support into the NERF bootloader, at
which point new servers from the Open Compute Project will have a
solution as well.  (We can probably also get solutions for Chrome OS
devices, since those have TPM-like which are trusted to have a
comptently engineered hardware RNG --- I'm not sure I would trust all
TPM devices in commodity hardware, but again, at least we can shift
blame off of the kernel.  :-P)

Still, these are all point solutions, and don't really solve the
problem on older systems, or non-x86 systems.

>  (b) initializing the crng really can be a timing problem
> 
> The interrupt thing is only going to get worse as disks turn into
> ssd's and some of them end up using polling rather than interrupts..
> So we're likely to see _fewer_ interrupts in the future, not more.

Yeah, agreed.  Maybe we should have an "insecure_randomness" boot
option which blindly forces the CRNG to be initialized at boot, so
that at least people can get to a command line, if insecurely?  I
don't have any good ideas about how to solve this problem in general.
:-( :-( :-(

- Ted

Re: [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case

2020-05-28 Thread Theodore Y. Ts'o



Thanks, I've applied this patch series.

- Ted

Re: [PATCH V5 0/9] Enable ext4 support for per-file/directory DAX operations

2020-05-28 Thread Theodore Y. Ts'o

On Thu, May 28, 2020 at 07:59:54AM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> Changes from V4:
>   Fix up DAX mutual exclusion with other flags.
>   Add clean up patch (remove jflags)
> 
> Changes from V3:
>   Change EXT4_DAX_FL to bit24
>   Cache device DAX support in the super block and use that is
>   ext4_should_use_dax()
> 
> Changes from V2:
>   Rework DAX exclusivity with verity and encryption based on feedback
>   from Eric
> 
> Enable the same per file DAX support in ext4 as was done for xfs.  This series
> builds and depends on the V11 series for xfs.[1]
> 
> This passes the same xfstests test as XFS.
> 
> The only issue is that this modifies the old mount option parsing code rather
> than waiting for the new parsing code to be finalized.
> 
> This series starts with 3 fixes which include making Verity and Encrypt truly
> mutually exclusive from DAX.  I think these first 3 patches should be picked 
> up
> for 5.8 regardless of what is decided regarding the mount parsing.
> 
> [1] https://lore.kernel.org/lkml/20200428002142.404144-1-ira.we...@intel.com/
> 
> To: linux-e...@vger.kernel.org
> To: Andreas Dilger 
> To: "Theodore Y. Ts'o" 
> To: Jan Kara 
> To: Eric Biggers 

Thanks, applied to the ext4-dax branch.

- Ted

Re: [PATCH v2] ext4: support xattr gnu.* namespace for the Hurd

2020-05-28 Thread Theodore Y. Ts'o

On Mon, May 25, 2020 at 09:39:40PM +0200, Jan (janneke) Nieuwenhuizen wrote:
> The Hurd gained[0] support for moving the translator and author
> fields out of the inode and into the "gnu.*" xattr namespace.
> 
> In anticipation of that, an xattr INDEX was reserved[1].  The Hurd has
> now been brought into compliance[2] with that.
> 
> This patch adds support for reading and writing such attributes from
> Linux; you can now do something like
> 
> mkdir -p hurd-root/servers/socket
> touch hurd-root/servers/socket/1
> setfattr --name=gnu.translator --value='"/hurd/pflocal\0"' \
> hurd-root/servers/socket/1
> getfattr --name=gnu.translator hurd-root/servers/socket/1
> # file: 1
> gnu.translator="/hurd/pflocal"
> 
> to setup a pipe translator, which is being used to create[3] a
> vm-image for the Hurd from GNU Guix.
> 
> [0] https://summerofcode.withgoogle.com/projects/#5869799859027968
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3980bd3b406addb327d858aebd19e229ea340b9a
> [2] 
> https://git.savannah.gnu.org/cgit/hurd/hurd.git/commit/?id=a04c7bf83172faa7cb080fbe3b6c04a8415ca645
> [3] https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-hurd-vm

This patch is missing a Signed-off-by.  If you don't understand why
this is really important, please read: 

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin

Can you resubmit with the DCO or confirm that it's OK for me to add your 
Signed-off-by?

  - Ted

Re: [PATCH V5 0/9] Enable ext4 support for per-file/directory DAX operations

2020-05-28 Thread Theodore Y. Ts'o

On Thu, May 28, 2020 at 10:54:41PM -0400, Theodore Y. Ts'o wrote:
> 
> Thanks, applied to the ext4-dax branch.
> 

I spoke too soon.  While I tried merging with the ext4.git dev branch,
a merge conflict made me look closer and I realize I needed to make
the following changes (see diff between your patch set and what is
currently in ext4-dax).

Essentially, I needed to rework the branch to take into account commit
e0198aff3ae3 ("ext4: reject mount options not supported when
remounting in handle_mount_opt()").

The problem is that if you allow handle_mount_opt() to apply the
changes to the dax settings, and then later on, ext4_remount() realize
that we're remounting, and we need to reject the change, there's a
race if we restore the mount options to the original configuration.
Specifically, as Syzkaller pointed out, between when we change the dax
settings and then reset them, it's possible for some file to be opened
with "wrong" dax setting, and then when they are reset, *boom*.

The correct way to deal with this is to reject the mount option change
much earlier, in handle_mount_opt(), *before* we mess with the dax
settings.

Please take a look at the ext4-dax for the actual changes which I
made.

Cheers,

- Ted


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3658e3016999..9a37d70394b2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1733,7 +1733,7 @@ static int clear_qf_name(struct super_block *sb, int 
qtype)
 #define MOPT_NO_EXT3   0x0200
 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING0x0400
-#define MOPT_SKIP  0x0800
+#define MOPT_NO_REMOUNT0x0800
 
 static const struct mount_opts {
int token;
@@ -1783,18 +1783,15 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
-   {Opt_dax, EXT4_MOUNT_DAX_ALWAYS, MOPT_SET | MOPT_SKIP},
-   {Opt_dax_always, EXT4_MOUNT_DAX_ALWAYS,
-   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
-   {Opt_dax_inode, EXT4_MOUNT2_DAX_INODE,
-   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
-   {Opt_dax_never, EXT4_MOUNT2_DAX_NEVER,
-   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
+   {Opt_dax, 0, MOPT_NO_REMOUNT},
+   {Opt_dax_always, 0, MOPT_NO_REMOUNT},
+   {Opt_dax_inode, 0, MOPT_NO_REMOUNT},
+   {Opt_dax_never, 0, MOPT_NO_REMOUNT},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
-   {Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0},
-   {Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING},
+   {Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0 | MOPT_NO_REMOUNT},
+   {Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING | MOPT_NO_REMOUNT},
{Opt_journal_ioprio, 0, MOPT_NO_EXT2 | MOPT_GTE0},
{Opt_data_journal, EXT4_MOUNT_JOURNAL_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
{Opt_data_ordered, EXT4_MOUNT_ORDERED_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
@@ -1831,7 +1828,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_GTE0},
-   {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+   {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET | MOPT_NO_REMOUNT},
{Opt_err, 0, 0}
 };
 
@@ -1929,6 +1926,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
 "Mount option \"%s\" incompatible with ext3", opt);
return -1;
}
+   if ((m->flags & MOPT_NO_REMOUNT) && is_remount) {
+   ext4_msg(sb, KERN_ERR,
+"Mount option \"%s\" not supported when remounting",
+opt);
+   return -1;
+   }
 
if (args->from && !(m->flags & MOPT_STRING) && match_int(args, ))
return -1;
@@ -2008,11 +2011,6 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
}
sbi->s_resgid = gid;
} else if (token == Opt_journal_dev) {
-   if (is_remount) {
-   ext4_msg(sb, KERN_ERR,
-"Cannot specify journal on remount");
-   return -1;
-   }
*journal_devnum = arg;
} else if (token == Opt_journal_path) {
char *journal_path;
@@ -2020,11 +2018,6 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
struct path path;
int error;
 
-   if (is_remount) {
-   ext4_msg(sb, KERN_ERR,
-"Cannot specify journal on remount");
-

< 1 2 3 4 5 6 7 8 9 >

501 - 600 of 886 matches

Mail list logo