from:"Theodore Ts'o"

Re: REGRESSION: lockdep warning triggered by 15b9ce7ecd: virtio_balloon: stay awake while adjusting balloon

2024-01-10 Thread Theodore Ts'o

On Wed, Jan 10, 2024 at 03:11:01AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 08, 2024 at 04:50:15PM -0500, Theodore Ts'o wrote:
> > Hi, while doing final testing before sending a pull request, I merged
> > in linux-next, and commit 5b9ce7ecd7: virtio_balloon: stay awake while
> > adjusting balloon seems to be causing a lockdep warning (see attached)
> > when running gce-xfstests on a Google Compute Engine e2 VM.  I was not
> > able to trigger it using kvm-xfstests, but the following command:
> > "gce-xfstests -C 10 ext4/4k generic/476) was sufficient to triger the
> > problem.   For more information please see [1] and [2].
> > 
> > [1] 
> > https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
> > [2] https://thunk.org/gce-xfstests
> > 
> > I found it by looking at the git logs, and this commit aroused my
> > suspicions, and I further testing showed that the lockdep warning was
> > reproducible with this commit, but not when testing with the
> > immediately preceeding commit (15b9ce7ecd^).
> > 
> > Cheers,
> 
> 
> Thanks a lot for the report!
> I pushed a fixed patch out (tree rebased).
> Would be great if you can confirm it's allright now.

I manually fixed up the white-space issues with the patch last night,
and verified that it fixed it for me with an overnight test run.  (My
patch was versus next-20240109, and then I tested with ext4/dev merged
in.  Previously I had noted the problem with next-20240107 with
ext4/dev merged in.)

Thanks,

- Ted


>From 98097bbd4fe2e15db8fa357aa6e29435cb62e450 Mon Sep 17 00:00:00 2001
From: David Stevens 
Date: Tue, 9 Jan 2024 14:41:21 +0900
Subject: [PATCH] virtio_balloon: Fix interrupt context deadlock

Use _irq spinlock functions with the adjustment_lock, since
start_update_balloon_size needs to acquire it in an interrupt context.

Fixes: 5b9ce7ecd715 ("virtio_balloon: stay awake while adjusting balloon")
Reported-by: Theodore Ts'o 
Tested-by: Theodore Ts'o 
Signed-off-by: David Stevens 
Signed-off-by: Theodore Ts'o 
---
 drivers/virtio/virtio_balloon.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index aa6a1a649ad6..1f5b3dd31fcf 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -459,12 +459,12 @@ static void start_update_balloon_size(struct 
virtio_balloon *vb)
 
 static void end_update_balloon_size(struct virtio_balloon *vb)
 {
-   spin_lock(>adjustment_lock);
+   spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
pm_relax(vb->vdev->dev.parent);
}
-   spin_unlock(>adjustment_lock);
+   spin_unlock_irq(>adjustment_lock);
 }
 
 static void virtballoon_changed(struct virtio_device *vdev)
@@ -506,9 +506,9 @@ static void update_balloon_size_func(struct work_struct 
*work)
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
 
-   spin_lock(>adjustment_lock);
+   spin_lock_irq(>adjustment_lock);
vb->adjustment_signal_pending = false;
-   spin_unlock(>adjustment_lock);
+   spin_unlock_irq(>adjustment_lock);
 
diff = towards_target(vb);
 
-- 
2.43.0

REGRESSION: lockdep warning triggered by 15b9ce7ecd: virtio_balloon: stay awake while adjusting balloon

2024-01-08 Thread Theodore Ts'o

Hi, while doing final testing before sending a pull request, I merged
in linux-next, and commit 5b9ce7ecd7: virtio_balloon: stay awake while
adjusting balloon seems to be causing a lockdep warning (see attached)
when running gce-xfstests on a Google Compute Engine e2 VM.  I was not
able to trigger it using kvm-xfstests, but the following command:
"gce-xfstests -C 10 ext4/4k generic/476) was sufficient to triger the
problem.   For more information please see [1] and [2].

[1] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
[2] https://thunk.org/gce-xfstests

I found it by looking at the git logs, and this commit aroused my
suspicions, and I further testing showed that the lockdep warning was
reproducible with this commit, but not when testing with the
immediately preceeding commit (15b9ce7ecd^).

Cheers,

- Ted


root: ext4/4k run xfstest generic/476
systemd[1]: Started fstests-generic-476.scope - /usr/bin/bash -c test -w 
/proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec 
./tests/generic/476.
kernel: [  399.361181] EXT4-fs (dm-1): mounted filesystem 
840e25bd-f650-4819-8562-7eded85ef370 r/w with ordered data mode. Quota mode: 
none.
systemd[1]: fstests-generic-476.scope: Deactivated successfully.
systemd[1]: fstests-generic-476.scope: Consumed 3min 1.966s CPU time.
systemd[1]: xt\x2dvdb.mount: Deactivated successfully.
kernel: [  537.085404] EXT4-fs (dm-0): unmounting filesystem 
d3d7a675-f7b6-4384-abec-2e60d885b6da.
systemd[1]: xt\x2dvdc.mount: Deactivated successfully.
kernel: [  540.565870] 
kernel: [  540.567523] 
kernel: [  540.572007] WARNING: inconsistent lock state
kernel: [  540.576407] 6.7.0-rc3-xfstests-lockdep-00012-g5b9ce7ecd715 #318 Not 
tainted
kernel: [  540.583532] 
kernel: [  540.587928] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
kernel: [  540.594326] kworker/0:3/329 [HC0[0]:SC0[0]:HE1:SE1] takes:
kernel: [  540.599955] 90b280a548c0 (>adjustment_lock){?...}-{2:2}, at: 
update_balloon_size_func+0x33/0x190
kernel: [  540.609926] {IN-HARDIRQ-W} state was registered at:
kernel: [  540.614935]   __lock_acquire+0x3f2/0xb30
kernel: [  540.618992]   lock_acquire+0xbf/0x2b0
kernel: [  540.622786]   _raw_spin_lock_irqsave+0x43/0x90
kernel: [  540.627366]   virtballoon_changed+0x51/0xd0
kernel: [  540.631947]   virtio_config_changed+0x5a/0x70
kernel: [  540.636437]   vp_config_changed+0x11/0x20
kernel: [  540.640576]   __handle_irq_event_percpu+0x88/0x230
kernel: [  540.645500]   handle_irq_event+0x38/0x80
kernel: [  540.649558]   handle_edge_irq+0x8f/0x1f0
kernel: [  540.653791]   __common_interrupt+0x47/0xf0
kernel: [  540.658106]   common_interrupt+0x79/0xa0
kernel: [  540.661672] EXT4-fs (dm-1): unmounting filesystem 
840e25bd-f650-4819-8562-7eded85ef370.
kernel: [  540.663183]   asm_common_interrupt+0x26/0x40
kernel: [  540.663190]   acpi_safe_halt+0x1b/0x30
kernel: [  540.663196]   acpi_idle_enter+0x7b/0xd0
kernel: [  540.663199]   cpuidle_enter_state+0x90/0x4f0
kernel: [  540.688723]   cpuidle_enter+0x2d/0x40
kernel: [  540.692516]   cpuidle_idle_call+0xe4/0x120
kernel: [  540.697036]   do_idle+0x84/0xd0
kernel: [  540.700393]   cpu_startup_entry+0x2a/0x30
kernel: [  540.704588]   rest_init+0xe9/0x180
kernel: [  540.708118]   arch_call_rest_init+0xe/0x30
kernel: [  540.712426]   start_kernel+0x41c/0x4b0
kernel: [  540.716310]   x86_64_start_reservations+0x18/0x30
kernel: [  540.721164]   x86_64_start_kernel+0x8c/0x90
kernel: [  540.725737]   secondary_startup_64_no_verify+0x178/0x17b
kernel: [  540.731432] irq event stamp: 22681
kernel: [  540.734956] hardirqs last  enabled at (22681): [] 
_raw_spin_unlock_irq+0x28/0x50
kernel: [  540.744564] hardirqs last disabled at (22680): [] 
_raw_spin_lock_irq+0x5d/0x90
kernel: [  540.753475] softirqs last  enabled at (22076): [] 
srcu_invoke_callbacks+0x101/0x1c0
kernel: [  540.762904] softirqs last disabled at (22072): [] 
srcu_invoke_callbacks+0x101/0x1c0
kernel: [  540.773298] 
kernel: [  540.773298] other info that might help us debug this:
kernel: [  540.780207]  Possible unsafe locking scenario:
kernel: [  540.780207] 
kernel: [  540.786438]CPU0
kernel: [  540.789007]
kernel: [  540.791766]   lock(>adjustment_lock);
kernel: [  540.796014]   
kernel: [  540.798778] lock(>adjustment_lock);
kernel: [  540.803605] 
kernel: [  540.803605]  *** DEADLOCK ***
kernel: [  540.803605] 
kernel: [  540.809840] 2 locks held by kworker/0:3/329:
kernel: [  540.814259]  #0: 90b280079148 
((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1a6/0x500
kernel: [  540.824952]  #1: 9e3b40d1fe50 
((work_completion)(>update_balloon_size_work)){+.+.}-{0:0}, at: 
process_one_work+0x1a6/0x500
kernel: [  540.837088] 
kernel: [  540.837088] stack backtrace:
kernel: [  540.841584] CPU: 0 PID: 329 Comm: kworker/0:3 Not tainted

Maintainers / Kernel Summit 2021 planning kick-off

2021-04-19 Thread Theodore Ts'o

[ Feel free to forward this to other Linux kernel mailing lists as
  appropriate -- Ted ]

This year, the Maintainers and Kernel Summit is currently planned to
be held in Dublin, Ireland, September 27 -- 29th.  Of course, this is
subject to change depending on how much progress the world makes
towards vaccinating the population against the COVID-19 virus, and
whether employers are approving conference travel.  At this point,
there's a fairly good chance that we will need to move to a virtual
conference format, either for one or both of the summits.

As in previous years, the Maintainers Summit is invite-only, where the
primary focus will be process issues around Linux Kernel Development.
It will be limited to 30 invitees and a handful of sponsored
attendees.

The Kernel Summit is organized as a track which is run in parallel
with the other tracks at the Linux Plumbers Conference (LPC), and is
open to all registered attendees of LPC.

Linus has generated a core list of people to be invited to the
Maintainers Summit, and the program committee will be using that list
a starting point of people to be considered.  People who suggest
topics that should be discussed at the Maintainers Summit will also be
added to the list for consideration.  To make topic suggestions for
the Maintainers Summit, please send e-mail to the
ksum...@lists.linux.dev with a subject prefix of [MAINTAINERS SUMMIT].

(Note: The older ksummit-disc...@lists.linuxfoundation.org list has
been migrated to lists.linux.dev, with the subscriber list and
archives preserved.)

The other job of the program committee will be to organize the program
for the Kernel Summit.  The goal of the Kernel Summit track will be to
provide a forum to discuss specific technical issues that would be
easier to resolve in person than over e-mail.  The program committee
will also consider "information sharing" topics if they are clearly of
interest to the wider development community (i.e., advanced training
in topics that would be useful to kernel developers).

To suggest a topic for the Kernel Summit, please do two things.
First, please tag your e-mail with [TECH TOPIC].  As before, please
use a separate e-mail for each topic, and send the topic suggestions
to the ksummit-discuss list.

Secondly, please create a topic at the Linux Plumbers Conference
proposal submission site and target it to the Kernel Summit track.
For your convenience you can use:

https://bit.ly/lpc21-summit

Please do both steps.  I'll try to notice if someone forgets one or
the other, but your chances of making sure your proposal gets the
necessary attention and consideration are maximized by submitting both
to the mailing list and the web site.

People who submit topic suggestions before June 12th and which are
accepted, will be given free admission to the Linux Plumbers
Conference.

We will be reserving roughly half of the Kernel Summit slots for
last-minute discussions that will be scheduled during the week of
Plumbers, in an "unconference style".  This allows last-minute ideas
that come up to be given given slots for discussion.

If you were not subscribed on to the kernel@lists.linux-dev mailing
list from last year (or if you had removed yourself from the
ksummit-disc...@lists.linux-foundation.org mailing list after the
previous year's kernel and maintainers' summit summit), you can
subscribe sending an e-mail to:

ksummit+subscr...@lists.linux.dev

The mailing list archive is available at:

https://lore.kernel.org/ksummit

The program committee this year is composed of the following people:

Jens Axboe
Arnd Bergmann
Jon Corbet
Greg Kroah-Hartman
Ted Ts'o

Re: [PATCH v2 3/6] kunit: ext4: adhear to KUNIT formatting standard

2021-04-18 Thread Theodore Ts'o

On Wed, Apr 14, 2021 at 04:58:06AM -0400, Nico Pache wrote:
> Drop 'S' from end of CONFIG_EXT4_KUNIT_TESTS inorder to adhear to the KUNIT
> *_KUNIT_TEST config name format.

Another spelling nit, that should be "in order".

This will break people who have existing .kunitconfig files, but if we
are going to make this a standard (but please document it as a
standard first!), might as well do it sooner rather than later.

Cheers,

- Ted

Re: [PATCH v2 0/6] kunit: Fix formatting of KUNIT tests to meet the standard

2021-04-18 Thread Theodore Ts'o

On Wed, Apr 14, 2021 at 04:58:03AM -0400, Nico Pache wrote:
> There are few instances of KUNIT tests that are not properly defined.
> This commit focuses on correcting these issues to match the standard
> defined in the Documentation.

The word "standard" seems to be over-stating things.  The
documentation currently states, "they _usually_ have config options
ending in ``_KUNIT_TEST'' (emphasis mine).  I can imagine that there
might be some useful things we can do from a tooling perspective if we
do standardize things, but if you really want to make it a "standard",
we should first update the manpage to say so, and explain why (e.g.,
so that we can easily extract out all of the kunit test modules, and
perhaps paint a vision of what tools might be able to do with such a
standard).

Alternatively, the word "standard" could perhaps be changed to
"convention", which I think more accurately defines how things work at
the moment.

> Nico Pache (6):
>   kunit: ASoC: topology: adhear to KUNIT formatting standard
>   kunit: software node: adhear to KUNIT formatting standard
>   kunit: ext4: adhear to KUNIT formatting standard
>   kunit: lib: adhear to KUNIT formatting standard
>   kunit: mptcp: adhear to KUNIT formatting standard
>   m68k: update configs to match the proper KUNIT syntax

Also, "adhear" is not the correct spelling; the correct spelling is
"adhere" (from the Latin verb "adhaerere", "to stick", as in "to hold
fast or stick by as if by gluing", which then became "to bind oneself
to the observance of a set of rules or standards or practices").

 - Ted

Re: [PATCH] fs/ext4: prevent the CPU from being 100% occupied in ext4_mb_discard_group_preallocations

2021-04-18 Thread Theodore Ts'o

On Sun, Apr 18, 2021 at 06:28:34PM +0800, Wen Yang wrote:
> The kworker has occupied 100% of the CPU for several days:
> PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
> 68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11
> 
> 
> 
> The thread that references this pa has been waiting for IO to return:
> PID: 15140  TASK: 88004d6dc300  CPU: 16  COMMAND: "kworker/u64:1"
> [c900273e7518] __schedule at 8173ca3b
> [c900273e75a0] schedule at 8173cfb6
> [c900273e75b8] io_schedule at 810bb75a
> [c900273e75e0] bit_wait_io at 8173d8d1
> [c900273e75f8] __wait_on_bit_lock at 8173d4e9
> [c900273e7638] out_of_line_wait_on_bit_lock at 8173d742
> [c900273e76b0] __lock_buffer at 81288c32
> [c900273e76c8] do_get_write_access at a00dd177 [jbd2]
> [c900273e7728] jbd2_journal_get_write_access at a00dd3a3 [jbd2]
> [c900273e7750] __ext4_journal_get_write_access at a023b37b [ext4]
> [c900273e7788] ext4_mb_mark_diskspace_used at a0242a0b [ext4]
> [c900273e77f0] ext4_mb_new_blocks at a0244100 [ext4]
> [c900273e7860] ext4_ext_map_blocks at a02389ae [ext4]
> [c900273e7950] ext4_map_blocks at a0204b52 [ext4]
> [c900273e79d0] ext4_writepages at a0208675 [ext4]
> [c900273e7b30] do_writepages at 811c487e
> [c900273e7b40] __writeback_single_inode at 81280265
> [c900273e7b90] writeback_sb_inodes at 81280ab2
> [c900273e7c90] __writeback_inodes_wb at 81280ed2
> [c900273e7cd8] wb_writeback at 81281238
> [c900273e7d80] wb_workfn at 812819f4
> [c900273e7e18] process_one_work at 810a5dc9
> [c900273e7e60] worker_thread at 810a60ae
> [c900273e7ec0] kthread at 810ac696
> [c900273e7f50] ret_from_fork at 81741dd9
> 
> On the bare metal server, we will use multiple hard disks, the Linux
> kernel will run on the system disk, and business programs will run on
> several hard disks virtualized by the BM hypervisor. The reason why IO
> has not returned here is that the process handling IO in the BM hypervisor
> has failed.

So if the I/O not returning for every days, such that this thread had
been hanging for that long, it also follows that since it was calling
do_get_write_access(), that a handle was open.  And if a handle is
open, then the current jbd2 transaction would never close --- which
means none of the file system operations executed over the past few
days would never commit, and would be undone on the next reboot.
Furthermore, sooner or later the journal would run out of space, at
which point the *entire* system would be locked up waiting for the
transaction to close.

I'm guessing that if the server hadn't come to a full livelock
earlier, it's because there aren't that many metadata operations that
are happening in the server's stable state operation.  But in any
case, this particular server was/is(?) doomed, and all of the patches
that you proposed are not going to help in the long run.  The correct
fix is to fix the hypervisor, which is the root cause of the problem.

I could imagine some kind of retry counter, where we start sleeping
after some number of retries, and give up after some larger number of
retries (at which point the allocation would fail with ENOSPC).  We'd
need to do some testing against our current tests which test how we
handle running close to ENOSPC, and I'm not at all convinced it's
worth the effort in the end.  We're trying to (slightly) improve the
case where (a) the file system is running close to full, (b) the
hypervisor is critically flawed and is the real problem, and (c) the
VM is eventually doomed to fail anyway due to a transaction never
closing due to an I/O never getting acknowledged for days(!).

If you really want to fix things in the guest OS, I perhaps the
virtio_scsi driver (or whatever I/O driver you are using), should
notice when an I/O request hasn't gotten acknowledged after minutes or
hours, and do something such as force a SCSI reset (which will result
in the file system needing to be unmounted and recovered, but due to
the hypervisor bug, that was an inevitable end result anyway).

Cheers,

- Ted

Re: [PATCH 00/13] [RFC] Rust support

2021-04-16 Thread Theodore Ts'o

On Fri, Apr 16, 2021 at 02:07:49PM +0100, Wedson Almeida Filho wrote:
> On Fri, Apr 16, 2021 at 01:24:23PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 14, 2021 at 08:45:51PM +0200, oj...@kernel.org wrote:
> > >   - Featureful language: sum types, pattern matching, generics,
> > > RAII, lifetimes, shared & exclusive references, modules &
> > > visibility, powerful hygienic and procedural macros...
> > 
> > IMO RAII is over-valued, but just in case you care, the below seems to
> > work just fine. No fancy new language needed, works today. Similarly you
> > can create refcount_t guards, or with a little more work full blown
> > smart_ptr crud.
> 
> Peter, we do care, thank you for posting this. It's a great example for us to
> discuss some of the minutiae of what we think Rust brings to the table in
> addition to what's already possible in C.

Another fairly common use case is a lockless, racy test of a
particular field, as an optimization before we take the lock before we
test it for realsies.  In this particular case, we can't allocate
memory while holding a spinlock, so we check to see without taking the
spinlock to see whether we should allocate memory (which is expensive,
and unnecessasry most of the time):

alloc_transaction:
/*
 * This check is racy but it is just an optimization of allocating new
 * transaction early if there are high chances we'll need it. If we
 * guess wrong, we'll retry or free the unused transaction.
 */
if (!data_race(journal->j_running_transaction)) {
/*
 * If __GFP_FS is not present, then we may be being called from
 * inside the fs writeback layer, so we MUST NOT fail.
 */
if ((gfp_mask & __GFP_FS) == 0)
gfp_mask |= __GFP_NOFAIL;
new_transaction = kmem_cache_zalloc(transaction_cache,
gfp_mask);
if (!new_transaction)
return -ENOMEM;
}
...
repeat:
read_lock(>j_state_lock);
...
if (!journal->j_running_transaction) {
read_unlock(>j_state_lock);
if (!new_transaction)
goto alloc_transaction;
write_lock(>j_state_lock);
if (!journal->j_running_transaction &&
(handle->h_reserved || !journal->j_barrier_count)) {
jbd2_get_transaction(journal, new_transaction);
new_transaction = NULL;
}
write_unlock(>j_state_lock);
goto repeat;
}
...

The other thing that I'll note is that diferent elements in thet
journal structure are protected by different spinlocks; we don't have
a global lock protecting the entire structure, which is critical for
scalability on systems with a large number of CPU's with a lot of
threads all wanting to perform file system operations.

So having a guard structure which can't be bypassed on the entire
structure would result in a pretty massive performance penalty for the
ext4 file system.  I know that initially the use of Rust in the kernel
is targetted for less performance critical modules, such as device
drivers, but I thought I would mention some of the advantages of more
advanced locking techniques.

Cheers,

- Ted

Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

2021-04-10 Thread Theodore Ts'o

On Sun, Apr 11, 2021 at 03:45:01AM +0800, Wen Yang wrote:
> At this time, some logs are lost. It is suspected that the hard disk itself
> is faulty.

If you have a kernel crash dump, that means you can extract out the
dmesg buffer, correct?  Is there any I/O error messages in the kernel
log?

What is the basis of the suspicion that the hard drive is faulty?
Kernel dmesg output?  Error reporting from smartctl?

> There are many hard disks on our server. Maybe we should not occupy 100% CPU
> for a long time just because one hard disk fails.

It depends on the nature of the hard drive failure.  How is it
failing?

One thing which we do need to be careful about is when focusing on how
to prevent a failure caused by some particular (potentially extreme)
scenarios, that we don't cause problems on more common scenarios (for
example a heavily loaded server, and/or a case where the file system
is almost full where we have multiple files "fighting" over a small
number of free blocks).

In general, my attitude is that the best way to protect against hard
drive failures is to have processes which are monitoring the health of
the system, and if there is evidence of a failed drive, that we
immediately kill all jobs which are relying on that drive (which we
call "draining" a particular drive), and/or if a sufficiently large
percentage of the drives have failed, or the machine can no longer do
its job, to automatically move all of those jobs to other servers
(e.g., "drain" the server), and then send the machine to some kind of
data center repair service, where the failed hard drives can be
replaced.

I'm skeptical of attempts to try to make the file system to somehow
continue to be able to "work" in the face of hard drive failures,
since failures can be highly atypical, and what might work well in one
failure scenario might be catastrophic in another.  It's especially
problematic if the HDD is not explcitly signalling an error condition,
but rather being slow (because it's doing a huge number of retries),
or the HDD is returning data which is simply different from what was
previously written.  The best we can do in that case is to detect that
something is wrong (this is where metadata checksums would be very
helpful), and then either remount the file system r/o, or panic the
machine, and/or signal to userspace that some particular file system
should be drained.

Cheers,

- Ted

Re: [PATCH 0/8] EXT4: Trivial typo fixes

2021-04-09 Thread Theodore Ts'o

On Sat, Mar 27, 2021 at 04:00:04PM +0530, Bhaskar Chowdhury wrote:
> This patch series fixed few mundane typos in the below specific files.
> 
> Bhaskar Chowdhury (8):
>   EXT4: migrate.c: Fixed few typos
>   EXT4: namei.c: Fixed a typo
>   EXT4: inline.c: A typo fixed
>   Fix a typo
>   EXT4: indirect.c: A trivial typo fix
>   EXT4: xattr.c: Mundane typo fix
>   EXT4: fast_commit.c: A mere typo fix
>   EXT4: mballoc.h: Single typo fix
>
>  fs/ext4/fast_commit.c | 2 +-
>  fs/ext4/indirect.c| 2 +-
>  fs/ext4/inline.c  | 2 +-
>  fs/ext4/inode.c   | 2 +-
>  fs/ext4/mballoc.h | 2 +-
>  fs/ext4/migrate.c | 6 +++---
>  fs/ext4/namei.c   | 8 
>  fs/ext4/xattr.c   | 2 +-
>  8 files changed, 13 insertions(+), 13 deletions(-)

I've applied this patch series folded together into a single patch.

  - Ted

Re: [PATCH -next] ext4: fix error return code in ext4_fc_perform_commit()

2021-04-09 Thread Theodore Ts'o

On Thu, Apr 08, 2021 at 03:00:33PM +0800, Xu Yihang wrote:
> In case of if not ext4_fc_add_tlv branch, an error return code is missing.
> 
> Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
> Reported-by: Hulk Robot 
> Signed-off-by: Xu Yihang 

Thanks, applied.

- Ted

Re: [PATCH v3] ext4: Fix ext4_error_err save negative errno into superblock

2021-04-09 Thread Theodore Ts'o

On Tue, Apr 06, 2021 at 10:53:31AM +0800, Ye Bin wrote:
> As write_mmp_block return 1 when buffer isn't uptodate, return -EIO is
> more appropriate.
> 
> Fixes: 54d3adbc29f0 ("ext4: save all error info in save_error_info() and drop 
> ext4_set_errno()")
> Reported-by: Liu Zhi Qiang 
> Signed-off-by: Ye Bin 
> Reviewed-by: Andreas Dilger 

Thanks, applied.

- Ted

Re: [RFC] ext4: Fix fs can't panic when abort by user

2021-04-09 Thread Theodore Ts'o

On Fri, Apr 09, 2021 at 05:28:54PM -0400, Theodore Ts'o wrote:
> I'll apply the patch with a modified commit description to warn of
> this particular change in behavior.

Applied with the following commit description:

ext4: always panic when errors=panic is specified

Before commit 014c9caa29d3 ("ext4: make ext4_abort() use
__ext4_error()"), the following series of commands would trigger a
panic:

1. mount /dev/sda -o ro,errors=panic test
2. mount /dev/sda -o remount,abort test

After commit 014c9caa29d3, remounting a file system using the test
mount option "abort" will no longer trigger a panic.  This commit will
restore the behaviour immediately before commit 014c9caa29d3.
(However, note that the Linux kernel's behavior has not been
consistent; some previous kernel versions, including 5.4 and 4.19
similarly did not panic after using the mount option "abort".)

This also makes a change to long-standing behaviour; namely, the
following series commands will now cause a panic, when previously it
did not:

1. mount /dev/sda -o ro,errors=panic test
2. echo test > /sys/fs/ext4/sda/trigger_fs_error

However, this makes ext4's behaviour much more consistent, so this is
a good thing.

Fixes: 014c9caa29d3 ("ext4: make ext4_abort() use __ext4_error()")
Signed-off-by: Ye Bin 
Link: https://lore.kernel.org/r/20210401081903.3421208-1-yebi...@huawei.com
Signed-off-by: Theodore Ts'o

Re: [RFC] ext4: Fix fs can't panic when abort by user

2021-04-09 Thread Theodore Ts'o

On Thu, Apr 01, 2021 at 04:19:03PM +0800, Ye Bin wrote:
> Test steps:
> 1. mount /dev/sda -o errors=panic test
> 2. mount /dev/sda -o remount,ro test
> 3. mount /dev/sda -o remount,abort test
> 
> Before 014c9caa29d3 not been merged there will trigger panic. But
> 014c9caa29d3 change this behavior.

This makes sense, but I'll note this behavior has changed over time.

root@kvm-xfstests:~# mount -o errors=panic /dev/vdc /vdc
[   20.252713] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: 
errors=panic
root@kvm-xfstests:~# mount -o remount,ro /dev/vdc
[   24.832448] EXT4-fs (vdc): re-mounted. Opts: (null)
root@kvm-xfstests:~# mount -o remount,abort /dev/vdc
[   30.833543] EXT4-fs error (device vdc): ext4_remount:5340: Abort forced by 
user
mount: /vdc: cannot remount /dev/vdc read-write, is write-protected.
root@kvm-xfstests:~# mount -o remount,abort,ro /dev/vdc
[   34.545549] EXT4-fs error (device vdc): ext4_remount:5340: Abort forced by 
user
[   34.555475] EXT4-fs (vdc): re-mounted. Opts: abort
root@kvm-xfstests:~# uname -a
Linux kvm-xfstests 4.19.163-xfstests #1 SMP Sat Dec 19 23:55:11 EST 2020 x86_64 
GNU/Linux
root@kvm-xfstests:~#

The same is true for the 5.4 kernel.

I do agree that it *should* force a panic, and the fact that
superblock is read-only shouldn't make a difference as to how
errors=panic is handled.

So I think the patch is correct, but I'll note that it also changes
this case:

1)  mount -o /dev/sda -o ro,errors=panic test
2)  echo test > /sys/fs/ext4/sda/trigger_fs_error

With this patch, this will also now panic, whereas before, an
ext4_error() would not trigger a panic.  I think that's better because
it makes things more consistent --- but it is a change in behavior
which could potentially surprise some people.  But since they can
easily get the previous behavior with an explicit "mount -o
ro,errors=continue", I think that's acceptable.

I'll apply the patch with a modified commit description to warn of
this particular change in behavior.

- Ted

Re: [PATCH] ext4: Delete redundant uptodate check for buffer

2021-04-09 Thread Theodore Ts'o

On Thu, Apr 01, 2021 at 04:50:01PM +0530, Ritesh Harjani wrote:
> On 21/04/01 03:03PM, Shaokun Zhang wrote:
> > From: Yang Guo 
> >
> > The buffer uptodate state has been checked in function set_buffer_uptodate,
> > there is no need use buffer_uptodate before calling set_buffer_uptodate and
> > delete it.
> >
> > Cc: "Theodore Ts'o" 
> > Cc: Andreas Dilger 
> > Signed-off-by: Yang Guo 
> > Signed-off-by: Shaokun Zhang 
> > ---
> >  fs/ext4/inode.c | 6 ++
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> Thanks for the patch. Changes looks good and trivial.
> 
> Feel free to add -
> Reviewed-by: Ritesh Harjani 

Thanks, applied.

- Ted

Re: [PATCH 03/17] bit_spinlock: Prepare for split_locks

2021-04-09 Thread Theodore Ts'o

On Fri, Apr 09, 2021 at 03:35:55PM +0100, Matthew Wilcox wrote:
> > This changes the function signature for bit_spin_lock(), if I'm
> > reading this correctly.  Hence, this is going to break git
> > bisectability; was this patch series separated out for easy of review,
> > and you were planning on collapsing things into a single patch to
> > preserve bisectability?
> 
> It's perfectly bisectable.
> 
> Before: bit_spin_lock takes two arguments
> During: bit_spin_lock takes at least two arguments, ignores all but the first 
> two
> After: bit_spin_lock takes three arguments

Ah, got it, thanks for the clarification!

- Ted

Re: [PATCH 03/17] bit_spinlock: Prepare for split_locks

2021-04-09 Thread Theodore Ts'o

On Fri, Apr 09, 2021 at 03:51:17AM +0100, Matthew Wilcox (Oracle) wrote:
> Make bit_spin_lock() and variants variadic to help with the transition.
> The split_lock parameter will become mandatory at the end of the series.
> Also add bit_spin_lock_nested() and bit_spin_unlock_assign() which will
> both be used by the rhashtable code later.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

This changes the function signature for bit_spin_lock(), if I'm
reading this correctly.  Hence, this is going to break git
bisectability; was this patch series separated out for easy of review,
and you were planning on collapsing things into a single patch to
preserve bisectability?

- Ted

Re: [PATCH] ext4: Fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed

2021-04-08 Thread Theodore Ts'o

On Wed, Apr 07, 2021 at 09:41:57AM +0800, yebin wrote:
> > > If call ext4_ext_insert_extent failed but new extent already inserted, we 
> > > just
> > > update "ex->ee_len = orig_ex.ee_len", this will lead to extent overlap, 
> > > then
> > > cause bug on when cache extent.
> > How did this happen in the first place?  It sounds like if the extent
> > was already inserted, that would be casue there was an on-disk file
> > system corruption, no?
> > 
> > In that case, shouldn't we call ext4_error() to declare the file
> > system has an inconsistency, so it can be fixed by fsck?
> We inject IO fault when runing  fsstress,  JBD detect IO error then trigger
> JBD abort.  At the same time,
> if ext4_ext_insert_extent already insert new exntent then call
> ext4_ext_dirty to dirty metadata , but
> JBD already aborted ,  ext4_ext_dirty will return error.
> In ext4_ext_dirty function call  ext4_ext_check_inode check extent if ok, if
> not, trigger BUG_ON and
> also print extent detail information.

In this particular case, skipping the "ex->ee_len = orig_ex.ee_len"
may avoid the BUG_ON.  But it's not clear that this is always the
right thing to do.  The fundamental question is what should we do we
run into an error while we are in the middle of making changes to
on-disk and in-memory data structures?

In the ideal world, we should undo the changes that we were in the
middle of making before we return an error.  That way, the semantics
are very clear; on success, the function has made the requested change
to the file system.  If the function returns an error, then no changes
should be made.

That was the reasoning behind resetting ex->ee_len to orig_ex.ee_len
in the fix_extent_len inside ext4_split_extent_at().  Unofrtunately,
ext4_ext_insert_extent() does *not* always follow this convention, and
that's because it would be extremely difficult for it to do so --- the
mutations that it makes can be quite complex, including potentially
increasing the height of the extent tree.

However, I don't think your fix is by any means the ideal one, because
the much more common way that ext4_ext_insert_extent() is when it
needs to insert a new leaf node, or need to increase the height of the
extent tree --- and in it returns an ENOSPC failure.  In that case, it
won't have made any changes changes in the extent tree, and so having
ext4_split_extent_at() undo the change to ex->ee_len is the right
thing to do.

Having blocks get leaked when there is an ENOSPC failure, requiring
fsck to be run --- and without giving the user any warning that this
has happened is *not* a good way to fail.  So I don't think the
proposed patch is the right way to go.

A better way to go would be to teach ext4_ext_insert_extent() so if
there is a late failure, that it unwinds the leaf node back to its
original state (at least from a semantic value).  Since the extent
leaf node could have been split, and/or adjacent extent entries may
have been merged, what it would need to do is to remember the starting
block number and length, and make whatever changes are necessaries to
the extent entries in that leaf node corresponding to that starting
block number and length.

If you don't want to do that, then a "do no harm" fix would be
something like this:

...
} else if (err == -EROFS) {
return err;
} else if (err)
goto fix_extent_len;

So in the journal abort case, when err is set to EROFS, we don't try
to reset the length, since in theory the file system is read-only
already anyway.  However, in the ENOSPC case, we won't end up silently
leaking blocks that will be lost until the user somehow decides to run
fsck.

There are still times when this doesn't get things completely right
(e.g., what if we get a late ENOMEM error versus an early ENOMEM
failure), where the only real fix is to make ext4_ext_insert_extent()
obey the convention that if it returns an error, it must not result in
any user-visible state change.

Cheers,

- Ted

Re: KCSAN: data-race in __jbd2_journal_file_buffer / jbd2_journal_dirty_metadata

2021-04-06 Thread Theodore Ts'o

On Tue, Apr 06, 2021 at 02:32:33PM +0200, Jan Kara wrote:
> And the comment explains, why we do this unreliable check. Again, if we
> wanted to silence KCSAN, we could use data_race() macro but AFAIU Ted isn't
> very fond of that annotation.

I'm not fond of the data_race macro, but I like bogus KCSAN reports
even less.  My main complaint is if we're going to have to put the
data_race() macro in place, we're going to need to annotate each
location with an explanation of why it's there (suppress a KCSAN false
positive), and why's it's safe.  If it's only one or two places, it'll
probably be fine.  If it's dozens, then I would say that KCSAN is
becoming a net negative in terms of making the Linux kernel code
maintainable.

- Ted

Re: [PATCH] ext4: Fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed

2021-04-05 Thread Theodore Ts'o

On Thu, Mar 25, 2021 at 10:29:25AM +0800, Ye Bin wrote:
> We got follow bug_on:
> [130747.323114] kernel BUG at fs/ext4/extents_status.c:762!
> [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP
> ..
> [130747.334329] Call trace:
> [130747.334553]  ext4_es_cache_extent+0x150/0x168 [ext4]
> [130747.334975]  ext4_cache_extents+0x64/0xe8 [ext4]
> [130747.335368]  ext4_find_extent+0x300/0x330 [ext4]
> [130747.335759]  ext4_ext_map_blocks+0x74/0x1178 [ext4]
> [130747.336179]  ext4_map_blocks+0x2f4/0x5f0 [ext4]
> [130747.336567]  ext4_mpage_readpages+0x4a8/0x7a8 [ext4]
> [130747.336995]  ext4_readpage+0x54/0x100 [ext4]
> [130747.337359]  generic_file_buffered_read+0x410/0xae8
> [130747.337767]  generic_file_read_iter+0x114/0x190
> [130747.338152]  ext4_file_read_iter+0x5c/0x140 [ext4]
> [130747.338556]  __vfs_read+0x11c/0x188
> [130747.338851]  vfs_read+0x94/0x150
> [130747.339110]  ksys_read+0x74/0xf0
> 
> If call ext4_ext_insert_extent failed but new extent already inserted, we just
> update "ex->ee_len = orig_ex.ee_len", this will lead to extent overlap, then
> cause bug on when cache extent.

How did this happen in the first place?  It sounds like if the extent
was already inserted, that would be casue there was an on-disk file
system corruption, no?

In that case, shouldn't we call ext4_error() to declare the file
system has an inconsistency, so it can be fixed by fsck?

> If call ext4_ext_insert_extent failed don't update ex->ee_len with old value.
> Maybe there will lead to block leak, but it can be fixed by fsck later.

   - Ted

Re: [PATCH] jbd2: avoid -Wempty-body warnings

2021-04-05 Thread Theodore Ts'o

On Tue, Mar 30, 2021 at 05:15:33PM +0200, Jan Kara wrote:
> On Mon 22-03-21 11:21:38, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> > 
> > Building with 'make W=1' shows a harmless -Wempty-body warning:
> > 
> > fs/jbd2/recovery.c: In function 'fc_do_one_pass':
> > fs/jbd2/recovery.c:267:75: error: suggest braces around empty body in an 
> > 'if' statement [-Werror=empty-body]
> >   267 | jbd_debug(3, "Fast commit replay failed, err = 
> > %d\n", err);
> >   | 
> >   ^
> > 
> > Change the empty dprintk() macros to no_printk(), which avoids this
> > warning and adds format string checking.
> > 
> > Signed-off-by: Arnd Bergmann 
> 
> Sure. Feel free to add:
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted

Re: [PATCH v2 2/2] ext4: Optimize match for casefolded encrypted dirs

2021-04-05 Thread Theodore Ts'o

On Fri, Mar 19, 2021 at 07:34:14AM +, Daniel Rosenberg wrote:
> Matching names with casefolded encrypting directories requires
> decrypting entries to confirm case since we are case preserving. We can
> avoid needing to decrypt if our hash values don't match.
> 
> Signed-off-by: Daniel Rosenberg 

Thanks, applied.

- Ted

Re: [PATCH v2 1/2] ext4: Handle casefolding with encryption

2021-04-05 Thread Theodore Ts'o

On Fri, Mar 19, 2021 at 07:34:13AM +, Daniel Rosenberg wrote:
> This adds support for encryption with casefolding.
> 
> Since the name on disk is case preserving, and also encrypted, we can no
> longer just recompute the hash on the fly. Additionally, to avoid
> leaking extra information from the hash of the unencrypted name, we use
> siphash via an fscrypt v2 policy.
> 
> The hash is stored at the end of the directory entry for all entries
> inside of an encrypted and casefolded directory apart from those that
> deal with '.' and '..'. This way, the change is backwards compatible
> with existing ext4 filesystems.
> 
> Signed-off-by: Daniel Rosenberg 

Applied, thanks with the following addition so that tests, e2fsprogs,
etc., can determine whether or not the currently running kernel has
this feature enabled:

diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index a3d08276d441..7367ba406e01 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -313,6 +313,7 @@ EXT4_ATTR_FEATURE(verity);
 #endif
 EXT4_ATTR_FEATURE(metadata_csum_seed);
 EXT4_ATTR_FEATURE(fast_commit);
+EXT4_ATTR_FEATURE(encrypted_casefold);
 
 static struct attribute *ext4_feat_attrs[] = {
ATTR_LIST(lazy_itable_init),
@@ -330,6 +331,7 @@ static struct attribute *ext4_feat_attrs[] = {
 #endif
ATTR_LIST(metadata_csum_seed),
ATTR_LIST(fast_commit),
+   ATTR_LIST(encrypted_casefold),
NULL,
 };
 ATTRIBUTE_GROUPS(ext4_feat);


Future versions of e2fsprogs may issue a warning if tune2fs or mke2fs
tries to modify or create a file system such that both the encryption
and casefold feature is enabled if it appears that the kernel won't
support this combination.  Daniel, if you could try to get this change
into the Android kernels that are using encrypted casefold, that would
be a good thing.

- Ted

Re: [PATCH v3] Updated locking documentation for transaction_t

2021-04-02 Thread Theodore Ts'o

On Thu, Feb 11, 2021 at 06:14:10PM +0100, Alexander Lochmann wrote:
> Some members of transaction_t are allowed to be read without
> any lock being held if accessed from the correct context.
> We used LockDoc's findings to determine those members.
> Each member of them is marked with a short comment:
> "no lock needed for jbd2 thread".
> 
> Signed-off-by: Alexander Lochmann 
> Signed-off-by: Horst Schirmeier 
> Reviewed-by: Jan Kara 

Thanks, applied.  I reflowed the comments before applying the patch.

  - Ted

Re: [PATCH v2] Updated locking documentation for journal_t

2021-04-02 Thread Theodore Ts'o

On Thu, Feb 11, 2021 at 10:51:55AM +0100, Alexander Lochmann wrote:
> Some members of transaction_t are allowed to be read without
> any lock being held if consistency doesn't matter.
> Based on LockDoc's findings, we extended the locking
> documentation of those members.
> Each one of them is marked with a short comment:
> "no lock for quick racy checks".
> 
> Signed-off-by: Alexander Lochmann 
> Signed-off-by: Horst Schirmeier 
> Reviewed-by: Jan Kara 

Thanks, applied.  I had to fix up the patch, which was mailer-damaged,
and I also reflowed the comments.

- Ted

Re: [PATCH v8 0/6] Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Theodore Ts'o

On Thu, Apr 01, 2021 at 01:49:17PM -0700, Roy Yang wrote:
> Thanks Ted, Casey and Al Viro. I am sorry for the inconvenience.
> 
> I tried to follow the instructions listed under
> https://lore.kernel.org/lkml/20210330205750.428816-1-keesc...@chromium.org/
> using git-send-email
> 
> Thought that will reply to the original thread with the original
> subject . Let me know what I can do to correct this to avoid
> confusion.

It did chain to the original thread; that's how I was able to figure
out the context.  However, it looks like in your reply, you set the
subject to be:

[PATCH] Where we are for this patch?

And the problem is if someone had deleted the original e-mail chain
--- for example, optimizing the kernel stack is not one of the
subjects that I normally closely track, I had already deleted those
e-mail chains.  So all I saw (and I suspect all Al saw) was a message
with the above subject line, and no context.

If you had kept the original subject line, then those of us who mostly
focus on file system stuff would have known that it's an area which is
covered by other maintainers, and we wouldn't have deleted your query
and let someone else respond.

Cheers,

- Ted

P.S.  I personally find the use "git-send-email" to reply to be so
much of a pain that I just use a non-google.com address to which I can
use a traditional threaded mail-reader (such as mutt) and I can send
e-mail without having to worry about the DMARC nonsense.  :-)

Re: [PATCH] Where we are for this patch?

2021-04-01 Thread Theodore Ts'o

On Thu, Apr 01, 2021 at 07:48:30PM +, Al Viro wrote:
> On Thu, Apr 01, 2021 at 12:17:44PM -0700, Roy Yang wrote:
> > Both Android and Chrome OS really want this feature; For 
> > Container-Optimized OS, we have customers
> > interested in the defense too.
> > 
> > Thank you very much.
> > 
> > Change-Id: I1eb1b726007aa8f9c374b934cc1c690fb4924aa3
> 
>   You forgot to tell what patch you are refering to.  Your
> Change-Id (whatever the hell that is) doesn't help at all.  Don't
> assume that keys in your internal database make sense for the
> rest of the world, especially when they appear to contain a hash
> of something...

The Change-Id fails to have any direct search hits at lore.kernel.org.
However, it turn up Roy's original patch, and clicking on the
message-Id in the "In-Reply-Field", it apperas Roy was replying to
this message:

https://lore.kernel.org/lkml/20210330205750.428816-1-keesc...@chromium.org/

which is the head of this patch series:

Subject: [PATCH v8 0/6] Optionally randomize kernel stack offset each syscall

That being said, it would have been better if the original subject
line had been preserved, and it's yet another example of how the
lore.kernel.org URL is infinitely better than the Change-Id.  :-)

  - Ted

Re: [PATCH v5 4/4] fs: unicode: Add utf8 module and a unicode layer

2021-03-30 Thread Theodore Ts'o

On Mon, Mar 29, 2021 at 10:47:52PM -0700, Eric Biggers wrote:
> > Isn't this a user problem?  If the modules required to boot are on the
> > filesystem itself, you are in trouble.  But, if that is the case, your
> > rootfs is case-insensitive and you gotta have utf8 as built-in or have
> > it in an early userspace.
> 
> We could make it the user's problem, but that seems rather unfriendly.
> Especially because the utf8 module would be needed if the filesystem has the
> casefold feature at all, regardless of whether any casefolded directories are
> needed at boot time or not.  (Unless there is a plan to change that?)

I guess I'm not that worried, since the vast majority of desktop
distribution are using initial ramdisks these days.  And if someone
did build a monolithic kernel that couldn't mount the root file
system, they would figure that out pretty quickly.

The biggest problem they would have with trying to enable encryption
or casefolding on the root file system is that if they are using Grub,
older versions of Grub would see an unknown incompat feature, and
immediately have heartburn, and refuse to touch whatever file system
/boot is located on.  If the distribution has /boot as a stand-alone
partition, that won't be a problem, but if you have a single file
system which includes the location of kernels and initrds' are
located, the moment you try set the encryption or casefold on the file
system, you're immediately hosed --- and if you do this on a laptop
while you are on an airplane, without thinking things through, and
without access to a rescue USB thumb drive, life can
get... interesting.  (Why, yes, I'm speaking from direct experience;
why do you ask?  :-)

So in comparison to making such a mistake, building a kernel that was
missing casefold, and needing to fall back to an older kernel is not
really that bad of a user experience.  You just have to fall back the
distro kernel, which most kernel developers who are dogfooding
bleeding kernels are probably smart enough keep one around.

We *could* teach ext4 to support mounting file systems that have
casefold, without having the unicode module loaded, which would make
things a bit better, but I'm not sure it's worth the effort.  We could
even make the argument that letting the system boot, and then having
access to some directories return ENOTSUPP would actually be a more
confusing user experience than a simple hard failure when we try
mounting the file system.

Cheers,

- Ted

Re: [Ksummit-discuss] RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions

2021-03-23 Thread Theodore Ts'o

On Tue, Mar 23, 2021 at 09:57:57AM +0100, Thorsten Leemhuis wrote:
> On 22.03.21 22:56, Theodore Ts'o wrote:
> > On Mon, Mar 22, 2021 at 08:25:15PM +0100, Thorsten Leemhuis wrote:
> >> I agree to the last point and yeah, maybe regressions are the more
> >> important problem we should work on – at least from the perspective of
> >> kernel development.  But from the users perspective (and
> >> reporting-issues.rst is written for that perspective) it feel a bit
> >> unsatisfying to not have a solution to query for existing report,
> >> regressions or not. H...
> > First of all, thanks for working on reporting-issues.rst.
> 
> Thx, very glad to hear that. I didn't get much feedback on it, which
> made me wonder if anybody besides docs folks actually looked at it...

I'll admit that I had missed your initial submission, but having
looked at it, while I could imagine some nits where it could be
improved, in my opinion, it's strictly better than the older
reporting-bugs doc.

> Hmmm, yeah, I like that idea. I'll keep it in mind for later: I would
> prefer to get reporting-issues.rst officially blessed and
> reporting-bugs.rst gone before working on further enhancements.

Is there anyone following this thread who believes that there is
anything we should change *before* oficially blessing
reporting-issues.rst?  Given that Konstantin has already linked to
reporting-issues from the front page of kernel.bugzilla.org, I think
we we should just go ahead and officially bless it and be done with
it.   :-)

Once it is blessed, I'd also suggest putting a link to
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
as an "other resources" at https://www.kernel.org.

- Ted

Re: RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions

2021-03-23 Thread Theodore Ts'o

On Tue, Mar 23, 2021 at 12:20:25PM -0400, Steven Rostedt wrote:
> On Mon, 22 Mar 2021 20:25:15 +0100
> Thorsten Leemhuis  wrote:
> 
> > I agree to the last point and yeah, maybe regressions are the more
> > important problem we should work on – at least from the perspective of
> > kernel development.  But from the users perspective (and
> > reporting-issues.rst is written for that perspective) it feel a bit
> > unsatisfying to not have a solution to query for existing report,
> > regressions or not. H...
> 
> I think the bulk of user issues are going to be regressions. Although you
> may be in a better position to know for sure, but at least for me, wearing
> my "user" hat, the thing that gets me the most is upgrading to a new kernel
> and suddenly something that use to work no longer does. And that is the
> definition of a regression. My test boxes still run old distros (one is
> running fedora 13). These are the boxes that catch the most issues, and if
> they do, they are pretty much guaranteed to be a regression.

I think it depends on the user and the subsystem.  You're a
sophisticated user, but I've fielded a goodly number of ext4 "bug
reports" which were coming from a Ubuntu 16.04 kernel, or a user who
is seeing a block device issue (either a driver bug or a hardware
failure), or in some cases both.

A lot of these "bug reports" would be headed off at the pass if we
advertised:

https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

much more heavily; assuming we can get the users to actually read it,
first.

- Ted

Re: [Ksummit-discuss] RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions

2021-03-22 Thread Theodore Ts'o

On Mon, Mar 22, 2021 at 08:25:15PM +0100, Thorsten Leemhuis wrote:
> I agree to the last point and yeah, maybe regressions are the more
> important problem we should work on – at least from the perspective of
> kernel development.  But from the users perspective (and
> reporting-issues.rst is written for that perspective) it feel a bit
> unsatisfying to not have a solution to query for existing report,
> regressions or not. H...

First of all, thanks for working on reporting-issues.rst.  If we can
actually get users to *read* it, I think it's going to save kernel
developers a huge amount of time and frustration.

I wonder if it might be useful to have a form which users could be
encouraged to fill out so that (a) the information is available in a
structured format so it's easier for developers to find the relevant
information, (b) so it is easier for programs to parse, for easier
reporting or indexing, and (c) as a nudge so that users remember to
report critical bits of information such as the hardware
configuration, the exact kernel version, which distribution userspace
was in use, etc.

There could also be something in the text form which would make it
easier for lore.kernel.org searches to identify bug reports.  (e.g.,
"LINUX KERNEL BUG REPORTER TEMPLATE")

- Ted

Re: [GIT PULL] ext4 fixes for v5.12

2021-03-22 Thread Theodore Ts'o

On Mon, Mar 22, 2021 at 11:10:52PM +1100, Herbert Xu wrote:
> Theodore Ts'o  wrote:
> > 
> > From: 曹子德(Theodore Y Ts'o) 
> 
> "Yue" doesn't seem to match your second character which is usually
> romanised as "Tze" in Cantonese, could it be
> 
>   曹予德

Quite right.  I hadn't noticed that I had the wrong character when I
cut and pasted.  Thanks for pointing that out!

 - Ted

Re: [GIT PULL] ext4 fixes for v5.12

2021-03-21 Thread Theodore Ts'o

On Mon, Mar 22, 2021 at 11:05:13AM +0800, Gao Xiang wrote:
> I think the legel name would be "Zhang Yi" (family name goes first [1])
> according to
> The Chinese phonetic alphabet spelling rules for Chinese names [2].
> 
> Indeed, that is also what the legel name is written in alphabet on our
> passport or credit/debit cards.
> 
> Also, many official English-written materials use it in that way, for
> example, a somewhat famous bastetball player Yao Ming [3][4][5].
> 
> That is what I wrote my own name as this but I also noticed the western
> ordering of names is quite common for Chinese people in Linux kernel.
> Anyway, it's just my preliminary personal thought (might be just my
> own perference) according to (I think, maybe) formal occasions.

Yeah, there doesn't seem to be a lot of consistency with the ordering
of Chinese names when they are written in Roman characters.  Some
people put the family name first, and other people will put the
personal (first) name first.  In some cases it may be because the
developer in question is living in America, and so they've decided to
use the American naming convention.  (Two example of this are former
ext4 developers Mingming Cao and Jiaying Zhang, who live in Portland
and Los Angelos, and their family names are Cao and Zhang,
respectively.)

My personal opinion is people should use whatever name they are
comfortable with, using whatever characters they prefer.  The one
thing that would be helpful for me is for people to give a hint about
how they would prefer to be called --- for example, would you prefer
that start an e-mail with the salutation, "Hi Gao", "Hi Xiang", or "Hi
Gao Xiang"?

And if I don't know, and I guess wrong, please feel free to correct
me, either privately, or publically on the e-mail list, if you think
it would be helpful for more people to understand how you'd prefer to
be called.

Cheers,

- Ted

Re: [GIT PULL] ext4 fixes for v5.12

2021-03-21 Thread Theodore Ts'o

On Mon, Mar 22, 2021 at 09:33:54AM +0800, zhangyi (F) wrote:
> 
> I will use my real name "Yi Zhang" next time.
>

Hi Yi,

I think what Linus was suggsting was that if people wanted, they could
do something like this in their git commits:

From: 曹子德(Theodore Y Ts'o) 

I don't do this because my legal name is actually Theodore Yue Tak
Ts'o (where Ts'o Yue Tak is the standard romanization of my Chinese
name in Cantonese --- my parents were from Hong Kong), and even though
Cantonese is technically the first langauge I learned as a child, at
this point I'm probably more fluent in Spanish (my third language)
than Cantonese.  :-)

In any case, git and e-mail should be able to handle non-Roman
characters so if you want to insert your name in Chinese in your Git
authorship, please feel free to do so.  Or not --- it's totally up to
you.

Cheers,

- Ted

[GIT PULL] ext4 fixes for v5.12

2021-03-21 Thread Theodore Ts'o

The following changes since commit a38fd8748464831584a19438cbb3082b5a2dab15:

  Linux 5.12-rc2 (2021-03-05 17:33:41 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_stable

for you to fetch changes up to 64395d950bc476106b39341e42ebfd4d2eb71d2c:

  ext4: initialize ret to suppress smatch warning (2021-03-21 00:45:37 -0400)


Miscellaneous ext4 bug fixes for v5.12.


Eric Biggers (1):
  ext4: fix error handling in ext4_end_enable_verity()

Eric Whitney (1):
  ext4: shrink race window in ext4_should_retry_alloc()

Harshad Shirwadkar (1):
  ext4: fix rename whiteout with fast commit

Jan Kara (2):
  ext4: add reclaim checks to xattr code
  ext4: fix timer use-after-free on failed mount

Pan Bian (1):
  ext4: stop inode update before return

Sabyrzhan Tasbolatov (1):
  fs/ext4: fix integer overflow in s_log_groups_per_flex

Shijie Luo (1):
  ext4: fix potential error in ext4_do_update_inode

Theodore Ts'o (1):
  ext4: initialize ret to suppress smatch warning

Zhaolong Zhang (1):
  ext4: fix bh ref count on error paths

zhangyi (F) (3):
  ext4: find old entry again if failed to rename whiteout
  ext4: do not iput inode under running transaction in ext4_rename()
  ext4: do not try to set xattr into ea_inode if value is empty

 fs/ext4/balloc.c  | 38 +
 fs/ext4/ext4.h|  3 ++
 fs/ext4/extents.c |  2 +-
 fs/ext4/fast_commit.c |  9 --
 fs/ext4/inode.c   | 18 ++--
 fs/ext4/mballoc.c | 11 ++--
 fs/ext4/namei.c   | 50 +
 fs/ext4/super.c   |  7 -
 fs/ext4/sysfs.c   |  7 +
 fs/ext4/verity.c  | 89 
---
 fs/ext4/xattr.c   |  6 +++-
 11 files changed, 168 insertions(+), 72 deletions(-)

Re: ext4: stop inode update before return

2021-03-20 Thread Theodore Ts'o

On Sun, Jan 17, 2021 at 12:57:32AM -0800, Pan Bian wrote:
> The inode update should be stopped before returing the error code.
> 
> Signed-off-by: Pan Bian 

Thanks, applied.

- Ted

Re: [syzbot] KCSAN: data-race in start_this_handle / start_this_handle

2021-03-11 Thread Theodore Ts'o

On Thu, Mar 11, 2021 at 04:08:30PM +0100, Marco Elver wrote:
> If the outcome of the check does not affect correctness and the code is
> entirely fault tolerant to the precise value being read, then a
> data_race(!journal->j_running_transaction) marking here would be fine.

So a very common coding pattern is to check a value w/o the lock, and
if it looks like we might need to check *with* a lock, we'll grab the
lock and recheck.  Does KCSAN understand that this sort of thing is
safe automatically?

In thie particular case, it's a bit more complicated than that; we're
checking a value, and then allocating memory, grabbing the spin lock,
and then re-checking the value, so we don't have to drop the spinlock,
allocate the memory, grab the lock again, and then rechecking the
value.  So even if KCSAN catches the simpler case as described above,
we still might need to explicitly mark the data_race explicitly.

But the more we could have the compiler automatically figure out
things without needing an explicit tag, it would seem to me that this
would be better, since manual tagging is going to be more error-prone.

Cheers,

   - Ted

Re: [RFC] inode.i_opflags - Usage of two different locking schemes

2021-03-05 Thread Theodore Ts'o

On Fri, Mar 05, 2021 at 04:35:47PM +0100, Alexander Lochmann wrote:
> 
> 
> On 05.03.21 16:18, Theodore Ts'o wrote:
> > 1)  I don't see where i_opflags is being read in ipc/mqueue.c at all,
> > either with or without i_rwsem.
> > 
> It is read in fs/dcache.c

So why is this unique to the mqueue inode then?  It might be helpful
to have explicit call stacks in the e-mail, in text form, when you
resend to LKML.

That's because the HTML file is ***huge*** (1.7Meg), and I'm having
trouble with my browser properly rendering it.  In any case, the html
claims to be showing the counter examples and I'm still stuck on the
*example*?

Cheers,

- Ted

Re: [RFC] inode.i_opflags - Usage of two different locking schemes

2021-03-05 Thread Theodore Ts'o

On Fri, Mar 05, 2021 at 02:10:09PM +0100, Alexander Lochmann wrote:
> Hi folks,
> 
> I've stumbled across an interesting locking scheme. It's related to struct
> inode, more precisely it is an mqueue inode.
> Our results show that inode:mqueue.i_opflags is read with i_rwsem being
> hold.
> In d_flags_for_inode, and do_inode_permission the i_lock is used to read and
> write i_opflags.
> Is this a real locking scheme? Is a lock needed to access i_opflags at all?
> What is the magic behind this contradiction?
> 
> I've put the report of the counterexamples on our webserver:
> https://ess.cs.tu-dortmund.de/lockdoc-bugs/cex-inode-mqueue.html.
> It contains the stacktraces leading to those accesses, and the locks that
> were actually held.

1)  I don't see where i_opflags is being read in ipc/mqueue.c at all,
either with or without i_rwsem.

2)  I'm not sure what this has to do with ext4?

  - Ted

Re: [PATCH] ext4: fix bh ref count on error paths

2021-03-05 Thread Theodore Ts'o

On Tue, Mar 02, 2021 at 05:42:31PM +0800, Zhaolong Zhang wrote:
> __ext4_journalled_writepage should drop bhs' ref count on error paths
> 
> Signed-off-by: Zhaolong Zhang 

Thanks, applied.

- Ted

Re: [PATCH v2] fs/ext4: fix integer overflow in s_log_groups_per_flex

2021-03-05 Thread Theodore Ts'o

On Wed, Feb 24, 2021 at 03:58:00PM +0600, Sabyrzhan Tasbolatov wrote:
> syzbot found UBSAN: shift-out-of-bounds in ext4_mb_init [1], when
> 1 << sbi->s_es->s_log_groups_per_flex is bigger than UINT_MAX,
> where sbi->s_mb_prefetch is unsigned integer type.
> 
> 32 is the maximum allowed power of s_log_groups_per_flex. Following if
> check will also trigger UBSAN shift-out-of-bound:
> 
> if (1 << sbi->s_es->s_log_groups_per_flex >= UINT_MAX) {
> 
> So I'm checking it against the raw number, perhaps there is another way
> to calculate UINT_MAX max power. Also use min_t as to make sure it's
> uint type.
> 
> [1] UBSAN: shift-out-of-bounds in fs/ext4/mballoc.c:2713:24
> shift exponent 60 is too large for 32-bit type 'int'
> Call Trace:
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x137/0x1be lib/dump_stack.c:120
>  ubsan_epilogue lib/ubsan.c:148 [inline]
>  __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
>  ext4_mb_init_backend fs/ext4/mballoc.c:2713 [inline]
>  ext4_mb_init+0x19bc/0x19f0 fs/ext4/mballoc.c:2898
>  ext4_fill_super+0xc2ec/0xfbe0 fs/ext4/super.c:4983
> 
> Reported-by: syzbot+a8b4b0c60155e87e9...@syzkaller.appspotmail.com
> Signed-off-by: Sabyrzhan Tasbolatov 

Applied, thanks.

- Ted

Re: [RFC PATCH v2 00/13] Add futex2 syscall

2021-03-04 Thread Theodore Ts'o

On Wed, Mar 03, 2021 at 09:42:06PM -0300, André Almeida wrote:
>  ** Performance
> 
>  - For comparing futex() and futex2() performance, I used the artificial
>benchmarks implemented at perf (wake, wake-parallel, hash and
>requeue). The setup was 200 runs for each test and using 8, 80, 800,
>8000 for the number of threads, Note that for this test, I'm not using
>patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
>at "The patchset" section.

How heavily contended where the benchmarks?  One of the benefits of
the original futex was that no system call was necessary in the happy
path when the lock is uncontended.  Especially on a non-NUMA system
(which are the far more common case), since that's where relying on a
single memory access was a huge win for the original futex.  I would
expect that futex2 will fare worse in this particular case, since it
requires a system call entry for all operations --- the question is
how large is the delta in this worst case (for futex2) and best case
(for futex) scenario.

Cheers,

- Ted

[GIT PULL] ext4 changes for v5.12

2021-02-25 Thread Theodore Ts'o

The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:

  Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 0a76945fd1ba2ab44da7b578b311efdfedf92e6c:

  ext4: add .kunitconfig fragment to enable ext4-specific tests (2021-02-11 
23:16:30 -0500)


Miscellaneous ext4 cleanups and bug fixes.  Pretty boring this
cycle...


Daejun Park (1):
  ext4: Change list_for_each* to list_for_each_entry*

Daniel Latypov (1):
  ext4: add .kunitconfig fragment to enable ext4-specific tests

Eric Whitney (1):
  ext4: reset retry counter when ext4_alloc_file_blocks() makes progress

Geert Uytterhoeven (1):
  ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

Theodore Ts'o (2):
  ext4: don't try to processed freed blocks until mballoc is initialized
  ext4: fix potential htree index checksum corruption

Vinicius Tinti (1):
  ext4: factor out htree rep invariant check

Zheng Yongjun (1):
  ext4: use DEFINE_MUTEX() for mutex lock

 fs/ext4/.kunitconfig  |  3 +++
 fs/ext4/Kconfig   |  3 +--
 fs/ext4/extents.c | 16 
 fs/ext4/fast_commit.c | 29 +++--
 fs/ext4/namei.c   | 45 +++--
 fs/ext4/super.c   | 12 +---
 6 files changed, 59 insertions(+), 49 deletions(-)
 create mode 100644 fs/ext4/.kunitconfig

Re: 5.10 LTS Kernel: 2 or 6 years?

2021-02-19 Thread Theodore Ts'o

On Fri, Feb 19, 2021 at 12:16:12PM +0100, Greg Kroah-Hartman wrote:
> 
> Great!  Can you run 'git bisect' on the 4.14.y stable tree to find the
> offending change?

To be fair, especially with WiFi bugs, you may need to run for hours
or days before you are absolutely sure that a particular bisection
point will result in the the locking/kernel crash bug to manifest
itself.  Worse, you have to be actively using the Wifi in order to see
the problem, and in some cases, it only triggers when you switch
between AP's, so you have to be actively using it in the work office,
and taking it between conference rooms, only to see your machine crash
taking your unsaved work, email drafts, etc. with it.

That being said, users should at least report the bug.  (That's what I
did, when I ran into this a bunch of years ago, with an explanation of
"I'm trying to do a bisect, but it may take a few weeks for me to
figure out what the !@#!? is going on.  In my case, I was trying to
use upstream -rcX kernels to dogfood on my work laptop, but the
principle is the same.)

Ultiumately, I solved the problem, by switching laptops to one that
didn't use an NVidia GPU (which sometimes forced me to stay 1-2
upstream versions behind, making life even more difficult when
debugging these issues, until the out-of-tree video driver got updated
to work with newer upstream), and which also had WiFi hardware which
was less subject to these issues.

It's unfortunate, but not all hardware is as well supported on Linux.
And in my case, because $WORK was using Enterprise WiFi systems with
AP's that don't get as much testing by developers, very few other
people could repro the bugs.  That's life, and sometimes the only
solution is to switch hardware.  And/or you just use a Chromebook in
those sorts of situations, separating your work/enterprise and
upstream development hardware, and be done with it.  :-)

Cheers,

- Ted

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-19 Thread Theodore Ts'o

On Wed, Feb 17, 2021 at 03:48:39PM -0700, Andreas Dilger wrote:
> It would be possible to detect if the encrypted/casefold+dirdata
> variant is in use, because the dirdata variant would have the 0x40
> bit set in the file_type byte.  It isn't possible to positively
> identify the "raw" non-dirdata variant, but the assumption would be
> if (rec_len >= round_up(name_len, 4) + 8) in an encrypted+casefold
> directory that the "raw" hash must be present in the dirent.

Consider a 4k directory directory block which has only three entries,
".", "..", and "a".  The directory entry for "a" will have a rec_len
substantially larger than name_len.

Fortunatelly, the "raw" non-dirdata variant case easily can be
detected.  If the directory has the encryption and casefold set, and
the 0x40 bit is not set, then raw must be present, assuming that the
directory block has not been corrupted (but if it's corrupted, all
bets are off).

   - Ted

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-17 Thread Theodore Ts'o

On Tue, Feb 16, 2021 at 08:01:11PM -0800, Daniel Rosenberg wrote:
> I'm not sure what the conflict is, at least format-wise. Naturally,
> there would need to be some work to reconcile the two patches, but my
> patch only alters the format for directories which are encrypted and
> casefolded, which always must have the additional hash field. In the
> case of dirdata along with encryption and casefolding, couldn't we
> have the dirdata simply follow after the existing data? Since we
> always already know the length, it'd be unambiguous where that would
> start. Casefolding can only be altered on an empty directory, and you
> can only enable encryption for an empty directory, so I'm not too
> concerned there. I feel like having it swapping between the different
> methods makes it more prone to bugs, although it would be doable. I've
> started rebasing the dirdata patch on my end to see how easy it is to
> mix the two. At a glance, they touch a lot of the same areas in
> similar ways, so it shouldn't be too hard. It's more of a question of
> which way we want to resolve that, and which patch goes first.
> 
> I've been trying to figure out how many devices in the field are using
> casefolded encryption, but haven't found out yet. The code is
> definitely available though, so I would not be surprised if it's being
> used, or is about to be.

The problem is in how the space after the filename in a directory is
encoded.  The dirdata format is (mildly) expandable, supporting up to
4 different metadata chunks after the filename, using a very
compatctly encoded TLV (or moral equivalent) scheme.  For directory
inodes that have both the encyption and compression flags set, we have
a single blob which gets used as the IV for the crypto.

So it's the difference between a simple blob that is only used for one
thing in this particular case, and something which is the moral
equivalent of simple ASN.1 or protobuf encoding.

Currently, datadata has defined uses for 2 of the 4 "chunks", which is
used in Lustre servers.  The proposal which Andreas has suggested is
if the dirdata feature is supported, then the 3rd dirdata chunk would
be used for the case where we currently used by the
encrypted-casefolded extension, and the 4th would get reserved for a
to-be-defined extension mechanism.

If there ext4 encrypted/casefold is not yet in use, and we can get the
changes out to all potential users before they release products out
into the field, then one approach would be to only support
encrypted/casefold when dirdata is also enabled.

If ext4 encrypted/casefold is in use, my suggestion is that we support
both encrypted/casefold && !dirdata as you have currently implemented
it, and encrypted/casefold && dirdata as Andreas has proposed.

IIRC, supporting that Andreas's scheme essentially means that we use
the top four bits in the rec_len field to indicate which chunks are
present, and then for each chunk which is present, there is a 1 byte
length followed by payload.  So that means in the case where it's
encrypted/casefold && dirdata, the required storage of the directory
entry would take one additional byte, plus setting a bit indicating
that the encrypted/casefold dirdata chunk was present.

So, no, they aren't incompatible ultimatly, but it might require a
tiny bit more work to integrate the combined support for dirdata plus
encrypted/casefold.  One way we can do this, if we have to support the
current encrypted/casefold format because it's out there in deployed
implementations already, is to integrate encrypted/casefold &&
!dirdata first upstream, and then when we integrate dirdata into
upstream, we'll have to add support for the encrypted/casefold &&
dirdata case.  This means that we'll have two variants of the on-disk
format to test and support, but I don't think it's the going to be
that difficult.

Andreas, anything you'd like to correct or add in this summary?

- Ted

Re: possible deadlock in dquot_commit

2021-02-12 Thread Theodore Ts'o

>From: Theodore Ts'o 

On Fri, Feb 12, 2021 at 12:01:51PM +0100, Dmitry Vyukov wrote:
> > >
> > > There is a reproducer for 4.19 available on the dashboard. Maybe it will 
> > > help.
> > > I don't why it did not pop up on upstream yet, there lots of potential
> > > reasons for this.
> >
> > The 4.19 version of the syzbot report has a very different stack
> > trace.  Instead of it being related to an apparent write to the quota
> > file, it is apparently caused by a call to rmdir:
> >
>
> The 4.19 reproducer may reproducer something else, you know better. I
> just want to answer points re syzkaller reproducers. FTR the 4.19
> reproducer/reproducer is here:
> https://syzkaller.appspot.com/bug?id=b6cacc9fa48fea07154b8797236727de981c1e02

Yes, I know.  That was my point.  I don't think it's useful for
debugging the upstream dquot_commit syzbot report (for which we don't
have a reproducer yet).

> > there is never any attempt to run rmdir() on the corrupted file system that 
> > is mounted.
> 
> Recursive rmdir happens as part of test cleanup implicitly, you can
> see rmdir call in remove_dir function in the C reproducer:
> https://syzkaller.appspot.com/text?tag=ReproC=12caea3790

That rmdir() removes the mountpoint, which is *not* the fuzzed file
system which has the quota feature enabled.

> > procid never gets incremented, so all of the threads only operate on 
> > /dev/loop0
> 
> This is intentional. procid is supposed to "isolate" parallel test
> processes (if any). This reproducer does not use parallel test
> processes, thus procid has constant value.

Um... yes it does:

int main(void)
{
  syscall(__NR_mmap, 0x1000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x2000ul, 0x100ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x2100ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  use_temporary_dir();
  loop();
  return 0;
}

and what is loop?

static void loop(void)
{
  int iter = 0;
  for (;; iter++) {
...
reset_loop();
int pid = fork();
if (pid < 0)
  exit(1);
if (pid == 0) {
  if (chdir(cwdbuf))
exit(1);
  setup_test();
  execute_one();
  exit(0);
}
... 
remove_dir(cwdbuf);
  }
}

> > Am I correct in understanding that when syzbot is running, it uses the 
> > syzbot repro, and not the C repro?
> 
> It tries both. If first tries to interpret "syzkaller program" as it
> was done when the bug was triggered during fuzzing. But then it tries
> to convert it to a corresponding stand-alone C program and confirms
> that it still triggers the bug. If it provides a C reproducer, it
> means that it did trigger the bug using this exact C program on a
> freshly booted kernel (and the provided kernel oops is the
> corresponding oops obtained on this exact program).
> If it fails to reproduce the bug with a C reproducer, then it provides
> only the "syzkaller program" to not mislead developers.

Well, looking at the C reproducer, it doesn't reproduce on upstream,
and the stack trace makes no sense to me.  The rmdir() executes at the
end of the test, as part of the cleanup, and looking at the syzkaller
console, the stack trace involving rmdir happens *early* while test
threads are still trying to mount the file system.

  - Ted

Re: [PATCH] ext4: add .kunitconfig fragment to enable ext4-specific tests

2021-02-11 Thread Theodore Ts'o

On Tue, Feb 09, 2021 at 05:32:06PM -0800, Daniel Latypov wrote:
> As of [1], we no longer want EXT4_KUNIT_TESTS and others to `select`
> their deps. This means it can get harder to get all the right things
> selected as we gain more tests w/ more deps over time.
> 
> This patch (and [2]) proposes we store kunitconfig fragments in-tree to
> represent sets of tests. (N.B. right now we only have one ext4 test).
> 
> There's still a discussion to be had about how to have a hierarchy of
> these files (e.g. if one wanted to test all of fs/, not just fs/ext4).
> 
> But this fragment would likely be a leaf node and isn't blocked on
> deciding if we want `import` statements and the like.
> 
> Usage
> =
> 
> Before [2] (on its way to being merged):
>   $ cp fs/ext4/.kunitconfig .kunit/
>   $ ./tools/testing/kunit.py run
> 
> After [2]:
>   $ ./tools/testing/kunit.py run --kunitconfig=fs/ext4/.kunitconfig

Thanks, applied, with one minor fixup.  The path to kunit.py is

./tools/testing/kunit/kunit.py

- Ted

Re: [PATCH resend] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2021-02-11 Thread Theodore Ts'o

On Fri, Jan 22, 2021 at 12:02:34PM +0100, Geert Uytterhoeven wrote:
> EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
> user may not want to enable.  Fix this by making the test depend on
> EXT4_FS instead.
> 
> Fixes: 1cbeab1b242d16fd ("ext4: add kunit test for decoding extended 
> timestamps")
> Signed-off-by: Geert Uytterhoeven 

Applied, thanks.

- Ted

Re: possible deadlock in dquot_commit

2021-02-11 Thread Theodore Ts'o

On Thu, Feb 11, 2021 at 12:47:18PM +0100, Dmitry Vyukov wrote:
> > This actually looks problematic: We acquired >i_data_sem/2 (i.e.,
> > I_DATA_SEM_QUOTA subclass) in ext4_map_blocks() called from
> > ext4_block_write_begin(). This suggests that the write has been happening
> > directly to the quota file (or that lockdep annotation of the inode went
> > wrong somewhere). Now we normally protect quota files with IMMUTABLE flag
> > so writing it should not be possible. We also don't allow clearing this
> > flag on used quota file. Finally I'd checked lockdep annotation and
> > everything looks correct. So at this point the best theory I have is that a
> > filesystem has been suitably corrupted and quota file supposed to be
> > inaccessible from userspace got exposed but I'd expect other problems to
> > hit first in that case. Anyway without a reproducer I have no more ideas...
> 
> There is a reproducer for 4.19 available on the dashboard. Maybe it will help.
> I don't why it did not pop up on upstream yet, there lots of potential
> reasons for this.

The 4.19 version of the syzbot report has a very different stack
trace.  Instead of it being related to an apparent write to the quota
file, it is apparently caused by a call to rmdir:

 dump_stack+0x22c/0x33e lib/dump_stack.c:118
 print_circular_bug.constprop.0.cold+0x2d7/0x41e kernel/locking/lockdep.c:1221
   ...
 __mutex_lock+0xd7/0x13f0 kernel/locking/mutex.c:1072
 dquot_commit+0x4d/0x400 fs/quota/dquot.c:469
 ext4_write_dquot+0x1f2/0x2a0 fs/ext4/super.c:5644
   ...
 ext4_evict_inode+0x933/0x1830 fs/ext4/inode.c:298
 evict+0x2ed/0x780 fs/inode.c:559
 iput_final fs/inode.c:1555 [inline]
   ...
 vfs_rmdir fs/namei.c:3865 [inline]
 do_rmdir+0x3af/0x420 fs/namei.c:3943
 __do_sys_unlinkat fs/namei.c:4105 [inline]
 __se_sys_unlinkat fs/namei.c:4099 [inline]
 __x64_sys_unlinkat+0xdf/0x120 fs/namei.c:4099
 do_syscall_64+0xf9/0x670 arch/x86/entry/common.c:293
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Which leads me to another apparent contradiction.  Looking at the C
reproducer source code, and running the C reproducer under "strace
-ff", there is never any attempt to run rmdir() on the corrupted file
system that is mounted.  Neither as observed by my running the C
reproducer, or by looking at the C reproducer source code.

Looking at the code, I did see a number of things which seemed to be
bugs; procid never gets incremented, so all of the threads only
operate on /dev/loop0, and each call to the execute() function tries
to setup two file systems on /dev/loop0.  So the each thread to run
creates a temp file, binds it to /dev/loop0, and then creates another
temp file, tries to bind it to /dev/loop0 (which will fail), tries to
mount /dev/loop0 (again) on the samee mount point (which will
succeed).

I'm not sure if this is just some insanity that was consed up by the
fuzzer... or I'm wondering if this was an unfaithful translation of
the syzbot repro to C.  Am I correct in understanding that when syzbot
is running, it uses the syzbot repro, and not the C repro?

  - Ted

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-09 Thread Theodore Ts'o

On Tue, Feb 09, 2021 at 08:03:10PM -0700, Andreas Dilger wrote:
> Depending on the size of the "escape", it probably makes sense to move
> toward having e2fsck migrate from the current mechanism to using dirdata
> for all deployments.  In the current implementation, tools don't really
> know for sure if there is data beyond the filename in the dirent or not.

It's actually quite well defined.  If dirdata is enabled, then we
follow the dirdata rules.  If dirdata is *not* enabled, then if a
directory inode has the case folding and encryption flags set, then
there will be cryptographic data immediately following the filename.
Otherwise, there is no valid data after the filename.

> For example, what if casefold is enabled on an existing filesystem that
> already has an encrypted directory?  Does the code _assume_ that there is
> a hash beyond the name if the rec_len is long enough for this?

No, we will only expect there to be a hash beyond the name if
EXT4_CASEFOLD_FL and EXT4_ENCRYPT_FL flags are set on the inode.  (And
if the rec_len is not large enough, then that's a corrupted directory
entry.)

> I guess it is implicit with the casefold+encryption case for dirents in
> directories that have the encryption flag set in a filesystem that also
> has casefold enabled, but it's definitely not friendly to these features
> being enabled on an existing filesystem.

No, it's fine.  That's because the EXT4_CASEFOLD_FL inode flag can
only be set if the EXT4_FEATURE_INCOMPAT_CASEFOLD is set in the
superblock, and EXT4_ENCRYPT_FL inode flag can only be set if
EXT4_FEATURE_INCOMPAT_ENCRYPT is set in the superblock, this is why it
will be safe to enable of these features, since merely enabling the
file system features only allows new directories to be created with
both CASEFOLD_FL and ENCRYPT_FL set.

The only restriction we would have is a file system has both the case
folding and encryption features, it will *not* be safe to set the
dirdata feature flag without first scanning all of the directories to
see if there are any directories that have both the casefold and
encrypt flags set on that inode, and if so, to convert all of the
directory entries to use dirdata.  I don't think this is going to be a
significant restriction in practice, though.

- Ted

Re: [PATCH] ext4: add .kunitconfig fragment to enable ext4-specific tests

2021-02-09 Thread Theodore Ts'o

On Tue, Feb 09, 2021 at 05:32:06PM -0800, Daniel Latypov wrote:
> 
> After [2]:
>   $ ./tools/testing/kunit.py run --kunitconfig=fs/ext4/.kunitconfig

Any chance that in the future this might become:

$ ./tools/testing/kunit.py run --kunitconfig=fs/ext4

Or better yet, syntactic sugar like:

$ ./tools/testing/kunit.py test fs/ext4

would be really nice.

- Ted

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-09 Thread Theodore Ts'o

On Wed, Feb 03, 2021 at 11:31:28AM -0500, Theodore Ts'o wrote:
> On Wed, Feb 03, 2021 at 03:55:06AM -0700, Andreas Dilger wrote:
> > 
> > It looks like this change will break the dirdata feature, which is similarly
> > storing a data field beyond the end of the dirent. However, that feature 
> > also
> > provides for flags stored in the high bits of the type field to indicate
> > which of the fields are in use there.
> > The first byte of each field stores
> > the length, so it can be skipped even if the content is not understood.
> 
> Daniel, for context, the dirdata field is an out-of-tree feature which
> is used by Lustre, and so has fairly large deployed base.  So if there
> is a way that we can accomodate not breaking dirdata, that would be
> good.
> 
> Did the ext4 casefold+encryption implementation escape out to any
> Android handsets?

So from an OOB chat with Daniel, it appears that the ext4
casefold+encryption implementation did in fact escape out to Android
handsets.  So I think what we will need to do, ultiumately, is support
one way of supporting the casefold IV in the case where "encryption &&
casefold", and another way when "encryption && casefold && dirdata".

That's going to be a bit sucky, but I don't think it should be that
complex.  Daniel, Andreas, does that make sense to you?

   - Ted

Re: [PATCH 1/2] ext4: Handle casefolding with encryption

2021-02-03 Thread Theodore Ts'o

On Wed, Feb 03, 2021 at 03:55:06AM -0700, Andreas Dilger wrote:
> 
> It looks like this change will break the dirdata feature, which is similarly
> storing a data field beyond the end of the dirent. However, that feature also
> provides for flags stored in the high bits of the type field to indicate
> which of the fields are in use there.
> The first byte of each field stores
> the length, so it can be skipped even if the content is not understood.

Daniel, for context, the dirdata field is an out-of-tree feature which
is used by Lustre, and so has fairly large deployed base.  So if there
is a way that we can accomodate not breaking dirdata, that would be
good.

Did the ext4 casefold+encryption implementation escape out to any
Android handsets?

Thanks,

- Ted

Re: [PATCH v3] ext4: Enable code path when DX_DEBUG is set

2021-02-02 Thread Theodore Ts'o

On Tue, Feb 02, 2021 at 04:28:37PM +, Vinicius Tinti wrote:
> Clang with -Wunreachable-code-aggressive is being used to try to find
> unreachable code that could cause potential bugs. There is no plan to
> enable it by default.
> 
> The following code was detected as unreachable:
> 
> fs/ext4/namei.c:831:17: warning: code will never be executed
> [-Wunreachable-code]
> unsigned n = count - 1;
>  ^
> fs/ext4/namei.c:830:7: note: silence by adding parentheses to mark code as
> explicitly dead
> if (0) { // linear search cross check
> ^
> /* DISABLES CODE */ ( )
> 
> This has been present since commit ac27a0ec112a ("[PATCH] ext4: initial
> copy of files from ext3") and fs/ext3 had it present at the beginning of
> git history. It has not been changed since.
> 
> This patch moves the code to a new function `htree_rep_invariant_check`
> which only performs the check when DX_DEBUG is set. This allows the
> function to be used in other parts of the code.
> 
> Suggestions from: Andreas, Christoph, Nathan, Nick and Ted.
> 
> Signed-off-by: Vinicius Tinti 

Thanks, applied, although I rewrote the commit description:

ext4: factor out htree rep invariant check

This patch moves some debugging code which is used to validate the
hash tree node when doing a binary search of an htree node into a
separate function, which is disabled by default (since it is only used
by developers when they are modifying the htree code paths).

In addition to cleaning up the code to make it more maintainable, it
silences a Clang compiler warning when -Wunreachable-code-aggressive
is enabled.  (There is no plan to enable this warning by default, since
there it has far too many false positives; nevertheless, this commit
reduces one of the many false positives by one.)

The original description buried the lede, in terms of the primary
reason why I think the change was worthwhile (although I know you have
different priorities than mine :-).

Thanks for working to find a way to improve the code in a way that
makes both of us happy!

- Ted

Re: [PATCH v2] ext4: Change list_for_each* to list_for_each_entry*

2021-02-02 Thread Theodore Ts'o

On Mon, Jan 11, 2021 at 10:37:26AM +0900, Daejun Park wrote:
> In the fast_commit.c, list_for_each* + list_entry can be changed to
> list_for_each_entry*. It reduces number of variables and lines.
> 
> Signed-off-by: Daejun Park 

Thanks, applied.

- Ted

Re: [PATCH v2 -next] ext4: use DEFINE_MUTEX() for mutex lock

2021-02-02 Thread Theodore Ts'o

On Thu, Dec 24, 2020 at 09:22:44PM +0800, Zheng Yongjun wrote:
> mutex lock can be initialized automatically with DEFINE_MUTEX()
> rather than explicitly calling mutex_init().
> 
> Signed-off-by: Zheng Yongjun 

Thanks, applied.

- Ted

Re: [PATCH v2] ext4: Enable code path when DX_DEBUG is set

2021-02-01 Thread Theodore Ts'o

On Mon, Feb 01, 2021 at 07:05:11PM -0300, Vinicius Tinti wrote:
> 
> The goal is to try to detect real bugs. In this instance specifically I
> suggested to remove the "if (0) {...}" because it sounded like an
> unused code.
> 
> If it is useful it is fine to keep.

The trick was that it was unused code, but it was pretty obviously
deliberate, which should have implied that at some point, it was
considered useful.   :-)

It was the fact that you were so determined to find a way to suppress
the warning, suggesting multiple tactics, which made me wonder why
were you going through so much effort to silence the warning if the
goal was *not* to turn it on unconditionally everywhere?

I suspect the much more useful thing to consider is how can we suggest
hueristics to the Clang folks to make the warning more helpful.  For
example, Coverity will warn about the following:

void test_func(unsigned int arg)
{
if (arg < 0) {
printf("Hello, world\n")
}
}

This is an example of dead code that is pretty clearly unintended ---
and it's something that "clang -Wall" or "gcc -Wall" doesn't pick up
on, but Coverity does.

So in cases where the code is explicitly doing "if (0)" or "if
(IS_ENABLED(xxx))" where IS_ENABLED resolves down to zero due to
preprocessor magic, arguably, that's not a useful compiler warning
because it almost *certainly* is intentional.  But in the case of an
unsigned int being compared to see if it is less than, or greater than
or equal to 0, that's almost certainly a bug --- and yes, Coverity has
found a real bug (tm) in my code due to that kind of static code
analysis.  So it would actually be quite nice if there was a compiler
warning (either gcc or clang, I don't really care which) which would
reliably call that out without having the maintainer submit the code
to Coverity for analysis.

Cheers,

- Ted

P.S.  If anyone wants to file a feature request bug with the Clang
developers, feel free.  :-)

Re: [PATCH v2] ext4: Enable code path when DX_DEBUG is set

2021-02-01 Thread Theodore Ts'o

On Mon, Feb 01, 2021 at 01:16:19PM -0800, Nick Desaulniers wrote:
> I agree; Vinicius, my recommendation for -Wunreachable-* with Clang
> was to see whether dead code identified by this more aggressive
> diagnostic (than -Wunused-function) was to ask maintainers whether
> code identified by it was intentionally dead and if they would
> consider removing it.  If they say "no," that's fine, and doesn't need
> to be pushed.  It's not clear to maintainers that:
> 1. this warning is not on by default
> 2. we're not looking to pursue turning this on by default
> 
> If maintainers want to keep the dead code, that's fine, let them and
> move on to the next instance to see if that's interesting (or not).

It should be noted that in Documenting/process/coding-style.rst, there
is an expicit recommendation to code in a way that will result in dead
code warnings:

   Within code, where possible, use the IS_ENABLED macro to convert a Kconfig
   symbol into a C boolean expression, and use it in a normal C conditional:

   .. code-block:: c

if (IS_ENABLED(CONFIG_SOMETHING)) {
...
}

   The compiler will constant-fold the conditional away, and include or exclude
   the block of code just as with an #ifdef, so this will not add any runtime
   overhead.  However, this approach still allows the C compiler to see the code
   inside the block, and check it for correctness (syntax, types, symbol
   references, etc).  Thus, you still have to use an #ifdef if the code inside 
the
   block references symbols that will not exist if the condition is not met.

So our process documentation *explicitly* recommends against using
#ifdef CONFIG_XXX ... #endif, and instead use something that will
-Wunreachable-code-aggressive to cause the compiler to complain.  

Hence, this is not a warning that we will *ever* be able to enable
unconditionally --- so why work hard to remove such warnings from the
code?  If the goal is to see if we can detect real bugs using this
technique, well and good.  If the data shows that this warning
actually is useful in finding bugs, then manybe we can figure out a
way that we can explicitly hint to the compiler that in *this* case,
the maintainer actually knew what they were doing.

But if an examination of the warnings shows that
-Wunreachable-code-aggressive isn't actually finding any real bugs,
then perhaps it's not worth it.

Cheers,

- Ted

Re: [PATCH v2] ext4: Enable code path when DX_DEBUG is set

2021-02-01 Thread Theodore Ts'o

On Mon, Feb 01, 2021 at 03:41:50PM -0300, Vinicius Tinti wrote:
> 
> My goal is to avoid having a dead code. Three options come to mind.
> 
> The first would be to add another #ifdef SOMETHING (suggest a name).
> But this doesn't remove the code and someone could enable it by accident.

I *really* don't see the point of having the compiler whine about
"dead code", so I'm not terribly fond of
-Wunreachable-code-aggressive.  There may be times where depending on
how things are compiled, we *want* the compiler to remove code block,
and it makes the code less ugly than having #ifdef ... #endif breaking
up the code.

If turning that on requires uglifying many places in the kernel code,
maybe the right answer is... don't.

That being said, I have no problem of replacing

if (0) {
...
}

with

#ifdef DX_DEBUG
...
#endif

In this particular place.

But before we go there, I want to register my extreme skepticsm about
-Wunreachable-code-aggressive.  How much other places where it is
***obvious*** that the maintainer really knew what they are doing, and
it's just the compiler whining about a false positive?

> > However, if there *is* a bug, having an early detection that the
> > representation invariant of the data structure has been violated can
> > be useful in root causing a bug.  This would probably be clearer if
> > the code was pulled out into a separate function with comments
> > explaining that this is a rep invariant check.
> 
> Good idea. I will do that too.

If you want to do that, and do something like

#ifdef DX_DEBUG
static inline htree_rep_invariant_Check(...)
{
...
}
#else
static inline htree_rep_invariant_check(...) { }
#endif

I'm not going to complain.  That's actually a better way to go, since
there may be other places in the code where a developer might want to
introduce a rep invariant check.  So that's actually improving the
code, as opposed to making a pointless change just to suppress a
compiler warning.

Of course, then someone will try enabling a -W flag which causes the
compiler to whine about empty function bodies

- Ted

Re: [Linux-kernel-mentees] Patches from the future - can checkpatch help?

2021-02-01 Thread Theodore Ts'o

On Mon, Feb 01, 2021 at 05:50:45PM +0100, Lukas Bulwahn wrote:
> 
> Dwaipayan, there are two ways:
> - We build a bot listening to mailing lists and check. I like that
> implementation idea for various other checks.
> - Stephen Rothwell could include this as a check on linux-next and
> inform the git author and committer.
> 
> I am wondering though if that is worth the effort, three instances of
> a wrong date among 1M commits seems to be very seldom and the harm of
> that mistake is quite small as well.

Another solution might be to ask the git developers if they would be
willing to have "git am" print a warning if the date is sufficiently
insane (say, more than 3 months in the past or present).

One could also imagine a request that "git log" would have a new
format where normally the author time is printed, but if it's
sufficiently different from the commit time, the commit time is also
printed in parenthesis.

Or you could set up your git config so that "git log" uses
--pretty=fuller by default, which prints both the author date and
commit date.

Like Lukas, I'm not really sure it's worth the effort, however.

 - Ted

Re: [PATCH v2] ext4: Enable code path when DX_DEBUG is set

2021-02-01 Thread Theodore Ts'o

On Mon, Feb 01, 2021 at 01:15:29PM -0300, Vinicius Tinti wrote:
> On Mon, Feb 1, 2021 at 9:49 AM Christoph Hellwig  wrote:
> >
> > DX_DEBUG is completely dead code, so either kill it off or make it an
> > actual CONFIG_* symbol through Kconfig if it seems useful.
> 
> About the unreachable code in "if (0)" I think it could be removed.
> It seems to be doing an extra check.
> 
> About the DX_DEBUG I think I can do another patch adding it to Kconfig
> as you and Nathan suggested.

Yes, it's doing another check which is useful in terms of early
detection of bugs when a developer has the code open for
modifications.  It slows down performance under normal circumstances,
and assuming the code is bug-free(tm), it's entirely unnecessary ---
which is why it's under an "if (0)".

However, if there *is* a bug, having an early detection that the
representation invariant of the data structure has been violated can
be useful in root causing a bug.  This would probably be clearer if
the code was pulled out into a separate function with comments
explaining that this is a rep invariant check.

The main thing about DX_DEBUG right now is that it is **super**
verbose.  Unwary users who enable it will be sorry.  If we want to
make it to be a first-class feature enabled via CONFIG_EXT4_DEBUG, we
should convert all of the dx_trace calls to use pr_debug so they are
enabled only if dynamic debug enables those pr_debug() statements.
And this should absolutely be a separate patch.

Cheers,

- Ted

Re: [PATCH v2] ext4: Enable code path when DX_DEBUG is set

2021-02-01 Thread Theodore Ts'o

On Mon, Feb 01, 2021 at 12:49:24PM +, Christoph Hellwig wrote:
> DX_DEBUG is completely dead code, so either kill it off or make it an
> actual CONFIG_* symbol through Kconfig if it seems useful.

I wouldn't call it completely dead code.  If you manually add "#define
DX_DEBUG" fs/ext4/namei.c compiles without any problems.  I believe it
was most recently used when we added large htree support.

It's true that it can only be enabled via manually enabled via manual
editing of the .c file, but it's *really* something that only
developers who are actively involved in modifying the code would want
to use.  Sure, we could add work to add debug levels to all of the
dxtrace() statements, and/or switch it all to dyndebug.  We'd also
have to figure out how to get rid of all of the KERN_CONT printk's in
the ideal world.  The question is whether doing all of this is
worth it if the goal is to shut up some clang warnings.

   - Ted

Re: [RFC PATCH v2 0/4] make jbd2 debug switch per device

2021-01-27 Thread Theodore Ts'o

On Tue, Jan 26, 2021 at 08:50:02AM +0800, brookxu wrote:
> 
> trace point, eBPF and other hook technologies are better for production
> environments. But for pure debugging work, adding hook points feels a bit
> heavy. However, your suggestion is very valuable, thank you very much.

What feels heavy?  The act of adding a new jbd_debug() statement to
the sources, versus adding a new tracepoint?  Or how to enable a set
of tracepoints versus setting a jbd_debug level (either globally, or
per mount point)?  Or something else?

If it's the latter (which is what I think it is), how often are you
needing to add a new jbd_debug() statement *and* needing to run in a
test environment where you have multiple disks?  How often is it
useful to have multiple disks when doing your debugging?

I'm trying to understand why this has been useful to you, since that
generally doesn't match with my development, testing, or debugging
experience.  In general I try to test with one file system at a time,
since I'm trying to find something reproducible.  Do you have cases
where you need multiple file systems in your test environment in order
to do your testing?  Why is that?  Is it because you're trying to use
your production server code as your test reproducers?  And if so, I
would have thought adding the jbd_debug() statements and sending lots
of console print messages would distort the timing enough to make it
hard to reproduce a problem in found in your production environment.

It sounds like you have a very different set of test practices than
what I'm used to, and I'm trying to understand it better.

Cheers,

- Ted

Re: Getting a new fs in the kernel

2021-01-26 Thread Theodore Ts'o

On Tue, Jan 26, 2021 at 07:06:55PM +, Chaitanya Kulkarni wrote:
> From what I've seen you can post the long patch-series as an RFC and get the
> 
> discussion started.
> 
> The priority should be ease of review and not the total patch-count.

File systems are also complicated enough that it's useful to make the
patches available via a git repo, and it's highly recommended that you
are rebasing it against the latest kernel on a regular basis.

I also strongly recommend that once you get something that mostly
works, that you start doing regression testing of the file system.
Most of the major file systems in Linux use xfstests for their
testing.  One of the things that I've done is to package up xfstests
as a test appliance, suitable for running under KVM or using Google
Compute Engine, as a VM, to make it super easy for people to run
regression tests.  (One of my original goals for packaging it up was
to make it easy for graduate students who were creating research file
systems to try running regression tests so they could find potential
problems --- and understand how hard it is to make a robust,
production-ready file system, by giving them a realtively well
documented, turn-key system for running file system regression tests.)

For more information, see:

https://thunk.org/gce-xfstests

https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

The final thing I'll point out is that file system development is a
team sport.  Industry estimates are that it takes between 50 and 200
person-years to create a production-ready, general purpose enterprise
file system.  For example, ZFS took seven years to develop, starting
with a core team of 4, and growing to over 14 developers by the time
it was announced.  And that didn't include all of the QA, release
engineering, testers, performance engineers, to get it integrated into
the Solaris product.  Even after it was announced, it was a good four
years before customers trusted it for production workloads.

If you look at the major file systems in Linux: ext4, xfs, btrfs,
f2fs, etc., you'll find that none of them are solo endeavors, and all
of them have multiple companies who are employing the developers who
work on them.  Figuring out how to convince companies that there are
good business reasons for them to support the developers of your file
system is important, since in order to keep things going for the long
haul, it really needs to be more than a single person's hobby.

Good luck!

- Ted

Re: [RFC PATCH v2 0/4] make jbd2 debug switch per device

2021-01-25 Thread Theodore Ts'o

On Sat, Jan 23, 2021 at 08:00:42PM +0800, Chunguang Xu wrote:
> On a multi-disk machine, because jbd2 debugging switch is global, this
> confuses the logs of multiple disks. It is not easy to distinguish the
> logs of each disk and the amount of generated logs is very large. Maybe
> a separate debugging switch for each disk would be better, so that we
> can easily distinguish the logs of a certain disk. 
> 
> We can enable jbd2 debugging of a device in the following ways:
> echo X > /proc/fs/jbd2/sdX/jbd2_debug
> 
> But there is a small disadvantage here. Because the debugging switch is
> placed in the journal_t object, the log before the object is initialized
> will be lost. However, usually this will not have much impact on
> debugging.

The jbd debugging infrastructure dates back to the very beginnings of
ext3, when Stephen Tweedie added them while he was first implementing
the jbd layer.  So this dates back to a time before we had other
schemes like dynamic debug or tracepoints or eBPF.

I wonder if instead of trying to enhance our own bespoke debugging
system, instead we set up something like tracepoints where they would
be useful.  I'm not proposing that we try to replace all jbd_debug()
statements with tracepoints but I think it would be useful to look at
what sort of information would actually be *useful* on a production
server, and add those tracepoints to the jbd2 layer.  What I like
about tracepoints is you can enable them on a much more fine-grained
fashion; information is sent to userspace in a much more efficient
manner than printk; you can filter tracepoint events in the kernel,
before sending them to userspace; and if you want more sophisticated
filtering or aggregation, you can use eBPF.

What was the original use case which inspired this?  Were you indeed
trying to debug some kind of problem on a production system?  (Why did
you have multiple disks active at the same time?)  Was there a
specific problem you were trying to debug?  What debug level were you
using?  Which jbd_debug statements were most useful to you?  Which
just got in the way (but which had to be enabled given the log level
you needed to get the debug messages that you needed)?

  - Ted

[GIT PULL] ext4 bug fixes for v5.11-rc4

2021-01-15 Thread Theodore Ts'o

Note: there is a fairly simple merge conflict, which can be resolved
by taking EXT4_SB(sb) and replacing it with sbi using the version in
your tree.  My merge resolution (which I used to run regression tests)
is attached below.

- Ted


The following changes since commit be993933d2e997fdb72b8b1418d2a84df79b8962:

  ext4: remove unnecessary wbc parameter from ext4_bio_write_page (2020-12-22 
13:08:45 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_stable

for you to fetch changes up to e9f53353e166a67dfe4f8295100f8ac39d6cf10b:

  ext4: remove expensive flush on fast commit (2021-01-15 14:41:31 -0500)


A number of bug fixes for ext4:

   * For the new fast_commit feature
   * Fix some error handling codepaths in whiteout handling and
 mountpoint sampling
   * Fix how we write ext4_error information so it goes through the journal
 when journalling is active, to avoid races that can lead to lost
 error information, superblock checksum failures, or DIF/DIX features.


Daejun Park (2):
  ext4: fix wrong list_splice in ext4_fc_cleanup
  ext4: remove expensive flush on fast commit

Jan Kara (7):
  ext4: combine ext4_handle_error() and save_error_info()
  ext4: drop sync argument of ext4_commit_super()
  ext4: protect superblock modifications with a buffer lock
  ext4: save error info to sb through journal if available
  ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()
  ext4: fix superblock checksum failure when setting password salt
  ext4: drop ext4_handle_dirty_super()

Theodore Ts'o (1):
  ext4: don't leak old mountpoint samples

Yi Li (1):
  ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR

yangerkun (1):
  ext4: fix bug for rename with RENAME_WHITEOUT

 fs/ext4/ext4_jbd2.c   |  17 --
 fs/ext4/ext4_jbd2.h   |   5 --
 fs/ext4/fast_commit.c |  35 +--
 fs/ext4/file.c|   7 ++-
 fs/ext4/inode.c   |   6 +-
 fs/ext4/ioctl.c   |   3 +
 fs/ext4/namei.c   |  27 +---
 fs/ext4/resize.c  |  20 --
 fs/ext4/super.c   | 193 
--
 fs/ext4/xattr.c   |   5 +-
 10 files changed, 187 insertions(+), 131 deletions(-)

-

commit c3d123034d72849414f4b88bcd78843cec16caa5
Merge: 146620506274 8f4949dacec8
Author: Theodore Ts'o 
Date:   Thu Jan 14 22:09:51 2021 -0500

Merge branch 'dev' into test

diff --cc fs/ext4/super.c
index 21121787c874,0f0db49031dc..9a6f9875aa34
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@@ -5450,18 -5494,22 +5488,18 @@@ static void ext4_update_super(struct su
 */
if (!(sb->s_flags & SB_RDONLY))
ext4_update_tstamp(es, s_wtime);
 -  if (sb->s_bdev->bd_part)
 -  es->s_kbytes_written =
 -  cpu_to_le64(sbi->s_kbytes_written +
 -  ((part_stat_read(sb->s_bdev->bd_part,
 -   sectors[STAT_WRITE]) -
 -sbi->s_sectors_written_start) >> 1));
 -  else
 -  es->s_kbytes_written = cpu_to_le64(sbi->s_kbytes_written);
 +  es->s_kbytes_written =
-   cpu_to_le64(EXT4_SB(sb)->s_kbytes_written +
++  cpu_to_le64(sbi->s_kbytes_written +
 +  ((part_stat_read(sb->s_bdev, sectors[STAT_WRITE]) -
- EXT4_SB(sb)->s_sectors_written_start) >> 1));
-   if (percpu_counter_initialized(_SB(sb)->s_freeclusters_counter))
++sbi->s_sectors_written_start) >> 1));
+   if (percpu_counter_initialized(>s_freeclusters_counter))
ext4_free_blocks_count_set(es,
-   EXT4_C2B(EXT4_SB(sb), percpu_counter_sum_positive(
-   _SB(sb)->s_freeclusters_counter)));
-   if (percpu_counter_initialized(_SB(sb)->s_freeinodes_counter))
+   EXT4_C2B(sbi, percpu_counter_sum_positive(
+   >s_freeclusters_counter)));
+   if (percpu_counter_initialized(>s_freeinodes_counter))
es->s_free_inodes_count =
cpu_to_le32(percpu_counter_sum_positive(
-   _SB(sb)->s_freeinodes_counter));
+   >s_freeinodes_counter));
/* Copy error information to the on-disk superblock */
spin_lock(>s_error_lock);
if (sbi->s_add_error_count > 0) {

Re: [PATCH] ext4: Remove expensive flush on fast commit

2021-01-13 Thread Theodore Ts'o

On Wed, Jan 06, 2021 at 10:32:42AM +0900, Daejun Park wrote:
> In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast commit
> block when barrier is enabled. However, in recovery phase, ext4 compares
> CRC value in the tail. So it is sufficient adds REQ_FUA and REQ_PREFLUSH
> on the block that has tail.
> 
> Signed-off-by: Daejun Park 

Thanks, applied.

- Ted

Re: [PATCH] ext4: Fix wrong list_splice in ext4_fc_cleanup

2021-01-13 Thread Theodore Ts'o

On Wed, Dec 30, 2020 at 06:48:51PM +0900, Daejun Park wrote:
> After full/fast commit, entries in staging queue are promoted to main
> queue. In ext4_fs_cleanup function, it splice to staging queue to
> staging queue.
> 
> Signed-off-by: Daejun Park 

Thanks, applied.

- Ted

Re: [PATCH] Use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR.

2021-01-13 Thread Theodore Ts'o

On Wed, Jan 06, 2021 at 02:02:11PM +0100, Jan Kara wrote:
> On Wed 30-12-20 11:38:27, Yi Li wrote:
> > 1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
> > instead of IS_ERR_OR_NULL to fix this.
> > 
> > 2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
> > and go to call iput properly.
> > 
> > Signed-off-by: Yi Li 
> 
> Thanks for the patch! It looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 

Thanks, applied.

- Ted

Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2021-01-12 Thread Theodore Ts'o

On Tue, Jan 12, 2021 at 11:28:40PM +0100, Pavel Machek wrote:
> 
> This thread suggested that kernel is _not_ supposed to be robust
> against corrupt filesystems (because fsck is not integrated in
> kernel). Which was news to me (and I'm not the person that needs
> warning in execve documentation).

Define "supposed to be".  In the ideal world, the kernel should be
robust against corrupt file systems.  In the ideal world, hard drives
would never die, and memory bits would never get flipped due to cosmic
rays, and so Intel would be correct that consumers don't need ECC
memory.  In the ideal world, drivers would never make mistakes, and so
seat belts would be completely unnecessasry.

Unfortunately, we live in the real world.

And so there is risk associated with using various technologies, and
that risk is not a binary 0% vs 100%.  In my mind, assuming that file
systems are robust against maliciously created images is much like
driving around without a seat belt.  There are plenty of people who
drive without seat belts for years, and they haven't been killed yet.
But it's not something *I* would do.  But hey, I also spent extra
money to buy a personal desktop computer with ECC memory, and I don't
go bicycling without a helment, either.

You're free to decide what *you* want to do.  But I wouldn't take a
random file system image from the Net and mount it without running
fsck on the darned thing first.

As far as secure boot is concerned, do you know how many zero days are
out there with Windows machines?  I'm pretty sure there are vast
numbers.  There are even more systems where the system adminsitrators
haven't bothered to install latest updates, as well.

> And if we have filesystems where corrupt image is known to allow
> arbitrary code execution, we need to

Define *known*.  If you look at the syzbot dashboard, there are
hundreds of reproducers where root is able to crash a Linux server.
Some number of them will almost certainly be exploitable.  The problem
is it takes a huge amount of time to analyze them, and Syzbot's file
system fuzzers are positively developer-hostile to root cause.  So
usually I find and fix ext4 file system fuzzing reports via reports
from other fuzzers, and then eventually syzbot realizes that the
reproducer no longer works, and it gets closed out.

I'd certainly be willing to bet a beer or two that there are
vulnerabilities in VFAT, but do I *know* that to be the case?  No,
because I don't have the time to take a syzbot report relating to a
file system and root cause it.  The time to extract out the image, and
then figure out how to get a simple reproducer (as opposed to the mess
that a syzbot reproducer that might be randomly toggling a network
bridge interface on one thread while messing with a file system image
on another) is easily 10-20 times the effort to actually *fix* the bug
once we're able to come up with a trivial reproducer with a file
system image which is a separate file so we can analyze it using file
system debugging tools.

I'm also *quite* confident that the NSA, KGB, and other state
intelligence agencies have dozens of zero days for Windows, Linux,
etc.  We just don't know in what subsystem they are in, so we can't
just "disable them when secure boot is enabled".

- Ted

Re: Malicious fs images was Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2021-01-12 Thread Theodore Ts'o

On Sun, Jan 10, 2021 at 07:41:02PM +0100, Pavel Machek wrote:
> > >From our perspective (and Darrick and I discussed this on this week's
> > ext4 video conference, so it represents the ext4 and xfs maintainer's
> > position) is that the file system format is different.  First, the
> > on-disk format is not an ABI, and it is several orders more complex
> > than a system call interface.  Second, we make no guarantees about
> > what the file system created by malicious tools will do.  For example,
> > XFS developers reject bug reports from file system fuzzers, because
> > the v5 format has CRC checks, so randomly corrupted file systems won't
> > crash the kernel.  Yes, this doesn't protect against maliciously
> > created file systems where the attacker makes sure the checksums are
> > valid, but only crazy people who think containers are just as secure
> 
> Well, it is not just containers. It is also USB sticks. And people who
> believe secure boot is good idea and try to protect kernel against
> root. And crazy people who encrypt pointers in dmesg. And...
> 
> People want to use USB sticks from time to time. And while I
> understand XFS is so complex it is unsuitable for such use, I'd still
> expect bugs to be fixed there.
> 
> I hope VFAT to be safe to mount, because that is very common on USB.
> 
> I also hope ext2/3/4 is safe in that regard.

Ext4 will fix file system fuzzing attack bugs on a best efforts basis.
That is, when I have time, I've been known to stay up late to bugs
reported by fuzzers.  I hope ext4 is safe, but I'm not going to make
any guarantees that it is Bug-Free(tm).  If you want to trust it in
that way, you do so at your risk.

As far as VFS is concerned, I'm not aware of anyone who has been
working on fuzz-proofing VFAT, and looking at the Vault 2016 for
"American Fuzzy Lop"[1] while VFAT wasn't specifically tested, for the
vast majority of file systems, the "time to first bug" typically
ranged from seconds to minutes, with the exception of XFS and ext4
(where it was roughly 2 hours).  The specific bugs which triggered in
the 2016 AFL presentation have been fixed, at least for the file
systems which get regular maintainer attention, but this is why I try
to caution people not to count on file systems being proof against
maliciously formatted images.

[1] 
https://events.static.linuxfound.org/sites/events/files/slides/AFL%20filesystem%20fuzzing,%20Vault%202016_0.pdf

> Anyway it would be nice to have documentation explaining this. If I'm
> wrong about VFAT being safe, it would be good to know, and I guess
> many will be surprised that XFS is using different rules.

Using USB sticks is fine, so long as you trust the provenance of the
drive.  If you take a random USB stick that is handed to you by
someone whom you don't trust implicitly, or worse, that you picked up
abandoned on the sidewalk, there have been plenty of articles which
describe why this is a REALLY BAD IDEA, and even if you ignore
OS-level vuleranbilities, there are also firwmare and hardware based
vulerabilities that would put your computer at risk.  See [2] and [3]
for more details; there's a reason why I've visited at least one
financial institution where they put epoxy in USB ports to prevent
clueless workers from potentially compromising the bank's computers.

[2] 
https://www.redteamsecure.com/blog/usb-drop-attacks-the-danger-of-lost-and-found-thumb-drives/
[3] 
https://www.bleepingcomputer.com/news/security/heres-a-list-of-29-different-types-of-usb-attacks/

As far as documentation is concerned, how far should we go?  Should
there be a warning in the execve(2) system call man page that you
shouldn't download random binaries from the network and execute them?  :-)

Cheers,

- Ted

Re: Aarch64 EXT4FS inode checksum failures - seems to be weak memory ordering issues

2021-01-07 Thread Theodore Ts'o

On Thu, Jan 07, 2021 at 01:37:47PM +, Russell King - ARM Linux admin wrote:
> > The gcc bugzilla mentions backports into gcc-linaro, but I do not see
> > them in my git history.
> 
> So, do we raise the minimum gcc version for the kernel as a whole to 5.1
> or just for aarch64?

Russell, Arnd, thanks so much for tracking down the root cause of the
bug!

I will note that RHEL 7 uses gcc 4.8.  I personally don't have an
objections to requiring developers using RHEL 7 to have to install a
more modern gcc (since I use Debian Testing and gcc 10.2.1, myself,
and gcc 5.1 is so five years ago :-), but I could imagine that being
considered inconvenient for some.

- Ted

Re: Generate the config file for kernel compilation non-interactively in script.

2021-01-02 Thread Theodore Ts'o

On Fri, Jan 01, 2021 at 12:51:13PM +0800, Hongyi Zhao wrote:
> 
> I want to build the realtime Linux for ROS 2 according to the
> guidelines here:
> .
> 
> For this purpose, I must enable the rt_preempt relative options in the
> kernel withe the following method interactively:
> 
> But this is very inconvenient for doing the above job in script. Is
> there an alternative method to generate the above configurations for
> kernel compilation  non-interactively in script.

What I do for a slightly different use case is to use defconfigs.
After setting up a minimum kernel that has what I need for my use case
(in my case, to build a kernel that works for kvm-xfstests and
gce-xfststs), I save a defconfig: "make savedefconfig", and then stash
in a git repository:

https://github.com/tytso/xfstests-bld/tree/master/kernel-configs

Then when I need to build a kernel, I just copy a defconfig to
.config, and then run the command "make olddefconfig" to expand it
out.  The reason why I use defconfigs is that very often, at least for
my use cases, a defconfig generated for kernel version X.Y tends to
work for X.Y+1, X.Y+2, etc.  That's not always true, of course, which
is why there are a few extra lines added to:

https://github.com/tytso/xfstests-bld/blob/master/kernel-configs/x86_64-config-4.19

This was needed so that the this defconfig will work for 4.19.y
through 5.3.y.  The special cases were needed for 5.1 and 5.2, but I
haven't needed it for any other kernel versions in terms of making a
kernel that would correctly boot on KVM and GCE and correctly run
xfstests for ext4, xfs, btrfs, f2fs, etc.  So I just create a
defconfig for each LTS kernel version, and for the most part, it will
work for future kernel versions until the next LTS kerenl version gets
released, at which point I create a new defconfig for my test
appliance use case.

If your goal is to build newer kernel versions for RT_PREMPT kernels
for your use case, perhaps this technique will be helpful for you.

Hope this helps,

- Ted

Re: C/C++ Code Reviewer Available

2020-12-31 Thread Theodore Ts'o

On Wed, Dec 30, 2020 at 11:45:58PM -0800, Robin Rowe wrote:
> "Linux kernel's Kroah-Hartman: We're not struggling to get new coders, it's
> code review that's the bottleneck", says article at The Register.
> 
> Ok, I've used C++ for 20 years, taught C/C++ at two universities, and
> developed real-time safety-critical systems. Need me? Contact me off-list.

Hi Robin,

It's great that you expressed an interest in contributing to the Linux
kernel!

How much experience do you have with Linux kernel coding?  Most code
reviewers specialize in a particular kernel subsystem, and often get
their start submitting patches as part of coming up to speed with that
particular subsystem so that a subsystem maintainer knows how much
they can trust a particular code review.

The real challenge is that we have plenty of people who are
enthusiastic about submitting whitespace patches and drive-by
checkpatch fixes, and/or submitting new drives or other features for
their particular employer, but we need to get them to graduate from
being entry-level coders, to being people who become senior developers
who can perform trusted code reviews --- and for their employers to
allow them to spend company time contributing back to the community.

Or for people who are willing to spend their own time on nights and
weekend doing this sort of code review work and/or bug fixes and/or
reviewing issues reported by fuzz testers or static code analyizers,
instead of just focusing on features that aren't part of drivers or
features that their companies need for their own short-term business
interest.  This does happen of course, for people who have become
super-passionate about contributing to the subsystem and the kernel
community at large, but there are many people who just submit device
drivers and/or specific kernel features because it's part of their 9-5
job.

This is most of what maintainers and other senior developers do; it's
the less glamorous "scut work" that fewer people are interested in ---
including implementing regression test frameworks, improving subsystem
documentation, investigating fuzz reports and hard-to-reproduce flaky
tests, helping to mentor and train newer engineers, and yes, code
review.  It's what's called servant-leadership.  :-)

Cheers,

- Ted

Re: ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()

2020-08-06 Thread Theodore Ts'o

On Mon, Jul 27, 2020 at 09:54:14AM +0800, brookxu wrote:
> Delete the invalid BUGON in ext4_mb_load_buddy_gfp(), the previous
> code has already judged whether page is NULL.
> 
> Signed-off-by: Chunguang Xu 

Applied, but I had to manually apply your patch since it was mangled
by your mailer.

It looks like the problem may have been caused by your using gmail;
please take a look at the file Documentation/process/email-clients.rst for some 
hints.

- Ted

Re: INFO: rcu detected stall in ext4_write_checks

2019-07-14 Thread Theodore Ts'o

On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> But short term I don't see any other solution than stop testing
> sched_setattr because it does not check arguments enough to prevent
> system misbehavior. Which is a pity because syzkaller has found some
> bad misconfigurations that were oversight on checking side.
> Any other suggestions?

Or maybe syzkaller can put its own limitations on what parameters are
sent to sched_setattr?  In practice, there are any number of ways a
root user can shoot themselves in the foot when using sched_setattr or
sched_setaffinity, for that matter.  I imagine there must be some such
constraints already --- or else syzkaller might have set a kernel
thread to run with priority SCHED_BATCH, with similar catastrophic
effects --- or do similar configurations to make system threads
completely unschedulable.

Real time administrators who know what they are doing --- and who know
that their real-time threads are well behaved --- will always want to
be able to do things that will be catastrophic if the real-time thread
is *not* well behaved.  I don't it is possible to add safety checks
which would allow the kernel to automatically detect and reject unsafe
configurations.

An apt analogy might be civilian versus military aircraft.  Most
airplanes are designed to be "inherently stable"; that way, modulo
buggy/insane control systems like on the 737 Max, the airplane will
automatically return to straight and level flight.  On the other hand,
some military planes (for example, the F-16, F-22, F-36, the
Eurofighter, etc.) are sometimes designed to be unstable, since that
way they can be more maneuverable.

There are use cases for real-time Linux where this flexibility/power
vs. stability tradeoff is going to argue for giving root the
flexibility to crash the system.  Some of these systems might
literally involve using real-time Linux in military applications,
something for which Paul and I have had some experience.  :-)

Speaking of sched_setaffinity, one thing which we can do is have
syzkaller move all of the system threads to they run on the "system
CPU's", and then move the syzkaller processes which are testing the
kernel to be on the "system under test CPU's".  Then regardless of
what priority the syzkaller test programs try to run themselves at,
they can't crash the system.

Some real-time systems do actually run this way, and it's a
recommended configuration which is much safer than letting the
real-time threads take over the whole system:

http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application

- Ted

[GIT PULL] ext4 updates for 5.3-rc1

2019-07-09 Thread Theodore Ts'o

The following changes since commit cd6c84d8f0cdc911df435bb075ba22ce3c605b07:

  Linux 5.2-rc2 (2019-05-26 16:49:19 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 96fcaf86c3cb9340015fb475d79ef0a6fcf858ed:

  ext4: fix coverity warning on error path of filename setup (2019-07-02 
17:56:12 -0400)


Many bug fixes and cleanups, and an optimization for case-insensitive
lookups.


Colin Ian King (1):
  ext4: remove redundant assignment to node

Darrick J. Wong (1):
  ext4: don't allow any modifications to an immutable file

Gabriel Krisman Bertazi (2):
  ext4: optimize case-insensitive lookups
  ext4: fix coverity warning on error path of filename setup

Gaowei Pu (1):
  jbd2: fix some print format mistakes

Jan Kara (1):
  ext4: gracefully handle ext4_break_layouts() failure during truncate

Kimberly Brown (1):
  ext4: replace ktype default_attrs with default_groups

Liu Song (1):
  jbd2: fix typo in comment of journal_submit_inode_data_buffers

Ross Zwisler (3):
  mm: add filemap_fdatawait_range_keep_errors()
  jbd2: introduce jbd2_inode dirty range scoping
  ext4: use jbd2_inode dirty range scoping

Theodore Ts'o (7):
  ext4: enforce the immutable flag on open files
  ext4: clean up kerneldoc warnigns when building with W=1
  jbd2: drop declaration of journal_sync_buffer()
  ext4: allow directory holes
  ext4: rename "dirent_csum" functions to use "dirblock"
  ext4: refactor initialize_dirent_tail()
  ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()

Wang Shilong (1):
  ext4: only set project inherit bit for directory

zhangjs (1):
  ext4: make __ext4_get_inode_loc plug

 fs/ext4/balloc.c |   4 +-
 fs/ext4/dir.c|  27 +-
 fs/ext4/ext4.h   |  65 ++--
 fs/ext4/ext4_jbd2.h  |  12 ++---
 fs/ext4/extents.c|   4 +-
 fs/ext4/extents_status.c |   1 -
 fs/ext4/file.c   |   4 ++
 fs/ext4/indirect.c   |  22 +++--
 fs/ext4/inline.c |  21 
 fs/ext4/inode.c  |  93 ++-
 fs/ext4/ioctl.c  |  48 +-
 fs/ext4/mballoc.c|   5 +-
 fs/ext4/move_extent.c|  15 +++---
 fs/ext4/namei.c  | 213 
+--
 fs/ext4/sysfs.c  |   6 ++-
 fs/jbd2/commit.c |  25 +++---
 fs/jbd2/journal.c|  25 +-
 fs/jbd2/transaction.c|  49 ++
 fs/unicode/utf8-core.c   |  28 +++
 include/linux/fs.h   |   2 +
 include/linux/jbd2.h |  23 -
 include/linux/unicode.h  |   3 ++
 mm/filemap.c |  22 +
 23 files changed, 483 insertions(+), 234 deletions(-)

Re: exfat filesystem

2019-07-09 Thread Theodore Ts'o

On Tue, Jul 09, 2019 at 08:48:34AM -0700, Matthew Wilcox wrote:
> 
> Interesting analysis.  It seems to me that the correct forms would be
> observed if someone suitably senior at Microsoft accepted the work from
> Valdis and submitted it with their sign-off.  KY, how about it?

It might be that the simplest way to do this is just to have someone
from Microsoft send the pull request (with a signed tag) to Linus.
There are any number ways to arrange this but the PGP-signed tag might
be sufficient.  Alternatively, some kind of declaration from a
Microsoft lawyer to OIN might be sufficient.  This is where asking the
LF if they can bring together a meeting of the minds of LF, OIN, and
Microsoft lawyers might make things much easier.

- Ted

Re: Procedure questions - new filesystem driver..

2019-07-09 Thread Theodore Ts'o

On Tue, Jul 09, 2019 at 04:21:36AM -0700, Matthew Wilcox wrote:
> How does
> https://www.zdnet.com/article/microsoft-open-sources-its-entire-patent-portfolio/
> change your personal opinion?

According to SFC's legal analysis, Microsoft joining the OIN doesn't
mean that the eXFAT patents are covered, unless *Microsoft*
contributes the code to the Linux usptream kernel.  That's because the
OIN is governed by the Linux System Definition, and until MS
contributes code which covered by the exFAT patents, it doesn't count.

For more details:

https://sfconservancy.org/blog/2018/oct/10/microsoft-oin-exfat/

(This is not legal advice, and I am not a lawyer.)

- Ted

Re: Procedure questions - new filesystem driver..

2019-07-08 Thread Theodore Ts'o

On Mon, Jul 08, 2019 at 08:37:42PM -0400, Valdis Klētnieks wrote:
> I have an out-of-tree driver for the exfat file system that I beaten into 
> shape
> for upstreaming. The driver works, and passes sparse and checkpatch (except
> for a number of line-too-long complaints).
> 
> Do you want this taken straight to the fs/ tree, or through drivers/staging?

How have you dealt with the patent claims which Microsoft has
asserted[1] on the exFAT file system design?

[1] 
https://www.microsoft.com/en-us/legal/intellectualproperty/mtl/exfat-licensing.aspx

I am not making any claims about the validity of Microsoft's patent
assertions on exFAT, one way or another.  But it might be a good idea
for some laywers from the Linux Foundation to render some legal advice
to their employees (namely Greg K-H and Linus Torvalds) regarding the
advisability of taking exFAT into the official Linux tree.

Personally, if Microsoft is going to be unfriendly about not wanting
others to use their file system technology by making patent claims,
why should we reward them by making their file system better by
improvings its interoperability?  (My personal opinion only.)

Cheers,

- Ted

Re: INFO: rcu detected stall in ext4_write_checks

2019-07-06 Thread Theodore Ts'o

On Fri, Jul 05, 2019 at 11:16:31PM -0700, Paul E. McKenney wrote:
> I suppose RCU could take the dueling-banjos approach and use increasingly
> aggressive scheduler policies itself, up to and including SCHED_DEADLINE,
> until it started getting decent forward progress.  However, that
> sounds like the something that just might have unintended consequences,
> particularly if other kernel subsystems were to also play similar
> games of dueling banjos.

So long as the RCU threads are well-behaved, using SCHED_DEADLINE
shouldn't have much of an impact on the system --- and the scheduling
parameters that you can specify on SCHED_DEADLINE allows you to
specify the worst-case impact on the system while also guaranteeing
that the SCHED_DEADLINE tasks will urn in the first place.  After all,
that's the whole point of SCHED_DEADLINE.

So I wonder if the right approach is during the the first userspace
system call to shced_setattr to enable a (any) real-time priority
scheduler (SCHED_DEADLINE, SCHED_FIFO or SCHED_RR) on a userspace
thread, before that's allowed to proceed, the RCU kernel threads are
promoted to be SCHED_DEADLINE with appropriately set deadline
parameters.  That way, a root user won't be able to shoot the system
in the foot, and since the vast majority of the time, there shouldn't
be any processes running with real-time priorities, we won't be
changing the behavior of a normal server system.

(I suspect there might be some audio applications that might try to
set real-time priorities, but for desktop systems, it's probably more
important that the system not tie its self into knots since the
average desktop user isn't going to be well equipped to debug the
problem.)

> Alternatively, is it possible to provide stricter admission control?

I think that's an orthogonal issue; better admission control would be
nice, but it looks to me that it's going to be fundamentally an issue
of tweaking hueristics, and a fool-proof solution that will protect
against all malicious userspace applications (including syzkaller) is
going to require solving the halting problem.  So while it would be
nice to improve the admission control, I don't think that's a going to
be a general solution.

- Ted

Re: INFO: rcu detected stall in ext4_write_checks

2019-07-05 Thread Theodore Ts'o

On Fri, Jul 05, 2019 at 12:10:55PM -0700, Paul E. McKenney wrote:
> 
> Exactly, so although my patch might help for CONFIG_PREEMPT=n, it won't
> help in your scenario.  But looking at the dmesg from your URL above,
> I see the following:

I just tested with CONFIG_PREEMPT=n

% grep CONFIG_PREEMPT /build/ext4-64/.config
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y
# CONFIG_PREEMPTIRQ_EVENTS is not set

And with your patch, it's still not helping.

I think that's because SCHED_DEADLINE is a real-time style scheduler:

   In  order  to fulfill the guarantees that are made when a thread is ad‐
   mitted to the SCHED_DEADLINE policy,  SCHED_DEADLINE  threads  are  the
   highest  priority  (user  controllable)  threads  in the system; if any
   SCHED_DEADLINE thread is runnable, it will preempt any thread scheduled
   under one of the other policies.

So a SCHED_DEADLINE process is not going yield control of the CPU,
even if it calls cond_resched() until the thread has run for more than
the sched_runtime parameter --- which for the syzkaller repro, was set
at 26 days.

There are some safety checks when using SCHED_DEADLINE:

   The kernel requires that:

   sched_runtime <= sched_deadline <= sched_period

   In  addition,  under  the  current implementation, all of the parameter
   values must be at least 1024 (i.e., just over one microsecond, which is
   the  resolution  of the implementation), and less than 2^63.  If any of
   these checks fails, sched_setattr(2) fails with the error EINVAL.

   The  CBS  guarantees  non-interference  between  tasks,  by  throttling
   threads that attempt to over-run their specified Runtime.

   To ensure deadline scheduling guarantees, the kernel must prevent situ‐
   ations where the set of SCHED_DEADLINE threads is not feasible (schedu‐
   lable)  within  the given constraints.  The kernel thus performs an ad‐
   mittance test when setting or changing SCHED_DEADLINE  policy  and  at‐
   tributes.   This admission test calculates whether the change is feasi‐
   ble; if it is not, sched_setattr(2) fails with the error EBUSY.

The problem is that SCHED_DEADLINE is designed for sporadic tasks:

   A  sporadic  task is one that has a sequence of jobs, where each job is
   activated at most once per period.  Each job also has a relative  dead‐
   line,  before which it should finish execution, and a computation time,
   which is the CPU time necessary for executing the job.  The moment when
   a  task wakes up because a new job has to be executed is called the ar‐
   rival time (also referred to as the request time or release time).  The
   start time is the time at which a task starts its execution.  The abso‐
   lute deadline is thus obtained by adding the relative deadline  to  the
   arrival time.

It appears that kernel's admission control before allowing
SCHED_DEADLINE to be set on a thread was designed for sane
applications, and not abusive ones.  Given that process started doing
abusive things *after* SCHED_DEADLINE policy was set, in order kernel
to figure out that in fact SCHED_DEADLINE should be denied for any
arbitrary kernel thread would require either (a) solving the halting
problem, or (b) being able to anticipate the future (in which case,
we should be using that kernel algorithm to play the stock market  :-)

   - Ted

Re: [PATCH] ext4: replace ktype default_attrs with default_groups

2019-07-02 Thread Theodore Ts'o

On Wed, May 08, 2019 at 04:07:48PM -0400, Kimberly Brown wrote:
> The kobj_type default_attrs field is being replaced by the
> default_groups field. Replace the default_attrs field in ext4_sb_ktype
> and ext4_feat_ktype with default_groups. Use the ATTRIBUTE_GROUPS macro
> to create ext4_groups and ext4_feat_groups.
> 
> Signed-off-by: Kimberly Brown 
> ---
> 
> This patch depends on a patch in the driver-core tree:
> https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core.git/commit/?h=driver-core-next=aa30f47cf666111f6bbfd15f290a27e8a7b9d854
> 
> Greg KH can take this patch through the driver-core tree, or this patch
> can wait a release cycle and go through the subsystem's tree, whichever
> the subsystem maintainer is more comfortable with.

Thanks, since that commit is in 5.2-rc2, I've applied this patch.

- Ted

Re: ext3/ext4 filesystem corruption under post 5.1.0 kernels

2019-07-01 Thread Theodore Ts'o

On Mon, Jul 01, 2019 at 02:43:14PM +0200, Geert Uytterhoeven wrote:
> Hi Ted,
> 
> Despite this fix having been applied upstream,  the kernel prints from
> time to time:
> 
> EXT4-fs (sda1): error count since last fsck: 5
> EXT4-fs (sda1): initial error at time 1557931133:
> ext4_get_branch:171: inode 1980: block 27550
> EXT4-fs (sda1): last error at time 1558114349:
> ext4_get_branch:171: inode 1980: block 27550
> 
> This happens even after a manual run of "e2fsck -f" (while it's mounted
> RO), which reports a clean file system.

What's happening is this.  When the kernel detects a corruption, newer
kernels will set these superblock fields:

__le32  s_error_count;  /* number of fs errors */
__le32  s_first_error_time; /* first time an error happened */
__le32  s_first_error_ino;  /* inode involved in first error */
__le64  s_first_error_block;/* block involved of first error */
__u8s_first_error_func[32] __nonstring; /* function where the 
error happened */
__le32  s_first_error_line; /* line number where error happened */
__le32  s_last_error_time;  /* most recent time of an error */
__le32  s_last_error_ino;   /* inode involved in last error */
__le32  s_last_error_line;  /* line number where error happened */
__le64  s_last_error_block; /* block involved of last error */
__u8s_last_error_func[32] __nonstring;  /* function where the 
error happened */

When newer versions of e2fsck *fix* the corruption, it will clear
these fields.  It's basically a safety check because *way* too many
ext4 users run with errors=continue (aka, "don't worry, be happy"
mode), and so this is a poke in the system logs that the file system
is corrupted, and they, really, *REALLY* should fix it before they
lose (more) data.

> The inode and block numbers match the numbers printed due to the
> previous bug.

You can also see when the last file system error was detected via:

% date -d @1558114349
Fri 17 May 2019 01:32:29 PM EDT

> Do you have an idea what's wrong?
> Note that I run a very old version of e2fsck (from a decade ago).

... and that's the problem.  If you're going to be using newer
versions of the kernel, you really should be using newer versions of
e2fsprogs.

There have been a lot of bug fixes in the last 10 years, and some of
them can be data corruption bugs

- Ted

Re: INFO: rcu detected stall in ext4_write_checks

2019-06-26 Thread Theodore Ts'o

More details about what is going on.  First, it requires root, because
one of that is required is using sched_setattr (which is enough to
shoot yourself in the foot):

sched_setattr(0, {size=0, sched_policy=0x6 /* SCHED_??? */, sched_flags=0, 
sched_nice=0, sched_priority=0, sched_runtime=2251799813724439, 
sched_deadline=4611686018427453437, sched_period=0}, 0) = 0

This is setting the scheduler policy to be SCHED_DEADLINE, with a
runtime parameter of 2251799.813724439 seconds (or 26 days) and a
deadline of 4611686018.427453437 seconds (or 146 *years*).  This means
a particular kernel thread can run for up to 26 **days** before it is
scheduled away, and if a kernel reads gets woken up or sent a signal,
no worries, it will wake up roughly seven times the interval that Rip
Van Winkle spent snoozing in a cave in the Catskill Mountains (in
Washington Irving's short story).

We then kick off a half-dozen threads all running:

   sendfile(fd, fd, , 0x8080fffe);

(and since count is a ridiculously large number, this gets cut down to):

   sendfile(fd, fd, , 2147479552);

Is it any wonder that we are seeing RCU stalls?   :-)

- Ted

Re: INFO: rcu detected stall in ext4_write_checks

2019-06-26 Thread Theodore Ts'o

The reproducer causes similar rcu stalls when using xfs:

RSP: 0018:aae8c0953c58 EFLAGS: 0246 ORIG_RAX: ff13
RAX: 0288 RBX: b05a RCX: aae8c0953d50
RDX: 001c RSI: 001c RDI: dcec41772800
RBP: dcec41772800 R08: 0015143ceae3 R09: 0001
R10:  R11: 96863980 R12: 88d179ac7d80
R13: 88d174837ca0 R14: 0288 R15: 001c
 generic_file_buffered_read+0x2c1/0x8b0
 xfs_file_buffered_aio_read+0x5f/0x140
 xfs_file_read_iter+0x6e/0xd0
 generic_file_splice_read+0x110/0x1d0
 splice_direct_to_actor+0xd5/0x230
 ? pipe_to_sendpage+0x90/0x90
 do_splice_direct+0x9f/0xd0
 do_sendfile+0x1d4/0x3a0
 __se_sys_sendfile64+0x58/0xc0
 do_syscall_64+0x50/0x1b0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f8038387469

and btrfs

[   42.671321] RSP: 0018:960dc0963b90 EFLAGS: 0246 ORIG_RAX: 
ff13
[   42.673592] RAX: 0001 RBX: 89eb3628ab30 RCX: 
[   42.675775] RDX: 89eb3628a380 RSI:  RDI: 9ec63980
[   42.677851] RBP: 89eb3628aaf8 R08: 0001 R09: 0001
[   42.680028] R10:  R11: 9ec63980 R12: 89eb3628a380
[   42.682213] R13: 0246 R14: 0001 R15: 9ec63980
[   42.684509]  xas_descend+0xed/0x120
[   42.685682]  xas_load+0x39/0x50
[   42.686691]  find_get_entry+0xa0/0x330
[   42.687885]  pagecache_get_page+0x30/0x2d0
[   42.689190]  generic_file_buffered_read+0xee/0x8b0
[   42.690708]  generic_file_splice_read+0x110/0x1d0
[   42.692374]  splice_direct_to_actor+0xd5/0x230
[   42.693868]  ? pipe_to_sendpage+0x90/0x90
[   42.695180]  do_splice_direct+0x9f/0xd0
[   42.696407]  do_sendfile+0x1d4/0x3a0
[   42.697551]  __se_sys_sendfile64+0x58/0xc0
[   42.698854]  do_syscall_64+0x50/0x1b0
[   42.700021]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   42.701619] RIP: 0033:0x7f21e8d5f469

So this is probably a generic vfs issue (probably in sendfile).  Next
steps is probably for someone to do a bisection search covering
changes to the fs/ and mm/ directories.  This should exclude bogus
changes in the networking layer, although it does seem that there is
something very badly wrong which is breaking bisectability, if you
can't even scp to the system under test for certain commit ranges.  :-( :-( :-(

- Ted

Re: INFO: rcu detected stall in ext4_write_checks

2019-06-26 Thread Theodore Ts'o

On Wed, Jun 26, 2019 at 10:27:08AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:abf02e29 Merge tag 'pm-5.2-rc6' of git://git.kernel.org/pu..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1435aaf6a0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=e5c77f8090a3b96b
> dashboard link: https://syzkaller.appspot.com/bug?extid=4bfbbf28a2e50ab07368
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=11234c41a0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15d7f026a0
> 
> The bug was bisected to:
> 
> commit 0c81ea5db25986fb2a704105db454a790c59709c
> Author: Elad Raz 
> Date:   Fri Oct 28 19:35:58 2016 +
> 
> mlxsw: core: Add port type (Eth/IB) set API

Um, so this doesn't pass the laugh test.

> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10393a89a0

It looks like the automated bisection machinery got confused by two
failures getting triggered by the same repro; the symptoms changed
over time.  Initially, the failure was:

crashed: INFO: rcu detected stall in {sys_sendfile64,ext4_file_write_iter}

Later, the failure changed to something completely different, and much
earlier (before the test was even started):

run #5: basic kernel testing failed: failed to copy test binary to VM: failed 
to run ["scp" "-P" "22" "-F" "/dev/null" "-o" "UserKnownHostsFile=/dev/null" 
"-o" "BatchMode=yes" "-o" "IdentitiesOnly=yes" "-o" "StrictHostKeyChecking=no" 
"-o" "ConnectTimeout=10" "-i" "/syzkaller/jobs/linux/workdir/image/key" 
"/tmp/syz-executor216456474" "root@10.128.15.205:./syz-executor216456474"]: 
exit status 1
Connection timed out during banner exchange
lost connection

Looks like an opportunity to improve the bisection engine?

- Ted

Re: [PATCH v2 4/7] linux/signal.h: Ignore SIGINFO by default in new tasks

2019-06-25 Thread Theodore Ts'o

On Tue, Jun 25, 2019 at 07:11:50PM +0300, Arseny Maslennikov wrote:
> This matches the behaviour of other Unix-like systems that have SIGINFO
> and causes less harm to processes that do not install handlers for this
> signal, making the keyboard status character non-fatal for them.
> 
> This is implemented with the assumption that SIGINFO is defined
> to be equivalent to SIGPWR; still, there is no reason for PWR to
> result in termination of the signal recipient anyway — it does not
> indicate there is a fatal problem with the recipient's execution
> context (like e.g. FPE/ILL do), and we have TERM/KILL for explicit
> termination requests.

So this is a consequence of trying to overload SIGINFO with SIGPWR.
At least on some legacy Unix systems, the way SIGPWR worked was that
init would broadcast SIGPWR to all userspace process (with system
daemons on an exception list so they could get shutdown last).
Applications which didn't have a SIGPWR handler registered, would just
exit.  Those that did, were expected to clean up in preparation with
the impending shutdown.  After some period of time, then processes
would get hard killed and then system daemons would get shut down and
the system would cleanly shut itself down.

So SIGPWR acted much like SIGHUP, and that's why the default behavior
was "terminate".  Now, as far as I know, we've not actually used in
that way, at least for most common distributions, but there is a sane
reason why things are they way there are, and in there are people who
have been using SIGPWR the way it had originally been intended, there
is risk in overloading SIGINFO with SIGPWR --- in particular, typing
^T might cause some enterprise database to start shutting itself down.  :-)

(In particular it might be worth checking Linux ports of Oracle and
DB2.)

- Ted

Re: [PATCH v5 00/18] kunit: introduce KUnit, the Linux kernel unit testing framework

2019-06-21 Thread Theodore Ts'o

On Fri, Jun 21, 2019 at 08:59:48AM -0600, shuah wrote:
> > > ### But wait! Doesn't kselftest support in kernel testing?!
> > >
> > > 
> 
> I think I commented on this before. I agree with the statement that
> there is no overlap between Kselftest and KUnit. I would like see this
> removed. Kselftest module support supports use-cases KUnit won't be able
> to. I can build an kernel with Kselftest test modules and use it in the
> filed to load and run tests if I need to debug a problem and get data
> from a system. I can't do that with KUnit.
> 
> In my mind, I am not viewing this as which is better. Kselftest and
> KUnit both have their place in the kernel development process. It isn't
> productive and/or necessary to comparing Kselftest and KUnit without a
> good understanding of the problem spaces for each of these.
>
> I would strongly recommend not making reference to Kselftest and talk
> about what KUnit offers.

Shuah,

Just to recall the history, this section of the FAQ was added to rebut
the ***very*** strong statements that Frank made that there was
overlap between Kselftest and Kunit, and that having too many ways for
kernel developers to do the identical thing was harmful (he said it
was too much of a burden on a kernel developer) --- and this was an
argument for not including Kunit in the upstream kernel.

If we're past that objection, then perhaps this section can be
dropped, but there's a very good reason why it was there.  I wouldn't
Brendan to be accused of ignoring feedback from those who reviewed his
patches.   :-)

- Ted

Re: [PATCH 1/6] mm/fs: don't allow writes to immutable files

2019-06-20 Thread Theodore Ts'o

On Mon, Jun 10, 2019 at 09:46:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> The chattr manpage has this to say about immutable files:
> 
> "A file with the 'i' attribute cannot be modified: it cannot be deleted
> or renamed, no link can be created to this file, most of the file's
> metadata can not be modified, and the file can not be opened in write
> mode."
> 
> Once the flag is set, it is enforced for quite a few file operations,
> such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
> don't check for immutability when doing a write(), a PROT_WRITE mmap(),
> a truncate(), or a write to a previously established mmap.
> 
> If a program has an open write fd to a file that the administrator
> subsequently marks immutable, the program still can change the file
> contents.  Weird!
> 
> The ability to write to an immutable file does not follow the manpage
> promise that immutable files cannot be modified.  Worse yet it's
> inconsistent with the behavior of other syscalls which don't allow
> modifications of immutable files.
> 
> Therefore, add the necessary checks to make the write, mmap, and
> truncate behavior consistent with what the manpage says and consistent
> with other syscalls on filesystems which support IMMUTABLE.
> 
> Signed-off-by: Darrick J. Wong 

I note that this patch doesn't allow writes to swap files.  So Amir's
generic/554 test will still fail for those file systems that don't use
copy_file_range.

I'm indifferent as to whether you add a new patch, or include that
change in this patch, but perhaps we should fix this while we're
making changes in these code paths?

- Ted

Re: [PATCH v2 3/3] ext4: use jbd2_inode dirty range scoping

2019-06-20 Thread Theodore Ts'o

On Thu, Jun 20, 2019 at 09:18:39AM -0600, Ross Zwisler wrote:
> Use the newly introduced jbd2_inode dirty range scoping to prevent us
> from waiting forever when trying to complete a journal transaction.
> 
> Signed-off-by: Ross Zwisler 
> Reviewed-by: Jan Kara 
> Cc: sta...@vger.kernel.org

Applied, thanks.

- Ted

Re: [PATCH v2 2/3] jbd2: introduce jbd2_inode dirty range scoping

2019-06-20 Thread Theodore Ts'o

On Thu, Jun 20, 2019 at 09:18:38AM -0600, Ross Zwisler wrote:
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 5c04181b7c6d8..0e0393e7f41a4 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1397,6 +1413,12 @@ extern intjbd2_journal_force_commit(journal_t 
> *);
>  extern int  jbd2_journal_force_commit_nested(journal_t *);
>  extern int  jbd2_journal_inode_add_write(handle_t *handle, struct 
> jbd2_inode *inode);
>  extern int  jbd2_journal_inode_add_wait(handle_t *handle, struct 
> jbd2_inode *inode);
> +extern int  jbd2_journal_inode_ranged_write(handle_t *handle,
> + struct jbd2_inode *inode, loff_t start_byte,
> + loff_t length);
> +extern int  jbd2_journal_inode_ranged_wait(handle_t *handle,
> + struct jbd2_inode *inode, loff_t start_byte,
> + loff_t length);
>  extern int  jbd2_journal_begin_ordered_truncate(journal_t *journal,
>   struct jbd2_inode *inode, loff_t new_size);
>  extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, 
> struct inode *inode);

You're adding two new functions that are called from outside the jbd2
subsystem.  To support compiling jbd2 as a module, we also need to add
EXPORT_SYMBOL declarations for these two functions.

I'll take care of this when applying this change.

Thanks, applied.

   - Ted

Re: [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping

2019-06-20 Thread Theodore Ts'o

On Thu, Jun 20, 2019 at 09:09:11AM -0600, Ross Zwisler wrote:
> We could definitely keep separate dirty ranges for each of the current and
> next transaction.  I think the case where you would see a difference would be
> if you had multiple transactions in a row which grew the dirty range for a
> given jbd2_inode, and then had a random I/O workload which kept dirtying pages
> inside that enlarged dirty range.
> 
> I'm not sure how often this type of workload would be a problem.  For the
> workloads I've been testing which purely append to the inode, having a single
> dirty range per jbd2_inode is sufficient.

My inclination would be to keep things simple for now, unless we have
a real workload that tickles this.  In the long run I'm hoping to
remove the need to do writebacks from the journal thread altogether,
by always updating the metadata blocks *after* the I/O completes,
instead of before we submit the I/O.

- Ted

Re: [PATCH] ext4: remove redundant assignment to node

2019-06-19 Thread Theodore Ts'o

On Wed, Jun 19, 2019 at 10:00:06AM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Pointer 'node' is assigned a value that is never read, node is
> later overwritten when it re-assigned a different value inside
> the while-loop.  The assignment is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Applied, thanks.

- Ted

Re: [PATCH] ext4: make __ext4_get_inode_loc plug

2019-06-19 Thread Theodore Ts'o

On Mon, Jun 17, 2019 at 11:57:12PM +0800, jinshui zhang wrote:
> From: zhangjs 
> 
> If the task is unplugged when called, the inode_readahead_blks may not be 
> merged, 
> these will cause small pieces of io, It should be plugged.
> 
> Signed-off-by: zhangjs 

Thanks, applied.

I cleaned up the commit description a little, and I removed some of
the extra empty lines added by the patch.

Cheers,

- Ted

Re: [PATCH] ext4: make __ext4_get_inode_loc plug

2019-06-17 Thread Theodore Ts'o

On Mon, Jun 17, 2019 at 03:11:21PM +0800, Zhangjs Jinshui wrote:
> If the task is unplugged when called, the inode_readahead_blks may not be
> merged,
> these will cause small pieces of io, It should be plugged.
> 
> Signed-off-by: zhangjs 

This patch is white space damaged.   Please see:

https://www.kernel.org/doc/html/latest/process/email-clients.html

for more information.  Could you resend using an appropriate mail
client?

Many thanks!

- Ted

Re: PC speaker

2019-06-13 Thread Theodore Ts'o

On Thu, Jun 13, 2019 at 12:16:37PM -0400, R.F. Burns wrote:
> Is it possible to write a kernel module which, when loaded, will blow
> the PC speaker?

Yes; in fact, it's already been done.  See sound/drivers/pcsp/ in the
Linux kernel sources.

- Ted

Re: [RFC]: Convention for naming syscall revisions

2019-06-06 Thread Theodore Ts'o

On Thu, Jun 06, 2019 at 05:42:25PM +0200, Christian Brauner wrote:
> Hey everyone,
> 
> I hope this is not going to start a trash fire.
> 
> While working on a new clone version I tried to find out what the
> current naming conventions for syscall revisions is. I was told and
> seemed to be able to confirm through the syscall list that revisions of
> syscalls are for the most part (for examples see [1]) named after the
> number of arguments and not for the number of revisions. But some also
> seem to escape that logic (e.g. clone2).

There are also examples which show that it's a revision number:

  preadv2, pwritev2, mlock2, sync_file_range2

immediately come to mind.  It's also important to note that in some
cases, we do something very different (look aht the stat and fstat
variants), and that in some cases the number of parameters for a
system call vary between architectures (because of system call
argument passing limitations), and this gets papered over by glibc.

So we can define what the historical pattern, but there might be a big
difference between what might make sense as an internal naming
convention, and the names that we want to expose to userspace
application programmers --- especially if the number of arguments at
the syscall level might be different (on some architectures) than at
the C library level.

- Ted

Re: [PATCH] ext4: remove unnecessary gotos in ext4_xattr_set_entry

2019-06-01 Thread Theodore Ts'o

On Fri, May 31, 2019 at 03:46:54PM -0600, Andreas Dilger wrote:
> On May 31, 2019, at 6:10 AM, Pavel Tikhomirov  
> wrote:
> > 
> > In the "out" label we only iput old/new_ea_inode-s, in all these places
> > these variables are always NULL so there is no point in goto to "out".
> > 
> > Signed-off-by: Pavel Tikhomirov 
> 
> I'm not a fan of changes like this, since it adds potential complexity/bugs
> if the error handling path is changed in the future.  That is one of the major
> benefits of the "goto out_*" model of error handling is that you only need to
> add one new label to the end of the function when some new state is added that
> needs to be cleaned up, compared to having to check each individual error to
> see if something needs to be cleaned up.

I'm not a fan either, for the reasons Andreas stated; if you ever move
code around, it's much more hazardous because you now have to check if
what had previously been a "return ret" now has to change into "goto
outl".  In some case, it's really obvious, if the code is at the very
beginning of the function, but when you're 35 lines down, well over
the size of many of an editor window, it's no longer quite so obvious
whether or not "goto out" is necessary.

- Ted

Re: [PATCH] jbd2: fix typo in comment of journal_submit_inode_data_buffers

2019-05-30 Thread Theodore Ts'o

On Mon, May 27, 2019 at 10:24:57AM +0200, Jan Kara wrote:
> On Sat 25-05-19 17:12:51, Liu Song wrote:
> > From: Liu Song 
> > 
> > delayed/dealyed
> > 
> > Signed-off-by: Liu Song 
> 
> Thanks. You can add:
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3948 matches

Mail list logo