Re: About the try to remove cross-release feature entirely by Ingo

2017-12-29 Thread Amir Goldstein
On Fri, Dec 29, 2017 at 3:47 AM, Byungchul Park  wrote:
> On Wed, Dec 13, 2017 at 03:24:29PM +0900, Byungchul Park wrote:
>> Lockdep works, based on the following:
>>
>>(1) Classifying locks properly
>>(2) Checking relationship between the classes
>>
>> If (1) is not good or (2) is not good, then we
>> might get false positives.
>>
>> For (1), we don't have to classify locks 100%
>> properly but need as enough as lockdep works.
>>
>> For (2), we should have a mechanism w/o
>> logical defects.
>>
>> Cross-release added an additional capacity to
>> (2) and requires (1) to get more precisely classified.
>>
>> Since the current classification level is too low for
>> cross-release to work, false positives are being
>> reported frequently with enabling cross-release.
>> Yes. It's a obvious problem. It needs to be off by
>> default until the classification is done by the level
>> that cross-release requires.
>>
>> But, the logic (2) is valid and logically true. Please
>> keep the code, mechanism, and logic.
>
> I admit the cross-release feature had introduced several false positives
> about 4 times(?), maybe. And I suggested roughly 3 ways to solve it. I
> should have explained each in more detail. The lack might have led some
> to misunderstand.
>
>(1) The best way: To classify all waiters correctly.
>
>   Ultimately the problems should be solved in this way. But it
>   takes a lot of time so it's not easy to use the way right away.
>   And I need helps from experts of other sub-systems.
>
>   While talking about this way, I made a trouble.. I still believe
>   that each sub-system expert knows how to solve dependency problems
>   most, since each has own dependency rule, but it was not about
>   responsibility. I've never wanted to charge someone else it but me.
>
>(2) The 2nd way: To make cross-release off by default.
>
>   At the beginning, I proposed cross-release being off by default.
>   Honestly, I was happy and did it when Ingo suggested it on by
>   default once lockdep on. But I shouldn't have done that but kept
>   it off by default. Cross-release can make some happy but some
>   unhappy until problems go away through (1) or (2).
>
>(3) The 3rd way: To invalidate waiters making trouble.
>
>   Of course, this is not the best. Now that you have already spent
>   a lot of time to fix original lockdep's problems since lockdep was
>   introduced in 2006, we don't need to use this way for typical
>   locks except a few special cases. Lockdep is fairly robust by now.
>
>   And I understand you don't want to spend more time to fix
>   additional problems again. Now that the situation is different
>   from the time, 2006, it's not too bad to use this way to handle
>   the issues.
>

Purely logically, aren't you missing a 4th option:

(4) The 4th way: To validate specific waiters.

Is it not an option for a subsystem to opt-in for cross-release validation
of specific locks/waiters? This may be a much preferred route for cross-
release. I remember seeing a post from a graphic driver developer that
found cross-release useful for finding bugs in his code.

For example, many waiters in kernel can be waiting for userspace code,
so does that mean the cross-release is going to free the world from
userspace deadlocks as well?? Possibly I am missing something.

In any way, it seem logical to me that some waiters should particpate
in lock chain dependencies, while other waiters should break the chain
to avoid false positives and to avoid protecting against user configurable
deadlocks (like loop mount over file inside the loop mounted fs).
And if you agree that this logic claim is correct, than surely, an inclusive
approach is the best way forward.

Cheers,
Amir.


Re: false positive lockdep splat with loop device

2017-09-26 Thread Amir Goldstein
On Tue, Sep 26, 2017 at 7:24 AM, Dave Chinner <da...@fromorbit.com> wrote:
> On Thu, Sep 21, 2017 at 09:43:41AM +0300, Amir Goldstein wrote:
>> On Thu, Sep 21, 2017 at 1:22 AM, Dave Chinner <da...@fromorbit.com> wrote:
>> > [cc lkml, PeterZ and Byungchul]
>> ...
>> > The thing is, this IO completion has nothing to do with the lower
>> > filesystem - it's the IO completion for the filesystem on the loop
>> > device (the upper filesystem) and is not in any way related to the
>> > IO completion from the dax device the lower filesystem is waiting
>> > on.
>> >
>> > IOWs, this is a false positive.
>> >
>> > Peter, this is the sort of false positive I mentioned were likely to
>> > occur without some serious work to annotate the IO stack to prevent
>> > them.  We can nest multiple layers of IO completions and locking in
>> > the IO stack via things like loop and RAID devices.  They can be
>> > nested to arbitrary depths, too (e.g. loop on fs on loop on fs on
>> > dm-raid on n * (loop on fs) on bdev) so this new completion lockdep
>> > checking is going to be a source of false positives until there is
>> > an effective (and simple!) way of providing context based completion
>> > annotations to avoid them...
>> >
>>
>> IMO, the way to handle this is to add 'nesting_depth' information
>> on blockdev (or bdi?). 'nesting' in the sense of blockdev->fs->blockdev->fs.
>> AFAIK, the only blockdev drivers that need to bump nesting_depth
>> are loop and nbd??
>
> You're assumming that this sort of "completion inversion" can only
> happen with bdev->fs->bdev, and that submit_bio_wait() is the only
> place where completions are used in stackable block devices.
>
> AFAICT, this could happen on with any block device that can be
> stacked multiple times that uses completions. e.g. MD has a function
> sync_page_io() that calls submit_bio_wait(), and that is called from
> places in the raid 5, raid 10, raid 1 and bitmap layers (plus others
> in DM). These can get stacked anywhere - even on top of loop devices
> - and so I think the issue has a much wider scope than just loop and
> nbd devices.

True. We can probably duplicate the struct file_system_type pattern,
something like:

struct block_device_type = loop_blkdev_type = {
.owner  = THIS_MODULE,
.name   = "loop",
.devt = MKDEV(LOOP_MAJOR, 0),
.probe   = loop_probe,
.lock  = NULL,
};

...
blk_register_region(_blkdev_type, range, NULL);

And then we have a static place holder for lock_class_key address
to be used when annotating bio completions.
I realize same block device types can be nested via raid/dm/loop,
but as we can see in the splat so do same file system types via loop/nbd,
so we can think of similar solution to both cases.

>
>> Not sure if the kernel should limit loop blockdev nesting depth??
>
> There's no way we should do that just because new lockdep
> functionality is unable to express such constructs.
>

Of course not. I was just wondering if there should be a limit to
blockdev nesting regardless (e.g. too much queuing in the io stack).

But even if there is no limit to blockdev nesting, we can define
CONFIG_LOCKDEP_MAX_NESTING and:

struct lock_class_nested_key {
struct lock_class_key keys[CONFIG_LOCKDEP_MAX_NESTING];
};

Then struct file_system_type and struct block_device_type keys
could be of type struct lock_class_nested_key.

nested_depth should be carried in struct gendisk and the it is the
responsibility of  the blockdev driver to bump nested_depth if it
is a nested blockdev.

Callers of lockdep_set_class() ,like lockdep_annotate_inode_mutex_key()
should use a variant lockdep_set_class_nested(lock, ket, nested_depth)
if appropriate.

For any depth greater than CONFIG_LOCKDEP_MAX_NESTING,
lockdep_set_class_nested() will overload the last key in the
array, so we don't solve false positives for infinite nesting depth,
but we solve them for a defined max nesting depth.

Alas, even if this is workable, it will not be anytime soon that I can
backup this scribble with tested patches.

Amir.


Re: [PATCH v2] xfs: fix incorrect log_flushed on fsync

2017-09-06 Thread Amir Goldstein
[-stable]

On Tue, Sep 5, 2017 at 5:40 PM, Brian Foster <bfos...@redhat.com> wrote:
> On Sat, Sep 02, 2017 at 06:47:03PM +0300, Amir Goldstein wrote:
>> On Sat, Sep 2, 2017 at 4:19 PM, Brian Foster <bfos...@redhat.com> wrote:
>> > On Fri, Sep 01, 2017 at 06:39:25PM +0300, Amir Goldstein wrote:
>> >> When calling into _xfs_log_force{,_lsn}() with a pointer
>> >> to log_flushed variable, log_flushed will be set to 1 if:
>> >> 1. xlog_sync() is called to flush the active log buffer
>> >> AND/OR
>> >> 2. xlog_wait() is called to wait on a syncing log buffers
>> >>
>> >> xfs_file_fsync() checks the value of log_flushed after
>> >> _xfs_log_force_lsn() call to optimize away an explicit
>> >> PREFLUSH request to the data block device after writing
>> >> out all the file's pages to disk.
>> >>
>> >> This optimization is incorrect in the following sequence of events:
>> >>
>> >>  Task ATask B
>> >>  ---
>> >>  xfs_file_fsync()
>> >>_xfs_log_force_lsn()
>> >>  xlog_sync()
>> >> [submit PREFLUSH]
>> >>xfs_file_fsync()
>> >>  file_write_and_wait_range()
>> >>[submit WRITE X]
>> >>[endio  WRITE X]
>> >>  _xfs_log_force_lsn()
>> >>xlog_wait()
>> >> [endio  PREFLUSH]
>> >>
>> >> The write X is not guarantied to be on persistent storage
>> >> when PREFLUSH request in completed, because write A was submitted
>> >> after the PREFLUSH request, but xfs_file_fsync() of task A will
>> >> be notified of log_flushed=1 and will skip explicit flush.
>> >>
>> >> If the system crashes after fsync of task A, write X may not be
>> >> present on disk after reboot.
>> >>
>> >> This bug was discovered and demonstrated using Josef Bacik's
>> >> dm-log-writes target, which can be used to record block io operations
>> >> and then replay a subset of these operations onto the target device.
>> >> The test goes something like this:
>> >> - Use fsx to execute ops of a file and record ops on log device
>> >> - Every now and then fsync the file, store md5 of file and mark
>> >
>> >>   md5 of file to stored value
>> >>
>> >> Cc: Christoph Hellwig <h...@lst.de>
>> >> Cc: Josef Bacik <jba...@fb.com>
>> >> Cc: <sta...@vger.kernel.org>
>> >> Signed-off-by: Amir Goldstein <amir7...@gmail.com>
>> >> ---
>> >>
>> >> Christoph,
>> >>
>> >> Here is another, more subtle, version of the fix.
>> >> It keeps a lot more use cases optimized (e.g. many threads doing fsync
>> >> and setting WANT_SYNC may still be optimized).
>> >> It also addresses your concern that xlog_state_release_iclog()
>> >> may not actually start the buffer sync.
>> >>
>> >> I tried to keep the logic as simple as possible:
>> >> If we see a buffer who is not yet SYNCING and we later
>> >> see that l_last_sync_lsn is >= to the lsn of that buffer,
>> >> we can safely say log_flushed.
>> >>
>> >> I only tested this patch through a few crash test runs, not even full
>> >> xfstests cycle, so I am not sure it is correct, just posting to get
>> >> your feedback.
>> >>
>> >> Is it worth something over the simpler v1 patch?
>> >> I can't really say.
>> >>
>> >
>> > This looks like it has a couple potential problems on a very quick look
>> > (so I could definitely be mistaken). E.g., the lsn could be zero at the
>> > time we set log_flushed in _xfs_log_force(). It also looks like waiting
>> > on an iclog that is waiting to run callbacks due to out of order
>> > completion could be interpreted as a log flush having occurred, but I
>> > haven't stared at this long enough to say whether that is actually
>> > possible.
>> >
>> > Stepping back from the details.. this seems like it could be done
>> > correctly in general. IIUC what you want to know is whether any iclog
>> > went from a pre-SYNCING state to a post-SYNCING state during the log

Re: [RFC] failure atomic writes for file systems and block devices

2017-03-01 Thread Amir Goldstein
On Tue, Feb 28, 2017 at 4:57 PM, Christoph Hellwig  wrote:
> Hi all,
>
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
>
> https://lwn.net/Articles/573092/
>
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
>
> 
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
>
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
>
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
>
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
>
> The semantics of the latter case are explained in detail at the Usenix
> paper above.
>
> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
>
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications
> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?
>

[CC += linux-...@vger.kernel.org] for that question and for the new API

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.  This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
>
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:
>
> git://git.infradead.org/users/hch/vfs.git O_ATOMIC
>
> Gitweb:
>
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC