x27;t work either. No device created in /dev (dax or pmem).
I think you need to do some ndctl magic to get the memory to be
namespaced correctly for the correct devices to appear.
https://docs.pmem.io/ndctl-user-guide/managing-namespaces
IIRC, need to set the type to pmem and the mode to fsdax, devdax or
raw to get the relevant device nodes to be created for the range..
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Wed, Jan 31, 2024 at 09:58:21AM -0500, Mathieu Desnoyers wrote:
> On 2024-01-30 21:48, Dave Chinner wrote:
> > On Tue, Jan 30, 2024 at 11:52:54AM -0500, Mathieu Desnoyers wrote:
> > > Introduce a generic way to query whether the dcache is virtually aliased
> > >
ne liner should go into
fs_dax_get_by_bdev(), similar to the blk_queue_dax() check at the
start of the function.
I also noticed that device mapper uses fs_dax_get_by_bdev() to
determine if it can support DAX, but this patch set does not address
that case. Hence it really seems to me like fs_dax_get_by_bdev() is
the right place to put this check.
-Dave.
--
Dave Chinner
da...@fromorbit.com
igurations with
the VFS dentry cache aliasing when we read this code? Something like
cpu_dcache_is_aliased(), perhaps?
-Dave.
--
Dave Chinner
da...@fromorbit.com
h
currently returns NULL if CONFIG_FS_DAX=n and so should be cahnged
to return NULL if any of these platform configs is enabled.
Then I don't think you need to change a single line of filesystem
code - they'll all just do what they do now if the block device
doesn't support DAX
-Dave.
--
Dave Chinner
da...@fromorbit.com
is
instantiated in cache - if the inode has a flag that says "use DAX"
and dax is suppoortable by the hardware, then the turn on DAX for
that inode. Otherwise we just use the normal non-dax IO paths.
Again, we don't error out the filesystem if DAX is not supported,
we just don't turn it on. This check is done in
xfs_inode_should_enable_dax() and I think all you need to do is
replace the IS_ENABLED(CONFIG_FS_DAX) with a dax_is_supported()
call...
-Dave.
--
Dave Chinner
da...@fromorbit.com
be write() IO dirtying new data or other
transactions running dirtying the journal/metadata. Both
sync_filesystem() and super_drop_pagecache() operate on current
state - they don't prevent future dax mapping instantiation or
dirtying from happening on the device, so they don't prevent this...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Sep 27, 2022 at 09:02:48AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 27, 2022 at 02:53:14PM +0800, Shiyang Ruan wrote:
> >
> >
> > 在 2022/9/20 5:15, Dave Chinner 写道:
> > > On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> > > >
down, then everything will fail before removal finally
triggers, and the act of unmounting the filesystem post device
removal will clean up the page cache and all the other caches.
IOWs, I don't understand why the page cache is considered special
here (as opposed to, say, the inode or dentry caches), nor why we
aren't shutting down the filesystem directly after syncing it to
disk to ensure that we don't end up with applications losing data as
a result of racing with the removal
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Sep 20, 2022 at 09:17:07AM +0800, Shiyang Ruan wrote:
> Hi Dave,
>
> 在 2022/9/20 5:15, Dave Chinner 写道:
> > On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> > > On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:
> > > > Since
On Mon, Sep 19, 2022 at 02:50:03PM +1000, Dave Chinner wrote:
> On Thu, Sep 15, 2022 at 09:26:42AM +, Shiyang Ruan wrote:
> > Since reflink&fsdax can work together now, the last obstacle has been
> > resolved. It's time to remove restrictions and drop this warning
0x3bed bytes)
6( 6 mod 256): TRUNCATE DOWN from 0x4 to 0x28b68 **
7( 7 mod 256): COLLAPSE 0x14000 thru 0x14fff (0x1000 bytes)
8( 8 mod 256): TRUNCATE UP from 0x27b68 to 0x3a9c4 **
9( 9 mod 256): READ 0x9cb7 thru 0x19799(0xfae3 bytes)
10( 10 mod 256): PUNCH0x1b3a8 thru 0x1dff8 (0x2c51 bytes)
--
Dave Chinner
da...@fromorbit.com
>
> While we're at it, add the usual "xfs_" prefix to struct failure_info,
> and actually initialize mf_flags.
>
> Signed-off-by: Darrick J. Wong
Looks fine.
Reviewed-by: Dave Chinner
--
Dave Chinner
da...@fromorbit.com
:
pwritev2(RWF_NOWAIT) can return -EOPNOTSUPP on buffered writes.
Documented in the man page.
FICLONERANGE on an filesystem that doesn't support reflink will
return -EOPNOTSUPP. Documented in the man page.
mmap(MAP_SYNC) returns -EOPNOTSUPP if the underlying filesystem
and/or storage doesn't support DAX. Documented in the man page.
I could go on, but I think I've made the point already...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, May 10, 2022 at 06:55:50PM -0700, Dan Williams wrote:
> [ add Andrew ]
>
>
> On Tue, May 10, 2022 at 6:49 PM Dave Chinner wrote:
> >
> > On Tue, May 10, 2022 at 05:03:52PM -0700, Darrick J. Wong wrote:
> > > On Sun, May 08, 2022 at 10:36:06PM +0800, Shiy
ubt it would be
ready for merge in the next cycle...
> I could just add the entire series to iomap-5.20-merge and base the
> xfs-5.20-merge off of that? But I'm not sure what else might be landing
> in the other subsystems, so I'm open to input.
It'll need to be a stable branch somewhere, but I don't think it
really matters where al long as it's merged into the xfs for-next
tree so it gets filesystem test coverage...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Fri, Apr 22, 2022 at 02:27:32PM -0700, Dan Williams wrote:
> On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner wrote:
> >
> > On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> >
On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > a single release. However, being unable to cleanly merge code we
>
On Wed, Apr 20, 2022 at 07:20:07PM -0700, Dan Williams wrote:
> [ add Andrew and Naoya ]
>
> On Wed, Apr 20, 2022 at 6:48 PM Shiyang Ruan wrote:
> >
> > Hi Dave,
> >
> > 在 2022/4/21 9:20, Dave Chinner 写道:
> > > Hi Ruan,
> > >
> > >
o that we can run it through filesystem
level DAX+reflink testing. That will mean we need this in a stable
shared topic branch and tighter co-ordination between the trees.
So before we go any further we need to know if the dax+reflink
enablement patchset is near being ready to merge
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Apr 12, 2022 at 07:06:40PM -0700, Dan Williams wrote:
> On Tue, Apr 12, 2022 at 5:04 PM Dave Chinner wrote:
> > On Mon, Apr 11, 2022 at 12:09:03AM +0800, Shiyang Ruan wrote:
> > > Introduce xfs_notify_failure.c to handle failure related works, such as
> > >
structures this rmapbt walk is dependent on
(e.g. perag structures) have been initialised yet so there's null
pointer dereferences going to happen here.
Perhaps even worse is that the rmapbt is not guaranteed to be in
consistent state until after log recovery has completed, so this
walk could get stuck forever in a stale on-disk cycle that
recovery would have corrected
Hence these notifications need to be delayed until after the
filesystem is mounted, all the internal structures have been set up
and log recovery has completed.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Fri, Apr 16, 2021 at 10:14:39AM +0530, Bharata B Rao wrote:
> On Wed, Apr 07, 2021 at 08:28:07AM +1000, Dave Chinner wrote:
> > On Mon, Apr 05, 2021 at 11:18:48AM +0530, Bharata B Rao wrote:
> >
> > > As an alternative approach, I have this below hack that does lazy
On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner wrote:
> > > > Profiles would be intere
On Wed, Apr 14, 2021 at 08:43:36AM -0600, Jens Axboe wrote:
> On 4/13/21 5:14 PM, Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> >> On 4/13/21 1:51 AM, SeongJae Park wrote:
> >>> From: SeongJae Park
> >>>
> >&
On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote:
> On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner wrote:
> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote:
> > > On 4/13/21 1:51 AM, SeongJae Park wrote:
> > > > From: SeongJ
nds to me like reclaim
*might* be batching page cache removal better (e.g. fewer, larger
batches) and so spending less time contending on the mapping tree
lock...
IOWs, I suspect this result might actually be a result of less lock
contention due to a change in batch processing characteristics of
the new algorithm rather than it being a "better" algorithm...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Apr 13, 2021 at 01:18:35AM +0200, Thomas Gleixner wrote:
> Dave,
>
> On Tue, Apr 13 2021 at 08:15, Dave Chinner wrote:
> > On Mon, Apr 12, 2021 at 05:20:53PM +0200, Thomas Gleixner wrote:
> >> On Wed, Apr 07 2021 at 07:22, Dave Chinner wrote:
> >> &
On Mon, Apr 12, 2021 at 05:20:53PM +0200, Thomas Gleixner wrote:
> Dave,
>
> On Wed, Apr 07 2021 at 07:22, Dave Chinner wrote:
> > On Tue, Apr 06, 2021 at 02:28:34PM +0100, Matthew Wilcox wrote:
> >> On Tue, Apr 06, 2021 at 10:33:43PM +1000, Dave Chinner wrote:
iated with a memcg very quickly
(via mem_cgroup_lruvec()). This will find pages associated directly
with the memcg, so it gives you a fairly accurate picture of the
page cache usage within the container.
This has none of the issues that arise from "sb != mnt_ns" that
walking superblocks and inode lists have, and it doesn't require you
to play games with mounts, superblocks and inode references
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
RINKER_MEMCG_AWARE flag. This could be based on fstype - most
virtual filesystems that expose system information do not really
need full memcg awareness because they are generally only visible to
a single memcg instance...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Apr 06, 2021 at 02:28:34PM +0100, Matthew Wilcox wrote:
> On Tue, Apr 06, 2021 at 10:33:43PM +1000, Dave Chinner wrote:
> > +++ b/fs/inode.c
> > @@ -57,8 +57,7 @@
> >
> > static unsigned int i_hash_mask __read_mostly;
> > static unsigned int i_hash
From: Dave Chinner
Because scalability of the global inode_hash_lock really, really
sucks and prevents me from doing scalability characterisation and
analysis of bcachefs algorithms.
Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
on a couple of different filesystems on a
From: Dave Chinner
in preparation for switching the VFS inode cache over the hlist_bl
lists, we nee dto be able to fake a list node that looks like it is
hased for correct operation of filesystems that don't directly use
the VFS indoe cache.
Signed-off-by: Dave Chinner
---
include/
From: Dave Chinner
In preparation for changing the inode hash table implementation.
Signed-off-by: Dave Chinner
---
fs/inode.c | 44 +---
1 file changed, 25 insertions(+), 19 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index a047ab306f9a
Hi folks,
Recently I've been doing some scalability characterisation of
various filesystems, and one of the limiting factors that has
prevented me from exploring filesystem characteristics is the
inode hash table. namely, the global inode_hash_lock that protects
it.
This has long been a problem,
On Thu, Mar 18, 2021 at 12:20:35PM -0700, Dan Williams wrote:
> On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner wrote:
> >
> > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > > Jason wondered why the get_user_pages_fast() path takes references on a
to run a device wide invalidation.
SO, yeah, I think this should simply be a single ranged call to the
filesystem like:
->memory_failure(dev, 0, -1ULL)
to tell the filesystem that the entire backing device has gone away,
and leave the filesystem to handle failure entirely at the
filesystem level.
-Dave.
--
Dave Chinner
da...@fromorbit.com
on would be for a "struct cage" as in Compound pAGE
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
gt; synchronous metadata changes being committed to the cache in one go
> (truncates, fallocates, fsync, xattrs, unlink+link of tmpfile) - and this
> can take quite a long time. The cache needs to be more proactive in
> getting stuff committed as it goes along.
Workqueues giv
ted to written until the
data is written back and the filesystem runs a conversion
transaction.
So, yeah, if you use FIEMAP to determine where data lies in a file
that is being actively modified, you're going get corrupt data
sooner rather than later. SEEK_HOLE/DATA are coherent with in
memory user data, so don't have this problem.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
ough the layers, and device disappearance may in fact manifest to
the user as data corruption rather than causing data to be
inaccessible.
Hence "remove" notifications just don't work in the storage stack.
They need to be translated to block ranges going bad (i.e. media
errors), a
On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner wrote:
> [..]
> > We do not need a DAX specific mechanism to tell us "DAX device
> > gone", we need a generic block device interface that tells us "
On Mon, Mar 01, 2021 at 04:32:36PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner wrote:
> > Now we have the filesytem people providing a mechanism for the pmem
> > devices to tell the filesystems about physical device failures so
> > they can
On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote:
> > > > On F
On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner wrote:
> > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote:
> > > > On Fri, Feb 26,
On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner wrote:
> > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote:
> > > > > My imm
On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> On Fri, Feb 26, 2021 at 12:51 PM Dave Chinner wrote:
> >
> > On Fri, Feb 26, 2021 at 11:24:53AM -0800, Dan Williams wrote:
> > > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong
> > > wrote:
> &
hen when userspace tries to access the
mapped DAX pages we get a new page fault. In processing the fault, the
filesystem will try to get direct access to the pmem from the block
device. This will get an ENODEV error from the block device because
because the backing store (pmem) has been unplugged and is no longer
there...
AFAICT, as long as pmem removal invalidates all the active ptes that
point at the pmem being removed, the filesystem doesn't need to
care about device removal at all, DAX or no DAX...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
n care at this
point about cross-device XCOPY at this point?
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> > On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> > > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner wrote:
> > > >
&
On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner wrote:
> >
> > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > &
ly breaking? What changed in
> > the kernel that caused this? Procfs has been around for a _very_ long
> > time :)
>
> That would be because of (v5.3):
>
> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>
> The intention of this change (series) was to
It is not
intended as a copy mechanism for copying data from one random file
descriptor to another.
The use of it as a general file copy mechanism in the Go system
library is incorrect and wrong. It is a userspace bug. Userspace
has done the wrong thing, userspace needs to be fixed.
-Dave.
--
Dave Chinner
da...@fromorbit.com
back. It's likely to be too much work for a bound
workqueue, too, especially when you consider that the workqueue
completion code will merge sequential ioends into one ioend, hence
making the IO completion loop counts bigger and latency problems worse
rather than better...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
to list the requested attributes of all
directories and files in the tree...
So, yeah, we do indeed do thousands of these fsxattr based
operations a second, sometimes tens of thousands a second or more,
and sometimes they are issued in bulk in performance critical paths
for container build/deployment operations
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
mechanisms. Of course, with these
special zero length files that contain ephemeral data, userspace can't
actually tell that they contain data from userspace using stat(). So
as far as userspace is concerned, copy_file_range() correctly
returned zero bytes copied from a zero byte long file and there's
nothing more to do.
This zero length file behaviour is, fundamentally, a kernel
filesystem implementation bug, not a copy_file_range() bug.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Jan 26, 2021 at 11:50:50AM +0800, Nicolas Boichat wrote:
> On Tue, Jan 26, 2021 at 9:34 AM Dave Chinner wrote:
> >
> > On Mon, Jan 25, 2021 at 03:54:31PM +0800, Nicolas Boichat wrote:
> > > Hi copy_file_range experts,
> > >
> > > We hit this in
x27;t check the file
size and just attempts to read unconditionally from the file. Hence
it happily returns non-existent stale data from busted filesystem
implementations that allow data to be read from beyond EOF...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
and so
provide the same benefit to all the filesystems that use it.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the
On Mon, Jan 11, 2021 at 11:38:48AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > >
.com/
and that should also allow accrual of the work skipped on each memcg
be accounted across multiple calls to the shrinkers for the same
memcg. Hence as memory pressure within the memcg goes up, the
repeated calls to direct reclaim within that memcg will result in
all of the freeable items in each cache eventually being freed...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Fri, Jan 08, 2021 at 03:59:22PM +0800, Ming Lei wrote:
> On Thu, Jan 07, 2021 at 09:21:11AM +1100, Dave Chinner wrote:
> > On Wed, Jan 06, 2021 at 04:45:48PM +0800, Ming Lei wrote:
> > > On Tue, Jan 05, 2021 at 07:39:38PM +0100, Christoph Hellwig wrote:
> > > > A
On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
> On 02.01.21 23:44, Dave Chinner wrote:
> > On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> > > On 31.12.20 22:59, Dave Chinner wrote:
> > > > On Thu, Dec 31, 2020 at 12:48:5
On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > If the value goes below the limit while some threads are
> > > already
rything we need to
determine whether we should do a large or small bio vec allocation
in the iomap writeback path...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> On 31.12.20 22:59, Dave Chinner wrote:
> > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > On 30.12.20 23:16, Dave Chinner wrote:
> > One could argue that, but one should al
lifts of the context setting up into
xfs_trans_alloc() back into the patchset before adding the
current->journal functionality patch.
Also, you need to test XFS code with CONFIG_XFS_DEBUG=y so that
asserts are actually built into the code and exercised, because this
ASSERT should have fired on the first rolling transaction that the
kernel executes...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> On 30.12.20 23:16, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > Threads, which committed items to the CIL, wait in the
> > > xc_push_wait waitqueue when use
> wake_up_all(&cil->xc_push_wait);
That just smells wrong to me. It *might* be correct, but this
condition should pair with the sleep condition, as space used by a
CIL context should never actually decrease
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
is
> related to that, because the md block devices itself are
> responsive (`xxd /dev/md0` )
My bet is that the OOT driver/hardware had dropped a log IO on the
floor - XFS is waiting for the CIL push to complete, and I'm betting
that is stuck waiting for iclog IO completion while writing the CIL
to the journal. The sysrq output will tell us if this is the case,
so that's the first place to look.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
inspection. But I'm
> not a VFS expert so I'm not quite sure.
Uh, if you have a shrinker racing to register and unregister, you've
got a major bug in your object initialisation/teardown code. i.e.
calling reagister/unregister at the same time for the same shrinker
is a bug, pure and simple.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
way.
So, AFAICT, the dax_lock() stuff is only necessary when the
filesystem can't be used to resolve the owner of physical page that
went bad
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Dec 15, 2020 at 02:27:18PM -0800, Yang Shi wrote:
> On Mon, Dec 14, 2020 at 6:46 PM Dave Chinner wrote:
> >
> > On Mon, Dec 14, 2020 at 02:37:19PM -0800, Yang Shi wrote:
> > > Use per memcg's nr_deferred for memcg aware shrinkers. The shrinker's
>
Combine that with the proposed "watch_sb()" syscall for reporting
such errors in a generic manner to interested listeners, and we've
got a fairly solid generic path for reporting data loss events to
userspace for an appropriate user-defined action to be taken...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
u still have a uesr data
recovery process to perform after this...
> And how does it help in dealing with page faults upon poisoned
> dax page?
It doesn't. If the page is poisoned, the same behaviour will occur
as does now. This is simply error reporting infrastructure, not
error handling.
Future work might change how we correct the faults found in the
storage, but I think the user visible behaviour is going to be "kill
apps mapping corrupted data" for a long time yet
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote:
> On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote:
> > On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote:
> > > Since memcg_shrinker_map_size just can be changd under holding
&g
return;
>
> kfree(shrinker->nr_deferred);
> shrinker->nr_deferred = NULL;
e.g. then this function can simply do:
{
if (shrinker->flags & SHRINKER_MEMCG_AWARE)
return unregister_memcg_shrinker(shrinker);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
acd..693a41e89969 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -201,7 +201,7 @@ DECLARE_RWSEM(shrinker_rwsem);
> #define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
>
> static DEFINE_IDR(shrinker_idr);
> -static int shrinker_nr_max;
> +int shrinker_nr_max;
Then we don't need to make yet another variable global...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
ile it may help your specific corner case,
it's likely to significantly change the reclaim balance of slab
caches, especially under GFP_NOFS intensive workloads where we can
only defer the work to kswapd.
Hence I think this is still a problematic approach as it doesn't
address the reason why deferred counts are increasing out of
control in the first place
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
r will do that for static functions automatically if it makes
sense.
Ok, so you only do the memcg nr_deferred thing if NUMA_AWARE &&
sc->memcg is true. so
static long shrink_slab_set_nr_deferred_memcg(...)
{
int nid = sc->nid;
deferred =
rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_deferred,
true);
return atomic_long_add_return(nr, &deferred->nr_deferred[id]);
}
static long shrink_slab_set_nr_deferred(...)
{
int nid = sc->nid;
if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
nid = 0;
else if (sc->memcg)
return shrink_slab_set_nr_deferred_memcg(, nid);
return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]);
}
And now there's no duplicated code.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
nd
nr_deferred pointers to the correct offset in the allocated range.
Then this patch is really only changes to the size of the chunk
being allocated, setting up the pointers and copying the relevant
data from the old to new.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
is a good idea. This couples the shrinker
infrastructure to internal details of how cgroups are initialised
and managed. Sure, certain operations might be done in certain
shrinker lock contexts, but that doesn't mean we should share global
locks across otherwise independent subsystems
Chee
up
that the barriers enforce.
IOWs, these memory barriers belong inside the cgroup code to
guarantee anything that sees an online cgroup will always see the
fully initialised cgroup structures. They do not belong in the
shrinker infrastructure...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Tue, Dec 15, 2020 at 01:03:45AM +, Pavel Begunkov wrote:
> On 15/12/2020 00:56, Dave Chinner wrote:
> > On Tue, Dec 15, 2020 at 12:20:23AM +, Pavel Begunkov wrote:
> >> As reported, we must not do pressure stall information accounting for
> >> direct IO, beca
On Tue, Dec 15, 2020 at 12:00:23PM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 12:20:24AM +, Pavel Begunkov wrote:
> > A preparation patch. It adds a simple helper which abstracts out number
> > of segments we're allocating for a bio from iov_iter_npages().
>
io_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
> {
> + /* reuse iter->bvec */
> + if (iov_iter_is_bvec(iter))
> + return 0;
> return iov_iter_npages(iter, max_segs);
Ah, I'm a blind idiot... :/
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
de this specific patch, so it's not clear what it's
actually needed for...
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
for paging IO */
> + bio_clear_flag(bio, BIO_WORKINGSET);
Why only do this for the old direct IO path? Why isn't this
necessary for the iomap DIO path?
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Wed, Dec 02, 2020 at 03:12:20PM +0800, Ruan Shiyang wrote:
> Hi Dave,
>
> On 2020/11/30 上午6:47, Dave Chinner wrote:
> > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> > >
> > > The call trace is like this:
> > > memory_fail
On Wed, Dec 02, 2020 at 10:04:17PM +0100, Greg Kroah-Hartman wrote:
> On Thu, Dec 03, 2020 at 07:40:45AM +1100, Dave Chinner wrote:
> > On Wed, Dec 02, 2020 at 08:06:01PM +0100, Greg Kroah-Hartman wrote:
> > > On Wed, Dec 02, 2020 at 06:41:43PM +0100, Miklos Szeredi wrote:
>
orrect regressions in fixes before they get propagated to users.
It also creates a clear demarcation between fixes and cc: stable for
maintainers and developers: only patches with a cc: stable will be
backported immediately to stable. Developers know what patches need
urgent backports and, unlike developers, the automated fixes scan
does not have the subject matter expertise or background to make
that judgement
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
r that filesystem instance then,
by definition, it does not support DAX and the bit should never be
set.
e.g. We don't talk about kernels that support reflink - what matters
to userspace is whether the filesystem instance supports reflink.
Think of the useless mess that xfs_info would be if it reported
kernel capabilities instead of filesystem instance capabilities.
i.e. we don't report that a filesystem supports reflink just because
the kernel supports it - it reports whether the filesystem instance
being queried supports reflink. And that also implies the kernel
supports it, because the kernel has to support it to mount the
filesystem...
So, yeah, I think it really does need to be conditional on the
filesystem instance being queried to be actually useful to users
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
is cached then we can
try to re-write it to disk to fix the bad data, otherwise we treat
it like a writeback error and report it on the next
write/fsync/close operation done on that file.
This gets rid of the mf_recover_controller altogether and allows
the interface to be used by any sort of block device for any sort
of bottom-up reporting of media/device failures.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Wed, Nov 25, 2020 at 06:46:54PM -0500, Sasha Levin wrote:
> On Thu, Nov 26, 2020 at 08:52:47AM +1100, Dave Chinner wrote:
> > We've already had one XFS upstream kernel regression in this -rc
> > cycle propagated to the stable kernels in 5.9.9 because the stable
> > pr
On Wed, Nov 25, 2020 at 10:35:50AM -0500, Sasha Levin wrote:
> From: Dave Chinner
>
> [ Upstream commit 883a790a84401f6f55992887fd7263d808d4d05d ]
>
> Jens has reported a situation where partial direct IOs can be issued
> and completed yet still return -EAGAIN. We don't
TX_ATTR_DAX in statx for either the
attributes or attributes_mask field because the filesystem is not
DAX capable. And given that we have filesystems with multiple block
devices that can have different DAX capabilities, I think this
statx() attr state (and mask) really has to come from the
filesystem, not VFS...
> Extra question: should we only set this in the attributes mask if
> CONFIG_FS_DAX=y ?
IMO, yes, because it will always be false on CONFIG_FS_DAX=n and so
it may well as not be emitted as a supported bit in the mask.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
On Wed, Nov 11, 2020 at 11:28:48AM +0100, Michal Suchánek wrote:
> On Tue, Nov 10, 2020 at 08:08:23AM +1100, Dave Chinner wrote:
> > On Mon, Nov 09, 2020 at 09:27:05PM +0100, Michal Suchánek wrote:
> > > On Mon, Nov 09, 2020 at 11:24:19AM -0800, Darrick J. Wong wrote:
> >
storing it's data on a different
filesystem that isn't mounted at install time, so the installer
has no chance of detecting that the application is going to use
DAX enabled storage.
IOWs, the installer cannot make decisions based on DAX state on
behalf of applications because it does not know what environment the
application is going to be configured to run in. DAX can only be
deteted reliably by the application at runtime inside it's
production execution environment.
Cheers,
Dave.
--
Dave Chinner
da...@fromorbit.com
1 - 100 of 1187 matches
Mail list logo