Re: [PATCH] kernel: fs: drop_caches: add dds drop_caches_count

2016-02-14 Thread Dave Chinner
On Fri, Feb 12, 2016 at 12:14:39PM -0800, Daniel Walker wrote:
> From: Khalid Mughal 
> 
> Currently there is no way to figure out the droppable pagecache size
> from the meminfo output. The MemFree size can shrink during normal
> system operation, when some of the memory pages get cached and is
> reflected in "Cached" field. Similarly for file operations some of
> the buffer memory gets cached and it is reflected in "Buffers" field.
> The kernel automatically reclaims all this cached & buffered memory,
> when it is needed elsewhere on the system. The only way to manually
> reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
> this can have performance impact. Since it discards cached objects,
> it may cause high CPU & I/O utilization to recreate the dropped
> objects during heavy system load.
> This patch computes the droppable pagecache count, using same
> algorithm as "vm/drop_caches". It is non-destructive and does not
> drop any pages. Therefore it does not have any impact on system
> performance. The computation does not include the size of
> reclaimable slab.

Why, exactly, do you need this? You've described what the patch
does (i.e. redundant, because we can read the code), and described
that the kernel already accounts this reclaimable memory elsewhere
and you can already read that and infer the amount of reclaimable
memory from it. So why isn't that accounting sufficient?

As to the code, I think it is a horrible hack - the calculation
does not come for free. Forcing iteration all the inodes in the
inode cache is not something we should allow users to do - what's to
stop someone just doing this 100 times in parallel and DOSing the
machine?

Or what happens when someone does 'grep "" /proc/sys/vm/*" to see
what all the VM settings are on a machine with a couple of TB of
page cache spread across a couple of hundred million cached inodes?
It a) takes a long time, b) adds sustained load to an already
contended lock (sb->s_inode_list_lock), and c) isn't configuration
information.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel: fs: drop_caches: add dds drop_caches_count

2016-02-15 Thread Dave Chinner
On Mon, Feb 15, 2016 at 10:19:54AM -0800, Daniel Walker wrote:
> On 02/14/2016 01:18 PM, Dave Chinner wrote:
> >On Fri, Feb 12, 2016 at 12:14:39PM -0800, Daniel Walker wrote:
> >>From: Khalid Mughal 
> >>
> >>Currently there is no way to figure out the droppable pagecache size
> >>from the meminfo output. The MemFree size can shrink during normal
> >>system operation, when some of the memory pages get cached and is
> >>reflected in "Cached" field. Similarly for file operations some of
> >>the buffer memory gets cached and it is reflected in "Buffers" field.
> >>The kernel automatically reclaims all this cached & buffered memory,
> >>when it is needed elsewhere on the system. The only way to manually
> >>reclaim this memory is by writing 1 to /proc/sys/vm/drop_caches. But
> >>this can have performance impact. Since it discards cached objects,
> >>it may cause high CPU & I/O utilization to recreate the dropped
> >>objects during heavy system load.
> >>This patch computes the droppable pagecache count, using same
> >>algorithm as "vm/drop_caches". It is non-destructive and does not
> >>drop any pages. Therefore it does not have any impact on system
> >>performance. The computation does not include the size of
> >>reclaimable slab.
> >Why, exactly, do you need this? You've described what the patch
> >does (i.e. redundant, because we can read the code), and described
> >that the kernel already accounts this reclaimable memory elsewhere
> >and you can already read that and infer the amount of reclaimable
> >memory from it. So why isn't that accounting sufficient?
> 
> We need it to determine accurately what the free memory in the
> system is. If you know where we can get this information already
> please tell, we aren't aware of it. For instance /proc/meminfo isn't
> accurate enough.

What you are proposing isn't accurate, either, because it will be
stale by the time the inode cache traversal is completed and the
count returned to userspace. e.g. pages that have already been
accounted as droppable can be reclaimed or marked dirty and hence
"unreclaimable".

IOWs, the best you are going to get is an approximate point-in-time
indication of how much memory is available for immediate reclaim.
We're never going to get an accurate measure in userspace unless we
accurately account for it in the kernel itself. Which, I think it
has already been pointed out, is prohibitively expensive so isn't
done.

As for a replacement, looking at what pages you consider "droppable"
is really only file pages that are not under dirty or under
writeback. i.e. from /proc/meminfo:

Active(file): 220128 kB
Inactive(file):60232 kB
Dirty:     0 kB
Writeback: 0 kB

i.e. reclaimable file cache = Active + inactive - dirty - writeback.

And while you are there, when you drop slab caches:

SReclaimable:  66632 kB

some amount of that may be freed. No guarantees can be made about
the amount, though.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel: fs: drop_caches: add dds drop_caches_count

2016-02-15 Thread Dave Chinner
On Mon, Feb 15, 2016 at 03:52:31PM -0800, Daniel Walker wrote:
> On 02/15/2016 03:05 PM, Dave Chinner wrote:
> >What you are proposing isn't accurate, either, because it will be
> >stale by the time the inode cache traversal is completed and the
> >count returned to userspace. e.g. pages that have already been
> >accounted as droppable can be reclaimed or marked dirty and hence
> >"unreclaimable".
> >
> >IOWs, the best you are going to get is an approximate point-in-time
> >indication of how much memory is available for immediate reclaim.
> >We're never going to get an accurate measure in userspace unless we
> >accurately account for it in the kernel itself. Which, I think it
> >has already been pointed out, is prohibitively expensive so isn't
> >done.
> >
> >As for a replacement, looking at what pages you consider "droppable"
> >is really only file pages that are not under dirty or under
> >writeback. i.e. from /proc/meminfo:
> >
> >Active(file): 220128 kB
> >Inactive(file):60232 kB
> >Dirty: 0 kB
> >Writeback: 0 kB
> >
> >i.e. reclaimable file cache = Active + inactive - dirty - writeback.
.

> Approximate point-in-time indication is an accurate
> characterization of what we are doing. This is good enough for us.
> NO matter what we do, we are never going to be able to address the
> "time of check to time of use” window.  But, this
> approximation works reasonably well for our use case.
> 
> As to his other suggestion of estimating the droppable cache, I
> have considered it but found it unusable. The problem is the
> inactive file pages count a whole lot pages more than the
> droppable pages.

inactive file pages are supposed to be exactly that - inactive. i.e.
the have not been referenced recently, and are unlikely to be dirty.
They should be immediately reclaimable.

> See the value of these, before and [after] dropping reclaimable
> pages.
> 
> Before:
> 
> Active(file): 183488 kB
> Inactive(file):   180504 kB
> 
> After (the drop caches):
>
> Active(file):  89468 kB
> Inactive(file):32016 kB
>
> The dirty and the write back are mostly 0KB under our workload as
> we are mostly dealing with the readonly file pages of binaries
> (programs/libraries)..  "

if the pages are read-only, then they are clean. If they are on the
LRUs, then they should be immediately reclaimable.

Let's go back to your counting criteria of all those file pages:

+static int is_page_droppable(struct page *page)
+{
+   struct address_space *mapping = page_mapping(page);
+
+   if (!mapping)
+   return 0;

invalidated page, should be none.

+   if (PageDirty(page))
+   return 0;

Dirty get ignored, in /proc/meminfo.

+   if (PageWriteback(page))
+   return 0;

Writeback ignored, in /proc/meminfo.

+   if (page_mapped(page))
+   return 0;

Clean page, mapped into userspace get ignored, in /proc/meminfo.

+   if (page->mapping != mapping)
+   return 0;

Invalidation race, should be none.

+   if (page_has_private(page))
+   return 0;

That's simply wrong. For XFs inodes, that will skip *every page on
every inode* because it attachs bufferheads to every page, even on
read. ext4 behaviour will depend on mount options and whether the
page has been dirtied or not. IOWs, this turns the number of
reclaimable pages in the inode cache into garbage because it counts
clean, reclaimable pages with attached bufferheads as non-reclaimable.

But let's ignore that by assuming you have read-only pages without
bufferheads (e.g. ext4, blocksize = pagesize, nobh mode on read-only
pages). That means the only thing that makes a difference to the
count returned is mapped pages, a count of which is also in
/proc/meminfo.

So, to pick a random active server here:

before  after
Active(file):   12103200 kB 24060 kB
Inactive(file):  5976676 kB  1380 kB
Mapped:31308 kB 31308 kB

How much was not reclaimed? Roughly the same number of pages as the
Mapped count, and that's exactly what we'd expect to see from the
above page walk counting code. Hence a slightly better approximation
of the pages that dropping caches will reclaim is:

reclaimable pages = active + inactive - dirty - writeback - mapped

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel: fs: drop_caches: add dds drop_caches_count

2016-02-15 Thread Dave Chinner
On Tue, Feb 16, 2016 at 02:58:04AM +, Nag Avadhanam (nag) wrote:
> Its the calculation of the # of bytes of non-reclaimable file system cache
> pages that has been troubling us. We do not want to count inactive file
> pages (of programs/binaries) that were once mapped by any process in the
> system as reclaimable because that might lead to thrashing under memory
> pressure (we want to alert admins before system starts dropping text
> pages).

The code presented does not match your requirements. It only counts
pages that are currently mapped into ptes. hence it will tell you
that once-used and now unmapped binary pages are reclaimable, and
drop caches will reclaim them. hence they'll need to be fetched from
disk again if they are faulted in again after a drop_caches run.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel: fs: drop_caches: add dds drop_caches_count

2016-02-16 Thread Dave Chinner
On Mon, Feb 15, 2016 at 09:57:42PM -0800, Nag Avadhanam wrote:
> On Mon, 15 Feb 2016, Dave Chinner wrote:
> >So, to pick a random active server here:
> >
> > before  after
> >Active(file):   12103200 kB  24060 kB
> >Inactive(file):  5976676 kB   1380 kB
> >Mapped:31308 kB  31308 kB
> >
> >How much was not reclaimed? Roughly the same number of pages as the
> >Mapped count, and that's exactly what we'd expect to see from the
> >above page walk counting code. Hence a slightly better approximation
> >of the pages that dropping caches will reclaim is:
> >
> >reclaimable pages = active + inactive - dirty - writeback - mapped
> 
> Thanks Dave. I considered that, but see this.
> 
> Mapped page count below is much higher than the (active(file) +
> inactive (file)).

Yes. it's all unreclaimable from drop caches, though.

> Mapped seems to include all page cache pages mapped into the process
> memory, including the shared memory pages, file pages and few other
> type
> mappings.
> 
> I suppose the above can be rewritten as (mapped is still high):
> 
> reclaimable pages = active + inactive + shmem - dirty - writeback - mapped
> 
> What about kernel pages mapped into user address space? Does "Mapped"
> include those pages as well? How do we exclude them? What about
> device mappings? Are these excluded in the "Mapped" pages
> calculation?

/me shrugs.

I have no idea - I really don't care about what pages are accounted
as mapped. I assumed that the patch proposed addressed your
requirements and so I suggested an alternative that provided almost
exactly the same information but erred on the side of
underestimation and hence solves your problem of drop_caches not
freeing as much memory as you expected

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kernel: fs: drop_caches: add dds drop_caches_count

2016-02-16 Thread Dave Chinner
On Mon, Feb 15, 2016 at 11:14:13PM -0800, Nag Avadhanam wrote:
> On Mon, 15 Feb 2016, Dave Chinner wrote:
> 
> >On Tue, Feb 16, 2016 at 02:58:04AM +, Nag Avadhanam (nag) wrote:
> >>Its the calculation of the # of bytes of non-reclaimable file system cache
> >>pages that has been troubling us. We do not want to count inactive file
> >>pages (of programs/binaries) that were once mapped by any process in the
> >>system as reclaimable because that might lead to thrashing under memory
> >>pressure (we want to alert admins before system starts dropping text
> >>pages).
> >
> >The code presented does not match your requirements. It only counts
> >pages that are currently mapped into ptes. hence it will tell you
> >that once-used and now unmapped binary pages are reclaimable, and
> >drop caches will reclaim them. hence they'll need to be fetched from
> >disk again if they are faulted in again after a drop_caches run.
> 
> Will the inactive binary pages be automatically unmapped even if the process
> into whose address space they are mapped is still around? I thought they
> are left mapped until such time there is memory pressure.

Right, page reclaim via memory pressure can unmap mapped pages in
order to reclaim them. Drop caches will skip them.

> We only care for binary pages (active and inactive) mapped into the
> address spaces of live processes. Its okay to aggressively reclaim
> inactive
> pages once mapped into processes that are no longer around.

Ok, if you're only concerned about live processes then drop caches
should behave as you want.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfs: Change URL for the project in xfs.txt

2018-03-02 Thread Dave Chinner
On Fri, Mar 02, 2018 at 09:24:01AM -0800, Darrick J. Wong wrote:
> On Fri, Mar 02, 2018 at 10:30:13PM +0900, Masanari Iida wrote:
> > The oss.sgi.com doesn't exist any more.
> > Change it to current project URL, https://xfs.wiki.kernel.org/
> > 
> > Signed-off-by: Masanari Iida 
> > ---
> >  Documentation/filesystems/xfs.txt | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/filesystems/xfs.txt 
> > b/Documentation/filesystems/xfs.txt
> > index 3b9b5c149f32..4d9ff0a7f8e1 100644
> > --- a/Documentation/filesystems/xfs.txt
> > +++ b/Documentation/filesystems/xfs.txt
> > @@ -9,7 +9,7 @@ variable block sizes, is extent based, and makes extensive 
> > use of
> >  Btrees (directories, extents, free space) to aid both performance
> >  and scalability.
> >  
> > -Refer to the documentation at http://oss.sgi.com/projects/xfs/
> > +Refer to the documentation at https://xfs.wiki.kernel.org/

Did I miss a memo?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfs: Change URL for the project in xfs.txt

2018-03-02 Thread Dave Chinner
On Fri, Mar 02, 2018 at 04:08:24PM -0600, Eric Sandeen wrote:
> 
> 
> On 3/2/18 3:57 PM, Dave Chinner wrote:
> > On Fri, Mar 02, 2018 at 09:24:01AM -0800, Darrick J. Wong wrote:
> >> On Fri, Mar 02, 2018 at 10:30:13PM +0900, Masanari Iida wrote:
> >>> The oss.sgi.com doesn't exist any more.
> >>> Change it to current project URL, https://xfs.wiki.kernel.org/
> >>>
> >>> Signed-off-by: Masanari Iida 
> >>> ---
> >>>  Documentation/filesystems/xfs.txt | 2 +-
> >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/Documentation/filesystems/xfs.txt 
> >>> b/Documentation/filesystems/xfs.txt
> >>> index 3b9b5c149f32..4d9ff0a7f8e1 100644
> >>> --- a/Documentation/filesystems/xfs.txt
> >>> +++ b/Documentation/filesystems/xfs.txt
> >>> @@ -9,7 +9,7 @@ variable block sizes, is extent based, and makes 
> >>> extensive use of
> >>>  Btrees (directories, extents, free space) to aid both performance
> >>>  and scalability.
> >>>  
> >>> -Refer to the documentation at http://oss.sgi.com/projects/xfs/
> >>> +Refer to the documentation at https://xfs.wiki.kernel.org/
> > 
> > Did I miss a memo?
> 
> About which part, the loss of oss.sgi or the addition of the kernel.org wiki?
> 
> The kernel.org wiki is pretty bare though.  OTOH xfs.org is a bit less
> official.  We really need to resolve this issue.

Moving everything to kernel.org wiki. As I mentioned on IRC, I'd
much prefer we move away from wiki's to something we can edit
locally, review via email, has proper revision control and a
"publish" mechanism that pushes built documentation out to the
public website.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP

2016-10-06 Thread Dave Chinner
On Thu, Oct 06, 2016 at 11:17:18AM -0700, Davidlohr Bueso wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
> 
> >Currently, when down_read() fails, the active read locking isn't undone
> >until the rwsem_down_read_failed() function grabs the wait_lock. If the
> >wait_lock is contended, it may takes a while to get the lock. During
> >that period, writer lock stealing will be disabled because of the
> >active read lock.
> >
> >This patch will release the active read lock ASAP so that writer lock
> >stealing can happen sooner. The only downside is when the reader is
> >the first one in the wait queue as it has to issue another atomic
> >operation to update the count.
> >
> >On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
> >the fio test with multithreaded randrw and randwrite tests on the
> >same file on a XFS partition on top of a NVDIMM with DAX were run,
> >the aggregated bandwidths before and after the patch were as follows:
> >
> > Test  BW before patch BW after patch  % change
> >   --- --  
> > randrw1210 MB/s  1352 MB/s  +12%
> > randwrite 1622 MB/s  1710 MB/s  +5.4%
> 
> Yeah, this is really a bad workload to make decisions on locking
> heuristics imo - if I'm thinking of the same workload. Mainly because
> concurrent buffered io to the same file isn't very realistic and you
> end up pathologically pounding on i_rwsem (which used to be until
> recently i_mutex until Al's parallel lookup/readdir). Obviously write
> lock stealing wins in this case.

Except that it's DAX, and in 4.7-rc1 that used shared locking at the
XFS level and never took exclusive locks.

*However*, the DAX IO path locking in XFS  has changed in 4.9-rc1 to
match the buffered IO single writer POSIX semantics - the test is a
bad test based on the fact it exercised a path that is under heavy
development and so can't be used as a regression test across
multiple kernels.

If you want to stress concurrent access to a single file, please
use direct IO, not DAX or buffered IO.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP

2016-10-09 Thread Dave Chinner
On Sun, Oct 09, 2016 at 08:17:48AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 08:47:51AM +1100, Dave Chinner wrote:
> > Except that it's DAX, and in 4.7-rc1 that used shared locking at the
> > XFS level and never took exclusive locks.
> > 
> > *However*, the DAX IO path locking in XFS  has changed in 4.9-rc1 to
> > match the buffered IO single writer POSIX semantics - the test is a
> > bad test based on the fact it exercised a path that is under heavy
> > development and so can't be used as a regression test across
> > multiple kernels.
> 
> That being said - I wonder if we should allow the shared lock on DAX
> files IFF the user is specifying O_DIRECT in the open mode..

It should do - if it doesn't then we screwed up the IO path
selection logic in XFS and we'll need to fix it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP

2016-10-11 Thread Dave Chinner
On Mon, Oct 10, 2016 at 02:34:34AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 10, 2016 at 05:07:45PM +1100, Dave Chinner wrote:
> > > > *However*, the DAX IO path locking in XFS  has changed in 4.9-rc1 to
> > > > match the buffered IO single writer POSIX semantics - the test is a
> > > > bad test based on the fact it exercised a path that is under heavy
> > > > development and so can't be used as a regression test across
> > > > multiple kernels.
> > > 
> > > That being said - I wonder if we should allow the shared lock on DAX
> > > files IFF the user is specifying O_DIRECT in the open mode..
> > 
> > It should do - if it doesn't then we screwed up the IO path
> > selection logic in XFS and we'll need to fix it.
> 
> Depends on your defintion of "we".  The DAX code has always abused the
> direct I/O path, and that abuse is ingrained in the VFS path in a way that
> we can't easily undo it in XFS, e.g. take a look at io_is_direct and
> iocb_flags in include/linux/fs.h, which ensure that DAX I/O will always
> appear as IOCB_DIRECT to the fs. 

Um, I seem to have completely missed that change - when did that
happen and why?

Oh, it was part of the misguided "enable DAX on block devices"
changes - it was supposed to fix a problem with block device access
using buffered IO instead of DAX (commit 65f87ee71852 "fs, block:
force direct-I/O for dax-enabled block devices")).  The original
block device DAX code was reverted soon after this because it was
badly flawed, but this change was not removed at the same time
(probably because it was forgotton about).

Hence I'd suggest that DAX check in io_is_direct() should be removed
ASAP; the filesystems don't need it as they check the inode DAX
state directly, and the code it "fixed" is no longer in the tree.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] docs: improve readability for people with poorer eyesight

2018-10-05 Thread Dave Chinner
On Thu, Oct 04, 2018 at 06:06:03PM -0700, Darrick J. Wong wrote:
> Hi,
> 
> So my eyesight still hasn't fully recovered, so in the meantime it's
> been difficult to read the online documentation.  Here's some stylesheet
> overrides I've been using to make it easier for me to read them:
> https://djwong.org/docs/kdoc/index.html
> 
> ---
> From: Darrick J. Wong 
> 
> My eyesight is not in good shape, which means that I have difficulty
> reading the online Linux documentation.  Specifically, body text is
> oddly small compared to list items and the contrast of various text
> elements is too low for me to be able to see easily.
> 
> Therefore, alter the HTML theme overrides to make the text larger and
> increase the contrast for better visibility, and trust the typeface
> choices of the reader's browser.
> 
> For the PDF output, increase the text size, use a sans-serif typeface
> for sans-serif text, and use a serif typeface for "roman" serif text.
> 
> Signed-off-by: Darrick J. Wong 

This fixes problems I noticed when trying to review the built html
documentation - the inconsistent font sizes on my high-dpi monitor
made it almost impossible to read even though I have no eyesight
problems

Acked-by: Dave Chinner 

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 00/22] xfs-4.20: major documentation surgery

2018-10-05 Thread Dave Chinner
On Wed, Oct 03, 2018 at 09:18:11PM -0700, Darrick J. Wong wrote:
> Hi all,
> 
> This series converts the existing in-kernel xfs documentation to rst
> format, links it in with the rest of the kernel's rst documetation, and
> then begins pulling in the contents of the Data Structures & Algorithms
> book from the xfs-documentation git tree.  No changes are made to the
> text during the import process except to fix things that the conversion
> process (asciidoctor + pandoc) didn't do correctly.  The goal of this
> series is to tie together the XFS code with the on-disk format
> documentation for the features supported by the code.
> 
> I've built the docs and put them here, in case you hate reading rst:
> https://djwong.org/docs/kdoc/admin-guide/xfs.html
> https://djwong.org/docs/kdoc/filesystems/xfs-data-structures/index.html
> 
> I've posted a branch here because the png import patch is huge:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=docs-4.20-merge
> 
> The patchset should apply cleanly against 4.19-rc6.  Comments and
> questions are, as always, welcome.

Jon,

Can you let us know whether the CC-by-SA 4.0 license is acceptible
or not? That's really the only thing that we need clarified at this
point - if it's OK I'll to pull this into the XFS tree for the 4.20
merge window. If not, we'll go back to the drawing board

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 00/22] xfs-4.20: major documentation surgery

2018-10-05 Thread Dave Chinner
On Fri, Oct 05, 2018 at 07:01:20PM -0600, Jonathan Corbet wrote:
> On Sat, 6 Oct 2018 10:51:54 +1000
> Dave Chinner  wrote:
> 
> > Can you let us know whether the CC-by-SA 4.0 license is acceptible
> > or not? That's really the only thing that we need clarified at this
> > point - if it's OK I'll to pull this into the XFS tree for the 4.20
> > merge window. If not, we'll go back to the drawing board
> 
> I remain pretty concerned about it, to tell the truth.  Rather than
> continue to guess, though, I've called for help, and will be talking with
> the LF lawyer about this next Thursday.  Before then, I can't say anything
> except "I don't think this works..."
> 
> Will let you know what I hear.

Thanks for the update, Jon. I'll put this on the backburner until I
hear back from you.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 00/22] xfs-4.20: major documentation surgery

2018-10-11 Thread Dave Chinner
On Thu, Oct 11, 2018 at 11:27:35AM -0600, Jonathan Corbet wrote:
> On Sat, 6 Oct 2018 10:51:54 +1000 Dave Chinner
>  wrote:
> 
> > Can you let us know whether the CC-by-SA 4.0 license is
> > acceptible or not? That's really the only thing that we need
> > clarified at this point - if it's OK I'll to pull this into the
> > XFS tree for the 4.20 merge window. If not, we'll go back to the
> > drawing board
> 
> OK, I've had a long conversation with the LF lawyer, and she said
> clearly that we really should not be introducing CC-SA material
> into the kernel source tree.  It's a pain; I really do like CC-SA
> better for documentation, but it's not GPL-compatible, and that
> creates a problem for the processed docs.

Ok, thanks for following upon this, Jon. We'll just keep it all in
the existing external repo and work out what to do from there.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] Documentation: filesystem: fix "Removed Sysctls" table

2019-07-23 Thread Dave Chinner
On Tue, Jul 23, 2019 at 03:52:01PM +0100, Sheriff Esseson wrote:
> On Tue, Jul 23, 2019 at 07:42:18AM -0600, Jonathan Corbet wrote:
> > On Tue, 23 Jul 2019 12:48:13 +0100
> > Sheriff Esseson  wrote:
> > 
> > > the "Removed Sysctls" section is a table - bring it alive with ReST.
> > > 
> > > Signed-off-by: Sheriff Esseson 
> > 
> > So this appears to be identical to the patch you sent three days ago; is
> > there a reason why you are sending it again now?
> > 
> > Thanks,
> > 
> > jon
> 
> Sorry, I was think the patch went unnoticed during the merge window - I could
> not find a response.

The correct thing to do in that case is to reply to the original
patch and ask if it has been looked at. The usual way of doing this
is quoting the commit message and replying with a "Ping?" comment
to bump it back to the top of everyone's mail stacks.

But, again, 3 days is not a long time, people tend to be extremely
busy and might take a few days to get to reviewing non-critical
changes, and people may not even review patches during the merge
window. I'd suggest waiting a week before pinging a patch you've
sent if there's been no response

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead

2019-01-07 Thread Dave Chinner
On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> As newer systems have more and more IRQs and CPUs available in their
> system, the performance of reading /proc/stat frequently is getting
> worse and worse.

Because the "roll-your-own" per-cpu counter implementaiton has been
optimised for low possible addition overhead on the premise that
summing the counters is rare and isn't a performance issue. This
patchset is a direct indication that this "summing is rare and can
be slow" premise is now invalid.

We have percpu counter infrastructure that trades off a small amount
of addition overhead for zero-cost reading of the counter value.
i.e. why not just convert this whole mess to percpu_counters and
then just use percpu_counter_read_positive()? Then we just don't
care how often userspace reads the /proc file because there is no
summing involved at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead

2019-01-07 Thread Dave Chinner
On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> >> As newer systems have more and more IRQs and CPUs available in their
> >> system, the performance of reading /proc/stat frequently is getting
> >> worse and worse.
> > Because the "roll-your-own" per-cpu counter implementaiton has been
> > optimised for low possible addition overhead on the premise that
> > summing the counters is rare and isn't a performance issue. This
> > patchset is a direct indication that this "summing is rare and can
> > be slow" premise is now invalid.
> >
> > We have percpu counter infrastructure that trades off a small amount
> > of addition overhead for zero-cost reading of the counter value.
> > i.e. why not just convert this whole mess to percpu_counters and
> > then just use percpu_counter_read_positive()? Then we just don't
> > care how often userspace reads the /proc file because there is no
> > summing involved at all...
> >
> > Cheers,
> >
> > Dave.
> 
> Yes, percpu_counter_read_positive() is cheap. However, you still need to
> pay the price somewhere. In the case of percpu_counter, the update is
> more expensive.

Ummm, that's exactly what I just said. It's a percpu counter that
solves the "sum is expensive and frequent" problem, just like you
are encountering here. I do not need basic scalability algorithms
explained to me.

> I would say the percentage of applications that will hit this problem is
> small. But for them, this problem has some significant performance overhead.

Well, duh!

What I was suggesting is that you change the per-cpu counter
implementation to the /generic infrastructure/ that solves this
problem, and then determine if the extra update overhead is at all
measurable. If you can't measure any difference in update overhead,
then slapping complexity on the existing counter to attempt to
mitigate the summing overhead is the wrong solution.

Indeed, it may be that you need o use a custom batch scaling curve
for the generic per-cpu coutner infrastructure to mitigate the
update overhead, but the fact is we already have generic
infrastructure that solves your problem and so the solution should
be "use the generic infrastructure" until it can be proven not to
work.

i.e. prove the generic infrastructure is not fit for purpose and
cannot be improved sufficiently to work for this use case before
implementing a complex, one-off snowflake counter implementation...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead

2019-01-08 Thread Dave Chinner
On Tue, Jan 08, 2019 at 11:58:26AM -0500, Waiman Long wrote:
> On 01/07/2019 09:04 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> >> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> >>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > What I was suggesting is that you change the per-cpu counter
> > implementation to the /generic infrastructure/ that solves this
> > problem, and then determine if the extra update overhead is at all
> > measurable. If you can't measure any difference in update overhead,
> > then slapping complexity on the existing counter to attempt to
> > mitigate the summing overhead is the wrong solution.
> >
> > Indeed, it may be that you need o use a custom batch scaling curve
> > for the generic per-cpu coutner infrastructure to mitigate the
> > update overhead, but the fact is we already have generic
> > infrastructure that solves your problem and so the solution should
> > be "use the generic infrastructure" until it can be proven not to
> > work.
> >
> > i.e. prove the generic infrastructure is not fit for purpose and
> > cannot be improved sufficiently to work for this use case before
> > implementing a complex, one-off snowflake counter implementation...
> 
> I see your point. I like the deferred summation approach that I am
> currently using. If I have to modify the current per-cpu counter
> implementation to support that

No! Stop that already. The "deferred counter summation" is exactly
the problem the current algorithm has and exactly the problem the
generic counters /don't have/. Changing the generic percpu counter
algorithm to match this specific hand rolled implementation is not
desirable as it will break implementations that rely on the
bound maximum summation deviation of the existing algorithm (e.g.
the ext4 and XFS ENOSPC accounting algorithms).

> and I probably need to add counter
> grouping support to amortize the overhead, that can be a major

The per-cpu counters already have configurable update batching
to amortise the summation update cost across multiple individual
per-cpu updates. You don't need to change the implementation at all,
just tweak the amortisation curve appropriately for the desired
requirements for update scalability (high) vs read accuracy (low).

Let's face it, on large systems where counters are frequently
updated, the resultant sum can be highly inaccurate by the time a
thousand CPU counters have been summed. The generic counters have a
predictable and bound "fast sum" maximum deviation (batch size *
nr_cpus), so over large machines are likely to be much more accurate
than "unlocked summing on demand" algorithms.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Documenting the crash consistency guarantees of file systems

2019-02-13 Thread Dave Chinner
On Wed, Feb 13, 2019 at 12:35:16PM -0600, Vijay Chidambaram wrote:
> On Wed, Feb 13, 2019 at 12:22 PM Amir Goldstein  wrote:
> > On Wed, Feb 13, 2019 at 7:06 PM Jayashree Mohan  wrote:
> AFAIK, any file system which persists things out of order to increase
> performance does not provide strictly ordered metadata semantics.

Define "things", please.

And while you are there, define "persist", please.

XFS can "persist" "things" out of order because it has a multi-phase
checkpointing subsystem and we can't control IO ordering during
concurrent checkpoint writeout. The out of order checkpoints don't
get put back in order until recovery is run - it reorders everything
that is in the journal into correct sequence order before it starts
recovery.

IOWs, the assertion that we must "persist" things in strict order to
maintain ordered metadata semantics is incorrect. /Replay/ of the
changes being recovered must be done in order, but that does not
require them to be persisted to stable storage in strict order.

> These semantics seem to indicate a total ordering among all
> operations, and an fsync should persist all previous operations (as
> ext3 used to do).

No, absolutely not.

You're talking about /globally ordered metadata/. This was the
Achille's Heel of ext3, resulting in fsync being indistinguishable
from sync and hence being horrifically slow. This caused a
generation of linux application developers to avoid using fsync and
causing users (and fs developers who got blamed for losing data)
endless amounts of pain when files went missing after crashes.
Hindsight teaches us that ext3's behaviour was a horrible mistake
and not one we want to repeat.

Strictly ordered metadata only requires persisting all the previous
dependent modifications to the objects we need to persist.
i.e. if you fsync an inode we just allocated, then we also have to
persist the changes in the same transaction and then all the
previous changes that are dependent on that set of objects, and so
one all the way back to objects being clean on disk. If we crash,
we can then rebuild all of the information the user persisted
correctly.

There is no requirement for any other newly created inode elsewhere
in the filesystem to be present after fsync of the first file.
Indepdnent file creation will only be persisted by the fsync if
there is a shared object modification dependency between those two
files elsewhere in metadata. e.g. they are both in the same inode
cluster so share an inode btree block modification that marks them
as used, hence if one it persisted, the btree block is persisted,
and hence the other inode and all it's dependencies need to be
persisted as well.

That dependency tree is the "strict ordering" we talk about. At
times it can look like "globally ordered metadata", but for
indepedent changes it will only result in the small number of
dependencies being persisted and not the entire filesystem (as per
ext3).

> Note that Jayashree and I aren't arguing file systems should provide
> this semantics, merely that ext4 and btrfs violate it at certain
> points.

As does XFS.

http://xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULLS_in_some_files_after_recovery_when_I_unplugged_the_power.3F

ext4 inherited all it's delalloc vs metadata ordering issues from
XFS as ext4 really just copied the XFS design without understanding
all the problems it had. Then when users started reporting problems
the problems we'd fixed with XFS they copied a number of the
mitigations from XFS as well


-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] Documenting the crash-recovery guarantees of Linux file systems

2019-03-05 Thread Dave Chinner
On Tue, Mar 05, 2019 at 08:59:00PM -0600, Jayashree wrote:
>  In this file, we document the crash-recovery guarantees
>  provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
>  present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
>  (SOMC), which is provided by xfs. It is not clear to us if other file systems
>  provide SOMC; we would be happy to modify the document if file-system
>  developers claim that their system provides (or aims to provide) SOMC.

I haven't had time to read this yet, but I will point out that
fstests assumes that filesystems that run
"_require_metadata_journaling" tests present SOMC semantics to
users. i.e. we are not testing POSIX semantics in fstests, we are
testing for SOMC crash-recovery semantics because POSIX does not
define a useful set of semantics.

That was the whole point of genericising that test infrastructure -
to make sure all the linux journalling filesystems behave the same
way in the same circumstances...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-13 Thread Dave Chinner
erms actually mean.  IOWs, you
need to define "previous dependent modifications", "modification
dependency", etc before using them. Essentially, you need to
describe the observable behaviour here, not the implementation that
creates the behaviour.

> If any file shares an object
> +modification dependency with the fsync-ed file, then that file's
> +directory entry is also persisted.

Which you need to explain with references to the ext4 hardlink
failure and how XFS will persist all the hard link directory entries
for each hardlink all the way back up to the root. i.e. don't
describe the implementation, describe the observable behaviour.

> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data. As with files, fsync(dir) also persists
> +previous dependent metadata operations.
>
> +btrfs
> +--
> +fsync(file) : Ensures that a newly created file's directory entry
> +is persisted, along with the directory entries of all its hard links.
> +You do not need to explicitly fsync individual hard links to the file.

So how is that different to XFS? Why explicitly state the hard link
behaviour, but then not mention anything about dependencies and
propagation? Especially after doing exactly the opposite when
describing XFS

> +fsync(dir) : All the file names within the directory will persist. All the
> +rename and unlink operations within the directory are persisted. Due
> +to the design choices made by btrfs, fsync of a directory could lead
> +to an iterative fsync on sub-directories, thereby requiring a full
> +file system commit. So btrfs does not advocate fsync of directories
> +[2].

I don't think this "recommendation" is appropriate for a document
describing behaviour. It's also indicative of btrfs not having SOMC
behaviour.

> +F2FS
> +
> +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix),
> +F2FS only guarantees POSIX behaviour. However, it provides xfs-like

What does "only guarantees POSIX behaviour" actually mean? because
it can mean "loses all your data on crash"

> +guarantees if mounted with fsync-mode=strict option.

So, by default, f2fs will lose all your data on crash? And they call
that "POSIX" behaviour, despite glibc telling applications that the
system provides data integrity preserving fsync functionality?

Seems like a very badly named mount option and a terrible default -
basically we have "fast-and-loose" behaviour which has "eats your
data" data integrity semantics and "strict" which should be POSIX
SIO conformant.


> +fsync(symlink)
> +-
> +A symlink inode cannot be directly opened for IO, which means there is
> +no such thing as fsync of a symlink [3]. You could be tricked by the
> +fact that open and fsync of a symlink succeeds without returning a
> +error, but what happens in reality is as follows.
> +
> +Suppose we have a symlink “foo”, which points to the file “A/bar”
> +
> +fd = open(“foo”, O_CREAT | O_RDWR)
> +fsync(fd)
> +
> +Both the above operations succeed, but if you crash after fsync, the
> +symlink could be still missing.
> +
> +When you try to open the symlink “foo”, you are actually trying to
> +open the file that the symlink resolves to, which in this case is
> +“A/bar”. When you fsync the inode returned by the open system call, you
> +are actually persisting the file “A/bar” and not the symlink. Note
> +that if the file “A/bar” does not exist and you try the open the
> +symlink “foo” without the O_CREAT flag, then file open will fail. To
> +obtain the file descriptor associated with the symlink inode, you
> +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> +file descriptor obtained this way can be only used to indicate a
> +location in the file-system tree and to perform operations that act
> +purely at the file descriptor level. Operations like read(), write(),
> +fsync() etc cannot be performed on such file descriptors.
> +
> +Bottomline : You cannot fsync() a symlink.

You can fsync() the parent dir after it is created or removed
to persist that operation.

> +fsync(special files)
> +
> +Special files in Linux include block and character device files
> +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> +behavior of fsync on symlinks described above, these special files do
> +not have an fsync function defined. Similar to symlinks, you
> +cannot fsync a special file [4].

You can fsync() the parent dir after it is created or removed
to persist that operation.

> +Strictly Ordered Metadata Consistency
> +-
> +With each file system providing varying levels of persistence
> +gu

Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-14 Thread Dave Chinner
On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner  wrote:
> > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > +Strictly Ordered Metadata Consistency
> > > +-
> > > +With each file system providing varying levels of persistence
> > > +guarantees, a consensus in this regard, will benefit application
> > > +developers to work with certain fixed assumptions about file system
> > > +guarantees. Dave Chinner proposed a unified model called the
> > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > +
> > > +Under this scheme, the file system guarantees to persist all previous
> > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > +inode, it will persist all the changes required to reference the inode
> > > +and its data. SOMC can be defined as follows [6]:
> > > +
> > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > +user after recovery without also observing op1.
> > > +
> > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > +share a dependency, which could be file-system specific. It might
> > > +require a developer to understand file-system internals to know if
> > > +SOMC would order one operation before another.
> >
> > That's largely an internal implementation detail, and users should
> > not have to care about the internal implementation because the
> > fundamental dependencies are all defined by the directory heirarchy
> > relationships that users can see and manipulate.
> >
> > i.e. fs internal dependencies only increase the size of the graph
> > that is persisted, but it will never be reduced to less than what
> > the user can observe in the directory heirarchy.
> >
> > So this can be further refined:
> >
> > If op1 precedes op2 in program order (in-memory execution
> > order), and op1 and op2 share a user visible reference, then
> > op2 must not be observed by a user after recovery without
> > also observing op1.
> >
> > e.g. in the case of the parent directory - the parent has a link
> > count. Hence every create, unlink, rename, hard link, symlink, etc
> > operation in a directory modifies a user visible link count
> > reference.  Hence fsync of one of those children will persist the
> > directory link count, and then all of the other preceeding
> > transactions that modified the link count also need to be persisted.
> >
> 
> One thing that bothers me is that the definition of SOMC (as well as
> your refined definition) doesn't mention fsync at all, but all the examples
> only discuss use cases with fsync.

You can't discuss operational ordering without a point in time to
use as a reference for that ordering.  SOMC behaviour is preserved
at any point the filesystem checkpoints itself, and the only thing
that changes is the scope of that checkpoint. fsync is just a
convenient, widely understood, minimum dependecy reference point
that people can reason from. All the interesting ordering problems
come from minimum dependecy reference point (i.e. fsync()), not from
background filesystem-wide checkpoints.

> I personally find SOMC guaranty *much* more powerful in the absence
> of fsync. I have an application that creates sparse files, sets xattrs, mtime
> and moves them into place. The observed requirement is that after crash
> those files either exist with correct mtime, xattr or not exist.

SOMC does not provide the guarantees you seek in the absence of a
known data synchronisation point:

a) a background metadata checkpoint can land anywhere in
that series of operations and hence recovery will land in an
intermediate state.

b) there is data that needs writing, and SOMC provides no
ordering guarantees for data. So after recovery file could
exist with correct mtime and xattrs, but have no (or
partial) data.

> To my understanding, SOMC provides a guaranty that the application does
> not need to do any fsync at all,

Absolutely not true. If the application has atomic creation
requirements that need multiple syscalls to set up, it must
implement them itself and use fsync to synchronise data and metadata
before the "atomic create" operation that makes it visible to the
application.

SOMC only guarantees what /metadata/ you see at a fileystem
synchronisation point; it does not provide ACID semantics to a
random set of system calls into the filesystem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-17 Thread Dave Chinner
On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner  wrote:
> >
> > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner  wrote:
> > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > +Strictly Ordered Metadata Consistency
> > > > > +-
> > > > > +With each file system providing varying levels of persistence
> > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > +developers to work with certain fixed assumptions about file system
> > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > +
> > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > +dependent modifications to the object upon fsync().  If you fsync() 
> > > > > an
> > > > > +inode, it will persist all the changes required to reference the 
> > > > > inode
> > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > +
> > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > +user after recovery without also observing op1.
> > > > > +
> > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > +share a dependency, which could be file-system specific. It might
> > > > > +require a developer to understand file-system internals to know if
> > > > > +SOMC would order one operation before another.
> > > >
> > > > That's largely an internal implementation detail, and users should
> > > > not have to care about the internal implementation because the
> > > > fundamental dependencies are all defined by the directory heirarchy
> > > > relationships that users can see and manipulate.
> > > >
> > > > i.e. fs internal dependencies only increase the size of the graph
> > > > that is persisted, but it will never be reduced to less than what
> > > > the user can observe in the directory heirarchy.
> > > >
> > > > So this can be further refined:
> > > >
> > > > If op1 precedes op2 in program order (in-memory execution
> > > > order), and op1 and op2 share a user visible reference, then
> > > > op2 must not be observed by a user after recovery without
> > > > also observing op1.
> > > >
> > > > e.g. in the case of the parent directory - the parent has a link
> > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > operation in a directory modifies a user visible link count
> > > > reference.  Hence fsync of one of those children will persist the
> > > > directory link count, and then all of the other preceeding
> > > > transactions that modified the link count also need to be persisted.
> > > >
> > >
> > > One thing that bothers me is that the definition of SOMC (as well as
> > > your refined definition) doesn't mention fsync at all, but all the 
> > > examples
> > > only discuss use cases with fsync.
> >
> > You can't discuss operational ordering without a point in time to
> > use as a reference for that ordering.  SOMC behaviour is preserved
> > at any point the filesystem checkpoints itself, and the only thing
> > that changes is the scope of that checkpoint. fsync is just a
> > convenient, widely understood, minimum dependecy reference point
> > that people can reason from. All the interesting ordering problems
> > come from minimum dependecy reference point (i.e. fsync()), not from
> > background filesystem-wide checkpoints.
> >
> 
> Yes, I was referring to rename as a commonly used operation used
> by application as "metadata barrier".

What is a "metadata barrier" and what are it's semantics supposed to
be?

> > > I personally find SOMC guaranty *much* more powerful in the absence
> > > of fsync. I have an application that creates sparse files, sets xattrs, 
> > > mtime
> > > and moves them into place. The observed requirement is that after crash
> > > those files eit

Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-18 Thread Dave Chinner
On Mon, Mar 18, 2019 at 09:13:58AM +0200, Amir Goldstein wrote:
> On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner  wrote:
> > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner  wrote:
> > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner  
> > > > > wrote:
> > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > > +Strictly Ordered Metadata Consistency
> > > > > > > +-
> > > > > > > +With each file system providing varying levels of persistence
> > > > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > > > +developers to work with certain fixed assumptions about file 
> > > > > > > system
> > > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > > +
> > > > > > > +Under this scheme, the file system guarantees to persist all 
> > > > > > > previous
> > > > > > > +dependent modifications to the object upon fsync().  If you 
> > > > > > > fsync() an
> > > > > > > +inode, it will persist all the changes required to reference the 
> > > > > > > inode
> > > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > > +
> > > > > > > +If op1 precedes op2 in program order (in-memory execution 
> > > > > > > order), and
> > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by 
> > > > > > > a
> > > > > > > +user after recovery without also observing op1.
> > > > > > > +
> > > > > > > +Unfortunately, SOMC's definition depends upon whether two 
> > > > > > > operations
> > > > > > > +share a dependency, which could be file-system specific. It might
> > > > > > > +require a developer to understand file-system internals to know 
> > > > > > > if
> > > > > > > +SOMC would order one operation before another.
> > > > > >
> > > > > > That's largely an internal implementation detail, and users should
> > > > > > not have to care about the internal implementation because the
> > > > > > fundamental dependencies are all defined by the directory heirarchy
> > > > > > relationships that users can see and manipulate.
> > > > > >
> > > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > > that is persisted, but it will never be reduced to less than what
> > > > > > the user can observe in the directory heirarchy.
> > > > > >
> > > > > > So this can be further refined:
> > > > > >
> > > > > > If op1 precedes op2 in program order (in-memory execution
> > > > > > order), and op1 and op2 share a user visible reference, then
> > > > > > op2 must not be observed by a user after recovery without
> > > > > > also observing op1.
> > > > > >
> > > > > > e.g. in the case of the parent directory - the parent has a link
> > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > > > operation in a directory modifies a user visible link count
> > > > > > reference.  Hence fsync of one of those children will persist the
> > > > > > directory link count, and then all of the other preceeding
> > > > > > transactions that modified the link count also need to be persisted.
> > > > > >
> > > > >
> > > > > One thing that bothers me is that the definition of SOMC (as well as
> > > > > your refined definition) doesn't mention fsync at all, but all the 
> > > > > examples
> > > > > only discuss use cases with fsync.
> > > >
> > > > You can't discuss operational ordering without a point in time to
> > > > use as a reference for that ordering.  SOMC behaviour is preserved
> > > 

Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-18 Thread Dave Chinner
On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote:
> For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin
> and Jayashree's advisor. We recently developed CrashMonkey, a tool for
> finding crash-consistency bugs in file systems. As part of the
> research effort, we had a lot of conversations with file-system
> developers to understand the guarantees provided by different file
> systems. This patch was inspired by the thought that we should quickly
> document what we know about the data integrity guarantees of different
> file systems. We did not expect to spur debate!
> 
> Thanks Dave, Amir, and Ted for the discussion. We will incorporate
> these comments into the next patch. If it is better to wait until a
> consensus is reached after the LSF meeting, we'd be happy to do so.
> 
> On Mon, Mar 18, 2019 at 2:14 AM Amir Goldstein  wrote:
> >
> > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner  wrote:
> > >
> > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner  
> > > > wrote:
> > > > >
> > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner  
> > > > > > wrote:
> > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > > > +Strictly Ordered Metadata Consistency
> > > > > > > > +-
> > > > > > > > +With each file system providing varying levels of persistence
> > > > > > > > +guarantees, a consensus in this regard, will benefit 
> > > > > > > > application
> > > > > > > > +developers to work with certain fixed assumptions about file 
> > > > > > > > system
> > > > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > > > +
> > > > > > > > +Under this scheme, the file system guarantees to persist all 
> > > > > > > > previous
> > > > > > > > +dependent modifications to the object upon fsync().  If you 
> > > > > > > > fsync() an
> > > > > > > > +inode, it will persist all the changes required to reference 
> > > > > > > > the inode
> > > > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > > > +
> > > > > > > > +If op1 precedes op2 in program order (in-memory execution 
> > > > > > > > order), and
> > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed 
> > > > > > > > by a
> > > > > > > > +user after recovery without also observing op1.
> > > > > > > > +
> > > > > > > > +Unfortunately, SOMC's definition depends upon whether two 
> > > > > > > > operations
> > > > > > > > +share a dependency, which could be file-system specific. It 
> > > > > > > > might
> > > > > > > > +require a developer to understand file-system internals to 
> > > > > > > > know if
> > > > > > > > +SOMC would order one operation before another.
> > > > > > >
> > > > > > > That's largely an internal implementation detail, and users should
> > > > > > > not have to care about the internal implementation because the
> > > > > > > fundamental dependencies are all defined by the directory 
> > > > > > > heirarchy
> > > > > > > relationships that users can see and manipulate.
> > > > > > >
> > > > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > > > that is persisted, but it will never be reduced to less than what
> > > > > > > the user can observe in the directory heirarchy.
> > > > > > >
> > > > > > > So this can be further refined:
> > > > > > >
> > > > > > > If op1 precedes op2 in program order (in-memory execution
> > > > > > > order), and op1 and op2 share a user visible refer

Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-19 Thread Dave Chinner
On Tue, Mar 19, 2019 at 09:35:19AM +0200, Amir Goldstein wrote:
> On Tue, Mar 19, 2019 at 5:13 AM Dave Chinner  wrote:
> > That is, sync_file_range() is only safe to use for this specific
> > sort of ordered data integrity algorithm when flushing the entire
> > file.(*)
> >
> > create
> > setxattr
> > write   metadata volatile
> >   delayed allocationdata volatile
> > 
> > sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
> > SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> >   Extent Allocation metadata volatile
> >   > device -+
> > data volatile
> >   <-- complete -+
> > 
> > rename  metadata volatile
> >
> > And so at this point, we only need a device cache flush to
> > make the data persistent and a journal flush to make the rename
> > persistent. And so it ends up the same case as non-AIO O_DIRECT.
> >
> 
> Funny, I once told that story and one Dave Chinner told me
> "Nice story, but wrong.":
> https://patchwork.kernel.org/patch/10576303/#22190719
> 
> You pointed to the minor detail that sync_file_range() uses
> WB_SYNC_NONE.

Ah, I forgot about that. That's what I get for not looking at the
code. Did I mention that SFR is a complete crock of shit when it
comes to data integrity operations? :/

> So yes, I agree, it is a nice story and we need to make it right,
> by having an API (perhaps SYNC_FILE_RANGE_ALL).
> When you pointed out my mistake, I switched the application to
> use the FIEMAP_FLAG_SYNC API as a hack.

Yeah, that 's a nasty hack :/

> Besides tests and documentation what could be useful is a portable
> user space library that just does the right thing for every filesystem.

*nod*

but before that, we need the model to be defined and documented.
And once we have a library, the fun part of convincing the world
that it should be the glibc default behaviour can begin

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] Documenting the crash-recovery guarantees of Linux file systems

2019-03-19 Thread Dave Chinner
On Tue, Mar 19, 2019 at 11:17:09AM -0400, Theodore Ts'o wrote:
> On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote:
> > For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin
> > and Jayashree's advisor. We recently developed CrashMonkey, a tool for
> > finding crash-consistency bugs in file systems. As part of the
> > research effort, we had a lot of conversations with file-system
> > developers to understand the guarantees provided by different file
> > systems. This patch was inspired by the thought that we should quickly
> > document what we know about the data integrity guarantees of different
> > file systems. We did not expect to spur debate!
> > 
> > Thanks Dave, Amir, and Ted for the discussion. We will incorporate
> > these comments into the next patch. If it is better to wait until a
> > consensus is reached after the LSF meeting, we'd be happy to do so.
> 
> Something to consider is that certain side effects of what fsync(2) or
> fdatasync(2) might drag into the jbd2 transaction might change if we
> were to implement (for example) something like Daejun Park and Dongkun
> Shin's "iJournaling: Fine-grained journaling for improving the latency
> of fsync system call" published in Usenix, ATC 2017:
> 
>https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf
> 
> That's an example of how if we document synchronization that goes
> beyond POSIX, it might change in the future.

Sure, but again this is orthognal to what we are discussing here:
the user visible ordering of metadata operations after a crash.

If anyone implements a multi-segment or per-inode journal (say, like
NOVA), then it is up to that implementation to maintain the ordering
guarantees that a SOMC model requires. You can implement whatever
fsync() go-fast bits you want, as long as it provides the ordering
behaviour guarantees that the model defines.

IOWs, Ted, I think you have the wrong end of the stick here. This
isn't about optimising fsync() to provide better performance, it's
about guaranteeing order so that fsync() is not necessary and we
improve performance by allowing applications to omit order-only
synchornisation points in their workloads.

i.e. an order-based integrity model /reduces/ the need for a
hyper-optimised fsync operation because applications won't need to
use it as often.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-02-28 Thread Dave Chinner
On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > On Mon, Feb 26, 2024 at 1:16 PM John Groves  wrote:
> > >
> > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > Run status group 0 (all jobs):
> > > > >   WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s 
> > > > > (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > >
> > > > > This is run on an xfs file system on a SATA ssd.
> > > >
> > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > to try this with XFS on pmem (with fio -direct=1)?
> > > >
> > > >   Luis
> > >
> > > Makes sense. Here is the same command line I used with xfs before, but
> > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > because xfs requires that.
> > >
> > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 
> > > --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none 
> > > --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 
> > > --directory=/mnt/xfs
> > 
> > Could you try with mkfs.xfs -d agcount=1024

Won't change anything for the better, may make things worse.

>bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, 
> stdev=165.61, samples=719
>iops: min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
>   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
>   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
>   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%

Most of the IO latencies are up round the 4-20ms marks. That seems
kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
should only take a couple of hundred microseconds. For Famfs, the
latencies appear to be around 1-4ms.

So where's all that extra time coming from?


>   lat (msec)   : 100=0.08%
>   cpu  : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511

And why is system time reporting at almost zero instead of almost
all the remaining cpu time (i.e. up at 80-90%)?

Can you run call-graph kernel profiles for XFS and famfs whilst
running this workload so we have some insight into what is behaving
differently here?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-03-10 Thread Dave Chinner
On Thu, Feb 29, 2024 at 08:52:48AM -0600, John Groves wrote:
> On 24/02/29 01:15PM, Dave Chinner wrote:
> > On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > >bw (  MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, 
> > > stdev=165.61, samples=719
> > >iops: min= 2516, max=13670, avg=7160.17, stdev=82.88, 
> > > samples=719
> > >   lat (usec)   : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> > >   lat (usec)   : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> > >   lat (msec)   : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
> > 
> > Most of the IO latencies are up round the 4-20ms marks. That seems
> > kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> > should only take a couple of hundred microseconds. For Famfs, the
> > latencies appear to be around 1-4ms.
> > 
> > So where's all that extra time coming from?
> 
> Below, you will see two runs with performance and latency distribution
> about the same as famfs (the answer for that was --fallocate=native).

Ah, that is exactly what I suspected, and was wanting profiles
because that will show up in them clearly.

> > >   lat (msec)   : 100=0.08%
> > >   cpu  : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
> > 
> > And why is system time reporting at almost zero instead of almost
> > all the remaining cpu time (i.e. up at 80-90%)?
> 
> Something weird is going on with the cpu reporting. Sometimes sys=~0, but 
> other times
> it's about what you would expect. I suspect some sort of measurement error,
> like maybe the method doesn't work with my cpu model? (I'm grasping, but with
> a somewhat rational basis...)
> 
> I pasted two xfs runs below. The first has the wonky cpu sys value, and
> the second looks about like what one would expect.
> 
> > 
> > Can you run call-graph kernel profiles for XFS and famfs whilst
> > running this workload so we have some insight into what is behaving
> > differently here?
> 
> Can you point me to an example of how to do that?

perf record --call-graph ...
pref report --call-graph ...


> I'd been thinking about the ~2x gap for a few days, and the most obvious
> difference is famfs files must be preallocated (like fallocate, but works
> a bit differently since allocation happens in user space). I just checked 
> one of the xfs files, and it had maybe 80 extents (whereas the famfs 
> files always have 1 extent here).

Which is about 4MB per extent. Extent size is not the problem for
zero-seek-latency storage hardware, though.

Essentially what you are seeing is interleaving extent allocation
between all the files because they are located in the same
directory. The locality algorithm is trying to place the data
extents close to the owner inode, but the indoes are also all close
together because they are located in the same AG as the parent
directory inode. Allocation concurrency is created by placing new
directories in different allocation groups, so we end up with
workloads in different directories being largely isolated from each
other.

However, that means when you are trying to write to many files in
the same directory at the same time, they are largely all competing
for the same AG lock to do block allocation during IO submission.
That creates interleaving of write() sized extents between different
files. We use speculative preallocation for buffered IO to avoid
this, and for direct IO the application needs to use extent size hints
or preallocation to avoid this contention based interleaving.

IOWs, by using fallocate() to preallocate all the space there will
be no allocation during IO submission and so the serialisation that
occurs due to competing allocations just goes away...

> FWIW I ran xfs with and without io_uring, and there was no apparent
> difference (which makes sense to me because it's not block I/O).
> 
> The prior ~2x gap still seems like a lot of overhead for extent list 
> mapping to memory, but adding --fallocate=native to the xfs test brought 
> it into line with famfs:

As I suspected. :)

As for CPU usage accounting, the number of context switches says it
all.

"Bad":

>   cpu  : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801

"good":

>   cpu  : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708

I'd say that in the "bad" case most of the kernel work is being
shuffled off to kernel threads to do the work and so it doesn't get
accounted to the submission task.  In comparison, in the "good" case
the work is being done in the submission thread and hence there's a
lot fewer context switches and the system time is correctly
accounted to the submission task.

Perhaps an io_uring task accounting problem?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH 00/13] fs/dax: Fix FS DAX page reference counts

2024-06-30 Thread Dave Chinner
On Thu, Jun 27, 2024 at 10:54:15AM +1000, Alistair Popple wrote:
> FS DAX pages have always maintained their own page reference counts
> without following the normal rules for page reference counting. In
> particular pages are considered free when the refcount hits one rather
> than zero and refcounts are not added when mapping the page.
> 
> Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> mechanism for allowing GUP to hold references on the page (see
> get_dev_pagemap). However there doesn't seem to be any reason why FS
> DAX pages need their own reference counting scheme.
> 
> By treating the refcounts on these pages the same way as normal pages
> we can remove a lot of special checks. In particular pXd_trans_huge()
> becomes the same as pXd_leaf(), although I haven't made that change
> here. It also frees up a valuable SW define PTE bit on architectures
> that have devmap PTE bits defined.
> 
> It also almost certainly allows further clean-up of the devmap managed
> functions, but I have left that as a future improvment.
> 
> This is an update to the original RFC rebased onto v6.10-rc5. Unlike
> the original RFC it passes the same number of ndctl test suite
> (https://github.com/pmem/ndctl) tests as my current development
> environment does without these patches.

I strongly suggest running fstests on pmem devices with '-o
dax=always' mount options to get much more comprehensive fsdax test
coverage. That exercises a lot of the weird mmap corner cases that
cause problems so it would be good to actually test that nothing new
got broken in FSDAX by this patchset.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com