Re: [PATCH 1/4] module: Add module_subinit{_noexit} and module_subeixt helper macros

2024-07-28 Thread Theodore Ts'o
On Mon, Jul 29, 2024 at 09:46:17AM +0800, Youling Tang wrote:
> 1. Previous version implementation: array mode (see link 1) :
>    Advantages:
>    - Few changes, simple principle, easy to understand code.
>    Disadvantages:
>    - Each modified module needs to maintain an array, more code.
> 
> 2. Current implementation: explicit call subinit in initcall (see link 2) :
>    Advantages:
>    - Direct use of helpes macros, the subinit call sequence is
>  intuitive, and the implementation is relatively simple.
>    Disadvantages:
>    - helper macros need to be implemented compared to array mode.
> 
> 3. Only one module_subinit per file (not implemented, see link 3) :
>    Advantage:
>    - No need to display to call subinit.
>    Disadvantages:
>    - Magic order based on Makefile makes code more fragile,
>    - Make sure that each file has only one module_subinit,
>    - It is not intuitive to know which subinits the module needs
>  and in what order (grep and Makefile are required),
>    - With multiple subinits per module, it would be difficult to
>  define module_{subinit, subexit} by MODULE, and difficult to
>  rollback when initialization fails (I haven't found a good way
>  to do this yet).
> 
>
> Personally, I prefer the implementation of method two.

But there's also method zero --- keep things the way they are, and
don't try to add a new astraction.

Advantage:

 -- Code has worked for decades, so it is very well tested
 -- Very easy to understand and maintain

Disadvantage

 --- A few extra lines of C code.

which we need to weigh against the other choices.

   - Ted



Re: [PATCH 1/4] module: Add module_subinit{_noexit} and module_subeixt helper macros

2024-07-27 Thread Theodore Ts'o
On Fri, Jul 26, 2024 at 11:09:02AM -0700, Christoph Hellwig wrote:
> On Fri, Jul 26, 2024 at 01:58:00PM -0400, Theodore Ts'o wrote:
> > Yeah, that's my reaction as well.  This only saves 50 lines of code in
> > ext4, and that includes unrelated changes such as getting rid of "int
> > i" and putting the declaration into the for loop --- "for (int i =
> > ...").  Sure, that saves two lines of code, but yay?
> > 
> > If the ordering how the functions gets called is based on the magic
> > ordering in the Makefile, I'm not sure this actually makes the code
> > clearer, more robust, and easier to maintain for the long term.
> 
> So you two object to kernel initcalls for the same reason and would
> rather go back to calling everything explicitly?

I don't oject to kernel initcalls which don't have any
interdependencies and where ordering doesn't matter.

- Ted



Re: [PATCH 1/4] module: Add module_subinit{_noexit} and module_subeixt helper macros

2024-07-26 Thread Theodore Ts'o
On Fri, Jul 26, 2024 at 05:22:37PM +0200, David Sterba wrote:
> All of this sounds overengineered for something that is a simple array
> and two helpers. The code is not finalized so I'll wait for the next
> version but specific file order in makefile and linker tricks seems
> fragile and I'm not sure I want this for btrfs.

Yeah, that's my reaction as well.  This only saves 50 lines of code in
ext4, and that includes unrelated changes such as getting rid of "int
i" and putting the declaration into the for loop --- "for (int i =
...").  Sure, that saves two lines of code, but yay?

If the ordering how the functions gets called is based on the magic
ordering in the Makefile, I'm not sure this actually makes the code
clearer, more robust, and easier to maintain for the long term.

 - Ted



Re: page->index limitation on 32bit system?

2021-02-19 Thread Theodore Ts'o
On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
> So it means the 32bit archs are already 2nd tier targets for at least
> upstream linux kernel?

At least as far as btrfs is concerned, anyway

> Or would it be possible to make it an option to make the index u64?
> So guys who really wants large file support can enable it while most
> other 32bit guys can just keep the existing behavior?

I think if this is going to be done at all, it would need to be a
compile-time CONFIG option to make the index be 64-bits.  That's
because there are a huge number of low-end Android devices (retail
price ~$30 USD in India, for example --- this set of customers is
sometimes called "the next billion users" by some folks) that are
using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
would be massively unfortunate/unfair/etc. to impose that performance
penalty on them.

It sounds like what Willy is saying is that supporting a 64-bit page
index on 32-bit platforms is going to be have a lot of downsides, and
not just the performance / memory overhead issue.  It's also a code
mainteinance concern, and that tax would land on the mm developers.
And if it's not well-maintained, without regular testing, it's likely
to be heavily subject to bitrot.  (Although I suppose if we don't mind
doubling the number of configs that kernelci has to test, this could
be mitigated.)

In contrast, changing btrfs to not depend on a single address space
for all of its metadata might be a lot of work, but it's something
which lands on the btrfs developers, as opposed to a another (perhaps
more central) kernel subsystem.  Managing at this tradeoff is
something that is going to be between the mm developers and the btrfs
developers, but as someone who doesn't do any work on either of these
subsystems, it seems like a pretty obvious choice.

The final observation I'll make is that if we know which NAS box
vendor can (properly) support volumes > 16 TB, we can probably find
the 64-bit page index patch.  It'll probably be against a fairly old
kernel, so it might not all _that_ helpful, but it might give folks a
bit of a head start.

I can tell you that the NAS box vendor that it _isn't_ is Synology.
Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
volume size limit, and this is enforced by the Synology NAS
software[1].  However, Synology NAS boxes can support multiple
volumes; until today, I never understood why, since it seemed to be
unnecessary complexity, but I suspect the real answer was this was how
Synology handled storage array sizes > 16TB on their older systems.
(All of their new NAS boxes use 64-bit processors.)

[1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/

Cheers,

- Ted


Re: [PATCH 1/6] mm/fs: don't allow writes to immutable files

2019-06-20 Thread Theodore Ts'o
On Thu, Jun 20, 2019 at 03:13:06PM -0700, Darrick J. Wong wrote:
> > I note that this patch doesn't allow writes to swap files.  So Amir's
> > generic/554 test will still fail for those file systems that don't use
> > copy_file_range.
> 
> I didn't add any IS_SWAPFILE checks here, so I'm not sure to what you're
> referring?

Sorry, my bad; I mistyped.  What I should have said is this patch
doesn't *prohibit* writes to swap files

(And so Amir's generic/554 test, even modified so it allow reads from
swapfiles, but not writes, when using copy_file_range, is still
failing for ext4.  I was looking to see if I could remove it from my
exclude list, but not yet.  :-)

> > I'm indifferent as to whether you add a new patch, or include that
> > change in this patch, but perhaps we should fix this while we're
> > making changes in these code paths?
> 
> The swapfile patches should be in a separate patch, which I was planning
> to work on but hadn't really gotten around to it.

Ok, great, thanks!!

- Ted


Re: [PATCH 1/6] mm/fs: don't allow writes to immutable files

2019-06-20 Thread Theodore Ts'o
On Mon, Jun 10, 2019 at 09:46:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> The chattr manpage has this to say about immutable files:
> 
> "A file with the 'i' attribute cannot be modified: it cannot be deleted
> or renamed, no link can be created to this file, most of the file's
> metadata can not be modified, and the file can not be opened in write
> mode."
> 
> Once the flag is set, it is enforced for quite a few file operations,
> such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
> don't check for immutability when doing a write(), a PROT_WRITE mmap(),
> a truncate(), or a write to a previously established mmap.
> 
> If a program has an open write fd to a file that the administrator
> subsequently marks immutable, the program still can change the file
> contents.  Weird!
> 
> The ability to write to an immutable file does not follow the manpage
> promise that immutable files cannot be modified.  Worse yet it's
> inconsistent with the behavior of other syscalls which don't allow
> modifications of immutable files.
> 
> Therefore, add the necessary checks to make the write, mmap, and
> truncate behavior consistent with what the manpage says and consistent
> with other syscalls on filesystems which support IMMUTABLE.
> 
> Signed-off-by: Darrick J. Wong 

I note that this patch doesn't allow writes to swap files.  So Amir's
generic/554 test will still fail for those file systems that don't use
copy_file_range.

I'm indifferent as to whether you add a new patch, or include that
change in this patch, but perhaps we should fix this while we're
making changes in these code paths?

- Ted


Re: [PATCH 1/8] mm/fs: don't allow writes to immutable files

2019-06-10 Thread Theodore Ts'o
On Mon, Jun 10, 2019 at 09:09:34AM -0700, Darrick J. Wong wrote:
> > I was planning on only taking 8/8 through the ext4 tree.  I also added
> > a patch which filtered writes, truncates, and page_mkwrites (but not
> > mmap) for immutable files at the ext4 level.
> 
> *Oh*.  I saw your reply attached to the 1/8 patch and thought that was
> the one you were taking.  I was sort of surprised, tbh. :)

Sorry, my bad.  I mis-replied to the wrong e-mail message  :-)

> > I *could* take this patch through the mm/fs tree, but I wasn't sure
> > what your plans were for the rest of the patch series, and it seemed
> > like it hadn't gotten much review/attention from other fs or mm folks
> > (well, I guess Brian Foster weighed in).
> 
> > What do you think?
> 
> Not sure.  The comments attached to the LWN story were sort of nasty,
> and now that a couple of people said "Oh, well, Debian documented the
> inconsistent behavior so just let it be" I haven't felt like
> resurrecting the series for 5.3.

Ah, I had missed the LWN article.   

Yeah, it's the same set of issues that we had discussed when this
first came up.  We can go round and round on this one; It's true that
root can now cause random programs which have a file mmap'ed for
writing to seg fault, but root has a million ways of killing and
otherwise harming running application programs, and it's unlikely
files get marked for immutable all that often.  We just have to pick
one way of doing things, and let it be same across all the file
systems.

My understanding was that XFS had chosen to make the inode immutable
as soon as the flag is set (as opposed to forbidding new fd's to be
opened which were writeable), and I was OK moving ext4 to that common
interpretation of the immmutable bit, even though it would be a change
to ext4.

And then when I saw that Amir had included a patch that would cause
test failures unless that patch series was applied, it seemed that we
had all thought that the change was a done deal.  Perhaps we should
have had a more explicit discussion when the test was sent for review,
but I had assumed it was exclusively a copy_file_range set of tests,
so I didn't realize it was going to cause ext4 failures.

 - Ted


Re: [PATCH 1/8] mm/fs: don't allow writes to immutable files

2019-06-10 Thread Theodore Ts'o
On Sun, Jun 09, 2019 at 09:41:44PM -0700, Darrick J. Wong wrote:
> On Sun, Jun 09, 2019 at 09:51:45PM -0400, Theodore Ts'o wrote:
> > On Wed, Apr 17, 2019 at 12:04:33PM -0700, Darrick J. Wong wrote:
>
> > Shouldn't this check be moved before the modification of vmf->flags?
> > It looks like do_page_mkwrite() isn't supposed to be returning with
> > vmf->flags modified, lest "the caller gets surprised".
> 
> Yeah, I think that was a merge error during a rebase... :(
> 
> Er ... if you're still planning to take this patch through your tree,
> can you move it to above the "vmf->flags = FAULT_FLAG_WRITE..." ?

I was planning on only taking 8/8 through the ext4 tree.  I also added
a patch which filtered writes, truncates, and page_mkwrites (but not
mmap) for immutable files at the ext4 level.

I *could* take this patch through the mm/fs tree, but I wasn't sure
what your plans were for the rest of the patch series, and it seemed
like it hadn't gotten much review/attention from other fs or mm folks
(well, I guess Brian Foster weighed in).

What do you think?

- Ted





Re: [PATCH 1/8] mm/fs: don't allow writes to immutable files

2019-06-09 Thread Theodore Ts'o
On Wed, Apr 17, 2019 at 12:04:33PM -0700, Darrick J. Wong wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index ab650c21bccd..dfd5eba278d6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2149,6 +2149,9 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
>  
>   vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
>  
> + if (vmf->vma->vm_file && IS_IMMUTABLE(file_inode(vmf->vma->vm_file)))
> + return VM_FAULT_SIGBUS;
> +
>   ret = vmf->vma->vm_ops->page_mkwrite(vmf);
>   /* Restore original flags so that caller is not surprised */
>   vmf->flags = old_flags;

Shouldn't this check be moved before the modification of vmf->flags?
It looks like do_page_mkwrite() isn't supposed to be returning with
vmf->flags modified, lest "the caller gets surprised".

   - Ted


Re: [PATCH 1/8] mm/fs: don't allow writes to immutable files

2019-06-09 Thread Theodore Ts'o
On Wed, Apr 17, 2019 at 12:04:33PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> The chattr manpage has this to say about immutable files:
> 
> "A file with the 'i' attribute cannot be modified: it cannot be deleted
> or renamed, no link can be created to this file, most of the file's
> metadata can not be modified, and the file can not be opened in write
> mode."
> 
> Once the flag is set, it is enforced for quite a few file operations,
> such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
> don't check for immutability when doing a write(), a PROT_WRITE mmap(),
> a truncate(), or a write to a previously established mmap.
> 
> If a program has an open write fd to a file that the administrator
> subsequently marks immutable, the program still can change the file
> contents.  Weird!
> 
> The ability to write to an immutable file does not follow the manpage
> promise that immutable files cannot be modified.  Worse yet it's
> inconsistent with the behavior of other syscalls which don't allow
> modifications of immutable files.
> 
> Therefore, add the necessary checks to make the write, mmap, and
> truncate behavior consistent with what the manpage says and consistent
> with other syscalls on filesystems which support IMMUTABLE.
> 
> Signed-off-by: Darrick J. Wong 

Thanks, looks good.  I'm going to take this patch through the ext4
tree since it doesn't have any dependencies on the rest of the patch
series.

 - Ted


Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

2019-05-31 Thread Theodore Ts'o
On Fri, May 31, 2019 at 08:22:06PM +0300, Amir Goldstein wrote:
> >
> > This is I think more precise:
> >
> > This guarantee can be achieved by calling fsync(2) before linking
> > the file, but there may be more performant ways to provide these
> > semantics.  In particular, note that the use of the AT_ATOMIC_DATA
> > flag does *not* guarantee that the new link created by linkat(2)
> > will be persisted after a crash.
> 
> OK. Just to be clear, mentioning hardlinks and st_link is not needed
> in your opinion?

Your previous text stated that it was undefined what would happen to
all hardlinks belonging to the file, and that would imply that if a
file had N hard links, some in the directory which we are modifying,
and some in other directories, that somehow any of them might not be
present after the crash.  And that's not the case.  Suppose the file
currently has hardlinks test1/foo, test1/quux, and test2/baz --- and
we've called syncfs(2) on the file system so everything is persisted,
and then linkat(2) is used to create a new hardlink, test1/bar.

After a crash, the existence of test1/foo, test1/quux, and test2/baz
are not in question.  It's only unclear whether or not test1/bar
exists after the crash.

As far as st_nlink is concerned, the presumption is that the file
system itself will be consistent after the crash.  So if the hard link
has been persisted, then st_nlink will be incremented, if it has not,
it won't be.

Finally, one thing which gets hard about trying to state these sorts
of things as guarantees.  Sometimes, the file system won't *know*
whether or not it can make these guarantees.  For example what should
we do if the file system is mounted with nobarrier?  If the overall
hardware design includes UPS's or some other kind of battery backup,
the guarantee may very well exist.  But the file system code can't
know whether or not that is the case.  So my inclination is to allow
the file system to accept the flag even if the mount option nobarrier
is in play --- but in that case, the guarantee is only if the rest of
the system is designed appropriately.

(For that matter, it used to be that there existed hard drives that
lied about whether they had a writeback cache, and/or made the CACHE
FLUSH a no-op so they could win the Winbench benchmarketing wars,
which was worth millions and millions of dollars in sales.  So we can
only assume that the hardware isn't lying to us when we use words like
"guarantee".)

- Ted


Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

2019-05-31 Thread Theodore Ts'o
On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> What do you think of:
> 
> "AT_ATOMIC_DATA (since Linux 5.x)
> A filesystem which accepts this flag will guarantee that if the linked file
> name exists after a system crash, then all of the data written to the file
> and all of the file's metadata at the time of the linkat(2) call will be
> visible.

" will be visible after the the file system is remounted".  (Never
hurts to be explicit.)

> The way to achieve this guarantee on old kernels is to call fsync (2)
> before linking the file, but doing so will also results in flushing of
> volatile disk caches.
>
> A filesystem which accepts this flag does NOT
> guarantee that any of the file hardlinks will exist after a system crash,
> nor that the last observed value of st_nlink (see stat (2)) will persist."
> 

This is I think more precise:

This guarantee can be achieved by calling fsync(2) before linking
the file, but there may be more performant ways to provide these
semantics.  In particular, note that the use of the AT_ATOMIC_DATA
flag does *not* guarantee that the new link created by linkat(2)
will be persisted after a crash.

We should also document that a file system which does not implement
this flag MUST return EINVAL if it is passed this flag to linkat(2).

  - Ted


Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

2019-05-28 Thread Theodore Ts'o
On Mon, May 27, 2019 at 08:26:55PM +0300, Amir Goldstein wrote:
> 
> Following our discussions on LSF/MM and beyond [1][2], here is
> an RFC documentation patch.
> 
> Ted, I know we discussed limiting the API for linking an O_TMPFILE
> to avert the hardlinks issue, but I decided it would be better to
> document the hardlinks non-guaranty instead. This will allow me to
> replicate the same semantics and documentation to renameat(2).
> Let me know how that works out for you.
> 
> I also decided to try out two separate flags for data and metadata.
> I do not find any of those flags very useful without the other, but
> documenting them seprately was easier, because of the fsync/fdatasync
> reference.  In the end, we are trying to solve a social engineering
> problem, so this is the least confusing way I could think of to describe
> the new API.

The way you have stated thigs is very confusing, and prone to be
misunderstood.  I think it would be helpful to state things in the
positive, instead of the negative.

Let's review what you had wanted:

*If* the filename is visible in the directory after the crash,
*then* all of the metadata/data that had been written to the file
before the linkat(2) would be visible.

HOWEVER, you did not want to necessarily force an fsync(2) in
order to provide that guarantee.  That is, the filename would
not necessarily be guaranteed to be visible after a crash when
linkat(2) returns, but if the existence of the filename is
persisted, then the data would be too.

Also, at least initially we talked about this only making
sense for O_TMPFILE file desacriptors.  I believe you were
trying to generalize things so it wouldn't necessarily have to
be a file created using O_TMPFILE.  Is that correct?

So instead of saying "A filesystem that accepts this flag will
guaranty, that old inode data will not be exposed in the new linked
name."  It's much clearer to state this in the affirmative:

A filesystem which accepts this flag will guarantee that if
the new pathname exists after a crash, all of the data written
to the file at the time of the linkat(2) call will be visible.

I would think it's much simpler to say what *will* happen, instead of
what will not be visible.  (After all, technically speaking, returning
all zeros or random garbage data fufills the requirement "old data
will not be exposed", but that's probably not what you had in mind.  :-)

Also please note that it's spelled "guarantee".

Cheers,

- Ted


Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress

2019-05-16 Thread Theodore Ts'o
On Thu, May 16, 2019 at 10:54:57AM +0100, Filipe Manana wrote:
> 
> Haven't tried ext4 with 1 process only (instead of 4), but I can try
> to see if it happens without concurrency as well.

How many CPU's and how much memory were you using?  And I assume this
was using KVM/QEMU?  How was it configured?

Thanks,

- Ted


Re: [PATCH] fstests: generic, fsync fuzz tester with fsstress

2019-05-16 Thread Theodore Ts'o
On Wed, May 15, 2019 at 04:02:21PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Run fsstress, fsync every file and directory, simulate a power failure and
> then verify the all files and directories exist, with the same data and
> metadata they had before the power failure.
> 
> This tes has found already 2 bugs in btrfs, that caused mtime and ctime of
> directories not being preserved after replaying the log/journal and loss
> of a directory's attributes (such a UID and GID) after replaying the log.
> The patches that fix the btrfs issues are titled:
> 
>   "Btrfs: fix wrong ctime and mtime of a directory after log replay"
>   "Btrfs: fix fsync not persisting changed attributes of a directory"
> 
> Running this test 1000 times:
> 
> - on ext4 it has resulted in about a dozen journal checksum errors (on a
>   5.0 kernel) that resulted in failure to mount the filesystem after the
>   simulated power failure with dmflakey, which produces the following
>   error in dmesg/syslog:
> 
> [Mon May 13 12:51:37 2019] JBD2: journal checksum error
> [Mon May 13 12:51:37 2019] EXT4-fs (dm-0): error loading journal

I'm curious what configuration you used when you ran the test.  I
tried to reproduce it, and had no luck:

TESTRUNID: tytso-20190516042341
KERNEL:kernel 5.1.0-rc3-xfstests-00034-g0c72924ef346 #999 SMP Wed May 15 
00:56:08 EDT 2019 x86_64
CMDLINE:   -c 4k -C 1000 generic/547
CPUS:  2
MEM:   7680

ext4/4k: 1000 tests, 1855 seconds
Totals: 1000 tests, 0 skipped, 0 failures, 0 errors, 1855s

FSTESTPRJ: gce-xfstests
FSTESTVER: blktests baccddc (Wed, 13 Mar 2019 00:06:50 -0700)
FSTESTVER: fio  fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
FSTESTVER: fsverity bdebc45 (Wed, 5 Sep 2018 21:32:22 -0700)
FSTESTVER: ima-evm-utils 0267fa1 (Mon, 3 Dec 2018 06:11:35 -0500)
FSTESTVER: nvme-cli v1.7-35-g669d759 (Tue, 12 Mar 2019 11:22:16 -0600)
FSTESTVER: quota  62661bd (Tue, 2 Apr 2019 17:04:37 +0200)
FSTESTVER: stress-ng 7d0353cf (Sun, 20 Jan 2019 03:30:03 +)
FSTESTVER: syzkaller bab43553 (Fri, 15 Mar 2019 09:08:49 +0100)
FSTESTVER: xfsprogs v5.0.0 (Fri, 3 May 2019 12:14:36 -0500)
FSTESTVER: xfstests-bld 9582562 (Sun, 12 May 2019 00:38:51 -0400)
FSTESTVER: xfstests linux-v3.8-2390-g64233614 (Thu, 16 May 2019 00:12:52 -0400)
FSTESTCFG: 4k
FSTESTSET: generic/547
FSTESTOPT: count 1000 aex
GCE ID:8592267165157073108


Re: [GIT PULL] inode->i_version rework for v4.16

2018-01-30 Thread Theodore Ts'o
On Tue, Jan 30, 2018 at 07:05:48AM -0500, Jeff Layton wrote:
> 
> I want to make sure I understand what's actually broken here thoug. Is
> it only broken when the two values are more than 2**63 apart, or is
> there something else more fundamentally wrong here?

The other problem is that returning a s64 (or a u64) is extremely
inefficient on a 32-bit platform.  So in general, it's better to avoid
returning a 64-bit type unless it's really necessary

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Lockdep is less useful than it was

2017-12-08 Thread Theodore Ts'o
On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> I think it was a mistake to force these on for everybody; they have a
> much higher false-positive rate than the rest of lockdep, so as you say
> forcing them on leads to fewer people using *any* of lockdep.
> 
> The bug you're hitting isn't Byungchul's fault; it's an annotation
> problem.  The same kind of annotation problem that we used to have with
> dozens of other places in the kernel which are now fixed.

The question is whose responsibility is it to annotate the code?  On
another thread it was suggested it was ext4's responsibility to add
annotations to avoid the false positives --- never the mind the fact
that every single file system is going to have add annotations.

Also note that the documentation for how to add annotations is
***horrible***.  It's mostly, "try to figure out how other people
added magic cargo cult code which is not well defined (look at the
definitions of lockdep_set_class, lockdep_set_class_and_name,
lockdep_set_class_and_subclass, and lockdep_set_subclass, and weep) in
other subsystems and hope and pray it works for you."

And the explanation when there are failures, either false positives,
or not, are completely opaque.  For example:

[   16.190198] ext4lazyinit/648 is trying to acquire lock:
[   16.191201]  ((gendisk_completion)1 << part_shift(NUMA_NO_NODE)){+.+.}, at: 
[<8a1ebe9d>] wait_for_completion_io+0x12/0x20

Just try to tell me that:

((gendisk_completion)1 << part_shift(NUMA_NO_NODE)){+.+.}

is human comprehensible with a straight face.  And since the messages
don't even include the subclass/class/name key annotations, as lockdep
tries to handle things that are more and more complex, I'd argue it
has already crossed the boundary where unless you are a lockdep
developer, good luck trying to understand what is going on or how to
add annotations.

So if you are adding complexity to the kernel with the argument,
"lockdep will save us", I'm with Dave --- it's just not a believable
argument.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-07 Thread Theodore Ts&#x27;o
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > Unfortunately for you, I don't find arguments along the lines of
> > "lockdep will save us" at all convincing.  lockdep already throws
> > too many false positives to be useful as a tool that reliably and
> > accurately points out rare, exciting, complex, intricate locking
> > problems.
> 
> But it does reliably and accurately point out "dude, you forgot to take
> the lock".  It's caught a number of real problems in my own testing that
> you never got to see.

The problem is that if it has too many false positives --- and it's
gotten *way* worse with the completion callback "feature", people will
just stop using Lockdep as being too annyoing and a waste of developer
time when trying to figure what is a legitimate locking bug versus
lockdep getting confused.

I can't even disable the new Lockdep feature which is throwing
lots of new false positives --- it's just all or nothing.

Dave has just said he's already stopped using Lockdep, as a result.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Ideas to reuse filesystem's checksum to enhance dm-raid1/10/5/6?

2017-11-20 Thread Theodore Ts&#x27;o
On Thu, Nov 16, 2017 at 03:32:05PM -0700, Chris Murphy wrote:
> 
> XFS by default does metadata csums. But ext4 doesn't use it for either
> metadata or the journal by default still, it is still optional. So for
> now it mainly benefits XFS.

Metadata checksums are enabled by default in the version of e2fsprogs
shipped by Debian.  Since there were no real problems reported by
Debian users, in the next release of e2fsprogs, coming soon, it will
be enabled by default for all new ext4 file systems.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-09 Thread Theodore Ts&#x27;o
On Mon, Oct 09, 2017 at 08:54:16AM -0400, Josef Bacik wrote:
> I purposefully used as little as possible, just json and sqlite, and I tried 
> to
> use as little python3 isms as possible.  Any rpm based systems should have 
> these
> libraries already installed, I agree that using any of the PyPI stuff is a 
> pain
> and is a non-starter for me as I want to make it as easy as possible to get
> running.
> 
> Where do you fall on the including it in xfstests question?  I expect that the
> perf stuff would be run more as maintainers put their pull requests together 
> to
> make sure things haven't regressed.  To that end I was going to wire up
> xfstests-bld to run this as well.  Whatever you and Dave prefer is what I'll 
> do,
> I'll use it wherever it ends up so you two are the ones that get to decide.

I'm currently using Python 2.7 mainly because the LTM subsystem in
gce-xfstests was implemented using that version of Python, and my
initial efforts to port it to Python 3 were... not smooth.  (Because
it was doing authentication, I got bit by the Python 2 vs Python 3
"bytes vs. strings vs. unicode" transition especially hard.)

So I'm going to be annoyed by needing to package Python 2.7 and Python
3.5 in my test VM's, but I can deal, and this will probably be the
forcing function for me to figure out how make that jump.  To be
honest, the bigger issue I'm going to have to figure out is how to
manage the state in the sqlite database across disposable VM's running
in parallel.  And the assumption being made with having a static
sqllite database on a test machine is that the hardware capabilities
of the are static, and that's not true with a VM, whether it's running
via Qemu or in some cloud environment.

I'm not going to care that much about Python 3 not being available on
enterprise distro's, but I could see that being annoying for some
folks.  I'll let them worry about that.

The main thing I think we'll need to worry about once we let Python
into xfstests is to be pretty careful about specifying what version of
Python the scripts need to be portable against (Python 3.3?  3.4?
3.5?) and what versions of python packages get imported.

The bottom line is that I'm personally supportive of adding Python and
fsperf to xfstests.  We just need to be careful about the portability
concerns, not just now, but any time we check in new Python code.  And
having some documented Python style guidelines, and adding unit tests
so we can notice potential portability breakages would be a good idea.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-09 Thread Theodore Ts&#x27;o
On Mon, Oct 09, 2017 at 02:54:34PM +0800, Eryu Guan wrote:
> I have no problem either if python is really needed, after all this is a
> very useful infrastructure improvement. But the python version problem
> brought up by Ted made me a bit nervous, we need to work that round
> carefully.
> 
> OTOH, I'm just curious, what is the specific reason that python is a
> hard requirement? If we can use perl, that'll be much easier for
> fstests.

Note that perl has its own versioning issues (but it's been easier
given that Perl has been frozen so long waiting for Godot^H^H^H^H^H
Perl6).  It's for similar reasons that Python 2.7 is nice and stable.
Python 2 is frozen while Python 3 caught up.  The only downside is
that Python 2.7 is now deprecated, and will stop getting security
updates from Python Core in 2020.

I'll note that Python 3.6 has some nice features that aren't in Python
3.5 --- but Debian Stable only has Python 3.5, and Debian Unstable has
Python 3.6.  Which is why I said, "it looks like you're using some
variant of Python 3.x", but it wasn't obvious what version exactly was
required by fsperf.  This version instability in Python and Perl is
way Larry McVoy ended up using Tcl for Bitkeeper, by the way.  That
was the only thing that was guaranteed to work everywhere, exactly the
same, without random changes added by Perl/Python innovations...

It's a little easier for me with gce/kvm-xfstests, since I'm using a
Debian Stable images/chroots, and I don't even going to _pretend_ that
I care about cross-distro portability for the Test Appliance VM's that
xfstests-bld creates.  But I suspect this will be more of an issue
with xfstests.

Also, the same issues around versioning / "DLL hell" and provenance of
various high-level package/modules exists with Perl just as much with
Python.  Just substitute CPAN with PyPI.  And again, some of the
popular Perl packages have been packaged by the distro's to solve the
versioning / provenance problem, but exactly _which_ packages /
modules are packaged varies from distro to distro.  (Hopefully the
most popular ones will be packaged by both Red Hat, SuSE and Debian
derivitives, but you'll have to check for each package / module you
want to use.)

One way of solving the problem is just including those Perl / Python
package modules in the sources of xfstests itself; that way you're not
depending on which version of a particular module / package is
available on a distro, and you're also not randomly downloading
software over the network and hoping it works / hasn't been taken over
by some hostile state power.  (I'd be much happier if PyPI or CPAN
used SHA checksums of what you expect to be downloaded; even if you
use Easy_Install's requirements.txt, you're still trusting PyPI to
give you what it thinks is version of Junitparser 1.0.0.)

This is why I've dropped my own copy of junitparser into the git repo
of xfstests-bld.  It's the "ship your own version of the DLL" solution
to the "DLL hell" problem, but it was the best I could come up with,
especially since Debian hadn't packaged the Junitparser python module.
I also figured it was much better at lowering the blood pressure of
the friendly local secops types.  :-)

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework

2017-10-08 Thread Theodore Ts&#x27;o
On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote:
> 
> Probably should have led with that shouldn't I have?  There's nothing keeping 
> me
> from doing it, but I didn't want to try and shoehorn in a python thing into
> fstests.  I need python to do the sqlite and the json parsing to dump into the
> sqlite database.

What version of python are you using?  From inspection it looks like
some variant of python 3.x (you're using print as a function w/o using
"from __future import print_function") but it's not immediately
obvious from the top-level fsperf shell script what version of python
your scripts are dependant upon.

This could potentially be a bit of a portability issue across various
distributions --- RHEL doesn't ship with Python 3.x at all, and on
Debian you need to use python3 to get python 3.x, since
/usr/bin/python still points at Python 2.7 by default.  So I could see
this as a potential issue for xfstests.

I'm currently using Python 2.7 in my wrapper scripts for, among other
things, to parse xUnit XML format and create nice summaries like this:

ext4/4k: 337 tests, 6 failures, 21 skipped, 3814 seconds
  Failures: generic/232 generic/233 generic/361 generic/388
generic/451 generic/459

So I'm not opposed to python, but I will note that if you are using
modules from the Python Package Index, and they are modules which are
not packaged by your distribution (so you're using pip or easy_install
to pull them off the network), it does make doing hermetic builds from
trusted sources to be a bit trickier.

If you have a secops team who wants to know the provenance of software
which get thrown in production data centers (and automated downloading
from random external sources w/o any code review makes them break out
in hives), use of PyPI adds a new wrinkle.  It's not impossible to
solve, by any means, but it's something to consider.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] vfs: add mount umount logs

2017-05-19 Thread Theodore Ts&#x27;o
On Fri, May 19, 2017 at 08:17:55AM +0800, Anand Jain wrote:
> > XFS already logs its own unmounts.
> 
>  Nice. as far as I know its only in XFS.

Ext4 logs mounts, but not unmounts.

> >  I prefer to let each filesystem log
> > its own unmount, because then the mount/unmount messages also have the
> > same prefix as all other messages coming from that filesystem driver.
> 
>  Ok. One nitpick though there are other mount types (remount, bind..),
>  and subsequent mounts of the same FS and FS driver can't log them
>  effectively.

The other issue is there is a difference between when an unmount is
started, or when a file system is unmounted in one namespace, and when
the file system as a whole is unmounted by the low-level because there
are no longer any references to the file system in any namespace.

More than once I've gotten a bug report for ext4 that was really
caused by the fact that some program had forked a daemon process in
its own namespace, so when the USB thumbdrive was unmounted, it wasn't
*really* unmounted.  So being able to clearly express that in the logs
is also a good thing.

I think the way to do that is to have one set of logs for when the
file system is unmounted from one mount namespace (or from a bind
mount), with perhaps an indication of the number of remaining
refcounts, and to make this be distinct from the unmount of the
underlying low file system code, which should be logged by the file
system code itself.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic: Check if cycle mount and sleep can affect fiemap result

2017-04-06 Thread Theodore Ts&#x27;o
On Wed, Apr 05, 2017 at 10:35:26AM +0800, Eryu Guan wrote:
> 
> Test fails with ext3/2 when driving with ext4 driver, fiemap changed
> after umount/mount cycle, then changed back to original result after
> sleeping some time. An ext4 bug? (cc'ed linux-ext4 list.)

I haven't had time to look at this, but I'm not sure this test is a
reasonable one on the face of it.

A file system may choose to optimize a file's extent tree for whatever
reason it wants, whenever it wants, including on an unmount --- and
that would not be an invalid thing to do.  So to have an xfstests that
causes a test failure if a file system were to, say, do some cleanup
at mount or unmount time, or when the file is next opened, to merge
adjacent extents together (and hence change what is returned by
FIEMAP) might be strange, or even weird --- but is this any of user
space's business?  Or anything we want to enforce as wrong wrong wrong
by xfstests?

   - Ted
   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-27 Thread Theodore Ts&#x27;o
On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-26 Thread Theodore Ts&#x27;o
On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us 
> > > whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess   so there still is one way this could screw us, and it's this 
reason for GFP_NOFS:

- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

- Ted
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-17 Thread Theodore Ts&#x27;o
On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> 
> OK, so I've been staring into the code and AFAIU current->journal_info
> can contain my stored information. I could either hijack part of the
> word as the ref counting is only consuming low 12b. But that looks too
> ugly to live. Or I can allocate some placeholder.

Yeah, I was looking at something similar.  Can you guarantee that the
context will only take one or two bits?  (Looks like it only needs one
bit ATM, even though at the moment you're storing the whole GFP mask,
correct?)

> But before going to play with that I am really wondering whether we need
> all this with no journal at all. AFAIU what Jack told me it is the
> journal lock(s) which is the biggest problem from the reclaim recursion
> point of view. What would cause a deadlock in no journal mode?

We still have the original problem for why we need GFP_NOFS even in
ext2.  If we are in a writeback path, and we need to allocate memory,
we don't want to recurse back into the file system's writeback path.
Certainly not for the same inode, and while we could make it work if
the mm was writing back another inode, or another superblock, there
are also stack depth considerations that would make this be a bad
idea.  So we do need to be able to assert GFP_NOFS even in no journal
mode, and for any file system including ext2, for that matter.

Because of the fact that we're going to have to play games with
current->journal_info, maybe this is something that I should take
responsibility for, and to go through the the ext4 tree after the main
patch series go through?  Maybe you could use xfs and ext2 as sample
(simple) implementations?

My only ask is that the memalloc nofs context be a well defined N
bits, where N < 16, and I'll find some place to put them (probably
journal_info).

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

2017-01-16 Thread Theodore Ts&#x27;o
On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> sb_getblk_gfp is not really needed as
> sb_getblk
>   __getblk_gfp
> __getblk_slow
>   grow_buffers
> grow_dev_page
> gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> 
> so __GFP_FS is cleared unconditionally and therefore the above commit
> didn't have any real effect in fact.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko 
> Reviewed-by: Jan Kara 

If I'm not mistaken, this patch is not dependent on any of the other
patches in this series (and the other patches are not dependent on
this one).  Hence, I could take this patch via the ext4 tree, correct?

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-16 Thread Theodore Ts&#x27;o
On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko 
> Reviewed-by: Jan Kara 

Changes in the jbd2 layer aren't going to guarantee that
memalloc_nofs_save() will be executed if we are running ext4 without a
journal (aka in no journal mode).  And this is a *very* common
configuration; it's how ext4 is used inside Google in our production
servers.

So that means the earlier patches will probably need to be changed so
the nOFS scope is done in the ext4_journal_{start,stop} functions in
fs/ext4/ext4_jbd2.c.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is is possible to submit binary image as fstest test case?

2016-10-06 Thread Theodore Ts&#x27;o
On Thu, Oct 06, 2016 at 08:29:28AM -0400, Brian Foster wrote:
> Doesn't necessarily bother me one way or the other, but something we've
> done with XFS in such situations is introduce a DEBUG mode only sysfs
> tunable that delays certain infrastructure (log recovery in our case) to
> coordinate with test cases that try to reproduce such timing/racing
> problems.

The other approach the btrfs folks might consider is to have a
sufficiently powerful "debugfs" or "xfs_db" program that can generate
test images with the desired property.

I've found that the time I've invested in creating debugfs has repaid
itself a hundred times over, especially when maintaining a regression
test suite for e2fsck.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChaCha20 vs. AES performance

2016-09-20 Thread Theodore Ts&#x27;o
On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote:
> Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in
> software) is over 2x as fast as AES (in hardware), at realistic (for a
> filesystem) block sizes:

On Skylake and Broadwell processors, AES is faster (the posting is
from a ChaCha20 enthusiast):

 https://blog.cloudflare.com/it-takes-two-to-chacha-poly/

My big worry though is that schemes that require that nonces/IV's must
**never** be reused are fragile.  It's for the same reason that DSA
makes my skin crawl.  If you ever screw up --- maybe after a crash, or
a file system bug, you end up reusing a nonce, it's game over.

So if there are hardware solutions which are faster or fast enough
that the crypto is no longer dominant cost, why not use a cipher
scheme which is more robust?

- Ted

P.S.  We're also both ignoring the cost of whatever changes are needed in
the file system to guarantee that the nonce is never, ever reused...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experimental btrfs encryption

2016-09-19 Thread Theodore Ts&#x27;o
On Mon, Sep 19, 2016 at 08:32:34PM -0400, Chris Mason wrote:
> > > That key is used to protect the contents of the data file, and to
> > > encrypt filenames and symlink targets --- since filenames can leak
> > > significant information about what the user is doing.  (For example,
> > > in the downloads directory of their web browser, leaking filenames is
> > > just as good as leaking part of their browsing history.)
> 
> One of the things that makes per-subvolume encryption attractive to me is
> that we're able to enforce the idea that an entire directory tree is
> encrypted by one key.  It can't be snapshotted again without the key, and it
> just fits with the rest of the btrfs management code.  I do want to support
> the existing vfs interfaces as well too though.

One of the main reasons for doing fs-level encryption is so you can
allow multiple users to have different keys.  In some cases you can
assume that different users will be in different distinct subvolumes
(e.g., each user has their own home directory), but that's not always
going to be possible.

One of the other things that was in the original design, but which got
dropped in our initial implementation, was the concept of having the
per-inode key wrapped by multiple user keys.  This would allow a file
to be accessible by more than one user.  So something to consider is
that there may very well be situations where you *want* to have more
than one key associated with a directory hierarchy.

> > The issue, here, is that inodes are fundamentally not a safe scope to
> > attach that information to in btrfs. As extents can be shared between
> > inodes (and thus both will need to decrypt them), and inodes can be
> > duplicated unmodified (snapshots), attaching keys and nonces to inodes
> > opens up a whole host of (possibly insoluble) issues, including
> > catastrophic nonce reuse via writable snapshots.
> 
> I'm going to have to read harder about nonce reuse.  In btrfs an inode is
> really a pair [ root id, inode number ], so strictly speaking two writable
> snapshots won't have the same inode in memory and when a snapshot is
> modified we'd end up with a different nonce for the new modifications.

Nonce reuse is not necessrily catastrophic.  It all depends on the
context.  In the case of Counter or GCM mode, nonce (or IV) reuse is
absolutely catastrophic.  It must *never* be done or you completely
lose all security.  As the Soviets discovered the hard way courtesy of
the Venona project (well, they didn't discover it until after they
lost the cold war, but...) one time pads are completely secure.
Two-time pads, are most emphatically _not_.  :-)

In the case of the nonces used in fscrypt's key derivation, reuse of
the nonce basically means that two files share the same key.  Assuming
you're using a competently designed block cipher (e.g., AES), reuse of
the key is not necessarily a problem.  What it would mean is that two
files which are are reflinked would share the same key.  And if you
have writable snapshots, that's definitely not a problem, since with
AES we use the a fixed key and a fixed IV given a logical block
number, and we can do block overwrites without having to guarantee
unique nonces (which you *do* need to worry about if you use counter
mode or some other stream cipher such as ChaCha20 --- Kent Overstreet
had some clever tricks to avoid IV reuse since he used a stream cipher
in his proposed bcachefs encryption).

The main issue is if you want to reflink a file and then have the two
files have different permissions / ownerships.  In that case, you
really want to use different keys for user A and for user B --- but if
you are assuming a single key per subvolume, you can't support
different keys for different users anyway, so you're kind of toast for
that use case in any case.

So in any case, assuming you're using block encryption (which is what
fscrypt uses) there really isn't a problem with nonce reuse, although
in some cases if you really do want to reflink a file and have it be
protected by different user keys, this would have to force copy of the
duplicated blocks at that point.  But arguably, that is a feature, not
a bug.  If the two users are mutually suspicious, you don't _want_ to
leak information about who much of a particular file had been changed
by a particular user.  So you would want to break the reflink and have
separate copies for both users anyway.


One final thought --- something which is really going to be a factor
in many use cases is going to be hardware accelerated encryption.  For
example, Qualcomm is already shipping an SOC where the encryption can
be done in the data path between the CPU and the eMMC storage device.
If you want to support certain applications that try to read megabytes
and megabytes of data before painting a single pixel, in-line hardware
crypto at line speeds is going to be critical if you don't want to
sacrifice performance, and keep users from being cranky because it
took extra seconds befo

Re: Experimental btrfs encryption

2016-09-19 Thread Theodore Ts&#x27;o
(I'm not on linux-btrfs@, so please keep me on the cc list.  Or
perhpas better yet, maybe we can move discussion to the linux-fsdevel@
list.)

Hi Anand,

After reading this thread on the web archives, and seeing that some
folks seem to be a bit confused about "vfs level crypto", fs/crypto,
and ext4/f2fs encryption, I thought I would give a few comments.

First of all, these are all the same thing.  Initially ext4 encryption
was implemented targetting ChromeOS as the initial customer, and as a
replacement for ecryptfs.  Folks have already pointed you at the
design document[1].  Also of interest is the is the 2015 Linux
Security Symposium slides set[2].  The first deployed use of this was
for Android N's File-based Encryption and Direct boot[3]; a technical
description which left out some of the product details (since LSS 2016
was before the Android N release) can be found at the 2016 LSS
slides[4].

[1] 
https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/preview
[2] http://kernsec.org/files/lss2014/Halcrow_EXT4_Encryption.pdf
[3] 
https://android.googleblog.com/2016/08/android-70-nougat-more-powerful-os-made.html
[4] http://kernsec.org/files/lss2015/halcrow.pdf

The other thing that perhaps would be worth noting is that Michael
Halcrow started this as an encryption/security expert who had dabbled
in file systems, while I was someone for whom encryption/security is a
hobby (although in a previous life I was the tech lead for Kerberos
and chaired the IPSEC working group) who was a file system expert.  In
order to do file system security well, you need people who are well
versed in both discplines working together.

With all due respect, the fact that you chose counter mode and how use
used it pretty clearly demonstrates that you would be well advised to
find someone who is a crypto expert to collaborate with you --- or use
the fs/crypto framework since it was designed and vetted by multiple
crypto experts as well as file system experts.

Having someone who is a product manager who can discuss with you
specific goals is also important, because there are lots of tradeoffs
and lots of design choices  and so what you chose to do is (or at
least should be!)  very much dependent on your threat model, who is
planning on using the feature, what you can and can not upon via-a-vis
hardware support, performance requirements, and so on.


Secondly, in terms of how it all works.  Each user as a "master key"
which is stored on a keyring.  We use a hash of the key to serve as
the key identifier, and associated with each inode we store a nonce (a
random unique string) and the key identifier.  We use the nonce and
the user's master key to generate a unique key for that inode.

That key is used to protect the contents of the data file, and to
encrypt filenames and symlink targets --- since filenames can leak
significant information about what the user is doing.  (For example,
in the downloads directory of their web browser, leaking filenames is
just as good as leaking part of their browsing history.)

As far as using the fs/crypto infrastructure, it's actually pretty
simple.  The file system needs to provide a flag indicating whether or
not the file is encrypted, and support extended attributes.  When you
create an inode in an encrypted directory, you call
fscrypt_inherit_context() and the fscrypto layer will take care of
creating the necessary xattr for the per-inode key.  When you need
open a encrypted file, or operate on an encrypted inode, you call
fscrypt_get_encryption_info() on the inode.  The per-inode encryption
key is cached in the i_crypt_info structure, which hangs off of the
struct inode.

When you write to an encrypted file, you call fscrypt_encrypt_page(),
which returns a struct page with the encrypted contents to be written.
After the write is completed (or in the error case), you call
fscrypt_restore_control_page() to release encrypted page.

To read from an encrypted page, you call fscrypt_get_ctx() to get an
encryption context, which gets stashed in the bio's bi_private
pointer.  (If btrfs is already using bi_private, then you'll need to
add a field in the structure which hangs off of bi_private to stash
the encryption context.)  After the read completes, you call
fscrypt_decrypt_bio_pages() to decrypt all of the pages read as part
of the read/write operation.

It's actually relatively straightforward to use.  If you have any
questions please feel free to ask on linux-fsdevel.


As far as poeple commenting that it might be better to encrypt on the
extent level --- the reason why we didn't chose that path is because
while it does make it easier to do authenticated encryption modes, the
downside is that you can only do the data integrity check if you read
in the entire extent.  This has obvious memory utilization impacts and
will also impact your 4k random read/write performance.

We do have a solution in mind to solve the authenticated encryption
problem; in fact, an intern ha

Re: [PATCH 12/32] dio: unwritten conversion bug tests

2016-02-12 Thread Theodore Ts&#x27;o
On Fri, Feb 12, 2016 at 02:52:53PM +1100, Dave Chinner wrote:
> On Thu, Feb 11, 2016 at 03:40:37PM -0800, Darrick J. Wong wrote:
> > Check that we don't expose old disk contents when a directio write to
> > an unwritten extent fails due to IO errors.  This primarily affects
> > XFS and ext4.
> > 
> > Signed-off-by: Darrick J. Wong 
> 
> aiocp.c: In function 'main':
> aiocp.c:407:1: warning: format '%d' expects argument of type 'int', but 
> argument 3 has type 'off_t' [-Wformat=]
>  printf("tocopy=%d len=%d blk=%d\n", tocopy, length, aio_blksize);
>  ^
> 
> Followup patch is fine, I'll commit as is.

Dave, given that Darrick's patches is going to cause more
_scratch_mount occurences, how quickly do you plan to commit his
patchset?  Should I wait until he makes his change before I do the
s/_scratch_mount/_scratch_cycle_mount/ change?  Should I make that
change, and let you fix it up the commit?

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/224: Increase filesystem instance size to 1.5 GiB

2015-08-31 Thread Theodore Ts&#x27;o
On Tue, Sep 01, 2015 at 05:49:14AM +0530, Chandan Rajendra wrote:
> mkfs.btrfs when invoked on small filesystems by "not" specifying any block
> sizes (i.e. mkfs.btrfs -f /dev/sda1) will automatically create filesystem
> instance with "data block size" == "metadata block size". However in the
> subpagesize-blocksize scenario, we need to specify both data and metadata
> block size on the command line (For e.g. mkfs.btrfs -f -s 4096 -n 16384
> /dev/sda1). In this case, Since the user is forcing the block sizes and it is
> impossible to have mixed block groups with differing data and metadata block
> sizes, mkfs.btrfs will fail.

Ok, so the issue is that for this particular test configuration, btrfs
has a minimum file system size.  What about changing
_scratch_mkfs_sized so that if MIN_FS_SIZE is set, the file system
created will be at least MIN_FS_SIZE in size.

This way it sets the minimum file system size for all tests, not just
generic/224, and any test configuration, whether it be ext4, xfs, or
btrfs where the data and metadata block size are the same, don't have
to take extra time -- only the test configuration of btrfs with
data_block_size != metadata_block_size.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/224: Increase filesystem instance size to 1.5 GiB

2015-08-31 Thread Theodore Ts&#x27;o
On Mon, Aug 31, 2015 at 03:19:22PM -0400, Austin S Hemmelgarn wrote:
> AFAIK, it shouldn't be failing that way, and should automatically switch to
> mixed mode allocation.  A 1G filesystem should work fine for BTRFS, but
> smaller ones will have higher chances of ENOSPC issues (inversely
> proportional to the size of the FS).  I would advise against using BTRFS on
> such a small disk (I avoid using it on anything smaller than 4G personally),
> but I'm not one of the developers, and the fact that I feel it isn't a good
> idea doesn't mean it shouldn't work.

Instead of modifying generic/224, maybe it would be better to have a
way to specify a minimum file system size on a per-file system basis.
That way, if some file system does have a minimum size of say, 1G or
4G, it can be configured in one place, instead of needing to modify
every test that uses a small file system size, or forcing all file
systems to use a larger file systems just for the benefit of a single
file system?

Or maybe just fix mkfs.btrfs?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] generic/224: Increase filesystem instance size to 1.5 GiB

2015-08-31 Thread Theodore Ts&#x27;o
On Sun, Aug 30, 2015 at 08:16:21PM +0530, Chandan Rajendra wrote:
> For small filesystem instances (i.e. size <= 1 GiB), mkfs.btrfs fails when
> "data block size" does not match with the "metadata block size" specified on
> the mkfs.btrfs command line. This commit increases the size of filesystem
> instance created so that the test can be executed on subpagesize-blocksize
> Btrfs instances which have different values for data and metadata blocksizes.

Stupid question --- why isn't this considered a bug in mkfs.btrfs?
Does btrfs simply not support file systems <= 1 GB?  So if someone has
a 1GB USB disk or SD card, what's the official advice from the btrfs
developers?  Use xfs or ext4?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure

2015-08-15 Thread Theodore Ts&#x27;o
On Wed, Aug 12, 2015 at 11:14:11AM +0200, Michal Hocko wrote:
> > Is this "if (!committed_data) {" check now dead code?
> > 
> > I also see other similar suspected dead sites in the rest of the series.
> 
> You are absolutely right. I have updated the patches.

Have you sent out an updated version of these patches?  Maybe I missed
it, but I don't think I saw them.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)

2015-06-25 Thread Theodore Ts&#x27;o
On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Does this sound reasonable?
> 
> Just to make sure I understand, the logic is something like:
> 
>   to read the i_version:
> 
>   inode->i_version_seen = true;
>   return inode->i_version
> 
>   to update the i_version:
> 
>   /*
>* If nobody's seen this value of i_version then we can
>* keep using it, otherwise we need a new one:
>*/
>   if (inode->i_version_seen)
>   inode->i_version++;
>   inode->i_version_seen = false;

Yep, that's what I was proposing.

> Looks OK to me.  As I say I'd expect i_version_seen == true to end up
> being the common case in a lot of v4 workloads, so I'm more skeptical of
> the claim of a performance improvement in the v4 case.

Well, so long as we require i_version to be committed to disk on every
single disk write, we're going to be trading off:

* client-side performance of the advanced NFSv4 cacheing for reads
* server-side performance for writes
* data robustness in case of the server crashing and the client-side 
cache
  getting out of sync with the server after the crash

I don't see any way around that.  (So for example, with lazy mtime
updates we wouldn't be updating the inode after every single
non-allocating write; enabling i_version updates will trash that
optimization.)

I just want to reduce to a bare minimum the performance hit in the
case where NFSv4 exports are not being used (since that is true in a
very *large* number of ext4 deployments --- i.e., every single Android
handset using ext4 :-), such that we can leave i_version updates
turned on by default.

> Could maintaining the new flag be a significant drag in itself?  If not,
> then I guess we're not making things any worse there, so fine.

I don't think so; it's a bit in the in-memory inode, so I don't think
that should be an issue.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)

2015-06-24 Thread Theodore Ts&#x27;o
On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> 
> This sounds similar to what Dave proposed, a per-inode I_VERSION
> attribute that can be changed through chattr. Though the negated meaning
> of the flag could be confusing, I had to reread the paragraph again.

Dave did not specify an I_VERSION attribute that would be stored on
disk.  Instead he talked about a inode flag that would be set when the
struct inode is created by the file system.

This would allow file systems to permanently configure (on a per-inode
basis) whether or not a particular inode would require a forced
i_version update any time the inode's data or metadata is modified.  I
suppose you could initialized the inode flag from an on-disk
attribute, but that wasn't implied by Dave's proposal, at least as I
understood it.

> > This should significantly improve the performance of using the
> > i_version field if the file system is being exported via NFSv4, and if
> > NFSv4 is not in use, no one will be looking at the i_version field, so
> > the performance impact will be very slight, and thus we could enable
> > i_version updates by default for btrfs and ext4.
> 
> Btrfs default is to update i_version and the uscecase gets fixed by the
> per-inode attribute. But from your description above I think that this
> might not be enough for ext4. The reason I see are the different
> defaults. You want to turn it on by default but not impose any
> performance penalty for that, while for our usecase it's sufficient to
> selectively turn it off.

The problem with selectively turning it off is that the user might
decide for a particular file which is getting lots of updates to turn
off i_version updates --- but then at some subsequent time, that file
is part of a file system which is exported NFSv4, clients will
mysteriously break because i_version was manually turned off.

So this is going to be a potential support nightmare for enterprise
distro help desks --- the user will report that a particular file is
constantly getting corrupted, due to the NFSv4 cache invalidation
getting broken, and it might not be obvious why this is only happening
with this one file, and it's because with btrfs, the i_version update
for particular file was selectively turned off.  I don't think it's a
good design where it is easy for the user to set a flag which breaks
functionality, and in a potentially confusing way, especially when the
net result is potential data corruption.

This is why I would much rather have the default be on, but with
minimal (preferably not measurable) performance overhead.  It's the
best of both worlds.

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: i_version vs iversion (Was: Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION)

2015-06-23 Thread Theodore Ts&#x27;o
On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
> 
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
> 
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.

I was talking to Mingming about this on today's ext4 conference call,
and one of the reasons why ext4 turns off i_version update by default
is because it does a real number on our performance as well --- and
furthermore, the only real user of the field from what we can tell is
NFSv4, which not all that many ext4 users actually care about.

This has caused pain for the nfsv4 folks since it means that they need
to tell people to use a special mount option for ext4 if they are
actually using this for nfsv4, and I suspect they won't be all that
eager to hear that btrfs is going to go the same way.

This however got us thinking --- even in if NFSv4 is depending on
i_version, it doesn't actually _look_ at that field all that often.
It's only going to look at it in a response to a client's getattr
call, and that in turn is used to so the client can do its local disk
cache invalidation if anby of the data blocks of the inode has changed.

So what if we have a per-inode flag which "don't update I_VERSION",
which is off by default, but after the i_version has been updated at
least once, is set, so the i_version field won't be updated again ---
at least until something has actually looked at the i_version field,
when the "don't update I_VERSOIN" flag will get cleared again.

So basically, if we know there are no microphones in the forest, we
don't need to make the tree fall.  However, if someone has sampled the
i_version field, then the next time the inode gets updated, we will
update the i_version field so the NFSv4 client can hear the sound of
the tree crashing to the forst floor and so it can invalidate its
local cache of the file.  :-)

This should significantly improve the performance of using the
i_version field if the file system is being exported via NFSv4, and if
NFSv4 is not in use, no one will be looking at the i_version field, so
the performance impact will be very slight, and thus we could enable
i_version updates by default for btrfs and ext4.

And this should make the distribution folks happy, since it will unify
the behavior of all file systems, and make life easier for users who
won't need to set certain magic mount options depending on what file
system they are using and whether they are using NFSv4 or not.

Does this sound reasonable?

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Documenting MS_LAZYTIME

2015-02-27 Thread Theodore Ts&#x27;o
With Omar's suggestions, this looks great.

Thanks!!

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Documenting MS_LAZYTIME

2015-02-26 Thread Theodore Ts&#x27;o
On Thu, Feb 26, 2015 at 02:36:33PM +0100, Michael Kerrisk (man-pages) wrote:
> 
> The disadvantage of MS_STRICTATIME | MS_LAZYTIME is that
> in the case of a system crash, the atime and mtime fields
> on disk might be out of date by at most 24 hours.

I'd change to "The disadvantage of MS_LAZYTIME is that..."  and
perhaps move that so it's clear it applies to any use of MS_LAZYTIME
has this as a downside.

Does that make sense?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Documenting MS_LAZYTIME

2015-02-26 Thread Theodore Ts&#x27;o
On Thu, Feb 26, 2015 at 09:49:39AM +0100, Michael Kerrisk (man-pages) wrote:
> > How about somethign like "This mount significantly reduces writes
> > needed to update the inode's timestamps, especially mtime and actime.
> 
> What is "actime" in the preceding line? Should it be "ctime"?

Sorry, no, it should be "atime".

> I find the wording of there a little confusing. Is the following 
> a correct rewrite:
> 
> The advantage of MS_STRICTATIME | MS_LAZYTIME is that stat(2)
> will return the correctly updated atime, but the atime updates
> will be flushed to disk only when (1) the inode needs to be 
> updated for filesystem / data consistency reasons or (2) the 
> inode is pushed out of memory, or (3) the filesystem is 
> unmounted.)

Yes, that's correct.  The only other thing I might add is that in the
case of a crash, the atime (or mtime) fields on disk might be out of
date by at most 24 hours.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Documenting MS_LAZYTIME

2015-02-20 Thread Theodore Ts&#x27;o
On Fri, Feb 20, 2015 at 09:49:34AM -0600, Eric Sandeen wrote:
> 
> >   This mount option significantly reduces  writes  to  the
> >   inode  table  for workloads that perform frequent random
> >   writes to preallocated files.
> 
> This seems like an overly specific description of a single workload out
> of many which may benefit, but what do others think?  "inode table" is also
> fairly extN-specific.

How about somethign like "This mount significantly reduces writes
needed to update the inode's timestamps, especially mtime and actime.
Examples of workloads where this could be a large win include frequent
random writes to preallocated files, as well as cases where the
MS_STRICTATIME mount option is enabled."?

(The advantage of MS_STRICTATIME | MS_LAZYTIME is that stat system
calls will return the correctly updated atime, but those atime updates
won't get flushed to disk unless the inode needs to be updated for
file system / data consistency reasons, or when the inode is pushed
out of memory, or when the file system is unmounted.)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Snapshot cannot be deleted

2015-01-20 Thread Theodore Ts&#x27;o
+linux-btrfs,-linux-ext4

On Tue, Jan 20, 2015 at 01:28:04PM +0100, Andreas Philipp wrote:
> 
> Due to the known (and fixed) bug in kernel 3.17.0 one of my btrfs
> volume suffers from unreadable and - even worse - uneraseable
> snapshots. Whenever such a snapshot is accessed there is "parent
> transid verify failed" in dmesg.
> Is there any way to delete these snapshots?

You sent this to the wrong mailing list, so I'm reforwarding this to
the linux-btrfs list.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Theodore Ts&#x27;o
On Tue, Dec 02, 2014 at 01:37:27PM -0700, Andreas Dilger wrote:
> 
> One thing that comes to mind is touch/utimes()/utimensat().  Those
> should definitely not result in timestamps being kept only in memory
> for 24h, since the whole point of those calls is to update the times.
> It makes sense for these APIs to dirty the inode for proper writeout.

Not a problem.  Touch/utimes* go through notify_change() and
->setattr, so they won't go through the I_DIRTY_TIME code path.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Theodore Ts&#x27;o
On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote:
> 
> This I do not understand. I thought that I_DIRTY_TIME, and the all
> lazytime mount option, is only for atime. So if there are dirty
> pages then there are also m/ctime that changed and surly we want to
> write these times to disk ASAP.

What are the situations where you are most concerned about mtime or
ctime being accurate after a crash?

I've been running with it on my laptop for a while now, and it's
certainly not a problem for build trees; remember, whenever you need
to update the inode to update i_blocks or i_size, the inode (with its
updated timestamps) will be flushed to disk anyway.

In actual practice, what happens in a build tree is that when make
decides that it needs to update a generated file, when the file is
created as a zero-length inode, m/ctime will be set to the time that
file is created, which is newer than its source files.  As the file is
written, the mtime is updated each time that we actually need to do an
allocating write.  In the case of the linker, it will seek to the
beginning of the file to update ELF header at the very end of its
operation, and *that* time will be left stale, such that the in-memory
mtime is perhaps a millisecond ahead of the on-disk mtime.  But in the
case of a crash, either time is such that make won't be confused.

I'm not aware of an application which is doing a large number of
non-allocating random writes (for example, such as a database), where
said database actually cares about mtime being correct.  In fact, most
databases use fdatasync() to prevent the mtimes from being sync'ed out
to disk on each transaction, so they don't have guaranteed timestamp
accuracy after a crash anyway.  The problem is even if the database is
using fdatasync(), every five seconds we end up updating the mtime
anyway --- and in the case of ext4, we end up needing to take various
journal locks which on a sufficiently parallel workload and a
sufficiently fast disk, can actually cause measurable contention.

Did you have such a use case or application in mind?

> if we are lazytime also with m/ctime then I think I would like an
> option for only atime lazy. because m/ctime is cardinal to some
> operations even though I might want atime lazy.

If there's a sufficiently compelling use case where we do actually
care about mtime/ctime being accurate, and the current semantics don't
provide enough of a guarantee, it's certainly something we could do.
I'd rather keep things simple unless it's really there.  (After all,
we did create the strictatime mount option, but I'm not sure anyone
every ends up using it.  It woud be a shame if we created a
strictcmtime, which had the same usage rate.)

I'll also note that if it's only about atime updates, with the default
relatime mount option, I'm not sure there's enough of a win to hae a
mode to justify a lazyatime only option.  If you really neeed strict
c/mtime after a crash, maybe the best thing to do is to just simply
not use the lazytime mount option and be done with it.

Cheeres,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-02 Thread Theodore Ts&#x27;o
On Tue, Dec 02, 2014 at 01:20:33AM -0800, Christoph Hellwig wrote:
> Why do you need the additional I_DIRTY flag?  A "lesser"
> __mark_inode_dirty should never override a stronger one.

Agreed, will fix.

> Otherwise this looks fine to me, except that I would split the default
> implementation into a new generic_update_time helper.

Sure, I can do that.

> > XFS doesn't have a ->dirty_time yet, but that way XFS would be able to
> > use the I_DIRTY_TIME flag to log the journal timestamps if it so
> > desires, and perhaps drop the need for it to use update_time().
> 
> We will probably always need a ->update_time to proide proper locking
> around the timestamp updates.

Couldn't you let the VFS set the inode timesstamps and then have xfs's
->dirty_time(inode, I_DIRTY_TIME) copy the timestamps to the on-disk
inode structure under the appropriate lock, or am I missing something?

> In the current from the generic lazytime might even be a loss for XFS as
> we're already really good at batching updates from multiple inodes in
> the same cluster for the in-place writeback, so I really don't want
> to just enable it without those optimizations without a lot of testing.

Fair enough; it's not surprising that this might be much more
effective as an optimization for ext4, for no other reason that
timestamp updates are so much heavyweight for us.  I suspect that it
should be a win for btrfs, though, and it should definitely be a win
for those file systems that don't use journalling at all.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-01 Thread Theodore Ts&#x27;o
On Mon, Dec 01, 2014 at 01:28:10AM -0800, Christoph Hellwig wrote:
> 
> The ->is_readonly method seems like a clear winner to me, I'm all for
> adding it, and thus suggested moving it first in the series.

It's a real winner for me as well, but the reason why I dropped it is
because if btrfs() has to keep its ->update_time function, we wouldn't
actually have a user for is_readonly().  I suppose we could have
update_time() call ->is_readonly() and then ->update_time() if they
exist, but it only seemed to add an extra call and a bit of extra
overhead without really simplifying things for btrfs.

If there were other users of ->is_readonly, then it would make sense,
but it seemed better to move into a separate code refactoring series.

> I've read a bit more through the series and would like to suggest
> the following approach for the rest:
> 
>  - convert ext3/4 to use ->update_time instead of the ->dirty_time
>callout so it gets and exact notifications (preferably the few
>remaining filesystems as well, although that shouldn't really be a
>blocker)

We could do that, although ext3/4's ->update_time() would be exactly
the same as the generic update_time() function, so there would be code
duplication.  If the goal is to get rid of the magic in
-->dirty_inode() being used to work around how the VFS makes changes
to fields that end up in the on-disk inode, we would need to audit a
lot of extra code paths; at the very least, in how the generic quota
code handles updates to i_size and i_blocks (for example).

And BTW, we don't actually have a dirty_time() function any more in
the current patch series.  update_time() is currently looking like
this:

static int update_time(struct inode *inode, struct timespec *time, int flags)
{
if (inode->i_op->update_time)
return inode->i_op->update_time(inode, time, flags);

if (flags & S_ATIME)
inode->i_atime = *time;
if (flags & S_VERSION)
inode_inc_iversion(inode);
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;

if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
!(inode->i_state & I_DIRTY))
__mark_inode_dirty(inode, I_DIRTY_TIME);
else
__mark_inode_dirty(inode, I_DIRTY_SYNC);
return 0;
}

>  - Convert xfs, btrfs and the remaining filesystes using ->dirty_inode
>incrementally.

Right, so xfs and btrfs (which are the two file systems that have
update_time at the moment) can just drop update_time() and then check
the ->dirty_time() for (flags & I_DIRTY_TIME).  Hmm, I suspect this
might be better for xfs, yes?

if ((inode->i_sb->s_flags & MS_LAZYTIME) && !(flags & S_VERSION) &&
!(inode->i_state & I_DIRTY))
__mark_inode_dirty(inode, I_DIRTY_TIME);
else
__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_TIME);

XFS doesn't have a ->dirty_time yet, but that way XFS would be able to
use the I_DIRTY_TIME flag to log the journal timestamps if it so
desires, and perhaps drop the need for it to use update_time().  (And
with XFS doing logical journalling, it may be that you might want to
include the timestamp update in the journal if you have a journal
transaction open already, so the disk is spun up or likely to be spin
up anyway, right?)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-27 Thread Theodore Ts&#x27;o
On Thu, Nov 27, 2014 at 06:00:16PM -0500, Theodore Ts'o wrote:
> Well it's not quite enough.  The problem is that for ext3 and
> ext4, the actual work of writing the inode happens in dirty_inode(),
> not in write_inode().  Which means we need to do something like this.
> 
> I'm not entirely sure whether or not this is too ugly to live;
> personally, I think my hack of handling this in update_time() might be
> preferable

 and this doesn't work because it breaks ext3/ext4's transaction
handling, since the writeback thread could be racing against some
transactional update of the inode.  So I don't see a way of making the
queue_io / move_expired_inodes approach to the 24 hour time being
tractable.

So the alternatives that I can see at this point is either give up on
the 24 hour timeout, or we fall back to handling this in
update_time().

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-27 Thread Theodore Ts&#x27;o
On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> * change queue_io() to also call
>   moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, time + 
> 24hours)
>   For this you need to tweak move_expired_inodes() to take pointer to
>   timestamp instead of pointer to work but that's trivial. Also you want
>   probably leave time ->older_than_this value (i.e. without +24 hours) if
>   we are doing WB_SYNC_ALL writeback. With this you can remove
>   flush_sb_dirty_time() completely.

Well it's not quite enough.  The problem is that for ext3 and
ext4, the actual work of writing the inode happens in dirty_inode(),
not in write_inode().  Which means we need to do something like this.

I'm not entirely sure whether or not this is too ugly to live;
personally, I think my hack of handling this in update_time() might be
preferable

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b93c529..95a42b3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -253,7 +253,7 @@ static bool inode_dirtied_after(struct inode *inode, 
unsigned long t)
  */
 static int move_expired_inodes(struct list_head *delaying_queue,
   struct list_head *dispatch_queue,
-  struct wb_writeback_work *work)
+  unsigned long *older_than_this)
 {
LIST_HEAD(tmp);
struct list_head *pos, *node;
@@ -264,8 +264,8 @@ static int move_expired_inodes(struct list_head 
*delaying_queue,
 
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
-   if (work->older_than_this &&
-   inode_dirtied_after(inode, *work->older_than_this))
+   if (older_than_this &&
+   inode_dirtied_after(inode, *older_than_this))
break;
list_move(&inode->i_wb_list, &tmp);
moved++;
@@ -309,9 +309,14 @@ out:
 static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
 {
int moved;
+   unsigned long one_day_later = jiffies + (HZ * 86400);
+
assert_spin_locked(&wb->list_lock);
list_splice_init(&wb->b_more_io, &wb->b_io);
-   moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, work);
+   moved = move_expired_inodes(&wb->b_dirty, &wb->b_io,
+   work->older_than_this);
+   moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
+&one_day_later);
trace_writeback_queue_io(wb, work, moved);
 }
 
@@ -637,6 +642,17 @@ static long writeback_sb_inodes(struct super_block *sb,
}
 
/*
+* If the inode is marked dirty time but is not dirty,
+* then at last for ext3 and ext4 we need to call
+* mark_inode_dirty_sync in order to get the inode
+* timestamp transferred to the on disk inode, since
+* write_inode is a no-op for those file systems.  HACK HACK 
HACK
+*/
+   if ((inode->i_state & I_DIRTY_TIME) &&
+   ((inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) == 0))
+   mark_inode_dirty_sync(inode);
+
+   /*
 * Don't bother with new inodes or inodes being freed, first
 * kind does not need periodic writeout yet, and for the latter
 * kind writeout is handled by the freer.
@@ -1233,9 +1249,10 @@ void inode_requeue_dirtytime(struct inode *inode)
spin_lock(&bdi->wb.list_lock);
spin_lock(&inode->i_lock);
if ((inode->i_state & I_DIRTY_WB) == 0) {
-   if (inode->i_state & I_DIRTY_TIME)
+   if (inode->i_state & I_DIRTY_TIME) {
+   inode->dirtied_when = jiffies;
list_move(&inode->i_wb_list, &bdi->wb.b_dirty_time);
-   else
+   } else
list_del_init(&inode->i_wb_list);
}
spin_unlock(&inode->i_lock);

Comments?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Theodore Ts&#x27;o
On Thu, Nov 27, 2014 at 08:49:52AM -0800, Christoph Hellwig wrote:
> I don't think this scheme works well.  As mentioned earlier XFS doesn't
> even use vfs dirty tracking at the moment, so introducing this in a
> hidden way sounds like a bad idea.  Probably the same for btrfs.
> 
> I'd rather keep update_time as-is for now, don't add ->write_time and
> let btrfs and XFS figure out how to implement the semantics on their
> own.

I can do that, but part of the reason why we were doing this rather
involved set of changes was to allow other file systems to be able to
take advantage of lazytime.  I suppose there is value in allowing
other file systems, such as jfs, f2fs, etc., to use it, but still,
it's a bit of a shame to drop btrfs and xfs support for this feature.

I'll note by the way that ext3 and ext4 doesn't really use VFS dirty
tracking either --- see my other comments about the naming of
"mark_inode_dirty" being a bit misleading, at least for all/most of
the major file systems.  The problem seems to be that replacement
schemes that we've all using are slightly different.  :-/

I suppose should let the btrfs folks decide whether they want to add
is_readonly() and write_time() function --- or maybe help with the
cleanup work so that mark_inode_dirty() can reflect an error to its
callers.   Chris, David, what do you think?

 - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-27 Thread Theodore Ts&#x27;o
On Thu, Nov 27, 2014 at 02:14:21PM +0100, Jan Kara wrote:
> Looking into the code & your patch I'd prefer to do something like:
> * add support for I_DIRTY_TIME in __mark_inode_dirty() - update_time will
>   call __mark_inode_dirty() with this flag if any of the times was updated.
>   That way we can just remove your ->write_time() callback - filesystems
>   can just handle this in their ->dirty_inode() methods if they wish.
>   __mark_inode_dirty() will take care of moving inode into proper writeback
>   list (i_dirty / i_dirty_time), dirtied_when will be set to current time.

One of the tricky bits about this is that btrfs wants to be able to
return an error from write_time() which gets reflected up through
update_time() to the callers of file_update_time().  Currently
__mark_inode_dirty() and family return a void, and changing this is
going to be a bit of a mess, since doing this correctly would require
auditing all of the callers of mark_inode_dirty(),
mark_inode_dirty_sync(), __mark_inode_dirty(), etc.

Doing this would be a good thing, and eventually I think it would be
nice if we could allow the mark_inode_dirty() functions return an
error instead of void, but I wonder if that's a cleanup that's better
saved for later.  While we were at it, maybe we should rename
mark_inode_dirty() to inode_dirty(), since that way we can be sure
we've looked at all of the call site of mark_inode_dirty() and friends
--- and we have a number of file systems, including btrfs, ext3, and
ext4, where mark_inode_dirty() is doing a lot more than just marking
the inode is dirty, and the only reason why it's named that is mostly
historical.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-27 Thread Theodore Ts&#x27;o
On Thu, Nov 27, 2014 at 04:41:59PM +0100, Jan Kara wrote:
>   Hum, but this puts lots of stuff under inode_hash_lock, including
> writeback list lock. I don't like this too much. I understand that getting
> handle for each inode is rather more CPU intensive but it should still be a
> clear win over the current situation and avoids entangling locks like this.

Hmm, if we dropped the inode_requeue_dirtytime(), then we can avoid
taking the writeback list lock.  The net result is that we would have
some inodes still on the b_dirty_time list that were no longer
I_DIRTY_TIME, but since I_DIRTY_TIME wouldn't be set, it's mostly
harmless since when we do iterate over the b_dirty_time list, those
inodes can be quickly identified and skipped over.  (And if the inode
ever gets dirtied for real, then it will get moved onto the b_dirty
list and that will be that.)

The problem with getting a handle on the inode is not just that it is
more CPU intensive, but that can't let the iput_final() call happen
until after we have finished the transaction handle.  We could keep a
linked list of inodes attached to the handle, and then only call iput
on them once ext4_journal_stop(handle) gets called, but that's a
complication I'd like to avoid if at all possible.

Being able to opporunistically write the timestamps when we are
journalling an inode table block is a pretty big win, so if we end up
extending the hold time on inode_hash_lock (only when we come across a
I_DIRTY_TIME inode that we can clear) a tiny bit, there will be a lot
of workloads where I think it's a worthwhile tradeoff.  If we can
avoid entangling the writebakc list lock, does that make you happier
about this approach?

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Theodore Ts&#x27;o
Christoph, can you take a quick look at this?  I'm not sure I got the
xfs inode transaction logging correct.

Thanks!!

- Ted

commit cd58addfa340c9cf88b1f9b2d31a42e2e65c7252
Author: Theodore Ts'o 
Date:   Thu Nov 27 10:14:27 2014 -0500

vfs: split update_time() into update_time() and write_time()

In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Previously, only btrfs and xfs uses update_time().

With this patch, btrfs only needs write_time(), and xfs uses
update_time() to synchronize its on-disk and in-memory timestamps.

Signed-off-by: Theodore Ts'o 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba 

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f4b0ecd..befd5d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1545,7 +1545,8 @@ struct inode_operations {
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
  u64 len);
int (*is_readonly)(struct inode *);
-   int (*update_time)(struct inode *, struct timespec *, int);
+   void (*update_time)(struct inode *);
+   int (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
   struct file *, unsigned open_flag,
   umode_t create_mode, int *opened);
diff --git a/fs/inode.c b/fs/inode.c
index 53f0173..94bc908 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1506,9 +1506,6 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (ret)
return ret;
}
-   if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
-
if (flags & S_ATIME)
inode->i_atime = *time;
if (flags & S_VERSION)
@@ -1517,6 +1514,10 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode->i_ctime = *time;
if (flags & S_MTIME)
inode->i_mtime = *time;
+   if (inode->i_op->update_time)
+   inode->i_op->update_time(inode);
+   if (inode->i_op->write_time)
+   return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..b69493d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -983,42 +983,42 @@ xfs_vn_setattr(
return error;
 }
 
-STATIC int
+STATIC void
 xfs_vn_update_time(
-   struct inode*inode,
-   struct timespec *now,
-   int flags)
+   struct inode*inode)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+
+   trace_xfs_update_time(ip);
+   xfs_ilock(ip, XFS_ILOCK_EXCL);
+   ip->i_d.di_ctime.t_sec = (__int32_t) inode->i_ctime.tv_sec;
+   ip->i_d.di_ctime.t_nsec = (__int32_t) inode->i_ctime.tv_nsec;
+
+   ip->i_d.di_mtime.t_sec = (__int32_t)inode->i_mtime.tv_sec;
+   ip->i_d.di_mtime.t_nsec = (__int32_t) inode->i_mtime.tv_nsec;
+
+   ip->i_d.di_atime.t_sec = (__int32_t) inode->i_atime.tv_sec;
+   ip->i_d.di_atime.t_nsec = (__int32_t) inode->i_atime.tv_nsec;
+   xfs_iunlock(ip, XFS_ILOCK_EXCL);
+}
+
+STATIC int
+xfs_vn_write_time(
+   struct inode*inode)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
struct xfs_trans*tp;
int error;
 
-   trace_xfs_update_time(ip);
-
+   trace_xfs_write_time(ip);
tp = xfs_trans_alloc(mp, XFS_TRANS_FSYNC_TS);
error = xfs_trans_reserve(tp, &M_RES(mp)->tr_fsyncts, 0, 0);
if (error) {
xfs_trans_cancel(tp, 0);
return error;
}
-
xfs_ilock(ip, XFS_ILOCK_EXCL);
-   if (flags & S_CTIME) {
-   inode->i_ctime = *now;
-   ip->i_d.di_ctime.t_sec = (__int32_t)now->tv_sec;
-   ip->i_d.di_ctime.t_nsec = (__int32_t)now->tv_nsec;
-   }
-   if (flags & S_MTIME) {
-   inode->i_mtime = *now;
-   ip->i_d.di_mtime.t_sec = (__int32_t)now->tv_sec;
-   ip->i_d.di_mtime.t_nsec = (__int32_t)now->tv_nsec;
-   }
-   if (flags & S_ATIME) {
-   inode->i_atime = *now;
-   ip->i_d.di_atime.t_sec = (__int32_t)now->tv_sec;
-   ip->i_d.di_atime.t_nsec = (__int32_t)now->tv_nsec;
-   }
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
return xf

Re: [PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-27 Thread Theodore Ts&#x27;o
This is what I'm currently playing with which I believe fixes the iput()
problem.  In fs/ext4/inode.c:

struct other_inode {
unsigned long   orig_ino;
struct ext4_inode   *raw_inode;
};
static int other_inode_match(struct inode * inode, unsigned long ino,
 void *data);

/*
 * Opportunistically update the other time fields for other inodes in
 * the same inode table block.
 */
static void ext4_update_other_inodes_time(struct super_block *sb,
  unsigned long orig_ino, char *buf)
{
struct other_inode oi;
unsigned long ino;
int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
int inode_size = EXT4_INODE_SIZE(sb);

oi.orig_ino = orig_ino;
ino = orig_ino & ~(inodes_per_block - 1);
for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
if (ino == orig_ino)
continue;
oi.raw_inode = (struct ext4_inode *) buf;
(void) find_inode_nowait(sb, ino, other_inode_match, &oi);
}
}

static int other_inode_match(struct inode * inode, unsigned long ino,
 void *data)
{
struct other_inode *oi = (struct other_inode *) data;

if ((inode->i_ino != ino) ||
(inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) ||
((inode->i_state & I_DIRTY_TIME) == 0))
return 0;
spin_lock(&inode->i_lock);
if (((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) &&
(inode->i_state & I_DIRTY_TIME)) {
struct ext4_inode_info  *ei = EXT4_I(inode);

inode->i_state &= ~I_DIRTY_TIME;
inode->i_ts_dirty_day = 0;
spin_unlock(&inode->i_lock);
inode_requeue_dirtytime(inode);

spin_lock(&ei->i_raw_lock);
EXT4_INODE_SET_XTIME(i_ctime, inode, oi->raw_inode);
EXT4_INODE_SET_XTIME(i_mtime, inode, oi->raw_inode);
EXT4_INODE_SET_XTIME(i_atime, inode, oi->raw_inode);
ext4_inode_csum_set(inode, oi->raw_inode, ei);
spin_unlock(&ei->i_raw_lock);
trace_ext4_other_inode_update_time(inode, oi->orig_ino);
return -1;
}
spin_unlock(&inode->i_lock);
return -1;
}

The above uses the following in fs/inode.c (which gets added instead of
find_active_inode_nowait):

/**
 * find_inode_nowait - find an inode in the inode cache
 * @sb: super block of file system to search
 * @hashval:hash value (usually inode number) to search for
 * @match:  callback used for comparisons between inodes
 * @data:   opaque data pointer to pass to @match
 *
 * Search for the inode specified by @hashval and @data in the inode
 * cache, where the helper function @match will return 0 if the inode
 * does not match, 1 if the inode does match, and -1 if the search
 * should be stopped.  The @match function must be responsible for
 * taking the i_lock spin_lock and checking i_state for an inode being
 * freed or being initialized, and incrementing the reference count
 * before returning 1.  It also must not sleep, since it is called with
 * the inode_hash_lock spinlock held.
 *
 * This is a even more generalized version of ilookup5() when the
 * function must never block --- find_inode() can block in
 * __wait_on_freeing_inode() --- or when the caller can not increment
 * the reference count because the resulting iput() might cause an
 * inode eviction().  The tradeoff is that the @match funtion must be
 * very carefully implemented.
 */
struct inode *find_inode_nowait(struct super_block *sb,
unsigned long hashval,
int (*match)(struct inode *, unsigned long,
 void *),
void *data)
{
struct hlist_head *head = inode_hashtable + hash(sb, hashval);
struct inode *inode, *ret_inode = NULL;
int mval;

spin_lock(&inode_hash_lock);
hlist_for_each_entry(inode, head, i_hash) {
if (inode->i_sb != sb)
continue;
mval = match(inode, hashval, data);
if (mval == 0)
continue;
if (mval == 1)
ret_inode = inode;
goto out;
}
out:
spin_unlock(&inode_hash_lock);
return ret_inode;
}
EXPORT_SYMBOL(find_inode_nowait);

Comments?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-27 Thread Theodore Ts&#x27;o
On Wed, Nov 26, 2014 at 11:23:28AM -0800, Christoph Hellwig wrote:
> As mentioned last round please move the addition of the is_readonly
> operation to the first thing in the series, so that the ordering makes
> more sense.

OK, will fix.

> Second I think this patch is incorrect for XFS - XFS uses ->update_time
> to set the time stampst in the dinode.  These two need to be coherent
> as we can write out a dirty inode any time, so it needs to have the
> timestamp uptodate.
> 
> Third update_time now calls mark_inode_dirty unconditionally, while
> previously it wasn't called when ->update_time was set.  At least
> for XFS that's a major change in behavior as XFS never used VFS dirty
> tracking for metadata updates.

Thanks, I'll fix both of the above.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 2/7] vfs: add support for a lazytime mount option

2014-11-26 Thread Theodore Ts&#x27;o
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c   | 55 -
 fs/inode.c  | 43 ++-
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  8 +++
 include/linux/backing-dev.h |  1 +
 include/linux/fs.h  | 11 +++--
 include/uapi/linux/fs.h |  1 +
 mm/backing-dev.c|  9 ++--
 8 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ef8c5d8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -397,7 +397,7 @@ static void requeue_inode(struct inode *inode, struct 
bdi_writeback *wb,
 * shot. If still dirty, it will be redirty_tail()'ed below.  Update
 * the dirty time to prevent enqueue and sync it again.
 */
-   if ((inode->i_state & I_DIRTY) &&
+   if ((inode->i_state & I_DIRTY_WB) &&
(wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
inode->dirtied_when = jiffies;
 
@@ -428,13 +428,15 @@ static void requeue_inode(struct inode *inode, struct 
bdi_writeback *wb,
 */
redirty_tail(inode, wb);
}
-   } else if (inode->i_state & I_DIRTY) {
+   } else if (inode->i_state & I_DIRTY_WB) {
/*
 * Filesystems can dirty the inode during writeback operations,
 * such as delayed allocation during submission or metadata
 * updates after data IO completion.
 */
redirty_tail(inode, wb);
+   } else if (inode->i_state & I_DIRTY_TIME) {
+   list_move(&inode->i_wb_list, &wb->b_dirty_time);
} else {
/* The inode is clean. Remove from writeback lists. */
list_del_init(&inode->i_wb_list);
@@ -482,11 +484,11 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
-   dirty = inode->i_state & I_DIRTY;
-   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   dirty = inode->i_state & I_DIRTY_INODE;
+   inode->i_state &= ~I_DIRTY_INODE;
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
-   if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+   if (dirty) {
int err = write_inode(inode, wbc);
if (ret == 0)
ret = err;
@@ -1162,7 +1164,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 
spin_lock(&inode->i_lock);
if ((inode->i_state & flags) != flags) {
-   const int was_dirty = inode->i_state & I_DIRTY;
+   const int was_dirty = inode->i_state & I_DIRTY_WB;
 
inode->i_state |= flags;
 
@@ -1224,6 +1226,24 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+void inode_requeue_dirtytime(struct inode *inode)
+{
+   struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+   spin_lock(&bdi->wb.list_lock);
+   spin_lock(&inode->i_lock);
+   if ((inode->i_state & I_DIRTY_WB) == 0) {
+   if (inode->i_state & I_DIRTY_TIME)
+   list_move(&inode->i_wb_list, &bdi->wb.b_dirty_time);
+   else
+   list_del_init(&inode->i_wb_list);
+   }
+   spin_unlock(&inode->i_lock);
+   spin_unlock(&bdi->wb.list_lock);
+
+}
+EXPORT_SYMBOL(inode_req

[PATCH-v4 0/7] add support for a lazytime mount option

2014-11-26 Thread Theodore Ts&#x27;o
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Any objections to my carrying these patches in the ext4 git tree?

Changes since -v3:
   - inodes with I_DIRTY_TIME set are placed on a new bdi list,
b_dirty_time.  This allows filesystem-level syncs to more
easily iterate over those inodes that need to have their
timestamps written to disk.
   - dirty timestamps will be written out asynchronously on the final
iput, instead of when the inode gets evicted.
   - separate the definition of the new function
find_active_inode_nowait() to a separate patch
   - create separate flag masks: I_DIRTY_WB and I_DIRTY_INODE, which
   indicate whether the inode needs to be on the write back lists,
   or whether the inode itself is dirty, while I_DIRTY means any one
   of the inode dirty flags are set.  This simplifies the fs
   writeback logic which needs to test for different combinations of
   the inode dirty flags in different places.

Changes since -v2:
   - If update_time() updates i_version, it will not use lazytime (i..e,
   the inode will be marked dirty so the change will be persisted on to
   disk sooner rather than later).  Yes, this eliminates the
   benefits of lazytime if the user is experting the file system via
   NFSv4.  Sad, but NFS's requirements seem to mandate this.
   - Fix time wrapping bug 49 days after the system boots (on a system
with a 32-bit jiffies).   Use get_monotonic_boottime() instead.
   - Clean up type warning in include/tracing/ext4.h
   - Added explicit parenthesis for stylistic reasons
   - Added an is_readonly() inode operations method so btrfs doesn't
   have to duplicate code in update_time().

Changes since -v1:
   - Added explanatory comments in update_time() regarding i_ts_dirty_days
   - Fix type used for days_since_boot
   - Improve SMP scalability in update_time and ext4_update_other_inodes_time
   - Added tracepoints to help test and characterize how often and under
 what circumstances inodes have their timestamps lazily updated


Theodore Ts'o (7):
  vfs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  vfs: add lazytime tracepoints for better debugging
  vfs: add find_active_inode_nowait() function
  ext4: add support for a lazytime mount option
  btrfs: add an is_readonly() so btrfs can use common code for
update_time()

 Documentation/filesystems/Locking |   2 +
 fs/btrfs/inode.c  |  34 +---
 fs/ext4/inode.c   |  49 -
 fs/ext4/super.c   |   9 +++
 fs/fs-writeback.c |  57 +--
 fs/inode.c| 113 +++---
 fs/proc_namespace.c   |   1 +
 fs/sync.c |   8 +++
 fs/xfs/xfs_iops.c |  39 ++---
 include/linux/backing-dev.h   |   1 +
 include/linux/fs.h|  17 +-
 include/trace/events/ext4.h   |  30 ++
 include/uapi/linux/fs.h   |   1 +
 mm/backing-dev.c  |   9 ++-
 14 files changed, 306 insertions(+), 64 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-11-26 Thread Theodore Ts&#x27;o
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba 
---
 Documentation/filesystems/Locking |  2 ++
 fs/btrfs/inode.c  | 10 ++
 fs/inode.c| 29 ++---
 fs/xfs/xfs_iops.c | 39 ---
 include/linux/fs.h|  1 +
 5 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index b30753c..e49861d 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -63,6 +63,7 @@ prototypes:
int (*removexattr) (struct dentry *, const char *);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, 
u64 len);
void (*update_time)(struct inode *, struct timespec *, int);
+   void (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
@@ -95,6 +96,7 @@ listxattr:no
 removexattr:   yes
 fiemap:no
 update_time:   no
+write_time:no
 atomic_open:   yes
 tmpfile:   no
 dentry_open:   no
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
-
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
+   int ret;
+
+   if (inode->i_op->update_time) {
+   ret = inode->i_op->update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags & S_ATIME)
+

[PATCH-v4 6/7] ext4: add support for a lazytime mount option

2014-11-26 Thread Theodore Ts&#x27;o
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c | 49 ++---
 fs/ext4/super.c |  9 +
 include/trace/events/ext4.h | 30 +++
 3 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5653fa4..8308c82 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4140,6 +4140,51 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino & ~(inodes_per_block - 1);
+   for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode->i_state & I_DIRTY_TIME) == 0 ||
+   !spin_trylock(&inode->i_lock)) {
+   iput(inode);
+   continue;
+   }
+   inode->i_state &= ~I_DIRTY_TIME;
+   inode->i_ts_dirty_day = 0;
+   spin_unlock(&inode->i_lock);
+   inode_requeue_dirtytime(inode);
+
+   ei = EXT4_I(inode);
+   raw_inode = (struct ext4_inode *) buf;
+
+   spin_lock(&ei->i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(&ei->i_raw_lock);
+   trace_ext4_other_inode_update_time(inode, orig_ino);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4237,7 +4282,6 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];
}
-
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
if (ei->i_extra_isize) {
@@ -4248,10 +4292,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei->i_extra_isize);
}
}
-
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);
+   ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
 
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 58859bc..93a2b7a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1132,6 +1132,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+   {Opt_lazytime, "lazytime"},
+   {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1452,6 +1455,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
case 

[PATCH-v4 3/7] vfs: don't let the dirty time inodes get more than a day stale

2014-11-26 Thread Theodore Ts&#x27;o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c  |  1 +
 fs/inode.c | 18 ++
 include/linux/fs.h |  1 +
 3 files changed, 20 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef8c5d8..529480a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1143,6 +1143,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 9e464cc..37c0429 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1510,6 +1510,8 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   struct timespec uptime;
+   unsigned short days_since_boot;
int ret;
 
if (inode->i_op->update_time) {
@@ -1526,11 +1528,26 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
+   /*
+* If i_ts_dirty_day is zero, then either we have not deferred
+* timestamp updates, or the system has been up for less than
+* a day (so days_since_boot is zero), so we defer timestamp
+* updates in that case and set the I_DIRTY_TIME flag.  If a
+* day or more has passed, then i_ts_dirty_day will be
+* different from days_since_boot, and then we should update
+* the on-disk inode and then we can clear i_ts_dirty_day.
+*/
if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
!(flags & S_VERSION) &&
!(inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) {
if (inode->i_state & I_DIRTY_TIME)
return 0;
+   get_monotonic_boottime(&uptime);
+   days_since_boot = div_u64(uptime.tv_sec, 86400);
+   if (inode->i_ts_dirty_day &&
+   (inode->i_ts_dirty_day != days_since_boot))
+   goto force_dirty;
+
spin_lock(&inode->i_lock);
if (inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
spin_unlock(&inode->i_lock);
@@ -1541,6 +1558,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
return 0;
}
inode->i_state |= I_DIRTY_TIME;
+   inode->i_ts_dirty_day = days_since_boot;
spin_unlock(&inode->i_lock);
inode_requeue_dirtytime(inode);
return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 55cf34d..d0a2181 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 5/7] vfs: add find_active_inode_nowait() function

2014-11-26 Thread Theodore Ts&#x27;o
Add a new function find_active_inode_nowait() which will never block.
If there is an inode being freed or is still being initialized, this
function will return NULL instead of blocking waiting for an inode to
be freed or to finish initializing.  Hence, a negative return from
this function does not mean that inode number is free for use.  It is
useful for callers that want to opportunistically do some work on an
inode only if it is present and available in the cache, and where
blocking is not an option.

Signed-off-by: Theodore Ts'o 
---
 fs/inode.c | 36 
 include/linux/fs.h |  2 ++
 2 files changed, 38 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index b2fea60..0b4c6ae 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1283,6 +1283,42 @@ struct inode *ilookup(struct super_block *sb, unsigned 
long ino)
 }
 EXPORT_SYMBOL(ilookup);
 
+/**
+ * find_active_inode_nowait - find an active inode in the inode cache
+ * @sb:super block of file system to search
+ * @ino:   inode number to search for
+ *
+ * Search for an active inode @ino in the inode cache, and if the
+ * inode is in the cache, the inode is returned with an incremented
+ * reference count.  If the inode is being freed or is newly
+ * initialized, return nothing instead of trying to wait for the inode
+ * initialization or destruction to be complete.
+ */
+struct inode *find_active_inode_nowait(struct super_block *sb,
+  unsigned long ino)
+{
+   struct hlist_head *head = inode_hashtable + hash(sb, ino);
+   struct inode *inode, *ret_inode = NULL;
+
+   spin_lock(&inode_hash_lock);
+   hlist_for_each_entry(inode, head, i_hash) {
+   if ((inode->i_ino != ino) ||
+   (inode->i_sb != sb))
+   continue;
+   spin_lock(&inode->i_lock);
+   if ((inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) == 0) {
+   __iget(inode);
+   ret_inode = inode;
+   }
+   spin_unlock(&inode->i_lock);
+   goto out;
+   }
+out:
+   spin_unlock(&inode_hash_lock);
+   return ret_inode;
+}
+EXPORT_SYMBOL(find_active_inode_nowait);
+
 int insert_inode_locked(struct inode *inode)
 {
struct super_block *sb = inode->i_sb;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d0a2181..dc615ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2419,6 +2419,8 @@ extern struct inode *ilookup(struct super_block *sb, 
unsigned long ino);
 
 extern struct inode * iget5_locked(struct super_block *, unsigned long, int 
(*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);
+extern struct inode *find_active_inode_nowait(struct super_block *,
+ unsigned long);
 extern int insert_inode_locked4(struct inode *, unsigned long, int 
(*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v4 7/7] btrfs: add an is_readonly() so btrfs can use common code for update_time()

2014-11-26 Thread Theodore Ts&#x27;o
The only reason btrfs cloned code from the VFS layer was so it could
add a check to see if a subvolume is read-ony.  Instead of doing that,
let's add a new inode operation which allows a file system to return
an error if the inode is read-only, and use that in update_time().
There may be other places where the VFS layer may want to know that
btrfs would want to treat an inode is read-only.

With this commit, there are no remaining users of update_time() in the
inode operations structure, so we can remove it and simply things
further.

Signed-off-by: Theodore Ts'o 
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba 
---
 fs/btrfs/inode.c   | 26 ++
 fs/inode.c | 22 +++---
 include/linux/fs.h |  2 +-
 3 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a5e0d0d..0bfe3a4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5554,26 +5554,12 @@ static int btrfs_dirty_inode(struct inode *inode)
return ret;
 }
 
-/*
- * This is a copy of file_update_time.  We need this so we can return error on
- * ENOSPC for updating the inode in the case of file write and mmap writes.
- */
-static int btrfs_update_time(struct inode *inode, struct timespec *now,
-int flags)
+static int btrfs_is_readonly(struct inode *inode)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
 
if (btrfs_root_readonly(root))
return -EROFS;
-
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *now;
-   if (flags & S_MTIME)
-   inode->i_mtime = *now;
-   if (flags & S_ATIME)
-   inode->i_atime = *now;
return 0;
 }
 
@@ -9466,8 +9452,8 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.permission = btrfs_permission,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9475,8 +9461,8 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.permission = btrfs_permission,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9546,8 +9532,8 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.fiemap = btrfs_fiemap,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9559,8 +9545,8 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.removexattr= btrfs_removexattr,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9573,8 +9559,8 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.getxattr   = btrfs_getxattr,
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 0b4c6ae..e29bd2d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1554,20 +1554,20 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
unsigned short days_since_boot;
int ret;
 
-   if (inode->i_op->update_time) {
-   ret = inode->i_op->update_time(inode, time, flags);
+   if (inode->i_op->is_readonly) {
+   ret = inode->i_op->is_readonly(inode);
if (ret)
return ret;
-   } else {
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-

[PATCH-v4 4/7] vfs: add lazytime tracepoints for better debugging

2014-11-26 Thread Theodore Ts&#x27;o
Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c | 1 +
 fs/inode.c| 5 +
 2 files changed, 6 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 529480a..3d87174 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index 37c0429..b2fea60 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * Inode locking rules:
  *
@@ -1443,6 +1446,7 @@ retry:
inode->i_op->write_time(inode);
else if (inode->i_sb->s_op->write_inode)
mark_inode_dirty_sync(inode);
+   trace_fs_lazytime_iput(inode);
goto retry;
}
iput_final(inode);
@@ -1561,6 +1565,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode->i_ts_dirty_day = days_since_boot;
spin_unlock(&inode->i_lock);
inode_requeue_dirtytime(inode);
+   trace_fs_lazytime_defer(inode);
return 0;
}
 force_dirty:
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-26 Thread Theodore Ts&#x27;o
On Wed, Nov 26, 2014 at 10:48:51AM +1100, Dave Chinner wrote:
> No abuse necessary at all. Just a different inode_dirtied_after()
> check is requires if the inode is on the time dirty list in
> move_expired_inodes().

I'm still not sure what you have in mind here.  When would this be
checked?  It sounds like you want to set a timeout such that when an
inode which had its timestamps updated lazily 24 hours earlier, the
inode would get written out.  Yes?  But that implies something is
going to have to scan the list of inodes on the dirty time list
periodically.  When are you proposing that this take place?

The various approaches that come to mind all seem more complex than
what I have in this patch 3 of 4, and I'm not sure it's worth the
complexity.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Theodore Ts&#x27;o
On Tue, Nov 25, 2014 at 06:30:40PM +0100, Jan Kara wrote:
>   This would be possible and as Boaz says, it might be possible to reuse
> the same list_head in the inode for this. Getting rid of the full scan of
> all superblock inodes would be nice (as the scan gets really expensive for
> large numbers of inodes (think of i_sb_lock contention) and this makes it
> twice as bad) so I'd prefer to do this if possible.

Fair enough, I'll give this a try.  Personally, I've never been that
solicitous towards the efficiency of sync, since if you ever use it,
you tend to destroy performance just because of contention of the disk
drive head caused by the writeback, never mind the i_sb_lock
contention.  ("I am sync(2), the destroyer of tail latency SLO's...")

In fact there has sometimes been discussion about disabling sync(2)
from non-root users, because the opportunity for mischief when a
developer logs and types sync out of reflex is too high.  Of course,
if we ever did this, I'm sure such a patch would never be accepted
upstream, but that's OK, most people don't seem to care about tail
latency outside of Facebook and Google anyway

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-25 Thread Theodore Ts&#x27;o
On Tue, Nov 25, 2014 at 06:19:27PM +0100, Jan Kara wrote:
>   Actually, I'd also prefer to do the writing from iput_final(). My main
> reason is that shrinker starts behaving very differently when you put
> inodes with I_DIRTY_TIME to the LRU. See inode_lru_isolate() and in
> particular:
> /*
>  * Referenced or dirty inodes are still in use. Give them another
>  * pass
>  * through the LRU as we canot reclaim them now.
>  */
> if (atomic_read(&inode->i_count) ||
> (inode->i_state & ~I_REFERENCED)) {
> list_del_init(&inode->i_lru);
> spin_unlock(&inode->i_lock);
> this_cpu_dec(nr_unused);
> return LRU_REMOVED;
> }

I must be missing something; how would the shirnker behave
differently?  I_DIRTY_TIME shouldn't have any effect on the shrinker;
note that I_DIRTY_TIME is *not* part of I_DIRTY, and this was quite
deliberate, because I didn't want I_DIRTY_TIME to have any affect on
any of the other parts of the writeback or inode management parts.

> Regarding your concern that we'd write the inode when file is closed -
> that's not true. We'll write the inode only after corresponding dentry is
> evicted and thus drops inode reference. That doesn't seem too bad to me.

True, fair enough.  It's not quite so lazy, but it should be close
enough.

I'm still not seeing the benefit in waiting until the last possible
minute to write out the timestamps; evict() can block as it is if
there are any writeback that needs to be completed, and if the
writeback happens to pages subject to delalloc, the timestamp update
could happen for free at that point.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v3 3/6] vfs: don't let the dirty time inodes get more than a day stale

2014-11-25 Thread Theodore Ts&#x27;o
On Tue, Nov 25, 2014 at 03:58:01PM +0100, Rasmus Villemoes wrote:
> 
> I think days_since_boot was a lot clearer than daycode. In any case,
> please make the comment and the code consistent.

Yeah, I was going back and forth between days since the epoch and days
since boot, but found it was more efficient to calculate the days
since boot.  Sure, I'll go back to days_since_boot.

> You should probably divide by the number of seconds in a day, not the
> number of jiffies in a day.

Right, brain hiccup on my part, will fix.

> Isn't div_u64 mostly for when the divisor is not known at compile time?
> Technically, "(u64)uptime.tv_sec / 86400" is of course a u64/u64 division,
> but the compiler should see that the divisor is only 32 bits and hence
> should be able to generate code which is at least as efficient as
> whatever inline asm the arch provides for u64/u32 divisions.

The problem with doing u64/u64 divisions is that the compiler will
call out to a (non-existent) library function on some architectures.
For example, try compiling the following using: with "gcc -S -m32
foo.c"

int main(int argc, char **argv)
{
unsigned long long t = time(0);

printf("%llu\n", t / 86400);
}

You will find in the .S file the following:

...
pushl   $0
pushl   $86400
pushl   %edx
pushl   %eax
call__udivdi3
...

It will work finn compiling for x86_64, but if you do this and push
out a git branch, you will soon get a very nice e-mail from the ktest
bot explaining how you've broken the build on the i386 architecture
since the kernel doesn't provide __udivdi3.

Hence if you are doing any kind of divisions involving u64, you have
to use the functions in include/linux/math64.h, and div_u64 is, per
math64.h:

/**
 * div_u64 - unsigned 64bit divide with 32bit divisor
 *
 * This is the most common 64bit divide and should be used if possible,
 * as many 32bit archs can optimize this variant better than a full 64bit
 * divide.
 */

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v3 6/6] btrfs: add an is_readonly() so btrfs can use common code for update_time()

2014-11-24 Thread Theodore Ts&#x27;o
The only reason btrfs cloned code from the VFS layer was so it could
add a check to see if a subvolume is read-ony.  Instead of doing that,
let's add a new inode operation which allows a file system to return
an error if the inode is read-only, and use that in update_time().
There may be other places where the VFS layer may want to know that
btrfs would want to treat an inode is read-only.

With this commit, there are no remaining users of update_time() in the
inode operations structure, so we can remove it and simply things
further.

Signed-off-by: Theodore Ts'o 
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/inode.c   | 26 ++
 fs/inode.c | 22 +++---
 include/linux/fs.h |  2 +-
 3 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a5e0d0d..0bfe3a4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5554,26 +5554,12 @@ static int btrfs_dirty_inode(struct inode *inode)
return ret;
 }
 
-/*
- * This is a copy of file_update_time.  We need this so we can return error on
- * ENOSPC for updating the inode in the case of file write and mmap writes.
- */
-static int btrfs_update_time(struct inode *inode, struct timespec *now,
-int flags)
+static int btrfs_is_readonly(struct inode *inode)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
 
if (btrfs_root_readonly(root))
return -EROFS;
-
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *now;
-   if (flags & S_MTIME)
-   inode->i_mtime = *now;
-   if (flags & S_ATIME)
-   inode->i_atime = *now;
return 0;
 }
 
@@ -9466,8 +9452,8 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.permission = btrfs_permission,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9475,8 +9461,8 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.permission = btrfs_permission,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9546,8 +9532,8 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.fiemap = btrfs_fiemap,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9559,8 +9545,8 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.removexattr= btrfs_removexattr,
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9573,8 +9559,8 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.getxattr   = btrfs_getxattr,
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
-   .update_time= btrfs_update_time,
.write_time = btrfs_write_time,
+   .is_readonly= btrfs_is_readonly,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 1f90591..e83c6d2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1555,20 +1555,20 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
unsigned short daycode;
int ret;
 
-   if (inode->i_op->update_time) {
-   ret = inode->i_op->update_time(inode, time, flags);
+   if (inode->i_op->is_readonly) {
+   ret = inode->i_op->is_readonly(inode);
if (ret)
return ret;
-   } else {
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-   inode->i_mtime

[PATCH-v3 4/6] vfs: add lazytime tracepoints for better debugging

2014-11-24 Thread Theodore Ts&#x27;o
Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c | 5 -
 fs/inode.c| 5 +
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index eb04277..cab2d6d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /*
@@ -1304,8 +1305,10 @@ static void flush_sb_dirty_time(struct super_block *sb)
iput(old_inode);
old_inode = inode;
 
-   if (dirty_time)
+   if (dirty_time) {
+   trace_fs_lazytime_flush(inode);
mark_inode_dirty(inode);
+   }
cond_resched();
spin_lock(&inode_sb_list_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 34a443f..6319ead 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * Inode locking rules:
  *
@@ -544,6 +547,7 @@ static void evict(struct inode *inode)
mark_inode_dirty(inode);
inode->i_sb->s_op->write_inode(inode, &wbc);
}
+   trace_fs_lazytime_evict(inode);
}
 
if (!list_empty(&inode->i_wb_list))
@@ -1550,6 +1554,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = daycode;
+   trace_fs_lazytime_defer(inode);
return 0;
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v3 2/6] vfs: add support for a lazytime mount option

2014-11-24 Thread Theodore Ts&#x27;o
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c   | 38 +-
 fs/inode.c  | 21 +
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 +++
 include/linux/fs.h  |  1 +
 include/uapi/linux/fs.h |  1 +
 6 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
dirty = inode->i_state & I_DIRTY;
-   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+   struct inode *inode, *old_inode = NULL;
+
+   WARN_ON(!rwsem_is_locked(&sb->s_umount));
+   spin_lock(&inode_sb_list_lock);
+   list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+   int dirty_time;
+
+   spin_lock(&inode->i_lock);
+   if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+   spin_unlock(&inode->i_lock);
+   continue;
+   }
+   dirty_time = inode->i_state & I_DIRTY_TIME;
+   __iget(inode);
+   spin_unlock(&inode->i_lock);
+   spin_unlock(&inode_sb_list_lock);
+
+   iput(old_inode);
+   old_inode = inode;
+
+   if (dirty_time)
+   mark_inode_dirty(inode);
+   cond_resched();
+   spin_lock(&inode_sb_list_lock);
+   }
+   spin_unlock(&inode_sb_list_lock);
+   iput(old_inode);
+}
+
 /**
  * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
  * @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+   flush_sb_dirty_time(sb);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
 
diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..2093a84 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(!list_empty(&inode->i_lru));
 
+   if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
+   if (inode->i_op->write_time)
+   inode->i_op->write_time(inode);
+   else if (inode->i_sb->s_op->write_inode) {
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_NONE,
+   };
+   mark_inode_dirty(inode);
+   inode->i_sb->s_op->write_inode(inode, &wbc);
+   }
+   }
+
if (!list_empty(&inode->i_wb_list))
inode_wb_list_del(inode);
 
@@ -1515,6 +1527,15 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *ti

[PATCH-v3 5/6] ext4: add support for a lazytime mount option

2014-11-24 Thread Theodore Ts&#x27;o
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c | 48 ++---
 fs/ext4/super.c |  9 +
 fs/inode.c  | 36 ++
 include/linux/fs.h  |  2 ++
 include/trace/events/ext4.h | 30 
 5 files changed, 122 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..03149b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,50 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino & ~(inodes_per_block - 1);
+   for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode->i_state & I_DIRTY_TIME) == 0 ||
+   !spin_trylock(&inode->i_lock)) {
+   iput(inode);
+   continue;
+   }
+   inode->i_state &= ~I_DIRTY_TIME;
+   inode->i_ts_dirty_day = 0;
+   spin_unlock(&inode->i_lock);
+
+   ei = EXT4_I(inode);
+   raw_inode = (struct ext4_inode *) buf;
+
+   spin_lock(&ei->i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(&ei->i_raw_lock);
+   trace_ext4_other_inode_update_time(inode, orig_ino);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4260,7 +4304,6 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];
}
-
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
if (ei->i_extra_isize) {
@@ -4271,10 +4314,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei->i_extra_isize);
}
}
-
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);
+   ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
 
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+   {Opt_lazytime, "lazytime"},
+   {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struc

[PATCH-v3 1/6] fs: split update_time() into update_time() and write_time()

2014-11-24 Thread Theodore Ts&#x27;o
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
---
 Documentation/filesystems/Locking |  2 ++
 fs/btrfs/inode.c  | 10 ++
 fs/inode.c| 29 ++---
 fs/xfs/xfs_iops.c | 39 ---
 include/linux/fs.h|  1 +
 5 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index b30753c..e49861d 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -63,6 +63,7 @@ prototypes:
int (*removexattr) (struct dentry *, const char *);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, 
u64 len);
void (*update_time)(struct inode *, struct timespec *, int);
+   void (*write_time)(struct inode *);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode, int *opened);
@@ -95,6 +96,7 @@ listxattr:no
 removexattr:   yes
 fiemap:no
 update_time:   no
+write_time:no
 atomic_open:   yes
 tmpfile:   no
 dentry_open:   no
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
-
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
+   int ret;
+
+   if (inode->i_op->update_time) {
+   ret = inode->i_op->update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags & S_ATIME)
+   inode->i_atime = *time;
+   

[PATCH-v3 0/6] add support for a lazytime mount option

2014-11-24 Thread Theodore Ts&#x27;o
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Any objections to my carrying these patches in the ext4 git tree?

Changes since -v2:
   - If update_time() updates i_version, it will not use lazytime (i..e,
   the inode will be marked dirty so the change will be persisted on to
   disk sooner rather than later).  Yes, this eliminates the
   benefits of lazytime if the user is experting the file system via
   NFSv4.  Sad, but NFS's requirements seem to mandate this.
   - Fix time wrapping bug 49 days after the system boots (on a system
with a 32-bit jiffies).   Use get_monotonic_boottime() instead.
   - Clean up type warning in include/tracing/ext4.h
   - Added explicit parenthesis for stylistic reasons
   - Added an is_readonly() inode operations method so btrfs doesn't
   have to duplicate code in update_time().

Changes since -v1:
   - Added explanatory comments in update_time() regarding i_ts_dirty_days
   - Fix type used for days_since_boot
   - Improve SMP scalability in update_time and ext4_update_other_inodes_time
   - Added tracepoints to help test and characterize how often and under
 what circumstances inodes have their timestamps lazily updated

Theodore Ts'o (6):
  fs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  vfs: add lazytime tracepoints for better debugging
  ext4: add support for a lazytime mount option
  btrfs: add an is_readonly() so btrfs can use common code for
update_time()

 Documentation/filesystems/Locking |  2 +
 fs/btrfs/inode.c  | 34 +++
 fs/ext4/inode.c   | 48 +++--
 fs/ext4/super.c   |  9 
 fs/fs-writeback.c | 42 +-
 fs/inode.c| 91 ++-
 fs/proc_namespace.c   |  1 +
 fs/sync.c |  7 +++
 fs/xfs/xfs_iops.c | 39 +++--
 include/linux/fs.h|  7 ++-
 include/trace/events/ext4.h   | 30 +
 include/uapi/linux/fs.h   |  1 +
 12 files changed, 262 insertions(+), 49 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v3 3/6] vfs: don't let the dirty time inodes get more than a day stale

2014-11-24 Thread Theodore Ts&#x27;o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c  |  1 +
 fs/inode.c | 28 +++-
 include/linux/fs.h |  1 +
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 2093a84..34a443f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,8 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   struct timespec uptime;
+   unsigned short daycode;
int ret;
 
if (inode->i_op->update_time) {
@@ -1525,17 +1527,33 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_CTIME)
inode->i_ctime = *time;
if (flags & S_MTIME)
-   inode->i_mtime = *time;
+   inode->i_mtime = *time;
}
+   /*
+* If i_ts_dirty_day is zero, then either we have not deferred
+* timestamp updates, or the system has been up for less than
+* a day (so days_since_boot is zero), so we defer timestamp
+* updates in that case and set the I_DIRTY_TIME flag.  If a
+* day or more has passed, then i_ts_dirty_day will be
+* different from days_since_boot, and then we should update
+* the on-disk inode and then we can clear i_ts_dirty_day.
+*/
if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
!(flags & S_VERSION)) {
if (inode->i_state & I_DIRTY_TIME)
return 0;
-   spin_lock(&inode->i_lock);
-   inode->i_state |= I_DIRTY_TIME;
-   spin_unlock(&inode->i_lock);
-   return 0;
+   get_monotonic_boottime(&uptime);
+   daycode = div_u64(uptime.tv_sec, (HZ * 86400));
+   if (!inode->i_ts_dirty_day ||
+   inode->i_ts_dirty_day == daycode) {
+   spin_lock(&inode->i_lock);
+   inode->i_state |= I_DIRTY_TIME;
+   spin_unlock(&inode->i_lock);
+   inode->i_ts_dirty_day = daycode;
+   return 0;
+   }
}
+   inode->i_ts_dirty_day = 0;
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-24 Thread Theodore Ts&#x27;o
On Tue, Nov 25, 2014 at 12:53:32PM +1100, Dave Chinner wrote:
> On Fri, Nov 21, 2014 at 02:59:23PM -0500, Theodore Ts'o wrote:
> > Guarantee that the on-disk timestamps will be no more than 24 hours
> > stale.
> > 
> > Signed-off-by: Theodore Ts'o 
> 
> If we put these inodes on the dirty inode list with at writeback
> time of 24 hours, this is completely unnecessary.

What do you mean by "a writeback time of 24 hours"?  Do you mean
creating a new field in the inode which specifies when the writeback
should happen?  I still worry about the dirty inode list getting
somewhat long large in the strictatime && lazytime case, and the inode
bloat nazi's coming after us for adding a new field to struct inode
structure.

Or do you mean trying to abuse the dirtied_when field in some way?

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-24 Thread Theodore Ts&#x27;o
On Tue, Nov 25, 2014 at 12:52:39PM +1100, Dave Chinner wrote:
> > +static void flush_sb_dirty_time(struct super_block *sb)
> > +{
  ...
> > +}
> 
> This just seems wrong to me, not to mention extremely expensive when we have
> millions of cached inodes on the superblock.

#1, It only gets called on a sync(2) or syncfs(2), which can be pretty
expensive as it is, so I didn't really worry about it.

#2, We're already iterating over all of the inodes in the sync(2) or
syncfs(2) path, so the code path in question is already O(N) in the
number of inodes.

> Why can't we just add a function like mark_inode_dirty_time() which
> puts the inode on the dirty inode list with a writeback time 24
> hours in the future rather than 30s in the future?

I was concerned about putting them on the dirty inode list because it
would be extra inodes for the writeback threads would have to skip
over and ignore (since they would not be dirty in the inde or data
pages sense).

Another solution would be to use a separate linked list for dirtytime
inodes, but that means adding some extra fields to the inode
structure, which some might view as bloat.  Given #1 and #2 above,
yes, we're doubling the CPU cost for sync(2) and syncfs(2), since
we're not iterating over all of the inodes twice, but I believe that
is a better trade-off than bloating the inode structure with an extra
set of linked lists or increasing the CPU cost to the writeback path
(which gets executed way more often than the sync or syncfs paths).


> Eviction is too late for this. I'm pretty sure that it won't get
> this far as iput_final() should catch the I_DIRTY_TIME in the !drop
> case via write_inode_now().

Actually, the tracepoint for fs_lazytime_evict() does get triggered
from time to time; but only when the inode is evicted due to memory
pressure, i.e., via the evict_inodes() path.

I thought about possibly doing this in iput_final(), but that would
mean that whenever we closed the last fd on the file, we would push
the inode out to disk.  For files that we are writing, that's not so
bad; but if we enable strictatime with lazytime, then we would be
updating the atime for inodes that had been only been read on every
close --- which in the case of say, all of the files in the kernel
tree, would be a little unfortunate.

Of course, the combination of strict atime and lazytime would result
in a pretty heavy write load on a umount or sync(2), so I suspect
keeping the relatime mode would make sense for most people, but I for
those people who need strict Posix compliance, it seemed like doing
something that worked well for strictatime plus lazytime would be a
good thing, which is why I tried to defer things as much as possible.

>   if (!datasync && (inode->i_state & I_DIRTY_TIME)) {
> 
> > +   spin_lock(&inode->i_lock);
> > +   inode->i_state |= I_DIRTY_SYNC;
> > +   spin_unlock(&inode->i_lock);
> > +   }
> > return file->f_op->fsync(file, start, end, datasync);
> 
> When we mark the inode I_DIRTY_TIME, we should also be marking it
> I_DIRTY_SYNC so that all the sync operations know that they should
> be writing this inode. That's partly why I also think these inodes
> should be tracked on the dirty inode list

The whole point of what I was doing is that I_DIRTY_TIME was not part
of I_DIRTY, and that when in update_time() we set I_DIRTY_TIME instead
of I_DIRTY_SYNC.  The goal is that these inodes would not end up on
the dirty list, because that way they wouldn't be forced out to disk
until either (a) the inode is written out for other reasons (i.e., a
change in i_size or i_blocks, etc.), (b) the inode is explicitly
fsync'ed, or (c) the the umount(2), sync(2), or syncfs(2) of the file
system.

That way, the timestamps are in the memory copy inode, but *not*
written on disk until they are opportunistically written out due to
some other modification of the inode which sets I_DIRTY_SYNC.

If we were to set I_DIRTY_SYNC alongside I_DIRTY_TIME, and put these
inodes on the dirty inode list, then I_DIRTY_TIME would effectively be
a no-op, and there would be no point to this whole exercise.

It may be that lazytime will never be the default, because it is so
different from what we are currently doing.  But I think it is worth
doing, even if it is an optional mount option which is only used under
special circumstances.  For myself, we will be using it in Google and
I will be using it on my laptop because it definitely reduces the
write load to the SSD.   This I've measured it via the tracepoints.

If there is significant objections to doing this in the VFS layer, I'm
happy to go back to doing this as in ext4-specific code.  There were a
few bits that were a bit more dodgy, and I can't make sync(2) and
syncfs(2) flush the dirty timestamps if I do it as an ext4-specific
hack, but for my use cases, I don't really care about those things.
The main reason why I redid this patch set as a VFS specific change
was because Cristoph and others specificall

Re: [PATCH-v2 0/5] add support for a lazytime mount option

2014-11-24 Thread Theodore Ts&#x27;o
On Mon, Nov 24, 2014 at 05:11:45PM -0500, J. Bruce Fields wrote:
> On Mon, Nov 24, 2014 at 06:57:27AM -0500, Theodore Ts'o wrote:
> > If we want to be paranoid, we handle i_version updates non-lazily; I
> > can see arguments in favor of that.
> > 
> > Ext4 only enables MS_I_VERSION if the user asks for it explicitly, so
> > it wouldn't cause me any problems.  However, xfs and btrfs enables it
> > by default, so that means xfs and btrfs wouldn't see the benefits of
> > lazytime (if you're going to have to push I_VERSION to disk, you might
> > as well update the [acm]time while you're at it).  I've always thought
> > that we *should* do is to only enable it if nfsv4 is serving the file
> > system, and not otherwise, though, which would also give us
> > consistency across all the file systems.
> 
> I guess you need to worry about the case where you shutdown nfsd, modify
> a file, then restart nfsd--you don't want a client to miss the
> modification in that case.

Hmmm, is there a way we can determine whether the file system is being
exported via nfsv4 is running, without using MS_I_VERSION?  Maybe a
new flag?  We could just disable lazytime if nfsd is running but
otherwise always increment i_version if lazytime is enabled.  My main
concern with i_version updates was not the in-memry update of
i_version, but rather the unnecessary extra metadata write "tax" that
would be inflicted on all users, including the many that aren't
serving files via NFSv4.

That way, if you shutdown the nfsv4 server, i_version would still be
updated, but we wouldn't be forcing the writes to disk, but then when
nfs v4 server is updated again, the i_version tax would be paid again.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-24 Thread Theodore Ts&#x27;o
On Mon, Nov 24, 2014 at 05:38:30PM +0100, David Sterba wrote:
> 
> It is necessary and the whole .update_time callback was added
> intentionally, see commits
> 
> c3b2da314834499f34cba94f7053e55f6d6f92d8
> fs: introduce inode operation ->update_time
> 
> e41f941a23115e84a8550b3d901a13a14b2edc2f
> Btrfs: move over to use ->update_time

Being able to signal an error if the time update fails is still
possible even if we drop update_time(), because the new write_time()
function will return an error.

> 2bc5565286121d2a77ccd728eb3484dff2035b58
> Btrfs: don't update atime on RO subvolumes

Yes, but this doesn't answer my question about other places where the
VFS is only checking MS_RDONLY and MNT_READONLY besides just
update_atime().  Maybe we should be exposing an "is_readonly(inode)"
inode operations function to address this?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale

2014-11-24 Thread Theodore Ts&#x27;o
On Mon, Nov 24, 2014 at 01:27:21PM +0100, Rasmus Villemoes wrote:
> On Sat, Nov 22 2014, Theodore Ts'o  wrote:
> 
> > Guarantee that the on-disk timestamps will be no more than 24 hours
> > stale.
> >
> > +   unsigned short days_since_boot = jiffies / (HZ * 86400);
> 
> This seems to wrap every 49 days (assuming 32 bit jiffies and HZ==1000),
> so on-disk updates can be delayed indefinitely, assuming just the right
> delays between writes.

Good point, I'll fix this.

> Would it make sense to introduce days_since_boot as a global variable
> and avoid these issues? This would presumably also make update_time a
> few cycles faster (avoiding a division-by-constant), but not sure if
> that's important. And something of course needs to update
> days_since_boot, but that should be doable.

I can do this fairly simply like this:

get_monotonic_boottime(&uptime);
daycode = uptime.tv_sec / (HZ * 86400);

and we only need to do this if lazytime is set, and the inode isn't
marked as I_DIRTY_TIME:

if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
!(flags & S_VERSION)) {
if (inode->i_state & I_DIRTY_TIME)
return 0;
get_monotonic_boottime(&uptime);
daycode = do_div64(uptime.tv_sec do_div, (HZ * 86400));
if (!inode->i_ts_dirty_day ||
inode->i_ts_dirty_day == daycode) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = daycode;
return 0;
}
}

So I'm not entirely sure it's worth it to create a global variable for
days since boot; I've been runnin with this patch in my laptop, we
wouldn't be triggering the get_monotonic_bootime() function all that
often.  (Since once the dirty_time flg is set, we don't need to check
about whether we need to set it again.)  And if we *did* care, it
would be simple enough to use a static counter which only recalculates
daycode every 30 or 60 minutes.


Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-24 Thread Theodore Ts&#x27;o
On Mon, Nov 24, 2014 at 07:21:01AM -0800, Christoph Hellwig wrote:
> On Fri, Nov 21, 2014 at 02:59:21PM -0500, Theodore Ts'o wrote:
> > We needed to preserve update_time() because btrfs wants to have a
> > special btrfs_root_readonly() check; otherwise we could drop the
> > update_time() inode operation entirely.
> 
> Can't btrfs just set the immutable flag on every inode that is read
> when the root has the BTRFS_ROOT_SUBVOL_RDONLY flag?  That would
> cut down the places that need this check to the ioctl path so that
> we prevent users from clearling the immutable flag.

Sounds like a good plan to me, although I'm not sure I understand how
BTRFS_ROOT_SUBVOL_RDONLY flag works, since I would have thought there
are all sorts of places in the VFS layer where it is currently
checking MS_RDONLY and MNT_READONLY and _not_ checking
BTRFS_ROOT_SUBVOL_RDONLY isn't causing other problems.

But unless there's something more subtle going on, it would seem to me
that setting the immutable flag on each inode would be a better way to
go in any case.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2 0/5] add support for a lazytime mount option

2014-11-24 Thread Theodore Ts&#x27;o
On Mon, Nov 24, 2014 at 01:07:55AM -0800, Christoph Hellwig wrote:
> What's the test coverage for this?  xfstest generic/192 tests that
> atime is persisted over remounts, which we had a bug with when XFS
> used to have a lazy atime implementation somewhat similar to the
> proposal.
> 
> We should have something similar for c/mtime as well.  Also a test to
> ensure timestamps are persisted afer a fsync, although right now I can't
> imagine how to do that genericly as no other filesystem seems to have
> an equivaent to XFS_IOC_GOINGDOWN.

generic/003 will show up problems if there are any [acm]time
persistences across remounts, so we have it already.  An earlier
(buggy) version of this patch actually tripped generic/003, so I can
attest that it works.

As far as testing to make sure timestamps are persisted after an
fsync, we should be able to do something genericly using dm_flaky, I
would imagine.  We'll need to suppress this in some circumstances
where we know that the file system doesn't have a journal enabled (and
thus has no such guarantees) but I have that issue today with the
various dm_flaky tests, and what I would probably suggest doing is
putting all of the dm_flaky tests in a separate xfstests group, so
that when I test file system configurations in nojournal mode, I can
suppress all of the dm_flakey tests very easily.

> It seems you also handle i_version updates lazily. although that's
> not mentioned anywhere.  I actually have a clarification request out on
> the IETF NFSv4 list about the persistance requirements for the change
> counter but I've not seen an answer to it yet.

If we want to be paranoid, we handle i_version updates non-lazily; I
can see arguments in favor of that.

Ext4 only enables MS_I_VERSION if the user asks for it explicitly, so
it wouldn't cause me any problems.  However, xfs and btrfs enables it
by default, so that means xfs and btrfs wouldn't see the benefits of
lazytime (if you're going to have to push I_VERSION to disk, you might
as well update the [acm]time while you're at it).  I've always thought
that we *should* do is to only enable it if nfsv4 is serving the file
system, and not otherwise, though, which would also give us
consistency across all the file systems.

What do folks think?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v2 0/5] add support for a lazytime mount option

2014-11-22 Thread Theodore Ts&#x27;o
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Any objections to my carrying these patches in the ext4 git tree?

Changes since -v1:
   - Added explanatory comments in update_time() regarding i_ts_dirty_days
   - Fix type used for days_since_boot
   - Improve SMP scalability in update_time and ext4_update_other_inodes_time
   - Added tracepoints to help test and characterize how often and under
 what circumstances inodes have their timestamps lazily updated

Theodore Ts'o (5):
  fs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  vfs: add lazytime tracepoints for better debugging
  ext4: add support for a lazytime mount option

 fs/btrfs/inode.c|  10 +
 fs/ext4/inode.c |  48 ++--
 fs/ext4/super.c |   9 
 fs/fs-writeback.c   |  42 +-
 fs/inode.c  | 104 +++-
 fs/proc_namespace.c |   1 +
 fs/sync.c   |   7 +++
 fs/xfs/xfs_iops.c   |  39 +++--
 include/linux/fs.h  |   5 +++
 include/trace/events/ext4.h |  29 
 include/trace/events/fs.h   |  56 
 include/uapi/linux/fs.h |   1 +
 12 files changed, 313 insertions(+), 38 deletions(-)
 create mode 100644 include/trace/events/fs.h

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v2 2/5] vfs: add support for a lazytime mount option

2014-11-22 Thread Theodore Ts&#x27;o
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c   | 38 +-
 fs/inode.c  | 20 
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 +++
 include/linux/fs.h  |  1 +
 include/uapi/linux/fs.h |  1 +
 6 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
dirty = inode->i_state & I_DIRTY;
-   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+   struct inode *inode, *old_inode = NULL;
+
+   WARN_ON(!rwsem_is_locked(&sb->s_umount));
+   spin_lock(&inode_sb_list_lock);
+   list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+   int dirty_time;
+
+   spin_lock(&inode->i_lock);
+   if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+   spin_unlock(&inode->i_lock);
+   continue;
+   }
+   dirty_time = inode->i_state & I_DIRTY_TIME;
+   __iget(inode);
+   spin_unlock(&inode->i_lock);
+   spin_unlock(&inode_sb_list_lock);
+
+   iput(old_inode);
+   old_inode = inode;
+
+   if (dirty_time)
+   mark_inode_dirty(inode);
+   cond_resched();
+   spin_lock(&inode_sb_list_lock);
+   }
+   spin_unlock(&inode_sb_list_lock);
+   iput(old_inode);
+}
+
 /**
  * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
  * @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+   flush_sb_dirty_time(sb);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
 
diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..11fe81b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(!list_empty(&inode->i_lru));
 
+   if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
+   if (inode->i_op->write_time)
+   inode->i_op->write_time(inode);
+   else if (inode->i_sb->s_op->write_inode) {
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_NONE,
+   };
+   mark_inode_dirty(inode);
+   inode->i_sb->s_op->write_inode(inode, &wbc);
+   }
+   }
+
if (!list_empty(&inode->i_wb_list))
inode_wb_list_del(inode);
 
@@ -1515,6 +1527,14 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
   

[PATCH-v2 5/5] ext4: add support for a lazytime mount option

2014-11-22 Thread Theodore Ts&#x27;o
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c | 48 ++---
 fs/ext4/super.c |  9 +
 fs/inode.c  | 36 ++
 include/linux/fs.h  |  2 ++
 include/trace/events/ext4.h | 29 +++
 5 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..03149b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,50 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino & ~(inodes_per_block - 1);
+   for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode->i_state & I_DIRTY_TIME) == 0 ||
+   !spin_trylock(&inode->i_lock)) {
+   iput(inode);
+   continue;
+   }
+   inode->i_state &= ~I_DIRTY_TIME;
+   inode->i_ts_dirty_day = 0;
+   spin_unlock(&inode->i_lock);
+
+   ei = EXT4_I(inode);
+   raw_inode = (struct ext4_inode *) buf;
+
+   spin_lock(&ei->i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(&ei->i_raw_lock);
+   trace_ext4_other_inode_update_time(inode, orig_ino);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4260,7 +4304,6 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];
}
-
if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
if (ei->i_extra_isize) {
@@ -4271,10 +4314,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei->i_extra_isize);
}
}
-
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);
+   ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
 
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+   {Opt_lazytime, "lazytime"},
+   {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struc

[PATCH-v2 1/5] fs: split update_time() into update_time() and write_time()

2014-11-22 Thread Theodore Ts&#x27;o
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/inode.c   | 10 ++
 fs/inode.c | 29 ++---
 fs/xfs/xfs_iops.c  | 39 ---
 include/linux/fs.h |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
-
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
+   int ret;
+
+   if (inode->i_op->update_time) {
+   ret = inode->i_op->update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags & S_ATIME)
+   inode->i_atime = *time;
+   if (flags & S_VERSION)
+   inode_inc_iversion(inode);
+   if (flags & S_CTIME)
+   inode->i_ctime = *time;
+   if (flags & S_MTIME)
+   inode->i_mtime = *time;
+   }
+   if (inode->i_op->write_time)
+   return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
 }
 
 STATIC int
-xfs_vn_update_time(
-   struct inode*inode,
-   struct timespec *now,
-   int flags)
+xfs_vn_write_time(
+   struct inode*inode)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(

[PATCH-v2 3/5] vfs: don't let the dirty time inodes get more than a day stale

2014-11-22 Thread Theodore Ts&#x27;o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c  |  1 +
 fs/inode.c | 16 +++-
 include/linux/fs.h |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 11fe81b..0d939a8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   unsigned short days_since_boot = jiffies / (HZ * 86400);
int ret;
 
if (inode->i_op->update_time) {
@@ -1527,14 +1528,27 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
-   if (inode->i_sb->s_flags & MS_LAZYTIME) {
+   /*
+* If i_ts_dirty_day is zero, then either we have not deferred
+* timestamp updates, or the system has been up for less than
+* a day (so days_since_boot is zero), so we defer timestamp
+* updates in that case and set the I_DIRTY_TIME flag.  If a
+* day or more has passed, then i_ts_dirty_day will be
+* different from days_since_boot, and then we should update
+* the on-disk inode and then we can clear i_ts_dirty_day.
+*/
+   if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
+   (!inode->i_ts_dirty_day ||
+inode->i_ts_dirty_day == days_since_boot)) {
if (inode->i_state & I_DIRTY_TIME)
return 0;
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
+   inode->i_ts_dirty_day = days_since_boot;
return 0;
}
+   inode->i_ts_dirty_day = 0;
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH-v2 4/5] vfs: add lazytime tracepoints for better debugging

2014-11-22 Thread Theodore Ts&#x27;o
Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c |  5 -
 fs/inode.c|  5 +
 include/trace/events/fs.h | 56 +++
 3 files changed, 65 insertions(+), 1 deletion(-)
 create mode 100644 include/trace/events/fs.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index eb04277..cab2d6d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 /*
@@ -1304,8 +1305,10 @@ static void flush_sb_dirty_time(struct super_block *sb)
iput(old_inode);
old_inode = inode;
 
-   if (dirty_time)
+   if (dirty_time) {
+   trace_fs_lazytime_flush(inode);
mark_inode_dirty(inode);
+   }
cond_resched();
spin_lock(&inode_sb_list_lock);
}
diff --git a/fs/inode.c b/fs/inode.c
index 0d939a8..5a9a7b0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -20,6 +20,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /*
  * Inode locking rules:
  *
@@ -544,6 +547,7 @@ static void evict(struct inode *inode)
mark_inode_dirty(inode);
inode->i_sb->s_op->write_inode(inode, &wbc);
}
+   trace_fs_lazytime_evict(inode);
}
 
if (!list_empty(&inode->i_wb_list))
@@ -1546,6 +1550,7 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = days_since_boot;
+   trace_fs_lazytime_defer(inode);
return 0;
}
inode->i_ts_dirty_day = 0;
diff --git a/include/trace/events/fs.h b/include/trace/events/fs.h
new file mode 100644
index 000..ca06d5c
--- /dev/null
+++ b/include/trace/events/fs.h
@@ -0,0 +1,56 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM fs
+
+#if !defined(_TRACE_FS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_FS_H
+
+#include 
+
+DECLARE_EVENT_CLASS(fs__inode,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode),
+
+   TP_STRUCT__entry(
+   __field(dev_t,  dev )
+   __field(ino_t,  ino )
+   __field(uid_t,  uid )
+   __field(gid_t,  gid )
+   __field(__u16, mode )
+   ),
+
+   TP_fast_assign(
+   __entry->dev= inode->i_sb->s_dev;
+   __entry->ino= inode->i_ino;
+   __entry->uid= i_uid_read(inode);
+   __entry->gid= i_gid_read(inode);
+   __entry->mode   = inode->i_mode;
+   ),
+
+   TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->mode,
+ __entry->uid, __entry->gid)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_defer,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_evict,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode)
+);
+
+DEFINE_EVENT(fs__inode, fs_lazytime_flush,
+   TP_PROTO(struct inode *inode),
+
+   TP_ARGS(inode)
+);
+#endif /* _TRACE_FS_H */
+
+/* This part must be outside protection */
+#include 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-21 Thread Theodore Ts&#x27;o
Out of curiosity, why does btrfs_update_time() need to call
btrfs_root_readonly()?  Why can't it just depend on the
__mnt_want_write() call in touch_atime()?

Surely if there are times when it's not OK to write into a btrfs file
system and mnt_is_readonly() returns false, the VFS is going to get
very confused abyway.

If the btrfs_update_time() is not necessary, then we could drop
btrfs_update_time() and update_time() from the inode operations
entirely, and depend on the VFS-level code in update_time().

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-21 Thread Theodore Ts&#x27;o
On Fri, Nov 21, 2014 at 02:19:07PM -0600, Andreas Dilger wrote:
> > -   if (inode->i_sb->s_flags & MS_LAZYTIME) {
> > +   if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> > +   (!inode->i_ts_dirty_day ||
> > +inode->i_ts_dirty_day == days_since_boot)) {
> > spin_lock(&inode->i_lock);
> > inode->i_state |= I_DIRTY_TIME;
> > spin_unlock(&inode->i_lock);
> > +   inode->i_ts_dirty_day = days_since_boot;
> 
> It isn't clear if this is correct.  It looks like the condition will
> only be entered if i_ts_dirty_day == days_since_boot, but that is only
> set inside the condition?

If i_ts_dirty_day is zero, the timestamps don't have to written to
disk.  This is either because the inode has been written to disk, or
the system has been up for less than a day, such that when we last a
lazy mtime update (i.e., we skipped the call to mark_inode_dirty)
since jiffies / (HZ * 86400) was zero.

If it is non-zero, then the timestamps were updated but were not sent
to disk N days since the system was booted.  So long as it remains N
days since the system was booted, we can skip calling
mark_inode_dirty().  But once it becomes N+1 days since the system was
booted, then we will call mark_inode_dirty() and set i_ts_dirty_day to
zero.

I'll add a comment so it's a bit more obvious what we're doing here,
but I'm pretty sure we currently have is in fact correct.

> and "days_since_boot" should be declared unsigned short so it wraps
> in the same way as i_ts_dirty_day

Good point, thanks.  This will only be an issue after the system has
been up for almost 90 years, but we might as well get it right,

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-21 Thread Theodore Ts&#x27;o
Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c  | 1 +
 fs/inode.c | 7 ++-
 include/linux/fs.h | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 6e91aca..f0d6232 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   int days_since_boot = jiffies / (HZ * 86400);
int ret;
 
if (inode->i_op->update_time) {
@@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
-   if (inode->i_sb->s_flags & MS_LAZYTIME) {
+   if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
+   (!inode->i_ts_dirty_day ||
+inode->i_ts_dirty_day == days_since_boot)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
+   inode->i_ts_dirty_day = days_since_boot;
return 0;
}
+   inode->i_ts_dirty_day = 0;
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] add support for a lazytime mount option

2014-11-21 Thread Theodore Ts&#x27;o
This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Theodore Ts'o (4):
  fs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  ext4: add support for a lazytime mount option

 fs/btrfs/inode.c| 10 ++
 fs/ext4/inode.c | 46 +-
 fs/ext4/super.c |  9 +
 fs/fs-writeback.c   | 39 +-
 fs/inode.c  | 88 ++---
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 
 fs/xfs/xfs_iops.c   | 39 +-
 include/linux/fs.h  |  5 +++
 include/uapi/linux/fs.h |  1 +
 10 files changed, 209 insertions(+), 36 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] ext4: add support for a lazytime mount option

2014-11-21 Thread Theodore Ts&#x27;o
Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c| 46 +-
 fs/ext4/super.c|  9 +
 fs/inode.c | 36 
 include/linux/fs.h |  2 ++
 4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..07ceafb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,49 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino & ~(inodes_per_block - 1);
+   for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode->i_state & I_DIRTY_TIME) == 0) {
+   iput(inode);
+   continue;
+   }
+   raw_inode = (struct ext4_inode *) buf;
+   ei = EXT4_I(inode);
+
+   spin_lock(&inode->i_lock);
+   inode->i_state &= ~I_DIRTY_TIME;
+   inode->i_ts_dirty_day = 0;
+   spin_unlock(&inode->i_lock);
+
+   spin_lock(&ei->i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(&ei->i_raw_lock);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4273,9 +4316,10 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);
 
+   ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
+
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
if (!err)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+   {Opt_lazytime, "lazytime"},
+   {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
case Opt_i_version:
sb->s_flags |= MS_I_VERSION;
return 1;
+   case Opt_lazytime:
+   sb->s_flags |= MS_LAZYTIME;
+   return 1;
+   case Opt_nolazytime:
+   sb->s_flags &= ~MS_LAZYTIME;
+   return 1;
}
 
for (m = ext4_mount_opts; m->token != Opt_err; m++)
diff --git a/fs/inode.c b/fs/inode.c
index f0d6232..89cfca7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1292,6 +1292,42 @@ struct inode *ilookup(struct super_block *sb, unsigned 
long ino)
 }
 EXPORT_SYMBOL(ilookup);
 
+/**
+ * find_

[PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-21 Thread Theodore Ts&#x27;o
In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/inode.c   | 10 ++
 fs/inode.c | 29 ++---
 fs/xfs/xfs_iops.c  | 39 ---
 include/linux/fs.h |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
-
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
+   int ret;
+
+   if (inode->i_op->update_time) {
+   ret = inode->i_op->update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags & S_ATIME)
+   inode->i_atime = *time;
+   if (flags & S_VERSION)
+   inode_inc_iversion(inode);
+   if (flags & S_CTIME)
+   inode->i_ctime = *time;
+   if (flags & S_MTIME)
+   inode->i_mtime = *time;
+   }
+   if (inode->i_op->write_time)
+   return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
 }
 
 STATIC int
-xfs_vn_update_time(
-   struct inode*inode,
-   struct timespec *now,
-   int flags)
+xfs_vn_write_time(
+   struct inode*inode)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(

[PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-21 Thread Theodore Ts&#x27;o
Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c   | 38 +-
 fs/inode.c  | 18 ++
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 +++
 include/linux/fs.h  |  1 +
 include/uapi/linux/fs.h |  1 +
 6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
dirty = inode->i_state & I_DIRTY;
-   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+   struct inode *inode, *old_inode = NULL;
+
+   WARN_ON(!rwsem_is_locked(&sb->s_umount));
+   spin_lock(&inode_sb_list_lock);
+   list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+   int dirty_time;
+
+   spin_lock(&inode->i_lock);
+   if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+   spin_unlock(&inode->i_lock);
+   continue;
+   }
+   dirty_time = inode->i_state & I_DIRTY_TIME;
+   __iget(inode);
+   spin_unlock(&inode->i_lock);
+   spin_unlock(&inode_sb_list_lock);
+
+   iput(old_inode);
+   old_inode = inode;
+
+   if (dirty_time)
+   mark_inode_dirty(inode);
+   cond_resched();
+   spin_lock(&inode_sb_list_lock);
+   }
+   spin_unlock(&inode_sb_list_lock);
+   iput(old_inode);
+}
+
 /**
  * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
  * @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+   flush_sb_dirty_time(sb);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
 
diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..6e91aca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(!list_empty(&inode->i_lru));
 
+   if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
+   if (inode->i_op->write_time)
+   inode->i_op->write_time(inode);
+   else if (inode->i_sb->s_op->write_inode) {
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_NONE,
+   };
+   mark_inode_dirty(inode);
+   inode->i_sb->s_op->write_inode(inode, &wbc);
+   }
+   }
+
if (!list_empty(&inode->i_wb_list))
inode_wb_list_del(inode);
 
@@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = 

Re: ext4 vs btrfs performance on SSD array

2014-09-02 Thread Theodore Ts&#x27;o
On Tue, Sep 02, 2014 at 04:20:24PM +0200, Jan Kara wrote:
> On Tue 02-09-14 07:31:04, Ted Tso wrote:
> > >  - the very small max readahead size
> > 
> > For things like the readahead size, that's probably something that we
> > should autotune, based the time it takes to read N sectors.  i.e.,
> > start N relatively small, such as 128k, and then bump it up based on
> > how long it takes to do a sequential read of N sectors until it hits a
> > given tunable, which is specified in milliseconds instead of kilobytes.
>   Actually the amount of readahead we do is autotuned (based on hit rate).
> So I would keep the setting in sysfs as the maximum size adaptive readahead
> can ever read and we can bump it up. We can possibly add another feedback
> into the readahead code to tune actualy readahead size depending on device
> speed but we'd have to research exactly what algorithm would work best.

I do think we will need to add a time based cap when bump up the max
adaptive readahead; otherwise what could happen is that if we are
streaming off of a slow block device, the readhaead could easily grow
to the point where it starts affecting the latency of competing read
requests to the slow block device.

I suppose we could make the argument that it's not needed, because most of
situations where we might be using slow block devices, the streaming
reader will likely have exclusive use of the device, since no one
would be crazy enough to say, try to run a live CD-ROM image when USB
sticks are so cheap.  :-)

So maybe in practice it won't matter, but I think some kind of time
based cap would probably be a good idea.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext4 vs btrfs performance on SSD array

2014-09-02 Thread Theodore Ts&#x27;o
>  - the very small max readahead size

For things like the readahead size, that's probably something that we
should autotune, based the time it takes to read N sectors.  i.e.,
start N relatively small, such as 128k, and then bump it up based on
how long it takes to do a sequential read of N sectors until it hits a
given tunable, which is specified in milliseconds instead of kilobytes.

>  - replacing cfq with deadline (or noop)

Unfortunately, that will break ionice and a number of other things...

  - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add support to check for FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_ZERO_RANGE crap modes

2014-08-01 Thread Theodore Ts&#x27;o
On Thu, Jul 31, 2014 at 08:09:10PM +0100, Hugo Mills wrote:
> On Thu, Jul 31, 2014 at 01:53:33PM -0400, Nicholas Krause wrote:
> > This adds checks for the stated modes as if they are crap we will return 
> > error
> > not supported.
> 
>You've just enabled two options, but you haven't actually
> implemented the code behind it. I would tell you *NOT* to do anything
> else on this work until you can answer the question: What happens if
> you apply this patch, create a large file called "foo.txt", and then a
> userspace program executes the following code?
> 
> int fd = open("foo.txt", O_RDWR);
> fallocate(fd, FALLOCATE_FL_COLLAPSE_RANGE, 50, 50);
> 
>Try it on a btrfs filesystem, both with and without your patch.
> Also try it on an ext4 filesystem.
> 
>Once you've done all of that, reply to this mail and tell me what
> the problem is with this patch. You need to make two answers: what are
> the technical problems with the patch? What errors have you made in
> the development process?

There are also the conceptual failures.  Before you do anything else,
you need to be able to answer the question, "what do you think the
flags FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_ZERO_RANGE are supposed
to do?"  What are the possible appropriate things for btrfs to do if
it sees these flags?  (Hint: there is more than one correct answer,
and its current choice is one of them.  What is the other one?)

Nick, the fact that you call these modes "crap" is a hint that you
have a fundamental lack of understanding --- and before you waste more
of kernel developers' time, you need to get that understanding first,
for any bit of code that you propose to "improve".

This is why I suggested that you work on userspace testing scripts
first.  It's pretty clear you are (a) incredibly sloppy, and (b)
lacking conceptual understanding of a lot of technical details, and
(c) even worse, aren't letting this lack of understanding stop you
from posting patches.  As a result you are adding negative value to
whatever project or subsystem you try to attach yourself to --- you're
not helping.

- Ted

P.S.   As a further hint, change the above code to read:

int fd = open("foo.txt", O_RDWR);
if (fallocate(fd, FALLOCATE_FL_COLLAPSE_RANGE, 4096, 8192) < 0)
perror("fallocate");

And then run "filefrag -vs foo.txt" before and after running the above
code fragment and then try something like this:

 cp /usr/share/dict/words foo.txt
 filefrag -vs foo.txt
 ls -l foo.txt
 /tmp/fallocate-test-prog
 filefrag -vs foo.txt
 ls -l foo.txt
 diff /usr/share/dict/words foo.txt

Try doing this on an ext4 or xfs system and a btrfs file system.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >