date:20170424

Re: [PATCH] Tidy while loop in end_compressed_writeback

2017-04-24 Thread Sahil Kang


Hi David,

I think my previous email may have failed to send.
Can you merge my patch below?
It's a minor update that I think improves readability.

Thanks,
Sahil

On 04/19/2017 05:43 AM, Sahil Kang wrote:

Hi,

This is my first patch so it's a minor code change.
I think removing the early continue from the loop makes the function a 
little easier to follow.

Please have a look and I'd appreciate any feedback.

Thanks,
Sahil


From 98afe83a570180e841fefe3fd48d450accc42ea3 Mon Sep 17 00:00:00 2001
From: Sahil Kang 
Date: Wed, 19 Apr 2017 04:47:00 -0700
Subject: [PATCH] Tidy while loop in end_compressed_writeback

Instead of continuing early in the loop when ret is 0,
we can decrement/increment nr_pages/index by 1 at the ending.
The for loop will not execute when ret is 0 anyway.

Signed-off-by: Sahil Kang 
---
 fs/btrfs/compression.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index c473c42..c653297 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -238,17 +238,14 @@ static noinline void 
end_compressed_writeback(struct inode *inode,

 ret = find_get_pages_contig(inode->i_mapping, index,
  min_t(unsigned long,
  nr_pages, ARRAY_SIZE(pages)), pages);
-if (ret == 0) {
-nr_pages -= 1;
-index += 1;
-continue;
-}
 for (i = 0; i < ret; i++) {
 if (cb->errors)
 SetPageError(pages[i]);
 end_page_writeback(pages[i]);
 page_cache_release(pages[i]);
 }
+
+ret = ret ? ret : 1; /* set ret to 1 if it's 0 */
 nr_pages -= ret;
 index += ret;
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs-progs: run_next_block: Return error code when extent buffer read fails

2017-04-24 Thread Praveen K Pandey

run_next_block() ignores the return value of read_tree_block() and
always returns success status. This might cause deal_root_from_list()
to loop infinitely when reading metadata block fails. This bug fixes
the issue by extracting and returning the error code from the return
value of read_tree_block().

Bug ID : https://bugzilla.kernel.org/show_bug.cgi?id=11

Signed-off-by: Praveen K Pandey 
---
 cmds-check.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index 17b7efb..40e3c99 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -7625,6 +7625,7 @@ static int run_next_block(struct btrfs_root *root,
if (!extent_buffer_uptodate(buf)) {
record_bad_block_io(root->fs_info,
extent_cache, bytenr, size);
+   ret = PTR_ERR(buf);
goto out;
}
 
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 00/20] fs: introduce new writeback error reporting and convert existing API as a wrapper around it

2017-04-24 Thread Jeff Layton

v3: wb_err_t -> errseq_t
clean up places that re-set errors after calling filemap_* functions

v2: introduce wb_err_t, use atomics

Apologies for the wide posting here, but this touches a lot of areas.
This is v3 of the patchset to improve how we're tracking and reporting
errors that occur during pagecache writeback.

There are several situations where the kernel can "lose" errors that
occur during writeback, such that fsync will return success even
though it failed to write back some data previously. The basic idea
here is to have the kernel be more deliberate about the point from
which errors are checked to ensure that that doesn't happen.

Additionally, this set changes the behavior of fsync in Linux to report
writeback errors on all fds instead of just the first one. This allows
writers to reliably tell whether their data made it to the backing
device without having to coordinate fsync calls with other writers.

This set sprawls over a large swath of kernel code. I think the first 12
patches in the series are pretty straightforward and are more or less
ready for merge.

The real changes start with patch 13. That adds support for errseq_t,
builds a new writeback error tracking API on top of that, and converts
the existing code to use it. After that, there are a few cleanup patches
to eliminate some unneeded error re-setting, etc.

Unfortunately, testing this across so many filesystems is rather
difficult. I have a xfstest for block-based filesystems that uses
dm_error that I'll post soon. That works well with ext4, but btrfs and
xfs seem to go r/o soon after the first error. I also don't have a good
general method for testing this on network filesystems (yet!).

I'd like to see better testing here and am open to suggestions. I will
note that the POSIX fsync spec says this:

"It is reasonable to assert that the key aspects of fsync() are
unreasonable to test in a test suite. That does not make the function
any less valuable, just more difficult to test. [...] It would also not
be unreasonable to omit testing for fsync(), allowing it to be treated
as a quality-of-implementation issue."

Of course, they're talking about a POSIX conformance test, but I
think the same point applies here.

At this point, I'd like to start getting some of the preliminary patches
merged (the first 12 or so). Most of those aren't terribly controversial
and seem like reasonable bugfixes and cleanups. If any subsystem
maintainers want to pick those up, then please do.

After that, I'd like to get the larger changes into linux-next with an
aim for merge in v4.13 or v4.14 (depending on how testing goes).

Feedback is of course welcome!

Jeff Layton (20):
  mm: drop "wait" parameter from write_one_page
  mm: fix mapping_set_error call in me_pagecache_dirty
  buffer: use mapping_set_error instead of setting the flag
  fs: check for writeback errors after syncing out buffers in
generic_file_fsync
  orangefs: don't call filemap_write_and_wait from fsync
  dax: set errors in mapping when writeback fails
  nilfs2: set the mapping error when calling SetPageError on writeback
  mm: ensure that we set mapping error if writeout() fails
  9p: set mapping error when writeback fails in launder_page
  fuse: set mapping error in writepage_locked when it fails
  cifs: set mapping error when page writeback fails in writepage or
launder_pages
  lib: add errseq_t type and infrastructure for handling it
  fs: new infrastructure for writeback error handling and reporting
  fs: retrofit old error reporting API onto new infrastructure
  mm: remove AS_EIO and AS_ENOSPC flags
  mm: don't TestClearPageError in __filemap_fdatawait_range
  cifs: cleanup writeback handling errors and comments
  mm: clean up error handling in write_one_page
  jbd2: don't reset error in journal_finish_inode_data_buffers
  gfs2: clean up some filemap_* calls

 Documentation/filesystems/vfs.txt |   9 +-
 fs/9p/vfs_addr.c  |   5 +-
 fs/btrfs/file.c   |  10 +-
 fs/btrfs/tree-log.c   |   9 +-
 fs/buffer.c   |   2 +-
 fs/cifs/cifsfs.c  |   4 +-
 fs/cifs/file.c|  17 ++--
 fs/cifs/inode.c   |  22 ++---
 fs/dax.c  |   4 +-
 fs/exofs/dir.c|   2 +-
 fs/ext2/dir.c |   2 +-
 fs/ext2/file.c|   2 +-
 fs/f2fs/file.c|   3 +
 fs/f2fs/node.c|   6 +-
 fs/fuse/file.c|   8 +-
 fs/gfs2/glops.c   |  12 +--
 fs/gfs2/lops.c|   4 +-
 fs/gfs2/super.c   |   6 +-
 fs/jbd2/commit.c  |  13 +--
 fs/jfs/jfs_metapage.c |   4 +-
 fs/libfs.c|   3 +
 fs/minix/dir.c|   2 +-
 fs/nilfs2/segment.c   |   1 +
 fs/open.c |   3 +
 fs/orangefs/file.c|   5 +-
 fs/sysv/dir.c

[PATCH v3 03/20] buffer: use mapping_set_error instead of setting the flag

2017-04-24 Thread Jeff Layton

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
---
 fs/buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9196f2a270da..70638941066d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -483,7 +483,7 @@ static void __remove_assoc_queue(struct buffer_head *bh)
list_del_init(&bh->b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
if (buffer_write_io_error(bh))
-   set_bit(AS_EIO, &bh->b_assoc_map->flags);
+   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page

2017-04-24 Thread Jeff Layton

launder_page is just writeback under the page lock. We still need to
mark the mapping for errors there when they occur.

Signed-off-by: Jeff Layton 
---
 fs/9p/vfs_addr.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index adaf6f6dd858..7af6e6501698 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -223,8 +223,11 @@ static int v9fs_launder_page(struct page *page)
v9fs_fscache_wait_on_page_write(inode, page);
if (clear_page_dirty_for_io(page)) {
retval = v9fs_vfs_writepage_locked(page);
-   if (retval)
+   if (retval) {
+   if (retval != -EAGAIN)
+   mapping_set_error(page->mapping, retval);
return retval;
+   }
}
return 0;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails

2017-04-24 Thread Jeff Layton

If writepage fails during a page migration, then we need to ensure that
fsync will see it by flagging the mapping.

Signed-off-by: Jeff Layton 
---
 mm/migrate.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 738f1d5f8350..3a59830bdae2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -792,7 +792,11 @@ static int writeout(struct address_space *mapping, struct 
page *page)
/* unlocked. Relock */
lock_page(page);
 
-   return (rc < 0) ? -EIO : -EAGAIN;
+   if (rc < 0) {
+   mapping_set_error(mapping, rc);
+   return -EIO;
+   }
+   return -EAGAIN;
 }
 
 /*
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 19/20] jbd2: don't reset error in journal_finish_inode_data_buffers

2017-04-24 Thread Jeff Layton

Now that we don't clear writeback errors after fetching them, there is
no need to reset them. This is also potentially racy.

Signed-off-by: Jeff Layton 
---
 fs/jbd2/commit.c | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b6b194ec1b4f..4c6262652028 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -264,17 +264,8 @@ static int journal_finish_inode_data_buffers(journal_t 
*journal,
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
-   if (err) {
-   /*
-* Because AS_EIO is cleared by
-* filemap_fdatawait_range(), set it again so
-* that user process can get -EIO from fsync().
-*/
-   mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
-
-   if (!ret)
-   ret = err;
-   }
+   if (err && !ret)
+   ret = err;
spin_lock(&journal->j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
smp_mb();
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 20/20] gfs2: clean up some filemap_* calls

2017-04-24 Thread Jeff Layton

In some places, it's trying to reset the mapping error after calling
filemap_fdatawait. That's no longer required. Also, turn several
filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait.
That will at least return writeback errors that occur during the write
phase.

Signed-off-by: Jeff Layton 
---
 fs/gfs2/glops.c | 12 
 fs/gfs2/lops.c  |  4 +---
 fs/gfs2/super.c |  6 ++
 3 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 5db59d444838..7362d19fdc4c 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -158,9 +158,7 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
 
gfs2_log_flush(sdp, gl, NORMAL_FLUSH);
-   filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
-   error = filemap_fdatawait_range(mapping, gl->gl_vm.start, 
gl->gl_vm.end);
-   mapping_set_error(mapping, error);
+   filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
gfs2_ail_empty_gl(gl);
 
spin_lock(&gl->gl_lockref.lock);
@@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl)
filemap_fdatawrite(metamapping);
if (ip) {
struct address_space *mapping = ip->i_inode.i_mapping;
-   filemap_fdatawrite(mapping);
-   error = filemap_fdatawait(mapping);
-   mapping_set_error(mapping, error);
+   filemap_write_and_wait(mapping);
+   } else {
+   filemap_fdatawait(metamapping);
}
-   error = filemap_fdatawait(metamapping);
-   mapping_set_error(metamapping, error);
gfs2_ail_empty_gl(gl);
/*
 * Writeback of the data mapping may cause the dirty flag to be set
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index b1f9144b42c7..565ce33dec9b 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -586,9 +586,7 @@ static void gfs2_meta_sync(struct gfs2_glock *gl)
if (mapping == NULL)
mapping = &sdp->sd_aspace;
 
-   filemap_fdatawrite(mapping);
-   error = filemap_fdatawait(mapping);
-
+   error = filemap_write_and_wait(mapping);
if (error)
gfs2_io_error(gl->gl_name.ln_sbd);
 }
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 361796a84fce..675c39566ea1 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1593,10 +1593,8 @@ static void gfs2_evict_inode(struct inode *inode)
 out_truncate:
gfs2_log_flush(sdp, ip->i_gl, NORMAL_FLUSH);
metamapping = gfs2_glock2aspace(ip->i_gl);
-   if (test_bit(GLF_DIRTY, &ip->i_gl->gl_flags)) {
-   filemap_fdatawrite(metamapping);
-   filemap_fdatawait(metamapping);
-   }
+   if (test_bit(GLF_DIRTY, &ip->i_gl->gl_flags))
+   filemap_write_and_wait(metamapping);
write_inode_now(inode, 1);
gfs2_ail_flush(ip->i_gl, 0);
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 15/20] mm: remove AS_EIO and AS_ENOSPC flags

2017-04-24 Thread Jeff Layton

They're no longer used.

Signed-off-by: Jeff Layton 
---
 include/linux/pagemap.h | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 32512ffc15fa..9593eac41499 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -20,13 +20,11 @@
  * Bits in mapping->flags.
  */
 enum mapping_flags {
-   AS_EIO  = 0,/* IO error on async write */
-   AS_ENOSPC   = 1,/* ENOSPC on async write */
-   AS_MM_ALL_LOCKS = 2,/* under mm_take_all_locks() */
-   AS_UNEVICTABLE  = 3,/* e.g., ramdisk, SHM_LOCK */
-   AS_EXITING  = 4,/* final truncate in progress */
+   AS_MM_ALL_LOCKS = 0,/* under mm_take_all_locks() */
+   AS_UNEVICTABLE  = 1,/* e.g., ramdisk, SHM_LOCK */
+   AS_EXITING  = 2,/* final truncate in progress */
/* writeback related tags are not used */
-   AS_NO_WRITEBACK_TAGS = 5,
+   AS_NO_WRITEBACK_TAGS = 3,
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 16/20] mm: don't TestClearPageError in __filemap_fdatawait_range

2017-04-24 Thread Jeff Layton

The -EIO returned here can end up overriding whatever error is marked in
the address space, and be returned at fsync time, even when there is a
more appropriate error stored in the mapping.

Read errors are also sometimes tracked on a per-page level using
PG_error. Suppose we have a read error on a page, and then that page is
subsequently dirtied by overwriting the whole page. Writeback doesn't
clear PG_error, so we can then end up successfully writing back that
page and still return -EIO on fsync.

Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
return from that is silently discarded. Any subsystem that is relying on
PG_error to report errors during fsync can easily lose writeback errors
due to this. All you need is a stray sync() call on the box at the wrong
time and you've lost the error.

Since the handling of the PG_error flag is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there are
writeback errors. Change the TestClearPageError call to ClearPageError,
and make __filemap_fdatawait_range a void return function.

Signed-off-by: Jeff Layton 
---
 mm/filemap.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index d94a76d4e023..47e7f50fb830 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -363,17 +363,16 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
-   int ret = 0;
 
if (end_byte < start_byte)
-   goto out;
+   return;
 
pagevec_init(&pvec, 0);
while ((index <= end) &&
@@ -390,14 +389,11 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
continue;
 
wait_on_page_writeback(page);
-   if (TestClearPageError(page))
-   ret = -EIO;
+   ClearPageError(page);
}
pagevec_release(&pvec);
cond_resched();
}
-out:
-   return ret;
 }
 
 /**
@@ -417,15 +413,10 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
 {
-   int ret, ret2;
errseq_t since = filemap_sample_wb_error(mapping);
 
-   ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-   ret2 = filemap_check_wb_error(mapping, since);
-   if (!ret)
-   ret = ret2;
-
-   return ret;
+   __filemap_fdatawait_range(mapping, start_byte, end_byte);
+   return filemap_check_wb_error(mapping, since);
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 18/20] mm: clean up error handling in write_one_page

2017-04-24 Thread Jeff Layton

Don't try to check PageError since that's potentially racy and not
necessarily going to be set after writepage errors out.

Instead, sample the mapping error early on, and use that value to tell
us whether we got a writeback error since then.

Signed-off-by: Jeff Layton 
---
 mm/page-writeback.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index de0dbf12e2c1..1643456881b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2373,11 +2373,12 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
 int write_one_page(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   int ret = 0;
+   int ret = 0, ret2;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 1,
};
+   errseq_t since = filemap_sample_wb_error(mapping);
 
BUG_ON(!PageLocked(page));
 
@@ -2386,16 +2387,14 @@ int write_one_page(struct page *page)
if (clear_page_dirty_for_io(page)) {
get_page(page);
ret = mapping->a_ops->writepage(page, &wbc);
-   if (ret == 0) {
+   if (ret == 0)
wait_on_page_writeback(page);
-   if (PageError(page))
-   ret = -EIO;
-   }
put_page(page);
} else {
unlock_page(page);
}
-   return ret;
+   ret2 = filemap_check_wb_error(mapping, since);
+   return ret ? : ret2;
 }
 EXPORT_SYMBOL(write_one_page);
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 17/20] cifs: cleanup writeback handling errors and comments

2017-04-24 Thread Jeff Layton

Now that writeback errors are handled on a per-file basis using the new
sequence counter method at the vfs layer, we no longer need to re-set
errors in the mapping after doing writeback in non-fsync codepaths.

Also, fix up some bogus comments.

Signed-off-by: Jeff Layton 
---
 fs/cifs/cifsfs.c |  4 +---
 fs/cifs/file.c   |  7 ++-
 fs/cifs/inode.c  | 22 +++---
 3 files changed, 10 insertions(+), 23 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index dd3f5fabfdf6..017a2d1d02c7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -829,10 +829,8 @@ static loff_t cifs_llseek(struct file *file, loff_t 
offset, int whence)
if (!CIFS_CACHE_READ(CIFS_I(inode)) && inode->i_mapping &&
inode->i_mapping->nrpages != 0) {
rc = filemap_fdatawait(inode->i_mapping);
-   if (rc) {
-   mapping_set_error(inode->i_mapping, rc);
+   if (rc)
return rc;
-   }
}
/*
 * Some applications poll for the file length in this strange
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 4b696a23b0b1..9b4f7f182add 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -722,9 +722,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush)
cinode = CIFS_I(inode);
 
if (can_flush) {
-   rc = filemap_write_and_wait(inode->i_mapping);
-   mapping_set_error(inode->i_mapping, rc);
-
+   filemap_write_and_wait(inode->i_mapping);
if (tcon->unix_ext)
rc = cifs_get_inode_info_unix(&inode, full_path,
  inode->i_sb, xid);
@@ -3906,8 +3904,7 @@ void cifs_oplock_break(struct work_struct *work)
break_lease(inode, O_WRONLY);
rc = filemap_fdatawrite(inode->i_mapping);
if (!CIFS_CACHE_READ(cinode)) {
-   rc = filemap_fdatawait(inode->i_mapping);
-   mapping_set_error(inode->i_mapping, rc);
+   filemap_fdatawait(inode->i_mapping);
cifs_zap_mapping(inode);
}
cifs_dbg(FYI, "Oplock flush inode %p rc %d\n", inode, rc);
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index b261db34103c..a58e605240fc 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2008,10 +2008,8 @@ int cifs_getattr(const struct path *path, struct kstat 
*stat,
if (!CIFS_CACHE_READ(CIFS_I(inode)) && inode->i_mapping &&
inode->i_mapping->nrpages != 0) {
rc = filemap_fdatawait(inode->i_mapping);
-   if (rc) {
-   mapping_set_error(inode->i_mapping, rc);
+   if (rc)
return rc;
-   }
}
 
rc = cifs_revalidate_dentry_attr(dentry);
@@ -2171,15 +2169,12 @@ cifs_setattr_unix(struct dentry *direntry, struct iattr 
*attrs)
 * Attempt to flush data before changing attributes. We need to do
 * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the
 * ownership or mode then we may also need to do this. Here, we take
-* the safe way out and just do the flush on all setattr requests. If
-* the flush returns error, store it to report later and continue.
+* the safe way out and just do the flush on all setattr requests.
 *
 * BB: This should be smarter. Why bother flushing pages that
-* will be truncated anyway? Also, should we error out here if
-* the flush returns error?
+* will be truncated anyway?
 */
-   rc = filemap_write_and_wait(inode->i_mapping);
-   mapping_set_error(inode->i_mapping, rc);
+   filemap_write_and_wait(inode->i_mapping);
rc = 0;
 
if (attrs->ia_valid & ATTR_SIZE) {
@@ -2314,15 +2309,12 @@ cifs_setattr_nounix(struct dentry *direntry, struct 
iattr *attrs)
 * Attempt to flush data before changing attributes. We need to do
 * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the
 * ownership or mode then we may also need to do this. Here, we take
-* the safe way out and just do the flush on all setattr requests. If
-* the flush returns error, store it to report later and continue.
+* the safe way out and just do the flush on all setattr requests.
 *
 * BB: This should be smarter. Why bother flushing pages that
-* will be truncated anyway? Also, should we error out here if
-* the flush returns error?
+* will be truncated anyway?
 */
-   rc = filemap_write_and_wait(inode->i_mapping);
-   mapping_set_error(inode->i_mapping, rc);
+   filemap_write_and_wait(inode->i_mapping);
rc = 0;
 
if (attrs->ia_valid & ATTR_SIZE) {
-- 
2.9

[PATCH v3 14/20] fs: retrofit old error reporting API onto new infrastructure

2017-04-24 Thread Jeff Layton

Now that we have a better way to store and report errors that occur
during writeback, we need to convert the existing codebase to use it. We
could just adapt all of the filesystem code and related infrastructure
to the new API, but that's a lot of churn.

When it comes to setting errors in the mapping, filemap_set_wb_error is
a drop-in replacement for mapping_set_error. Turn that function into a
simple wrapper around the new one.

Because we want to ensure that writeback errors are always reported at
fsync time, inject filemap_report_wb_error calls much closer to the
syscall boundary, in call_fsync.

For fsync calls (and things like the nfsd equivalent), we either return
the error that the fsync operation returns, or the one returned by
filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
the latest value. This allows us to provide new fsync semantics that
will return errors that may have occurred previously and been viewed
via other file descriptors.

The final piece of the puzzle is what to do about filemap_check_errors
calls that are being called directly or via filemap_* functions. Here,
we must take a little "creative license".

Since we now handle advancing the file->f_wb_err value at the generic
filesystem layer, we no longer need those callers to clear errors out
of the mapping or advance an errseq_t.

A lot of the existing codebase relies on being getting an error back
from those functions when there is a writeback problem, so we do still
want to have them report writeback errors somehow.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can at least report
the latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

In the case where we don't have a struct file to work with, this patch
just has the wrappers sample the current mapping->wb_err value, and use
that as an arbitrary point from which to check for errors.

That's probably not "correct" in all cases, particularly in the case of
something like filemap_fdatawait, but I'm not sure it's any worse than
what we already have, and this gives us a basis from which to work.

A lot of those callers will likely want to change to a model where they
sample the errseq_t much earlier (perhaps when starting a transaction),
store it in an appropriate place and then use that value later when
checking to see if an error occurred.

That will almost certainly take some involvement from other subsystem
maintainers. I'm quite open to adding new API functions to help enable
this if that would be helpful, but I don't really want to do that until
I better understand what's needed.

Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt |  9 -
 fs/btrfs/file.c   | 10 ++
 fs/btrfs/tree-log.c   |  9 ++---
 fs/f2fs/file.c|  3 +++
 fs/f2fs/node.c|  6 +-
 fs/fuse/file.c|  7 +--
 fs/libfs.c|  6 --
 include/linux/fs.h| 17 ++---
 include/linux/pagemap.h   |  8 ++--
 mm/filemap.c  | 33 +++--
 10 files changed, 44 insertions(+), 64 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index ed06fb39822b..f201a77873f7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -577,7 +577,7 @@ written at any point after PG_Dirty is clear.  Once it is 
known to be
 safe, PG_Writeback is cleared.
 
 If there is an error during writeback, then the address_space should be
-marked with an error (typically using filemap_set_wb_error), in order to
+marked with an error (typically using mapping_set_error), in order to
 ensure that the error can later be reported to the application when an
 fsync is issued.
 
@@ -893,10 +893,9 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call. Filesystems that use the
-   pagecache should call filemap_report_wb_error before returning
-   to ensure that any errors that occurred during writeback are
-   reported and the file's error sequence advanced.
+  fsync: called by the fsync(2) system call. Errors that were previously
+recorded using mapping_set_error will automatically be returned to
+the application and the file's error sequence advanced.
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..e15faf240b51 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1962,6 +1962,7 @@ int btrfs_sync_file(struct file *

[PATCH v3 13/20] fs: new infrastructure for writeback error handling and reporting

2017-04-24 Thread Jeff Layton

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this. Later patches will change the existing
code to use this new infrastructure.

Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt | 10 +-
 fs/open.c |  3 +++
 include/linux/fs.h| 24 
 mm/filemap.c  | 38 ++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 94dd27ef4a76..ed06fb39822b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,6 +576,11 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
+If there is an error during writeback, then the address_space should be
+marked with an error (typically using filemap_set_wb_error), in order to
+ensure that the error can later be reported to the application when an
+fsync is issued.
+
 Writeback makes use of a writeback_control structure...
 
 struct address_space_operations
@@ -888,7 +893,10 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Filesystems that use the
+   pagecache should call filemap_report_wb_error before returning
+   to ensure that any errors that occurred during writeback are
+   reported and the file's error sequence advanced.
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
diff --git a/fs/open.c b/fs/open.c
index 949cef29c3bb..88bfed8d3c88 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f,
f->f_inode = inode;
f->f_mapping = inode->i_mapping;
 
+   /* Ensure that we skip any errors that predate opening of the file */
+   f->f_wb_err = filemap_sample_wb_error(f->f_mapping);
+
if (unlikely(f->f_flags & O_PATH)) {
f->f_mode = FMODE_PATH;

[PATCH v3 12/20] lib: add errseq_t type and infrastructure for handling it

2017-04-24 Thread Jeff Layton

An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton 
---
 include/linux/errseq.h |  16 
 lib/Makefile   |   2 +-
 lib/errseq.c   | 199 +
 3 files changed, 216 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index ..e2d374dbf683
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,16 @@
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+typedef u32errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+   /* Optimize for the common case of no error */
+   if (unlikely(err))
+   __errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a8725..2423afef40f7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o refcount.o
+once.o refcount.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index ..c783d0a39cb0
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,199 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since an arbitrary
+ * time of their choosing.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value at a
+ * particular point in time. Later, that value can be used to tell whether any
+ * new errors have occurred since that time.
+ *
+ * Note that there is a risk of collisions, if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT   ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT)
+
+/* The "ones" bit for the counter */
+#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Most callers will want to use the errseq_set inline wrapper to efficiently
+ * handle the common case where err is 0.
+ */
+void __errseq_set(errseq_t *eseq, int err)
+{

[PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-24 Thread Jeff Layton

This ensures that we see errors on fsync when writeback fails.

Signed-off-by: Jeff Layton 
---
 fs/fuse/file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ec238fb5a584..07d0efcb050c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
 err_free:
fuse_request_free(req);
 err:
+   mapping_set_error(page->mapping, error);
end_page_writeback(page);
return error;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 11/20] cifs: set mapping error when page writeback fails in writepage or launder_pages

2017-04-24 Thread Jeff Layton

Signed-off-by: Jeff Layton 
---
 fs/cifs/file.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 21d404535739..4b696a23b0b1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct 
writeback_control *wbc)
set_page_writeback(page);
 retry_write:
rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
-   if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL)
+   if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) {
goto retry_write;
-   else if (rc == -EAGAIN)
+   } else if (rc == -EAGAIN) {
redirty_page_for_writepage(wbc, page);
-   else if (rc != 0)
+   } else if (rc != 0) {
SetPageError(page);
-   else
+   mapping_set_error(page->mapping, rc);
+   } else {
SetPageUptodate(page);
+   }
end_page_writeback(page);
put_page(page);
free_xid(xid);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 07/20] nilfs2: set the mapping error when calling SetPageError on writeback

2017-04-24 Thread Jeff Layton

In a later patch, we're going to want to make the fsync codepath not do
a TestClearPageError call as that can override the error set in the
address space. To do that though, we need to ensure that filesystems
that are relying on the PG_error bit for reporting writeback errors
also set an error in the address space.

Ensure that this is set in nilfs2.

Cc: Ryusuke Konishi 
Signed-off-by: Jeff Layton 
---
 fs/nilfs2/segment.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index febed1217b3f..612d4b446793 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1745,6 +1745,7 @@ static void nilfs_end_page_io(struct page *page, int err)
} else {
__set_page_dirty_nobuffers(page);
SetPageError(page);
+   mapping_set_error(page_mapping(page), err);
}
 
end_page_writeback(page);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 06/20] dax: set errors in mapping when writeback fails

2017-04-24 Thread Jeff Layton

In order to get proper error codes from fsync, we must set an error in
the mapping range when writeback fails.

Signed-off-by: Jeff Layton 
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 85abd741253d..9b6b04030c3f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 
ret = dax_writeback_one(bdev, mapping, indices[i],
pvec.pages[i]);
-   if (ret < 0)
+   if (ret < 0) {
+   mapping_set_error(mapping, ret);
return ret;
+   }
}
}
return 0;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: tests: add variable quotation to fsck-tests

2017-04-24 Thread Lakshmipathi.G

Signed-off-by: Lakshmipathi.G 
---
 tests/fsck-tests/006-bad-root-items/test.sh|  6 +++---
 tests/fsck-tests/012-leaf-corruption/test.sh   | 24 +++---
 tests/fsck-tests/013-extent-tree-rebuild/test.sh   | 22 ++--
 tests/fsck-tests/018-leaf-crossing-stripes/test.sh |  4 ++--
 .../fsck-tests/019-non-skinny-false-alert/test.sh  |  4 ++--
 tests/fsck-tests/020-extent-ref-cases/test.sh  |  4 ++--
 .../021-partially-dropped-snapshot-case/test.sh|  4 ++--
 tests/fsck-tests/022-qgroup-rescan-halfway/test.sh |  4 ++--
 tests/fsck-tests/023-qgroup-stack-overflow/test.sh |  4 ++--
 tests/fsck-tests/024-clear-space-cache/test.sh | 16 +++
 tests/fsck-tests/025-file-extents/test.sh  |  2 +-
 tests/fsck-tests/026-check-inode-link/test.sh  |  2 +-
 12 files changed, 48 insertions(+), 48 deletions(-)

diff --git a/tests/fsck-tests/006-bad-root-items/test.sh 
b/tests/fsck-tests/006-bad-root-items/test.sh
index 8433234..bf3ef78 100755
--- a/tests/fsck-tests/006-bad-root-items/test.sh
+++ b/tests/fsck-tests/006-bad-root-items/test.sh
@@ -1,15 +1,15 @@
 #!/bin/bash
 
-source $TOP/tests/common
+source "$TOP/tests/common"
 
 check_prereq btrfs
 
-echo "extracting image default_case.tar.xz" >> $RESULTS
+echo "extracting image default_case.tar.xz" >> "$RESULTS"
 tar --no-same-owner -xJf default_case.tar.xz || \
_fail "failed to extract default_case.tar.xz"
 check_image test.img
 
-echo "extracting image skinny_case.tar.xz" >> $RESULTS
+echo "extracting image skinny_case.tar.xz" >> "$RESULTS"
 tar --no-same-owner -xJf skinny_case.tar.xz || \
_fail "failed to extract skinny_case.tar.xz"
 check_image test.img
diff --git a/tests/fsck-tests/012-leaf-corruption/test.sh 
b/tests/fsck-tests/012-leaf-corruption/test.sh
index a308727..43b0e6d 100755
--- a/tests/fsck-tests/012-leaf-corruption/test.sh
+++ b/tests/fsck-tests/012-leaf-corruption/test.sh
@@ -1,6 +1,6 @@
 #!/bin/bash
 
-source $TOP/tests/common
+source "$TOP/tests/common"
 
 check_prereq btrfs-image
 
@@ -37,16 +37,16 @@ leaf_no_data_ext_list=(
 generate_leaf_corrupt_no_data_ext()
 {
dest=$1
-   echo "generating leaf_corrupt_no_data_ext.btrfs-image" >> $RESULTS
+   echo "generating leaf_corrupt_no_data_ext.btrfs-image" >> "$RESULTS"
tar --no-same-owner -xJf ./no_data_extent.tar.xz || \
_fail "failed to extract leaf_corrupt_no_data_ext.btrfs-image"
-   $TOP/btrfs-image -r test.img.btrfs-image $dest || \
+   "$TOP/btrfs-image" -r test.img.btrfs-image "$dest" || \
_fail "failed to extract leaf_corrupt_no_data_ext.btrfs-image"
 
# leaf at 4206592 and 20905984 contains no regular data
# extent, clear its csum to corrupt the leaf.
for x in 4206592 20905984; do
-   dd if=/dev/zero of=$dest bs=1 count=32 conv=notrunc seek=$x \
+   dd if=/dev/zero of="$dest" bs=1 count=32 conv=notrunc seek="$x" 
\
1>/dev/null 2>&1
done
 }
@@ -60,21 +60,21 @@ check_inode()
name=$5
 
# Check whether the inode exists
-   exists=$($SUDO_HELPER find $path -inum $ino)
+   exists=$($SUDO_HELPER find "$path" -inum "$ino")
if [ -z "$exists" ]; then
_fail "inode $ino not recovered correctly"
fi
 
# Check inode type
-   found_mode=$(printf "%o" 0x$($SUDO_HELPER stat $exists -c %f))
-   if [ $found_mode -ne $mode ]; then
+   found_mode=$(printf "%o" 0x$($SUDO_HELPER stat "$exists" -c %f))
+   if [ "$found_mode" -ne "$mode" ]; then
echo "$found_mode"
_fail "inode $ino modes not recovered"
fi
 
# Check inode size
-   found_size=$($SUDO_HELPER stat $exists -c %s)
-   if [ $mode -ne 41700 -a $found_size -ne $size ]; then
+   found_size=$($SUDO_HELPER stat "$exists" -c %s)
+   if [ $mode -ne 41700 -a "$found_size" -ne "$size" ]; then
_fail "inode $ino size not recovered correctly"
fi
 
@@ -90,11 +90,11 @@ check_inode()
 check_leaf_corrupt_no_data_ext()
 {
image=$1
-   $SUDO_HELPER mount -o loop $image -o ro $TEST_MNT
+   $SUDO_HELPER mount -o loop "$image" -o ro "$TEST_MNT"
 
i=0
while [ $i -lt ${#leaf_no_data_ext_list[@]} ]; do
-   check_inode $TEST_MNT/lost+found \
+   check_inode "$TEST_MNT/lost+found" \
${leaf_no_data_ext_list[i]} \
${leaf_no_data_ext_list[i + 1]} \
${leaf_no_data_ext_list[i + 2]} \
@@ -102,7 +102,7 @@ check_leaf_corrupt_no_data_ext()
${leaf_no_data_ext_list[i + 4]}
((i+=4))
done
-   $SUDO_HELPER umount $TEST_MNT
+   $SUDO_HELPER umount "$TEST_MNT"
 }
 
 setup_root_helper
diff --git a/tests/fsck-tests/013-extent-tree-rebuild/test.sh 
b/tests/fsck-tests/013-extent-tree-rebuild/test.sh
index 08c

[PATCH v3 05/20] orangefs: don't call filemap_write_and_wait from fsync

2017-04-24 Thread Jeff Layton

Orangefs doesn't do buffered writes yet, so there's no point in
initiating and waiting for writeback.

Signed-off-by: Jeff Layton 
---
 fs/orangefs/file.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index e6bbc8083d77..17ab42c4db52 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -646,14 +646,11 @@ static int orangefs_fsync(struct file *file,
   loff_t end,
   int datasync)
 {
-   int ret = -EINVAL;
+   int ret;
struct orangefs_inode_s *orangefs_inode =
ORANGEFS_I(file_inode(file));
struct orangefs_kernel_op_s *new_op = NULL;
 
-   /* required call */
-   filemap_write_and_wait_range(file->f_mapping, start, end);
-
new_op = op_alloc(ORANGEFS_VFS_OP_FSYNC);
if (!new_op)
return -ENOMEM;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-04-24 Thread Jeff Layton

ext2 currently does a test+clear of the AS_EIO flag, which is
is problematic for some coming changes.

What we really need to do instead is call filemap_check_errors
in __generic_file_fsync after syncing out the buffers. That
will be sufficient for this case, and help other callers detect
these errors properly as well.

With that, we don't need to twiddle it in ext2.

Suggested-by: Jan Kara 
Signed-off-by: Jeff Layton 
---
 fs/ext2/file.c | 2 +-
 fs/libfs.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b21891a6bfca..ed00e7ae0ef3 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, 
int datasync)
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
+   if (ret == -EIO) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
diff --git a/fs/libfs.c b/fs/libfs.c
index a8b62e5d43a9..12a48ee442d3 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 
 out:
inode_unlock(inode);
-   return ret;
+   err = filemap_check_errors(inode->i_mapping);
+   return ret ? : err;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 01/20] mm: drop "wait" parameter from write_one_page

2017-04-24 Thread Jeff Layton

The callers all set it to 1.

Also, make it clear that this function will not set any sort of AS_*
error, and that the caller must do so if necessary. No existing caller
uses this on normal files, so none of them need it.

Also, add __must_check here since, in general, the callers need to
handle an error here in some fashion.

Signed-off-by: Jeff Layton 
Reviewed-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
---
 fs/exofs/dir.c|  2 +-
 fs/ext2/dir.c |  2 +-
 fs/jfs/jfs_metapage.c |  4 ++--
 fs/minix/dir.c|  2 +-
 fs/sysv/dir.c |  2 +-
 fs/ufs/dir.c  |  2 +-
 include/linux/mm.h|  2 +-
 mm/page-writeback.c   | 14 +++---
 8 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 42f9a0a0c4ca..e163ed980c20 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -72,7 +72,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
set_page_dirty(page);
 
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
 
diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index d9650c9508e4..e2709695b177 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -100,7 +100,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
}
 
if (IS_DIRSYNC(dir)) {
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
if (!err)
err = sync_inode_metadata(dir, 1);
} else {
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index 489aaa1403e5..744fa3c079e6 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp)
get_page(page);
lock_page(page);
set_page_dirty(page);
-   write_one_page(page, 1);
+   write_one_page(page);
clear_bit(META_forcewrite, &mp->flag);
put_page(page);
 }
@@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp)
set_page_dirty(page);
if (test_bit(META_sync, &mp->flag)) {
clear_bit(META_sync, &mp->flag);
-   write_one_page(page, 1);
+   write_one_page(page);
lock_page(page); /* write_one_page unlocks the page */
}
} else if (mp->lsn) /* discard_metapage doesn't remove it */
diff --git a/fs/minix/dir.c b/fs/minix/dir.c
index 7edc9b395700..baa9721f1299 100644
--- a/fs/minix/dir.c
+++ b/fs/minix/dir.c
@@ -57,7 +57,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c
index 5bdae85ceef7..f5191cb2c947 100644
--- a/fs/sysv/dir.c
+++ b/fs/sysv/dir.c
@@ -45,7 +45,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index de01b8f2aa78..48609f1d9580 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -53,7 +53,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
mark_inode_dirty(dir);
}
if (IS_DIRSYNC(dir))
-   err = write_one_page(page, 1);
+   err = write_one_page(page);
else
unlock_page(page);
return err;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00a8fa7e366a..403f78afee20 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf,
 extern int filemap_page_mkwrite(struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
-int write_one_page(struct page *page, int wait);
+int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d8ac2a7fb9e7..de0dbf12e2c1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2361,15 +2361,16 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
 }
 
 /**
- * write_one_page - write out a single page and optionally wait on I/O
+ * write_one_page - write out a single page and wait on I/O
  * @page: the page to write
- * @wait: if true, wait on writeout
  *
  * The page must be locked by the caller and will be unlocked upon return.
  *
- * write_one_page() returns a negative error code if I/O failed.
+ * write_one_pag

[PATCH v3 02/20] mm: fix mapping_set_error call in me_pagecache_dirty

2017-04-24 Thread Jeff Layton

The error code should be negative. Since this ends up in the default
case anyway, this is harmless, but it's less confusing to negate it.
Also, later patches will require a negative error code here.

Signed-off-by: Jeff Layton 
Reviewed-by: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Matthew Wilcox 
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 27f7210e7fab..4b56e53e5378 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -674,7 +674,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long 
pfn)
 * the first EIO, but we're not worse than other parts
 * of the kernel.
 */
-   mapping_set_error(mapping, EIO);
+   mapping_set_error(mapping, -EIO);
}
 
return me_pagecache_clean(p, pfn);
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: Introduce 'btrfs inspect-internal dump-csum' option

2017-04-24 Thread Lakshmipathi.G

Example usage:
btrfs inspect-internal dump-csum /btrfs/50gbfile /dev/sda4
csum for /btrfs/50gbfile dumped to /btrfs/50gbfile.csumdump

Signed-off-by: Lakshmipathi.G 
---
 Makefile |   3 +-
 cmds-inspect-dump-csum.c | 256 +++
 cmds-inspect-dump-csum.h |  24 +
 cmds-inspect.c   |   3 +
 4 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 cmds-inspect-dump-csum.c
 create mode 100644 cmds-inspect-dump-csum.h

diff --git a/Makefile b/Makefile
index 81598df..6e63e18 100644
--- a/Makefile
+++ b/Makefile
@@ -102,7 +102,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \
   cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o 
\
-  mkfs/common.o
+  cmds-inspect-dump-csum.o mkfs/common.o
 libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o 
\
   kernel-lib/crc32c.o messages.o \
   uuid-tree.o utils-lib.o rbtree-utils.o
@@ -191,6 +191,7 @@ btrfs_convert_cflags = 
-DBTRFSCONVERT_EXT2=$(BTRFSCONVERT_EXT2)
 btrfs_fragments_libs = -lgd -lpng -ljpeg -lfreetype
 btrfs_debug_tree_objects = cmds-inspect-dump-tree.o
 btrfs_show_super_objects = cmds-inspect-dump-super.o
+btrfs_dump_csum_objects = cmds-inspect-dump-csum.o
 btrfs_calc_size_objects = cmds-inspect-tree-stats.o
 
 # collect values of the variables above
diff --git a/cmds-inspect-dump-csum.c b/cmds-inspect-dump-csum.c
new file mode 100644
index 000..6065eb4
--- /dev/null
+++ b/cmds-inspect-dump-csum.c
@@ -0,0 +1,256 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "kerncompat.h"
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "print-tree.h"
+#include "transaction.h"
+#include "list.h"
+#include "utils.h"
+#include "commands.h"
+#include "crc32c.h"
+#include "cmds-inspect-dump-csum.h"
+#include "help.h"
+#include "volumes.h"
+
+
+const char * const cmd_inspect_dump_csum_usage[] = {
+   "btrfs inspect-internal dump-csum  ",
+   "Get csums for the given file.",
+   NULL
+};
+
+int btrfs_lookup_csums(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
+   struct btrfs_path *path, u64 bytenr, int cow, int total_csums, FILE *fp)
+{
+   int ret;
+   int i;
+   int start_pos = 0;
+   struct btrfs_key file_key;
+   struct btrfs_key found_key;
+   struct btrfs_csum_item *item;
+   struct extent_buffer *leaf;
+   u64 csum_offset = 0;
+   u16 csum_size =
+   btrfs_super_csum_size(root->fs_info->super_copy);
+   int csums_in_item = 0;
+   unsigned int tree_csum = 0;
+   int pending_csums = total_csums;
+   static int cnt=1;
+
+   file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+   file_key.offset = bytenr;
+   file_key.type = BTRFS_EXTENT_CSUM_KEY;
+   ret = btrfs_search_slot(trans, root, &file_key, path, 0, cow);
+   if (ret < 0)
+   goto fail;
+   while(1){
+   leaf = path->nodes[0];
+   if (ret > 0) {
+   ret = 1;
+   if (path->slots[0] == 0)
+   goto fail; 
+   path->slots[0]--;
+   btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+   if (found_key.type != BTRFS_EXTENT_CSUM_KEY){
+   fprintf(stderr, "\nInvalid key found.");
+   goto fail;
+   }
+
+   csum_offset = ((bytenr - found_key.offset) / 
root->sectorsize) * csum_size;
+   csums_in_item = btrfs_item_size_nr(leaf, 
path->slots[0]);
+   csums_in_item /= csum_size;
+   csums_in_item -= ( bytenr - found_key.offset ) / 
root->sectorsize;
+   start_pos=csum_offset;
+   }
+   if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+   if (pending_csums > 0){
+   ret = btrfs_next_leaf(root, path);
+

[RFC xfstests PATCH] xfstests: add a writeback error handling test

2017-04-24 Thread Jeff Layton

This is just an RFC set for now. I've numbered it 999 for the moment so
as not to collide with tests being added.

I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, I aim
to make the kernel report them once on every file description:

https://lkml.org/lkml/2017/4/24/438

This patch adds a test for the new behavior. Basically, open many fds to
the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back. 

With the patch series above, ext4 now passes. xfs and btrfs end up in
r/o mode after the test. xfs returns -EIO at that point though, and
btrfs returns -EROFS. What behavior we actually want there, I'm not
certain. We might be able to mitigate that by putting the journals on a
separate device?

Signed-off-by: Jeff Layton 
---
 common/dmerror|  13 ---
 src/Makefile  |   2 +-
 src/fsync-err.c   | 102 ++
 tests/generic/999 |  74 
 tests/generic/999.out |   3 ++
 tests/generic/group   |   1 +
 tools/dmerror |  47 +++
 7 files changed, 236 insertions(+), 6 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/src/Makefile b/src/Makefile
index e62d7a9774d7..056a75b9f7bb 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -12,7 +12,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
godown resvtest writemod makeextents itrash rename \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
-   holetest t_truncate_self t_mmap_dio af_unix
+   holetest t_truncate_self t_mmap_dio af_unix fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/fsync-err.c b/src/fsync-err.c
new file mode 100644
index ..8ebfd145bd70
--- /dev/null
+++ b/src/fsync-err.c
@@ -0,0 +1,102 @@
+/*
+ * fsync-err.c: test whether writeback errors are reported to all open fds
+ * Copyright (c) 2017: Jeff Layton 
+ *
+ * Open a file several times, write to it and then fsync. Flip dm_error over
+ * to make the backing device stop working. Overwrite the same section and
+ * call fsync on all fds and verify that we get errors on all of them. Then,
+ * fsync one more time on all of them and verify that they return 0.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define NUM_FDS10
+
+static void usage() {
+   fprintf(stderr, "Usage: fsync-err \n");
+}
+
+int main(int argc, char **argv)
+{
+   int fd[NUM_FDS], ret, i;
+   char *fname, *buf;
+
+   if (argc < 1) {
+   usage();
+   return 1;
+   }
+
+   /* First argument is filename */
+   fname = argv[1];
+
+   for (i = 0; i < NUM_FDS; ++i) {
+   fd[i] = open(fname, O_WRONLY | O_CREAT | O_TRUNC, 0644);
+   if (fd[i] < 0) {
+   printf("open of fd[%d] failed: %m\n", i);
+   return 1;
+   }
+   }
+
+   buf = "foobar";
+   for (i = 0; i < NUM_FDS; ++i) {
+   ret = write(fd[i], buf, strlen(buf) + 1);
+   if (ret < 0) {
+   printf("First write on fd[%d] failed: %m\n", i);
+   return 1;
+   }
+   }
+
+   for (i = 0; i < NUM_FDS; ++i) {
+   ret = fsync(fd[i]);
+   if (ret < 0) {
+   printf("First fsync on fd[%d] failed: %m\n", i);
+   return 1;
+   }
+   }
+
+   /* flip the device to non-working mode */
+

Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls

2017-04-24 Thread Bob Peterson

- Original Message -
| In some places, it's trying to reset the mapping error after calling
| filemap_fdatawait. That's no longer required. Also, turn several
| filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait.
| That will at least return writeback errors that occur during the write
| phase.
| 
| Signed-off-by: Jeff Layton 
| ---
|  fs/gfs2/glops.c | 12 
|  fs/gfs2/lops.c  |  4 +---
|  fs/gfs2/super.c |  6 ++
|  3 files changed, 7 insertions(+), 15 deletions(-)
| 
| diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
| index 5db59d444838..7362d19fdc4c 100644
| --- a/fs/gfs2/glops.c
| +++ b/fs/gfs2/glops.c
| @@ -158,9 +158,7 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
|   GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
|  
|   gfs2_log_flush(sdp, gl, NORMAL_FLUSH);
| - filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
| - error = filemap_fdatawait_range(mapping, gl->gl_vm.start, 
gl->gl_vm.end);
| - mapping_set_error(mapping, error);
| + filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);

This should probably have "error = ", no?

|   gfs2_ail_empty_gl(gl);
|  
|   spin_lock(&gl->gl_lockref.lock);
| @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl)
|   filemap_fdatawrite(metamapping);
|   if (ip) {
|   struct address_space *mapping = ip->i_inode.i_mapping;
| - filemap_fdatawrite(mapping);
| - error = filemap_fdatawait(mapping);
| - mapping_set_error(mapping, error);
| + filemap_write_and_wait(mapping);
| + } else {
| + filemap_fdatawait(metamapping);
|   }
| - error = filemap_fdatawait(metamapping);
| - mapping_set_error(metamapping, error);

This part doesn't look right at all. There's a big difference in gfs2 between
mapping and metamapping. We need to wait for metamapping regardless.

(snip)

Regards,

Bob Peterson
Red Hat File Systems
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC xfstests PATCH] xfstests: add a writeback error handling test

2017-04-24 Thread Christoph Hellwig

On Mon, Apr 24, 2017 at 09:45:51AM -0400, Jeff Layton wrote:
> With the patch series above, ext4 now passes. xfs and btrfs end up in
> r/o mode after the test. xfs returns -EIO at that point though, and
> btrfs returns -EROFS. What behavior we actually want there, I'm not
> certain. We might be able to mitigate that by putting the journals on a
> separate device?

This looks like XFS shut down because of a permanent write error from
dm-error.  Which seems like the expected behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-04-24 Thread Christoph Hellwig

>  out:
>   inode_unlock(inode);
> - return ret;
> + err = filemap_check_errors(inode->i_mapping);
> + return ret ? : err;

Can you spell out the whole unary operation instead of this weird GCC
extension?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 01/20] mm: drop "wait" parameter from write_one_page

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 03/20] buffer: use mapping_set_error instead of setting the flag

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 02/20] mm: fix mapping_set_error call in me_pagecache_dirty

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 05/20] orangefs: don't call filemap_write_and_wait from fsync

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 07/20] nilfs2: set the mapping error when calling SetPageError on writeback

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page

2017-04-24 Thread Christoph Hellwig

On Mon, Apr 24, 2017 at 09:22:48AM -0400, Jeff Layton wrote:
> launder_page is just writeback under the page lock. We still need to
> mark the mapping for errors there when they occur.
> 
> Signed-off-by: Jeff Layton 

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-24 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 11/20] cifs: set mapping error when page writeback fails in writepage or launder_pages

2017-04-24 Thread Christoph Hellwig

On Mon, Apr 24, 2017 at 09:22:50AM -0400, Jeff Layton wrote:
> Signed-off-by: Jeff Layton 
> ---
>  fs/cifs/file.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 21d404535739..4b696a23b0b1 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct 
> writeback_control *wbc)
>   set_page_writeback(page);
>  retry_write:
>   rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
> + if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) {
>   goto retry_write;
> + } else if (rc == -EAGAIN) {
>   redirty_page_for_writepage(wbc, page);
> + } else if (rc != 0) {
>   SetPageError(page);
> + mapping_set_error(page->mapping, rc);
> + } else {
>   SetPageUptodate(page);
> + }

Hmmm.  I might be a little too nitpicky, but I hate having the same
partial condition duplicated if possible.  Why not:

if (rc == -EAGAIN) {
if (wbc->sync_mode == WB_SYNC_ALL)
goto retry_write;
redirty_page_for_writepage(wbc, page);
} else if (rc) {
SetPageError(page);
mapping_set_error(page->mapping, rc);
} else {
SetPageUptodate(page);
}

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Problem with file system

2017-04-24 Thread Fred Van Andel

I have a btrfs file system with a few thousand snapshots.  When I
attempted to delete 20 or so of them the problems started.

The disks are being read but except for the first few minutes there
are no writes.

Memory usage keeps growing until all the memory (24 Gb) is used in a
few hours. Eventually the system will crash with out of memory errors.

The CPU load is low (<5%) but iowait is around 30 to 50%

The drives are mounted but any process that attempts to access them
will just hang so I cannot access any data on the drives.

Smartctl does not show any issues with the drives.

The problem restarts after a reboot once you mount the drives.

I tried to zero the log hoping it wouldn't restart after a reboot but
that didn't work

I am assuming that the attempt to remove the snapshots caused this
problem.  How do I interrupt the process so I can access the
filesystem again?

# uname -a
Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

#   btrfs --version
btrfs-progs v4.9.1

#   btrfs fi show
Label: none  uuid: 79ba7374-bf77-4868-bb64-656ff5736c44
Total devices 6 FS bytes used 5.65TiB
devid1 size 1.82TiB used 1.29TiB path /dev/sdb
devid2 size 1.82TiB used 1.29TiB path /dev/sdc
devid3 size 1.82TiB used 1.29TiB path /dev/sdd
devid4 size 1.82TiB used 1.29TiB path /dev/sde
devid5 size 3.64TiB used 3.11TiB path /dev/sdf
devid6 size 3.64TiB used 3.11TiB path /dev/sdg

# btrfs fi df /pubroot
Data, RAID1: total=5.58TiB, used=5.58TiB
System, RAID1: total=32.00MiB, used=828.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=104.00GiB, used=70.64GiB
GlobalReserve, single: total=512.00MiB, used=28.51MiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-04-24 Thread Jan Kara

On Mon 24-04-17 09:22:43, Jeff Layton wrote:
> ext2 currently does a test+clear of the AS_EIO flag, which is
> is problematic for some coming changes.
> 
> What we really need to do instead is call filemap_check_errors
> in __generic_file_fsync after syncing out the buffers. That
> will be sufficient for this case, and help other callers detect
> these errors properly as well.
> 
> With that, we don't need to twiddle it in ext2.
> 
> Suggested-by: Jan Kara 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
> ---
>  fs/ext2/file.c | 2 +-
>  fs/libfs.c | 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index b21891a6bfca..ed00e7ae0ef3 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t 
> end, int datasync)
>   struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
>  
>   ret = generic_file_fsync(file, start, end, datasync);
> - if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
> + if (ret == -EIO) {
>   /* We don't really know where the IO error happened... */
>   ext2_error(sb, __func__,
>  "detected IO error when writing metadata buffers");
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a8b62e5d43a9..12a48ee442d3 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, 
> loff_t end,
>  
>  out:
>   inode_unlock(inode);
> - return ret;
> + err = filemap_check_errors(inode->i_mapping);
> + return ret ? : err;
>  }
>  EXPORT_SYMBOL(__generic_file_fsync);
>  
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails

2017-04-24 Thread Jan Kara

On Mon 24-04-17 09:22:45, Jeff Layton wrote:
> In order to get proper error codes from fsync, we must set an error in
> the mapping range when writeback fails.
> 
> Signed-off-by: Jeff Layton 

So I'm fine with the change but please expand the changelog to something
like:

DAX currently doesn't set errors in the mapping when cache flushing fails
in dax_writeback_mapping_range(). Since this function can get called only
from fsync(2) or sync(2), this is actually as good as it can currently get
since we correctly propagate the error up from dax_writeback_mapping_range()
to filemap_fdatawrite(). However in the future better writeback error
handling will enable us to properly report these errors on fsync(2) even if
there are multiple file descriptors open against the file or if sync(2)
gets called before fsync(2). So convert DAX to using standard error
reporting through the mapping.

After improving the changelog you can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 85abd741253d..9b6b04030c3f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space 
> *mapping,
>  
>   ret = dax_writeback_one(bdev, mapping, indices[i],
>   pvec.pages[i]);
> - if (ret < 0)
> + if (ret < 0) {
> + mapping_set_error(mapping, ret);
>   return ret;
> + }
>   }
>   }
>   return 0;
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails

2017-04-24 Thread Jan Kara

On Mon 24-04-17 09:22:47, Jeff Layton wrote:
> If writepage fails during a page migration, then we need to ensure that
> fsync will see it by flagging the mapping.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza
> ---
>  mm/migrate.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 738f1d5f8350..3a59830bdae2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -792,7 +792,11 @@ static int writeout(struct address_space *mapping, 
> struct page *page)
>   /* unlocked. Relock */
>   lock_page(page);
>  
> - return (rc < 0) ? -EIO : -EAGAIN;
> + if (rc < 0) {
> + mapping_set_error(mapping, rc);
> + return -EIO;
> + }
> + return -EAGAIN;
>  }
>  
>  /*
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page

2017-04-24 Thread Jan Kara

On Mon 24-04-17 09:22:48, Jeff Layton wrote:
> launder_page is just writeback under the page lock. We still need to
> mark the mapping for errors there when they occur.
> 
> Signed-off-by: Jeff Layton 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/9p/vfs_addr.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index adaf6f6dd858..7af6e6501698 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -223,8 +223,11 @@ static int v9fs_launder_page(struct page *page)
>   v9fs_fscache_wait_on_page_write(inode, page);
>   if (clear_page_dirty_for_io(page)) {
>   retval = v9fs_vfs_writepage_locked(page);
> - if (retval)
> + if (retval) {
> + if (retval != -EAGAIN)
> + mapping_set_error(page->mapping, retval);
>   return retval;
> + }
>   }
>   return 0;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-24 Thread Jan Kara

On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> This ensures that we see errors on fsync when writeback fails.
> 
> Signed-off-by: Jeff Layton 

Hum, but do we really want to clobber mapping errors with temporary stuff
like ENOMEM? Or do you want to handle that in mapping_set_error?

Honza

> ---
>  fs/fuse/file.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index ec238fb5a584..07d0efcb050c 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
>  err_free:
>   fuse_request_free(req);
>  err:
> + mapping_set_error(page->mapping, error);
>   end_page_writeback(page);
>   return error;
>  }
> -- 
> 2.9.3
> 
> 
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls

2017-04-24 Thread Jeff Layton

On Mon, 2017-04-24 at 10:12 -0400, Bob Peterson wrote:
> - Original Message -
> > In some places, it's trying to reset the mapping error after calling
> > filemap_fdatawait. That's no longer required. Also, turn several
> > filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait.
> > That will at least return writeback errors that occur during the write
> > phase.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/gfs2/glops.c | 12 
> >  fs/gfs2/lops.c  |  4 +---
> >  fs/gfs2/super.c |  6 ++
> >  3 files changed, 7 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> > index 5db59d444838..7362d19fdc4c 100644
> > --- a/fs/gfs2/glops.c
> > +++ b/fs/gfs2/glops.c
> > @@ -158,9 +158,7 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
> > GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE);
> >  
> > gfs2_log_flush(sdp, gl, NORMAL_FLUSH);
> > -   filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
> > -   error = filemap_fdatawait_range(mapping, gl->gl_vm.start, 
> > gl->gl_vm.end);
> > -   mapping_set_error(mapping, error);
> > +   filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
> 
> This should probably have "error = ", no?
> 

This error is discarded in the current code after resetting the error in
the mapping. With the earlier patches in this set we don't need to reset
the error like this anymore.

Now, if this code should doing something else with those errors, then
that's a separate problem.

> > gfs2_ail_empty_gl(gl);
> >  
> > spin_lock(&gl->gl_lockref.lock);
> > @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl)
> > filemap_fdatawrite(metamapping);
> > if (ip) {
> > struct address_space *mapping = ip->i_inode.i_mapping;
> > -   filemap_fdatawrite(mapping);
> > -   error = filemap_fdatawait(mapping);
> > -   mapping_set_error(mapping, error);
> > +   filemap_write_and_wait(mapping);
> > +   } else {
> > +   filemap_fdatawait(metamapping);
> > }
> > -   error = filemap_fdatawait(metamapping);
> > -   mapping_set_error(metamapping, error);
> 
> This part doesn't look right at all. There's a big difference in gfs2 between
> mapping and metamapping. We need to wait for metamapping regardless.
> 

...and this should wait. Basically, filemap_write_and_wait does
filemap_fdatawrite and then filemap_fdatawait. This is mostly just
replacing the existing code with a more concise helper.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with file system

2017-04-24 Thread Chris Murphy

On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel  wrote:
> I have a btrfs file system with a few thousand snapshots.  When I
> attempted to delete 20 or so of them the problems started.
>
> The disks are being read but except for the first few minutes there
> are no writes.
>
> Memory usage keeps growing until all the memory (24 Gb) is used in a
> few hours. Eventually the system will crash with out of memory errors.

Boot with these boot parameters
log_buf_len=1M

I find it easier to remotely login with another computer to capture
problems in case of a crash and I can't save things locally. So on the
remote computer use 'journalctl -kf -o short-monotonic'

Either on the 1st computer, or from an additional ssh connection from the 2nd:

echo 1 >/proc/sys/kernel/sysrq
btrfs fi show   #you need the UUID for the volume you're going to
mount, best to have it in advance

mount the file system normally, and once it's starting to have the
problem (I guess it happens pretty quickly?)

echo t > /proc/sysrq-trigger
grep . -IR /sys/fs/btrfs/UUID/allocation/

Paste in the UUID from fi show. If the computer is hanging due to
running out of memory, each of these commands can take a while to
complete. So it's best to have them all ready to go before you mount
the file system, and the problem starts happening. Best if you can
issue the commands more than once as the problem gets worse, if you
can keep them all organized and labeled.

Then attach them (rather than pasting them into the message).

> I tried to zero the log hoping it wouldn't restart after a reboot but
> that didn't work

Yeah don't just start randomly hitting the fs with a hammer like
zeroing the log tree. That's for a specific problem and this isn't it.

> I am assuming that the attempt to remove the snapshots caused this
> problem.  How do I interrupt the process so I can access the
> filesystem again?

Snapshot creation is essentially free. Snapshot removal is expensive.
There's no way to answer your questions because your email doesn't
even include a call trace. So a developer will need at least the call
trace, but there might be some other useful information in a sysrq +
t, as well as the allocation states.

> # btrfs fi df /pubroot
> Data, RAID1: total=5.58TiB, used=5.58TiB
> System, RAID1: total=32.00MiB, used=828.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=104.00GiB, used=70.64GiB
> GlobalReserve, single: total=512.00MiB, used=28.51MiB

Later, after this problem is solved, you'll want to get rid of that
single system chunk that isn't being used, but might cause a problem
in a device failure.

sudo btrfs balance start -mconvert=raid1,soft 

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails

2017-04-24 Thread Jeff Layton

On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote:
> On Mon 24-04-17 09:22:49, Jeff Layton wrote:
> > This ensures that we see errors on fsync when writeback fails.
> > 
> > Signed-off-by: Jeff Layton 
> 
> Hum, but do we really want to clobber mapping errors with temporary stuff
> like ENOMEM? Or do you want to handle that in mapping_set_error?
> 

Right now we don't really have such a thing as temporary errors in the
writeback codepath. If you return an error here, the data doesn't stay
dirty or anything, and I think we want to ensure that that gets reported
via fsync.

I'd like to see us add better handling for retryable errors for stuff
like ENOMEM or EAGAIN. I think this is the first step toward that
though. Once we have more consistent handling of writeback errors in
general, then we can start doing more interesting things with retryable
errors.

So yeah, I this is the right thing to do for now.

> 
> > ---
> >  fs/fuse/file.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index ec238fb5a584..07d0efcb050c 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page)
> >  err_free:
> > fuse_request_free(req);
> >  err:
> > +   mapping_set_error(page->mapping, error);
> > end_page_writeback(page);
> > return error;
> >  }
> > -- 
> > 2.9.3
> > 
> > 

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 11/20] cifs: set mapping error when page writeback fails in writepage or launder_pages

2017-04-24 Thread Jeff Layton

On Mon, 2017-04-24 at 08:27 -0700, Christoph Hellwig wrote:
> On Mon, Apr 24, 2017 at 09:22:50AM -0400, Jeff Layton wrote:
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/cifs/file.c | 10 ++
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > index 21d404535739..4b696a23b0b1 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct 
> > writeback_control *wbc)
> > set_page_writeback(page);
> >  retry_write:
> > rc = cifs_partialpagewrite(page, 0, PAGE_SIZE);
> > +   if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) {
> > goto retry_write;
> > +   } else if (rc == -EAGAIN) {
> > redirty_page_for_writepage(wbc, page);
> > +   } else if (rc != 0) {
> > SetPageError(page);
> > +   mapping_set_error(page->mapping, rc);
> > +   } else {
> > SetPageUptodate(page);
> > +   }
> 
> Hmmm.  I might be a little too nitpicky, but I hate having the same
> partial condition duplicated if possible.  Why not:
> 
>   if (rc == -EAGAIN) {
>   if (wbc->sync_mode == WB_SYNC_ALL)
>   goto retry_write;
>   redirty_page_for_writepage(wbc, page);
>   } else if (rc) {
>   SetPageError(page);
>   mapping_set_error(page->mapping, rc);
>   } else {
>   SetPageUptodate(page);
>   }
> 
> Otherwise this looks fine to me:
> 
> Reviewed-by: Christoph Hellwig 

No, you're absolutely right. I merged that change into the patch.

Thanks for the review so far!
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls

2017-04-24 Thread Bob Peterson

Hi,

- Original Message -
| On Mon, 2017-04-24 at 10:12 -0400, Bob Peterson wrote:
| > > + filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
| > 
| > This should probably have "error = ", no?
| > 
| 
| This error is discarded in the current code after resetting the error in
| the mapping. With the earlier patches in this set we don't need to reset
| the error like this anymore.
| 
| Now, if this code should doing something else with those errors, then
| that's a separate problem.

Okay, I see. My bad.
 
| > >   gfs2_ail_empty_gl(gl);
| > >  
| > >   spin_lock(&gl->gl_lockref.lock);
| > > @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl)
| > >   filemap_fdatawrite(metamapping);
| > >   if (ip) {
| > >   struct address_space *mapping = ip->i_inode.i_mapping;
| > > - filemap_fdatawrite(mapping);
| > > - error = filemap_fdatawait(mapping);
| > > - mapping_set_error(mapping, error);
| > > + filemap_write_and_wait(mapping);
| > > + } else {
| > > + filemap_fdatawait(metamapping);
| > >   }
| > > - error = filemap_fdatawait(metamapping);
| > > - mapping_set_error(metamapping, error);
| > 
| > This part doesn't look right at all. There's a big difference in gfs2
| > between
| > mapping and metamapping. We need to wait for metamapping regardless.
| > 
| 
| ...and this should wait. Basically, filemap_write_and_wait does
| filemap_fdatawrite and then filemap_fdatawait. This is mostly just
| replacing the existing code with a more concise helper.

But this isn't a simple replacement with a helper. This is two different
address spaces (mapping and metamapping) and you added an else in there.

So with this patch metamapping gets written, and if there's an ip,
mapping gets written but it doesn't wait for metamapping. Unless
I'm missing something.

You could replace both filemap_fdatawrites with the helper instead.
Today's code is structured as:

(a) write metamapping
if (ip)
(b) write mapping
(c) wait for mapping
(d) wait for metamapping

If you use the helper for both, it becomes, (a & d)(b & c) which is probably
acceptable. (I think we just tried to optimize what the elevator was doing).

But the way you've got it coded here still looks wrong. It looks like:
(a)
if (ip)
   (b & c)
ELSE -
   (d)

So (d) (metamapping) isn't guaranteed to be synced at the end of the function.
Of course, you know the modified helper functions better than I do.
What am I missing?

Regards,

Bob Peterson
Red Hat File Systems
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls

2017-04-24 Thread Jeff Layton

On Mon, 2017-04-24 at 13:41 -0400, Bob Peterson wrote:
> Hi,
> 
> - Original Message -
> > On Mon, 2017-04-24 at 10:12 -0400, Bob Peterson wrote:
> > > > +   filemap_write_and_wait_range(mapping, gl->gl_vm.start, 
> > > > gl->gl_vm.end);
> > > 
> > > This should probably have "error = ", no?
> > > 
> > 
> > This error is discarded in the current code after resetting the error in
> > the mapping. With the earlier patches in this set we don't need to reset
> > the error like this anymore.
> > 
> > Now, if this code should doing something else with those errors, then
> > that's a separate problem.
> 
> Okay, I see. My bad.
>  
> > > > gfs2_ail_empty_gl(gl);
> > > >  
> > > > spin_lock(&gl->gl_lockref.lock);
> > > > @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl)
> > > > filemap_fdatawrite(metamapping);
> > > > if (ip) {
> > > > struct address_space *mapping = ip->i_inode.i_mapping;
> > > > -   filemap_fdatawrite(mapping);
> > > > -   error = filemap_fdatawait(mapping);
> > > > -   mapping_set_error(mapping, error);
> > > > +   filemap_write_and_wait(mapping);
> > > > +   } else {
> > > > +   filemap_fdatawait(metamapping);
> > > > }
> > > > -   error = filemap_fdatawait(metamapping);
> > > > -   mapping_set_error(metamapping, error);
> > > 
> > > This part doesn't look right at all. There's a big difference in gfs2
> > > between
> > > mapping and metamapping. We need to wait for metamapping regardless.
> > > 
> > 
> > ...and this should wait. Basically, filemap_write_and_wait does
> > filemap_fdatawrite and then filemap_fdatawait. This is mostly just
> > replacing the existing code with a more concise helper.
> 
> But this isn't a simple replacement with a helper. This is two different
> address spaces (mapping and metamapping) and you added an else in there.
> 
> So with this patch metamapping gets written, and if there's an ip,
> mapping gets written but it doesn't wait for metamapping. Unless
> I'm missing something.
> 
> You could replace both filemap_fdatawrites with the helper instead.
> Today's code is structured as:
> 
> (a) write metamapping
> if (ip)
> (b) write mapping
> (c) wait for mapping
> (d) wait for metamapping
> 
> If you use the helper for both, it becomes, (a & d)(b & c) which is probably
> acceptable. (I think we just tried to optimize what the elevator was doing).
> 
> But the way you've got it coded here still looks wrong. It looks like:
> (a)
> if (ip)
>(b & c)
> ELSE -
>(d)
> 
> So (d) (metamapping) isn't guaranteed to be synced at the end of the function.
> Of course, you know the modified helper functions better than I do.
> What am I missing?
> 
> 


You're right of course. I'll fix that up in my tree.

Thanks!
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 05/20] orangefs: don't call filemap_write_and_wait from fsync

2017-04-24 Thread Mike Marshall

I've been running it here...

Acked-by: Mike Marshall 

On Mon, Apr 24, 2017 at 11:23 AM, Christoph Hellwig  wrote:
> Looks fine,
>
> Reviewed-by: Christoph Hellwig 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dearest

2017-04-24 Thread mariawarlor...@ono.com

Good day,

I hope that you have a wonderful health, I saw your profile and I would 
like to discuss an important and confidential business with you, your 
profile is perfect for a project i need your assistance for that so , 
let me know if you agree and I will send you all the necessary details.

Maria.
Thanks
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails

2017-04-24 Thread Ross Zwisler

On Mon, Apr 24, 2017 at 09:22:45AM -0400, Jeff Layton wrote:
> In order to get proper error codes from fsync, we must set an error in
> the mapping range when writeback fails.
> 
> Signed-off-by: Jeff Layton 

Works fine in some error injection testing.

Tested-by: Ross Zwisler 

> ---
>  fs/dax.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 85abd741253d..9b6b04030c3f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space 
> *mapping,
>  
>   ret = dax_writeback_one(bdev, mapping, indices[i],
>   pvec.pages[i]);
> - if (ret < 0)
> + if (ret < 0) {
> + mapping_set_error(mapping, ret);
>   return ret;
> + }
>   }
>   }
>   return 0;
> -- 
> 2.9.3
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] nowait aio: return on congested block device

2017-04-24 Thread Goldwyn Rodrigues

On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
> On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues 
>>
> 
>> +/* Request queue supports BIO_NOWAIT */
>> +queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
> 
> BIO_NOWAIT is gone.  And the comment would not be needed if the
> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
> 
> And I think all request based drivers should set the flag implicitly
> as ->queuecommand can't sleep, and ->queue_rq only when it's always
> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
> 

We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for
request completions. The ones which wait are MD devices because of sync
or suspend operations.

The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned,
it uses the flag to offload it to a workqueue.

The other way to do it implicitly is to change the flag to
BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm.
Is that what you are hinting at? Or do you have something else in mind?

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with file system

2017-04-24 Thread Qu Wenruo




At 04/24/2017 11:27 PM, Fred Van Andel wrote:

I have a btrfs file system with a few thousand snapshots.  When I
attempted to delete 20 or so of them the problems started.

The disks are being read but except for the first few minutes there
are no writes.

Memory usage keeps growing until all the memory (24 Gb) is used in a
few hours. Eventually the system will crash with out of memory errors.


Are you using qgroup/quota?

IIRC qgroup for subvolume deletion will cause full subtree rescan which 
can cause tons of memory.


Thanks,
Qu



The CPU load is low (<5%) but iowait is around 30 to 50%

The drives are mounted but any process that attempts to access them
will just hang so I cannot access any data on the drives.

Smartctl does not show any issues with the drives.

The problem restarts after a reboot once you mount the drives.

I tried to zero the log hoping it wouldn't restart after a reboot but
that didn't work

I am assuming that the attempt to remove the snapshots caused this
problem.  How do I interrupt the process so I can access the
filesystem again?

# uname -a
Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

#   btrfs --version
btrfs-progs v4.9.1

#   btrfs fi show
Label: none  uuid: 79ba7374-bf77-4868-bb64-656ff5736c44
 Total devices 6 FS bytes used 5.65TiB
 devid1 size 1.82TiB used 1.29TiB path /dev/sdb
 devid2 size 1.82TiB used 1.29TiB path /dev/sdc
 devid3 size 1.82TiB used 1.29TiB path /dev/sdd
 devid4 size 1.82TiB used 1.29TiB path /dev/sde
 devid5 size 3.64TiB used 3.11TiB path /dev/sdf
 devid6 size 3.64TiB used 3.11TiB path /dev/sdg

# btrfs fi df /pubroot
Data, RAID1: total=5.58TiB, used=5.58TiB
System, RAID1: total=32.00MiB, used=828.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=104.00GiB, used=70.64GiB
GlobalReserve, single: total=512.00MiB, used=28.51MiB
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/8] nowait aio: return on congested block device

2017-04-24 Thread Jens Axboe

On 04/24/2017 03:10 PM, Goldwyn Rodrigues wrote:
> 
> 
> On 04/19/2017 01:45 AM, Christoph Hellwig wrote:
>> On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote:
>>> From: Goldwyn Rodrigues 
>>>
>>
>>> +   /* Request queue supports BIO_NOWAIT */
>>> +   queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q);
>>
>> BIO_NOWAIT is gone.  And the comment would not be needed if the
>> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT.
>>
>> And I think all request based drivers should set the flag implicitly
>> as ->queuecommand can't sleep, and ->queue_rq only when it's always
>> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set.
>>
> 
> We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for
> request completions. The ones which wait are MD devices because of sync
> or suspend operations.
> 
> The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned,
> it uses the flag to offload it to a workqueue.
> 
> The other way to do it implicitly is to change the flag to
> BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm.
> Is that what you are hinting at? Or do you have something else in mind?

You are misunderstanding. What Christoph (correctly) says is that request
based drivers do not block, so all of those are fine. What you need to
worry about is drivers that are NOT request based. In other words, drivers
that hook into make_request_fn and process bio's.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

backing up a collection of snapshot subvolumes

2017-04-24 Thread J. Hart

I have a remote machine with a filesystem for which I periodically take 
incremental snapshots for historical reasons.  These snapshots are 
stored in an archival filesystem tree on a file server.  Older snapshots 
are removed and newer ones added on a rotational basis.  I need to be 
able to backup this archive by syncing it with a set of backup drives. 
Due to the size, I need to back it up incrementally rather than sending 
the entire content each time.  Due to the snapshot rotation, I need to 
be able to update the state of the archive backup filesystem as a whole, 
in much the same manner that rsync handles file trees.


It seems that I cannot use "btrfs send", as the archive directory 
contains the snapshots as subvolumes.


I cannot use rsync as it treats subvolumes as simple directories, and 
does not preserve subvolume attributes.  Rsync also does not support 
reflinks, so the snapshot directory content will no longer be reflinked 
to other snapshots on the archive backup.  I cannot use hard links in 
the incrementals as hard links do not cross subvolume boundaries.


Thoughts anyone ?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with file system

2017-04-24 Thread Duncan

Chris Murphy posted on Mon, 24 Apr 2017 11:02:02 -0600 as excerpted:

> On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel 
> wrote:
>> I have a btrfs file system with a few thousand snapshots.  When I
>> attempted to delete 20 or so of them the problems started.
>>
>> The disks are being read but except for the first few minutes there are
>> no writes.
>>
>> Memory usage keeps growing until all the memory (24 Gb) is used in a
>> few hours. Eventually the system will crash with out of memory errors.

In addition to what CMurphy and QW suggested (both valid), I have a 
couple other suggestions/pointers.  They won't help you get out of the 
current situation, but they might help you stay out of it in the future.

1) A "few thousand snapshots", but no mention of how many subvolumes 
those snapshots are of, or how many per subvolume.

As CMurphy says but I'll expand it here, taking a snapshot is nearly 
free, just a bit of metadata to write because btrfs is COW-based and all 
a snapshot does is lock down a copy of everything in the subvolume as it 
exists currently and the filesystem's already tracking that, but removal 
is expensive, because btrfs must go thru and check everything to see if 
it can actually be deleted (no other snapshots referencing the block) or 
not (something else referencing it).

Obviously, then, this checking gets much more complicated the more 
snapshots of the same subvolume that exist.  IOW, it's a scaling issue.

The same scaling issue applies to various other btrfs maintenance tasks, 
including btrfs check (aka btrfsck), and btrfs balance (and thus btrfs 
device remove, which does an implicit balance).  Both of these take *far* 
longer if the number of snapshots per subvolume is allowed to get out of 
hand.

Due to this scaling issue, the recommendation is no more than 200-300 
snapshots per subvolume, and keeping it down to 50-100 max is even 
better, if you can do it reasonably.  That helps keep scaling issues and 
thus time for any necessary maintenance manageable.  Otherwise... well, 
we've had reports of device removes (aka balances) that would take 
/months/ to finish at the rate they were going.  Obviously, well before 
it gets to that point it's far faster to simply blow away the filesystem 
and restore from backups.[1]

It follows that if you have an automated system doing the snapshots, it's 
equally important to have an automated system doing snapshot thinning as 
well, keeping the number of snapshots per subvolume within manageable 
scaling limits.

So if that's "a few thousand snapshots", I hope you that's of (at least) 
a double-digit number of subvolumes, keeping the number of snapshots per 
subvolume under 300, and under 100 if your snapshot rotation schedule 
will allow it.

2) As Qu suggests, btrfs quotas increase the scaling issues significantly.

Additionally, there have been and continue to be accuracy issues with 
certain quota corner-cases, so they can't be entirely relied upon anyway.

Generally, people using btrfs quotas fall into three categories:

a) Those who know the problems and are working with Qu and the other devs 
to report and trace issues so they will eventually work well, ideally 
with less of a scaling issue as well.

Bless them!  Keep it up! =:^)

b) Those who have a use-case that really depends on quotas.

Because btrfs quotas are buggy and not entirely reliable now, not to 
mention the scaling issues, these users are almost certainly better 
served using more mature filesystems with mature and dependable quotas.

c) Those who don't really care about quotas specifically, and are just 
using them because it's a nice feature.  This likely includes some who 
are simply running distros that enable quotas.

My recommendation for these users is to simply turn btrfs quotas off for 
now, as they're presently in general more trouble than they're worth, due 
to both the accuracy and scaling issues.  Hopefully quotas will be stable 
in a couple years, and with developer and tester hard work perhaps the 
scaling issues will have been reduced as well, and that recommendation 
can change.  But for now, if you don't really need them, leaving quotas 
off will significantly reduce scaling issues.  And if you do need them, 
they're not yet reliable on btrfs anyway, so better off using something 
more mature where they actually work.

3) Similarly (tho unlikely to apply in your case), beware of the scaling 
implications of the various reflink-based copying and dedup utilities, 
which work via the same copy-on-write and reflinking technology that's 
behind snapshotting.

Tho snapshotting is effectively reflinking /everything/ in the subvolume, 
so the scaling issues compound much faster there than they will with a 
more trivial level of reflinking.  Of course, when it comes to dedup, a 
more trivial level of reflinking means less benefit from doing the dedup 
in the first place, so there's a limit to the effectiveness of dedup 
before it starts having the same scal

Re: Problem with file system

2017-04-24 Thread Marat Khalili


On 25/04/17 03:26, Qu Wenruo wrote:
IIRC qgroup for subvolume deletion will cause full subtree rescan 
which can cause tons of memory. 
Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
use this absurd amount of memory for? Is it swappable?


Haven't read about RAM limitations for running qgroups before, only 
about CPU load (which importantly only requires patience, does not crash 
servers).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with file system

2017-04-24 Thread Qu Wenruo




At 04/25/2017 01:33 PM, Marat Khalili wrote:

On 25/04/17 03:26, Qu Wenruo wrote:
IIRC qgroup for subvolume deletion will cause full subtree rescan 
which can cause tons of memory. 
Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even 
use this absurd amount of memory for? Is it swappable?


The memory is used for 2 reasons.

1) Record which extents are needed to trace
   Freed at transaction commit.

   Need better idea to handle them. Maybe create a new tree so that we
   can write it to disk?
   Or another qgroup rework?

2) Record current roots referring to this extent
   Only after v4.10 IIRC.

The memory allocated is not swappable.

How many memory it uses depends on the number of extents of that subvolume.

It's 56 bytes for one extent, both tree block and data extent.
To use up 16G ram, it's about 300 million extents.
For 5.6T volume, its average extent size is about 20K.

It seems that your volume is highly fragmented though.

If that's the problem, disabling qgroup may be the best workaround.

Thanks,
Qu



Haven't read about RAM limitations for running qgroups before, only 
about CPU load (which importantly only requires patience, does not crash 
servers).


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

61 matches

Mail list logo