Re: [PATCH] Tidy while loop in end_compressed_writeback
Hi David, I think my previous email may have failed to send. Can you merge my patch below? It's a minor update that I think improves readability. Thanks, Sahil On 04/19/2017 05:43 AM, Sahil Kang wrote: Hi, This is my first patch so it's a minor code change. I think removing the early continue from the loop makes the function a little easier to follow. Please have a look and I'd appreciate any feedback. Thanks, Sahil From 98afe83a570180e841fefe3fd48d450accc42ea3 Mon Sep 17 00:00:00 2001 From: Sahil Kang Date: Wed, 19 Apr 2017 04:47:00 -0700 Subject: [PATCH] Tidy while loop in end_compressed_writeback Instead of continuing early in the loop when ret is 0, we can decrement/increment nr_pages/index by 1 at the ending. The for loop will not execute when ret is 0 anyway. Signed-off-by: Sahil Kang --- fs/btrfs/compression.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index c473c42..c653297 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -238,17 +238,14 @@ static noinline void end_compressed_writeback(struct inode *inode, ret = find_get_pages_contig(inode->i_mapping, index, min_t(unsigned long, nr_pages, ARRAY_SIZE(pages)), pages); -if (ret == 0) { -nr_pages -= 1; -index += 1; -continue; -} for (i = 0; i < ret; i++) { if (cb->errors) SetPageError(pages[i]); end_page_writeback(pages[i]); page_cache_release(pages[i]); } + +ret = ret ? ret : 1; /* set ret to 1 if it's 0 */ nr_pages -= ret; index += ret; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: run_next_block: Return error code when extent buffer read fails
run_next_block() ignores the return value of read_tree_block() and always returns success status. This might cause deal_root_from_list() to loop infinitely when reading metadata block fails. This bug fixes the issue by extracting and returning the error code from the return value of read_tree_block(). Bug ID : https://bugzilla.kernel.org/show_bug.cgi?id=11 Signed-off-by: Praveen K Pandey --- cmds-check.c | 1 + 1 file changed, 1 insertion(+) diff --git a/cmds-check.c b/cmds-check.c index 17b7efb..40e3c99 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -7625,6 +7625,7 @@ static int run_next_block(struct btrfs_root *root, if (!extent_buffer_uptodate(buf)) { record_bad_block_io(root->fs_info, extent_cache, bytenr, size); + ret = PTR_ERR(buf); goto out; } -- 2.5.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 00/20] fs: introduce new writeback error reporting and convert existing API as a wrapper around it
v3: wb_err_t -> errseq_t clean up places that re-set errors after calling filemap_* functions v2: introduce wb_err_t, use atomics Apologies for the wide posting here, but this touches a lot of areas. This is v3 of the patchset to improve how we're tracking and reporting errors that occur during pagecache writeback. There are several situations where the kernel can "lose" errors that occur during writeback, such that fsync will return success even though it failed to write back some data previously. The basic idea here is to have the kernel be more deliberate about the point from which errors are checked to ensure that that doesn't happen. Additionally, this set changes the behavior of fsync in Linux to report writeback errors on all fds instead of just the first one. This allows writers to reliably tell whether their data made it to the backing device without having to coordinate fsync calls with other writers. This set sprawls over a large swath of kernel code. I think the first 12 patches in the series are pretty straightforward and are more or less ready for merge. The real changes start with patch 13. That adds support for errseq_t, builds a new writeback error tracking API on top of that, and converts the existing code to use it. After that, there are a few cleanup patches to eliminate some unneeded error re-setting, etc. Unfortunately, testing this across so many filesystems is rather difficult. I have a xfstest for block-based filesystems that uses dm_error that I'll post soon. That works well with ext4, but btrfs and xfs seem to go r/o soon after the first error. I also don't have a good general method for testing this on network filesystems (yet!). I'd like to see better testing here and am open to suggestions. I will note that the POSIX fsync spec says this: "It is reasonable to assert that the key aspects of fsync() are unreasonable to test in a test suite. That does not make the function any less valuable, just more difficult to test. [...] It would also not be unreasonable to omit testing for fsync(), allowing it to be treated as a quality-of-implementation issue." Of course, they're talking about a POSIX conformance test, but I think the same point applies here. At this point, I'd like to start getting some of the preliminary patches merged (the first 12 or so). Most of those aren't terribly controversial and seem like reasonable bugfixes and cleanups. If any subsystem maintainers want to pick those up, then please do. After that, I'd like to get the larger changes into linux-next with an aim for merge in v4.13 or v4.14 (depending on how testing goes). Feedback is of course welcome! Jeff Layton (20): mm: drop "wait" parameter from write_one_page mm: fix mapping_set_error call in me_pagecache_dirty buffer: use mapping_set_error instead of setting the flag fs: check for writeback errors after syncing out buffers in generic_file_fsync orangefs: don't call filemap_write_and_wait from fsync dax: set errors in mapping when writeback fails nilfs2: set the mapping error when calling SetPageError on writeback mm: ensure that we set mapping error if writeout() fails 9p: set mapping error when writeback fails in launder_page fuse: set mapping error in writepage_locked when it fails cifs: set mapping error when page writeback fails in writepage or launder_pages lib: add errseq_t type and infrastructure for handling it fs: new infrastructure for writeback error handling and reporting fs: retrofit old error reporting API onto new infrastructure mm: remove AS_EIO and AS_ENOSPC flags mm: don't TestClearPageError in __filemap_fdatawait_range cifs: cleanup writeback handling errors and comments mm: clean up error handling in write_one_page jbd2: don't reset error in journal_finish_inode_data_buffers gfs2: clean up some filemap_* calls Documentation/filesystems/vfs.txt | 9 +- fs/9p/vfs_addr.c | 5 +- fs/btrfs/file.c | 10 +- fs/btrfs/tree-log.c | 9 +- fs/buffer.c | 2 +- fs/cifs/cifsfs.c | 4 +- fs/cifs/file.c| 17 ++-- fs/cifs/inode.c | 22 ++--- fs/dax.c | 4 +- fs/exofs/dir.c| 2 +- fs/ext2/dir.c | 2 +- fs/ext2/file.c| 2 +- fs/f2fs/file.c| 3 + fs/f2fs/node.c| 6 +- fs/fuse/file.c| 8 +- fs/gfs2/glops.c | 12 +-- fs/gfs2/lops.c| 4 +- fs/gfs2/super.c | 6 +- fs/jbd2/commit.c | 13 +-- fs/jfs/jfs_metapage.c | 4 +- fs/libfs.c| 3 + fs/minix/dir.c| 2 +- fs/nilfs2/segment.c | 1 + fs/open.c | 3 + fs/orangefs/file.c| 5 +- fs/sysv/dir.c
[PATCH v3 03/20] buffer: use mapping_set_error instead of setting the flag
Signed-off-by: Jeff Layton Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox --- fs/buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/buffer.c b/fs/buffer.c index 9196f2a270da..70638941066d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -483,7 +483,7 @@ static void __remove_assoc_queue(struct buffer_head *bh) list_del_init(&bh->b_assoc_buffers); WARN_ON(!bh->b_assoc_map); if (buffer_write_io_error(bh)) - set_bit(AS_EIO, &bh->b_assoc_map->flags); + mapping_set_error(bh->b_assoc_map, -EIO); bh->b_assoc_map = NULL; } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page
launder_page is just writeback under the page lock. We still need to mark the mapping for errors there when they occur. Signed-off-by: Jeff Layton --- fs/9p/vfs_addr.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index adaf6f6dd858..7af6e6501698 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -223,8 +223,11 @@ static int v9fs_launder_page(struct page *page) v9fs_fscache_wait_on_page_write(inode, page); if (clear_page_dirty_for_io(page)) { retval = v9fs_vfs_writepage_locked(page); - if (retval) + if (retval) { + if (retval != -EAGAIN) + mapping_set_error(page->mapping, retval); return retval; + } } return 0; } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails
If writepage fails during a page migration, then we need to ensure that fsync will see it by flagging the mapping. Signed-off-by: Jeff Layton --- mm/migrate.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 738f1d5f8350..3a59830bdae2 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -792,7 +792,11 @@ static int writeout(struct address_space *mapping, struct page *page) /* unlocked. Relock */ lock_page(page); - return (rc < 0) ? -EIO : -EAGAIN; + if (rc < 0) { + mapping_set_error(mapping, rc); + return -EIO; + } + return -EAGAIN; } /* -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 19/20] jbd2: don't reset error in journal_finish_inode_data_buffers
Now that we don't clear writeback errors after fetching them, there is no need to reset them. This is also potentially racy. Signed-off-by: Jeff Layton --- fs/jbd2/commit.c | 13 ++--- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c index b6b194ec1b4f..4c6262652028 100644 --- a/fs/jbd2/commit.c +++ b/fs/jbd2/commit.c @@ -264,17 +264,8 @@ static int journal_finish_inode_data_buffers(journal_t *journal, jinode->i_flags |= JI_COMMIT_RUNNING; spin_unlock(&journal->j_list_lock); err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping); - if (err) { - /* -* Because AS_EIO is cleared by -* filemap_fdatawait_range(), set it again so -* that user process can get -EIO from fsync(). -*/ - mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO); - - if (!ret) - ret = err; - } + if (err && !ret) + ret = err; spin_lock(&journal->j_list_lock); jinode->i_flags &= ~JI_COMMIT_RUNNING; smp_mb(); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 20/20] gfs2: clean up some filemap_* calls
In some places, it's trying to reset the mapping error after calling filemap_fdatawait. That's no longer required. Also, turn several filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait. That will at least return writeback errors that occur during the write phase. Signed-off-by: Jeff Layton --- fs/gfs2/glops.c | 12 fs/gfs2/lops.c | 4 +--- fs/gfs2/super.c | 6 ++ 3 files changed, 7 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index 5db59d444838..7362d19fdc4c 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -158,9 +158,7 @@ static void rgrp_go_sync(struct gfs2_glock *gl) GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE); gfs2_log_flush(sdp, gl, NORMAL_FLUSH); - filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end); - error = filemap_fdatawait_range(mapping, gl->gl_vm.start, gl->gl_vm.end); - mapping_set_error(mapping, error); + filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end); gfs2_ail_empty_gl(gl); spin_lock(&gl->gl_lockref.lock); @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl) filemap_fdatawrite(metamapping); if (ip) { struct address_space *mapping = ip->i_inode.i_mapping; - filemap_fdatawrite(mapping); - error = filemap_fdatawait(mapping); - mapping_set_error(mapping, error); + filemap_write_and_wait(mapping); + } else { + filemap_fdatawait(metamapping); } - error = filemap_fdatawait(metamapping); - mapping_set_error(metamapping, error); gfs2_ail_empty_gl(gl); /* * Writeback of the data mapping may cause the dirty flag to be set diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index b1f9144b42c7..565ce33dec9b 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -586,9 +586,7 @@ static void gfs2_meta_sync(struct gfs2_glock *gl) if (mapping == NULL) mapping = &sdp->sd_aspace; - filemap_fdatawrite(mapping); - error = filemap_fdatawait(mapping); - + error = filemap_write_and_wait(mapping); if (error) gfs2_io_error(gl->gl_name.ln_sbd); } diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 361796a84fce..675c39566ea1 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -1593,10 +1593,8 @@ static void gfs2_evict_inode(struct inode *inode) out_truncate: gfs2_log_flush(sdp, ip->i_gl, NORMAL_FLUSH); metamapping = gfs2_glock2aspace(ip->i_gl); - if (test_bit(GLF_DIRTY, &ip->i_gl->gl_flags)) { - filemap_fdatawrite(metamapping); - filemap_fdatawait(metamapping); - } + if (test_bit(GLF_DIRTY, &ip->i_gl->gl_flags)) + filemap_write_and_wait(metamapping); write_inode_now(inode, 1); gfs2_ail_flush(ip->i_gl, 0); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 15/20] mm: remove AS_EIO and AS_ENOSPC flags
They're no longer used. Signed-off-by: Jeff Layton --- include/linux/pagemap.h | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 32512ffc15fa..9593eac41499 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -20,13 +20,11 @@ * Bits in mapping->flags. */ enum mapping_flags { - AS_EIO = 0,/* IO error on async write */ - AS_ENOSPC = 1,/* ENOSPC on async write */ - AS_MM_ALL_LOCKS = 2,/* under mm_take_all_locks() */ - AS_UNEVICTABLE = 3,/* e.g., ramdisk, SHM_LOCK */ - AS_EXITING = 4,/* final truncate in progress */ + AS_MM_ALL_LOCKS = 0,/* under mm_take_all_locks() */ + AS_UNEVICTABLE = 1,/* e.g., ramdisk, SHM_LOCK */ + AS_EXITING = 2,/* final truncate in progress */ /* writeback related tags are not used */ - AS_NO_WRITEBACK_TAGS = 5, + AS_NO_WRITEBACK_TAGS = 3, }; static inline void mapping_set_error(struct address_space *mapping, int error) -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 16/20] mm: don't TestClearPageError in __filemap_fdatawait_range
The -EIO returned here can end up overriding whatever error is marked in the address space, and be returned at fsync time, even when there is a more appropriate error stored in the mapping. Read errors are also sometimes tracked on a per-page level using PG_error. Suppose we have a read error on a page, and then that page is subsequently dirtied by overwriting the whole page. Writeback doesn't clear PG_error, so we can then end up successfully writing back that page and still return -EIO on fsync. Worse yet, PG_error is cleared during a sync() syscall, but the -EIO return from that is silently discarded. Any subsystem that is relying on PG_error to report errors during fsync can easily lose writeback errors due to this. All you need is a stray sync() call on the box at the wrong time and you've lost the error. Since the handling of the PG_error flag is somewhat inconsistent across subsystems, let's just rely on marking the address space when there are writeback errors. Change the TestClearPageError call to ClearPageError, and make __filemap_fdatawait_range a void return function. Signed-off-by: Jeff Layton --- mm/filemap.c | 19 +-- 1 file changed, 5 insertions(+), 14 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index d94a76d4e023..47e7f50fb830 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -363,17 +363,16 @@ int filemap_flush(struct address_space *mapping) } EXPORT_SYMBOL(filemap_flush); -static int __filemap_fdatawait_range(struct address_space *mapping, +static void __filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { pgoff_t index = start_byte >> PAGE_SHIFT; pgoff_t end = end_byte >> PAGE_SHIFT; struct pagevec pvec; int nr_pages; - int ret = 0; if (end_byte < start_byte) - goto out; + return; pagevec_init(&pvec, 0); while ((index <= end) && @@ -390,14 +389,11 @@ static int __filemap_fdatawait_range(struct address_space *mapping, continue; wait_on_page_writeback(page); - if (TestClearPageError(page)) - ret = -EIO; + ClearPageError(page); } pagevec_release(&pvec); cond_resched(); } -out: - return ret; } /** @@ -417,15 +413,10 @@ static int __filemap_fdatawait_range(struct address_space *mapping, int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { - int ret, ret2; errseq_t since = filemap_sample_wb_error(mapping); - ret = __filemap_fdatawait_range(mapping, start_byte, end_byte); - ret2 = filemap_check_wb_error(mapping, since); - if (!ret) - ret = ret2; - - return ret; + __filemap_fdatawait_range(mapping, start_byte, end_byte); + return filemap_check_wb_error(mapping, since); } EXPORT_SYMBOL(filemap_fdatawait_range); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 18/20] mm: clean up error handling in write_one_page
Don't try to check PageError since that's potentially racy and not necessarily going to be set after writepage errors out. Instead, sample the mapping error early on, and use that value to tell us whether we got a writeback error since then. Signed-off-by: Jeff Layton --- mm/page-writeback.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index de0dbf12e2c1..1643456881b4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2373,11 +2373,12 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) int write_one_page(struct page *page) { struct address_space *mapping = page->mapping; - int ret = 0; + int ret = 0, ret2; struct writeback_control wbc = { .sync_mode = WB_SYNC_ALL, .nr_to_write = 1, }; + errseq_t since = filemap_sample_wb_error(mapping); BUG_ON(!PageLocked(page)); @@ -2386,16 +2387,14 @@ int write_one_page(struct page *page) if (clear_page_dirty_for_io(page)) { get_page(page); ret = mapping->a_ops->writepage(page, &wbc); - if (ret == 0) { + if (ret == 0) wait_on_page_writeback(page); - if (PageError(page)) - ret = -EIO; - } put_page(page); } else { unlock_page(page); } - return ret; + ret2 = filemap_check_wb_error(mapping, since); + return ret ? : ret2; } EXPORT_SYMBOL(write_one_page); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 17/20] cifs: cleanup writeback handling errors and comments
Now that writeback errors are handled on a per-file basis using the new sequence counter method at the vfs layer, we no longer need to re-set errors in the mapping after doing writeback in non-fsync codepaths. Also, fix up some bogus comments. Signed-off-by: Jeff Layton --- fs/cifs/cifsfs.c | 4 +--- fs/cifs/file.c | 7 ++- fs/cifs/inode.c | 22 +++--- 3 files changed, 10 insertions(+), 23 deletions(-) diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index dd3f5fabfdf6..017a2d1d02c7 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -829,10 +829,8 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence) if (!CIFS_CACHE_READ(CIFS_I(inode)) && inode->i_mapping && inode->i_mapping->nrpages != 0) { rc = filemap_fdatawait(inode->i_mapping); - if (rc) { - mapping_set_error(inode->i_mapping, rc); + if (rc) return rc; - } } /* * Some applications poll for the file length in this strange diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 4b696a23b0b1..9b4f7f182add 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -722,9 +722,7 @@ cifs_reopen_file(struct cifsFileInfo *cfile, bool can_flush) cinode = CIFS_I(inode); if (can_flush) { - rc = filemap_write_and_wait(inode->i_mapping); - mapping_set_error(inode->i_mapping, rc); - + filemap_write_and_wait(inode->i_mapping); if (tcon->unix_ext) rc = cifs_get_inode_info_unix(&inode, full_path, inode->i_sb, xid); @@ -3906,8 +3904,7 @@ void cifs_oplock_break(struct work_struct *work) break_lease(inode, O_WRONLY); rc = filemap_fdatawrite(inode->i_mapping); if (!CIFS_CACHE_READ(cinode)) { - rc = filemap_fdatawait(inode->i_mapping); - mapping_set_error(inode->i_mapping, rc); + filemap_fdatawait(inode->i_mapping); cifs_zap_mapping(inode); } cifs_dbg(FYI, "Oplock flush inode %p rc %d\n", inode, rc); diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c index b261db34103c..a58e605240fc 100644 --- a/fs/cifs/inode.c +++ b/fs/cifs/inode.c @@ -2008,10 +2008,8 @@ int cifs_getattr(const struct path *path, struct kstat *stat, if (!CIFS_CACHE_READ(CIFS_I(inode)) && inode->i_mapping && inode->i_mapping->nrpages != 0) { rc = filemap_fdatawait(inode->i_mapping); - if (rc) { - mapping_set_error(inode->i_mapping, rc); + if (rc) return rc; - } } rc = cifs_revalidate_dentry_attr(dentry); @@ -2171,15 +2169,12 @@ cifs_setattr_unix(struct dentry *direntry, struct iattr *attrs) * Attempt to flush data before changing attributes. We need to do * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the * ownership or mode then we may also need to do this. Here, we take -* the safe way out and just do the flush on all setattr requests. If -* the flush returns error, store it to report later and continue. +* the safe way out and just do the flush on all setattr requests. * * BB: This should be smarter. Why bother flushing pages that -* will be truncated anyway? Also, should we error out here if -* the flush returns error? +* will be truncated anyway? */ - rc = filemap_write_and_wait(inode->i_mapping); - mapping_set_error(inode->i_mapping, rc); + filemap_write_and_wait(inode->i_mapping); rc = 0; if (attrs->ia_valid & ATTR_SIZE) { @@ -2314,15 +2309,12 @@ cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs) * Attempt to flush data before changing attributes. We need to do * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the * ownership or mode then we may also need to do this. Here, we take -* the safe way out and just do the flush on all setattr requests. If -* the flush returns error, store it to report later and continue. +* the safe way out and just do the flush on all setattr requests. * * BB: This should be smarter. Why bother flushing pages that -* will be truncated anyway? Also, should we error out here if -* the flush returns error? +* will be truncated anyway? */ - rc = filemap_write_and_wait(inode->i_mapping); - mapping_set_error(inode->i_mapping, rc); + filemap_write_and_wait(inode->i_mapping); rc = 0; if (attrs->ia_valid & ATTR_SIZE) { -- 2.9
[PATCH v3 14/20] fs: retrofit old error reporting API onto new infrastructure
Now that we have a better way to store and report errors that occur during writeback, we need to convert the existing codebase to use it. We could just adapt all of the filesystem code and related infrastructure to the new API, but that's a lot of churn. When it comes to setting errors in the mapping, filemap_set_wb_error is a drop-in replacement for mapping_set_error. Turn that function into a simple wrapper around the new one. Because we want to ensure that writeback errors are always reported at fsync time, inject filemap_report_wb_error calls much closer to the syscall boundary, in call_fsync. For fsync calls (and things like the nfsd equivalent), we either return the error that the fsync operation returns, or the one returned by filemap_report_wb_error. In both cases, we advance the file->f_wb_err to the latest value. This allows us to provide new fsync semantics that will return errors that may have occurred previously and been viewed via other file descriptors. The final piece of the puzzle is what to do about filemap_check_errors calls that are being called directly or via filemap_* functions. Here, we must take a little "creative license". Since we now handle advancing the file->f_wb_err value at the generic filesystem layer, we no longer need those callers to clear errors out of the mapping or advance an errseq_t. A lot of the existing codebase relies on being getting an error back from those functions when there is a writeback problem, so we do still want to have them report writeback errors somehow. When reporting writeback errors, we will always report errors that have occurred since a particular point in time. With the old writeback error reporting, the time we used was "since it was last tested/cleared" which is entirely arbitrary and potentially racy. Now, we can at least report the latest error that has occurred since an arbitrary point in time (represented as a sampled errseq_t value). In the case where we don't have a struct file to work with, this patch just has the wrappers sample the current mapping->wb_err value, and use that as an arbitrary point from which to check for errors. That's probably not "correct" in all cases, particularly in the case of something like filemap_fdatawait, but I'm not sure it's any worse than what we already have, and this gives us a basis from which to work. A lot of those callers will likely want to change to a model where they sample the errseq_t much earlier (perhaps when starting a transaction), store it in an appropriate place and then use that value later when checking to see if an error occurred. That will almost certainly take some involvement from other subsystem maintainers. I'm quite open to adding new API functions to help enable this if that would be helpful, but I don't really want to do that until I better understand what's needed. Signed-off-by: Jeff Layton --- Documentation/filesystems/vfs.txt | 9 - fs/btrfs/file.c | 10 ++ fs/btrfs/tree-log.c | 9 ++--- fs/f2fs/file.c| 3 +++ fs/f2fs/node.c| 6 +- fs/fuse/file.c| 7 +-- fs/libfs.c| 6 -- include/linux/fs.h| 17 ++--- include/linux/pagemap.h | 8 ++-- mm/filemap.c | 33 +++-- 10 files changed, 44 insertions(+), 64 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index ed06fb39822b..f201a77873f7 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -577,7 +577,7 @@ written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. If there is an error during writeback, then the address_space should be -marked with an error (typically using filemap_set_wb_error), in order to +marked with an error (typically using mapping_set_error), in order to ensure that the error can later be reported to the application when an fsync is issued. @@ -893,10 +893,9 @@ otherwise noted. release: called when the last reference to an open file is closed - fsync: called by the fsync(2) system call. Filesystems that use the - pagecache should call filemap_report_wb_error before returning - to ensure that any errors that occurred during writeback are - reported and the file's error sequence advanced. + fsync: called by the fsync(2) system call. Errors that were previously +recorded using mapping_set_error will automatically be returned to +the application and the file's error sequence advanced. fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 520cb7230b2d..e15faf240b51 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1962,6 +1962,7 @@ int btrfs_sync_file(struct file *
[PATCH v3 13/20] fs: new infrastructure for writeback error handling and reporting
Most filesystems currently use mapping_set_error and filemap_check_errors for setting and reporting/clearing writeback errors at the mapping level. filemap_check_errors is indirectly called from most of the filemap_fdatawait_* functions and from filemap_write_and_wait*. These functions are called from all sorts of contexts to wait on writeback to finish -- e.g. mostly in fsync, but also in truncate calls, getattr, etc. The non-fsync callers are problematic. We should be reporting writeback errors during fsync, but many places spread over the tree clear out errors before they can be properly reported, or report errors at nonsensical times. If I get -EIO on a stat() call, there is no reason for me to assume that it is because some previous writeback failed. The fact that it also clears out the error such that a subsequent fsync returns 0 is a bug, and a nasty one since that's potentially silent data corruption. This patch adds a small bit of new infrastructure for setting and reporting errors during address_space writeback. While the above was my original impetus for adding this, I think it's also the case that current fsync semantics are just problematic for userland. Most applications that call fsync do so to ensure that the data they wrote has hit the backing store. In the case where there are multiple writers to the file at the same time, this is really hard to determine. The first one to call fsync will see any stored error, and the rest get back 0. The processes with open fds may not be associated with one another in any way. They could even be in different containers, so ensuring coordination between all fsync callers is not really an option. One way to remedy this would be to track what file descriptor was used to dirty the file, but that's rather cumbersome and would likely be slow. However, there is a simpler way to improve the semantics here without incurring too much overhead. This set adds an errseq_t to struct address_space, and a corresponding one is added to struct file. Writeback errors are recorded in the mapping's errseq_t, and the one in struct file is used as the "since" value. This changes the semantics of the Linux fsync implementation such that applications can now use it to determine whether there were any writeback errors since fsync(fd) was last called (or since the file was opened in the case of fsync having never been called). Note that those writeback errors may have occurred when writing data that was dirtied via an entirely different fd, but that's the case now with the current mapping_set_error/filemap_check_error infrastructure. This will at least prevent you from getting a false report of success. The new behavior is still consistent with the POSIX spec, and is more reliable for application developers. This patch just adds some basic infrastructure for doing this. Later patches will change the existing code to use this new infrastructure. Signed-off-by: Jeff Layton --- Documentation/filesystems/vfs.txt | 10 +- fs/open.c | 3 +++ include/linux/fs.h| 24 mm/filemap.c | 38 ++ 4 files changed, 74 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 94dd27ef4a76..ed06fb39822b 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -576,6 +576,11 @@ should clear PG_Dirty and set PG_Writeback. It can be actually written at any point after PG_Dirty is clear. Once it is known to be safe, PG_Writeback is cleared. +If there is an error during writeback, then the address_space should be +marked with an error (typically using filemap_set_wb_error), in order to +ensure that the error can later be reported to the application when an +fsync is issued. + Writeback makes use of a writeback_control structure... struct address_space_operations @@ -888,7 +893,10 @@ otherwise noted. release: called when the last reference to an open file is closed - fsync: called by the fsync(2) system call + fsync: called by the fsync(2) system call. Filesystems that use the + pagecache should call filemap_report_wb_error before returning + to ensure that any errors that occurred during writeback are + reported and the file's error sequence advanced. fasync: called by the fcntl(2) system call when asynchronous (non-blocking) mode is enabled for a file diff --git a/fs/open.c b/fs/open.c index 949cef29c3bb..88bfed8d3c88 100644 --- a/fs/open.c +++ b/fs/open.c @@ -709,6 +709,9 @@ static int do_dentry_open(struct file *f, f->f_inode = inode; f->f_mapping = inode->i_mapping; + /* Ensure that we skip any errors that predate opening of the file */ + f->f_wb_err = filemap_sample_wb_error(f->f_mapping); + if (unlikely(f->f_flags & O_PATH)) { f->f_mode = FMODE_PATH;
[PATCH v3 12/20] lib: add errseq_t type and infrastructure for handling it
An errseq_t is a way of recording errors in one place, and allowing any number of "subscribers" to tell whether an error has been set again since a previous time. It's implemented as an unsigned 32-bit value that is managed with atomic operations. The low order bits are designated to hold an error code (max size of MAX_ERRNO). The upper bits are used as a counter. The API works with consumers sampling an errseq_t value at a particular point in time. Later, that value can be used to tell whether new errors have been set since that time. Note that there is a 1 in 512k risk of collisions here if new errors are being recorded frequently, since we have so few bits to use as a counter. To mitigate this, one bit is used as a flag to tell whether the value has been sampled since a new value was recorded. That allows us to avoid bumping the counter if no one has sampled it since it was last bumped. Later patches will build on this infrastructure to change how writeback errors are tracked in the kernel. Signed-off-by: Jeff Layton --- include/linux/errseq.h | 16 lib/Makefile | 2 +- lib/errseq.c | 199 + 3 files changed, 216 insertions(+), 1 deletion(-) create mode 100644 include/linux/errseq.h create mode 100644 lib/errseq.c diff --git a/include/linux/errseq.h b/include/linux/errseq.h new file mode 100644 index ..e2d374dbf683 --- /dev/null +++ b/include/linux/errseq.h @@ -0,0 +1,16 @@ +#ifndef _LINUX_ERRSEQ_H +#define _LINUX_ERRSEQ_H +typedef u32errseq_t; + +void __errseq_set(errseq_t *eseq, int err); +static inline void errseq_set(errseq_t *eseq, int err) +{ + /* Optimize for the common case of no error */ + if (unlikely(err)) + __errseq_set(eseq, err); +} + +errseq_t errseq_sample(errseq_t *eseq); +int errseq_check(errseq_t *eseq, errseq_t since); +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since); +#endif diff --git a/lib/Makefile b/lib/Makefile index 320ac46a8725..2423afef40f7 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ -once.o refcount.o +once.o refcount.o errseq.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/errseq.c b/lib/errseq.c new file mode 100644 index ..c783d0a39cb0 --- /dev/null +++ b/lib/errseq.c @@ -0,0 +1,199 @@ +#include +#include +#include +#include + +/* + * An errseq_t is a way of recording errors in one place, and allowing any + * number of "subscribers" to tell whether it has changed since an arbitrary + * time of their choosing. + * + * It's implemented as an unsigned 32-bit value. The low order bits are + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits + * are used as a counter. This is done with atomics instead of locking so that + * these functions can be called from any context. + * + * The general idea is for consumers to sample an errseq_t value at a + * particular point in time. Later, that value can be used to tell whether any + * new errors have occurred since that time. + * + * Note that there is a risk of collisions, if new errors are being recorded + * frequently, since we have so few bits to use as a counter. + * + * To mitigate this, one bit is used as a flag to tell whether the value has + * been sampled since a new value was recorded. That allows us to avoid bumping + * the counter if no one has sampled it since the last time an error was + * recorded. + * + * A new errseq_t should always be zeroed out. A errseq_t value of all zeroes + * is the special (but common) case where there has never been an error. An all + * zero value thus serves as the "epoch" if one wishes to know whether there + * has ever been an error set since it was first initialized. + */ + +/* The low bits are designated for error code (max of MAX_ERRNO) */ +#define ERRSEQ_SHIFT ilog2(MAX_ERRNO + 1) + +/* This bit is used as a flag to indicate whether the value has been seen */ +#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT) + +/* The "ones" bit for the counter */ +#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1)) + +/** + * __errseq_set - set a errseq_t for later reporting + * @eseq: errseq_t field that should be set + * @err: error to set + * + * This function sets the error in *eseq, and increments the sequence counter + * if the last sequence was sampled at some point in the past. + * + * Any error set will always overwrite an existing error. + * + * Most callers will want to use the errseq_set inline wrapper to efficiently + * handle the common case where err is 0. + */ +void __errseq_set(errseq_t *eseq, int err) +{
[PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails
This ensures that we see errors on fsync when writeback fails. Signed-off-by: Jeff Layton --- fs/fuse/file.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index ec238fb5a584..07d0efcb050c 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page) err_free: fuse_request_free(req); err: + mapping_set_error(page->mapping, error); end_page_writeback(page); return error; } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 11/20] cifs: set mapping error when page writeback fails in writepage or launder_pages
Signed-off-by: Jeff Layton --- fs/cifs/file.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 21d404535739..4b696a23b0b1 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct writeback_control *wbc) set_page_writeback(page); retry_write: rc = cifs_partialpagewrite(page, 0, PAGE_SIZE); - if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) + if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) { goto retry_write; - else if (rc == -EAGAIN) + } else if (rc == -EAGAIN) { redirty_page_for_writepage(wbc, page); - else if (rc != 0) + } else if (rc != 0) { SetPageError(page); - else + mapping_set_error(page->mapping, rc); + } else { SetPageUptodate(page); + } end_page_writeback(page); put_page(page); free_xid(xid); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 07/20] nilfs2: set the mapping error when calling SetPageError on writeback
In a later patch, we're going to want to make the fsync codepath not do a TestClearPageError call as that can override the error set in the address space. To do that though, we need to ensure that filesystems that are relying on the PG_error bit for reporting writeback errors also set an error in the address space. Ensure that this is set in nilfs2. Cc: Ryusuke Konishi Signed-off-by: Jeff Layton --- fs/nilfs2/segment.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index febed1217b3f..612d4b446793 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1745,6 +1745,7 @@ static void nilfs_end_page_io(struct page *page, int err) } else { __set_page_dirty_nobuffers(page); SetPageError(page); + mapping_set_error(page_mapping(page), err); } end_page_writeback(page); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 06/20] dax: set errors in mapping when writeback fails
In order to get proper error codes from fsync, we must set an error in the mapping range when writeback fails. Signed-off-by: Jeff Layton --- fs/dax.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 85abd741253d..9b6b04030c3f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space *mapping, ret = dax_writeback_one(bdev, mapping, indices[i], pvec.pages[i]); - if (ret < 0) + if (ret < 0) { + mapping_set_error(mapping, ret); return ret; + } } } return 0; -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: tests: add variable quotation to fsck-tests
Signed-off-by: Lakshmipathi.G --- tests/fsck-tests/006-bad-root-items/test.sh| 6 +++--- tests/fsck-tests/012-leaf-corruption/test.sh | 24 +++--- tests/fsck-tests/013-extent-tree-rebuild/test.sh | 22 ++-- tests/fsck-tests/018-leaf-crossing-stripes/test.sh | 4 ++-- .../fsck-tests/019-non-skinny-false-alert/test.sh | 4 ++-- tests/fsck-tests/020-extent-ref-cases/test.sh | 4 ++-- .../021-partially-dropped-snapshot-case/test.sh| 4 ++-- tests/fsck-tests/022-qgroup-rescan-halfway/test.sh | 4 ++-- tests/fsck-tests/023-qgroup-stack-overflow/test.sh | 4 ++-- tests/fsck-tests/024-clear-space-cache/test.sh | 16 +++ tests/fsck-tests/025-file-extents/test.sh | 2 +- tests/fsck-tests/026-check-inode-link/test.sh | 2 +- 12 files changed, 48 insertions(+), 48 deletions(-) diff --git a/tests/fsck-tests/006-bad-root-items/test.sh b/tests/fsck-tests/006-bad-root-items/test.sh index 8433234..bf3ef78 100755 --- a/tests/fsck-tests/006-bad-root-items/test.sh +++ b/tests/fsck-tests/006-bad-root-items/test.sh @@ -1,15 +1,15 @@ #!/bin/bash -source $TOP/tests/common +source "$TOP/tests/common" check_prereq btrfs -echo "extracting image default_case.tar.xz" >> $RESULTS +echo "extracting image default_case.tar.xz" >> "$RESULTS" tar --no-same-owner -xJf default_case.tar.xz || \ _fail "failed to extract default_case.tar.xz" check_image test.img -echo "extracting image skinny_case.tar.xz" >> $RESULTS +echo "extracting image skinny_case.tar.xz" >> "$RESULTS" tar --no-same-owner -xJf skinny_case.tar.xz || \ _fail "failed to extract skinny_case.tar.xz" check_image test.img diff --git a/tests/fsck-tests/012-leaf-corruption/test.sh b/tests/fsck-tests/012-leaf-corruption/test.sh index a308727..43b0e6d 100755 --- a/tests/fsck-tests/012-leaf-corruption/test.sh +++ b/tests/fsck-tests/012-leaf-corruption/test.sh @@ -1,6 +1,6 @@ #!/bin/bash -source $TOP/tests/common +source "$TOP/tests/common" check_prereq btrfs-image @@ -37,16 +37,16 @@ leaf_no_data_ext_list=( generate_leaf_corrupt_no_data_ext() { dest=$1 - echo "generating leaf_corrupt_no_data_ext.btrfs-image" >> $RESULTS + echo "generating leaf_corrupt_no_data_ext.btrfs-image" >> "$RESULTS" tar --no-same-owner -xJf ./no_data_extent.tar.xz || \ _fail "failed to extract leaf_corrupt_no_data_ext.btrfs-image" - $TOP/btrfs-image -r test.img.btrfs-image $dest || \ + "$TOP/btrfs-image" -r test.img.btrfs-image "$dest" || \ _fail "failed to extract leaf_corrupt_no_data_ext.btrfs-image" # leaf at 4206592 and 20905984 contains no regular data # extent, clear its csum to corrupt the leaf. for x in 4206592 20905984; do - dd if=/dev/zero of=$dest bs=1 count=32 conv=notrunc seek=$x \ + dd if=/dev/zero of="$dest" bs=1 count=32 conv=notrunc seek="$x" \ 1>/dev/null 2>&1 done } @@ -60,21 +60,21 @@ check_inode() name=$5 # Check whether the inode exists - exists=$($SUDO_HELPER find $path -inum $ino) + exists=$($SUDO_HELPER find "$path" -inum "$ino") if [ -z "$exists" ]; then _fail "inode $ino not recovered correctly" fi # Check inode type - found_mode=$(printf "%o" 0x$($SUDO_HELPER stat $exists -c %f)) - if [ $found_mode -ne $mode ]; then + found_mode=$(printf "%o" 0x$($SUDO_HELPER stat "$exists" -c %f)) + if [ "$found_mode" -ne "$mode" ]; then echo "$found_mode" _fail "inode $ino modes not recovered" fi # Check inode size - found_size=$($SUDO_HELPER stat $exists -c %s) - if [ $mode -ne 41700 -a $found_size -ne $size ]; then + found_size=$($SUDO_HELPER stat "$exists" -c %s) + if [ $mode -ne 41700 -a "$found_size" -ne "$size" ]; then _fail "inode $ino size not recovered correctly" fi @@ -90,11 +90,11 @@ check_inode() check_leaf_corrupt_no_data_ext() { image=$1 - $SUDO_HELPER mount -o loop $image -o ro $TEST_MNT + $SUDO_HELPER mount -o loop "$image" -o ro "$TEST_MNT" i=0 while [ $i -lt ${#leaf_no_data_ext_list[@]} ]; do - check_inode $TEST_MNT/lost+found \ + check_inode "$TEST_MNT/lost+found" \ ${leaf_no_data_ext_list[i]} \ ${leaf_no_data_ext_list[i + 1]} \ ${leaf_no_data_ext_list[i + 2]} \ @@ -102,7 +102,7 @@ check_leaf_corrupt_no_data_ext() ${leaf_no_data_ext_list[i + 4]} ((i+=4)) done - $SUDO_HELPER umount $TEST_MNT + $SUDO_HELPER umount "$TEST_MNT" } setup_root_helper diff --git a/tests/fsck-tests/013-extent-tree-rebuild/test.sh b/tests/fsck-tests/013-extent-tree-rebuild/test.sh index 08c
[PATCH v3 05/20] orangefs: don't call filemap_write_and_wait from fsync
Orangefs doesn't do buffered writes yet, so there's no point in initiating and waiting for writeback. Signed-off-by: Jeff Layton --- fs/orangefs/file.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c index e6bbc8083d77..17ab42c4db52 100644 --- a/fs/orangefs/file.c +++ b/fs/orangefs/file.c @@ -646,14 +646,11 @@ static int orangefs_fsync(struct file *file, loff_t end, int datasync) { - int ret = -EINVAL; + int ret; struct orangefs_inode_s *orangefs_inode = ORANGEFS_I(file_inode(file)); struct orangefs_kernel_op_s *new_op = NULL; - /* required call */ - filemap_write_and_wait_range(file->f_mapping, start, end); - new_op = op_alloc(ORANGEFS_VFS_OP_FSYNC); if (!new_op) return -ENOMEM; -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync
ext2 currently does a test+clear of the AS_EIO flag, which is is problematic for some coming changes. What we really need to do instead is call filemap_check_errors in __generic_file_fsync after syncing out the buffers. That will be sufficient for this case, and help other callers detect these errors properly as well. With that, we don't need to twiddle it in ext2. Suggested-by: Jan Kara Signed-off-by: Jeff Layton --- fs/ext2/file.c | 2 +- fs/libfs.c | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ext2/file.c b/fs/ext2/file.c index b21891a6bfca..ed00e7ae0ef3 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync) struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; ret = generic_file_fsync(file, start, end, datasync); - if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { + if (ret == -EIO) { /* We don't really know where the IO error happened... */ ext2_error(sb, __func__, "detected IO error when writing metadata buffers"); diff --git a/fs/libfs.c b/fs/libfs.c index a8b62e5d43a9..12a48ee442d3 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, loff_t end, out: inode_unlock(inode); - return ret; + err = filemap_check_errors(inode->i_mapping); + return ret ? : err; } EXPORT_SYMBOL(__generic_file_fsync); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/20] mm: drop "wait" parameter from write_one_page
The callers all set it to 1. Also, make it clear that this function will not set any sort of AS_* error, and that the caller must do so if necessary. No existing caller uses this on normal files, so none of them need it. Also, add __must_check here since, in general, the callers need to handle an error here in some fashion. Signed-off-by: Jeff Layton Reviewed-by: Ross Zwisler Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox --- fs/exofs/dir.c| 2 +- fs/ext2/dir.c | 2 +- fs/jfs/jfs_metapage.c | 4 ++-- fs/minix/dir.c| 2 +- fs/sysv/dir.c | 2 +- fs/ufs/dir.c | 2 +- include/linux/mm.h| 2 +- mm/page-writeback.c | 14 +++--- 8 files changed, 15 insertions(+), 15 deletions(-) diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c index 42f9a0a0c4ca..e163ed980c20 100644 --- a/fs/exofs/dir.c +++ b/fs/exofs/dir.c @@ -72,7 +72,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len) set_page_dirty(page); if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index d9650c9508e4..e2709695b177 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -100,7 +100,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len) } if (IS_DIRSYNC(dir)) { - err = write_one_page(page, 1); + err = write_one_page(page); if (!err) err = sync_inode_metadata(dir, 1); } else { diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index 489aaa1403e5..744fa3c079e6 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -711,7 +711,7 @@ void force_metapage(struct metapage *mp) get_page(page); lock_page(page); set_page_dirty(page); - write_one_page(page, 1); + write_one_page(page); clear_bit(META_forcewrite, &mp->flag); put_page(page); } @@ -756,7 +756,7 @@ void release_metapage(struct metapage * mp) set_page_dirty(page); if (test_bit(META_sync, &mp->flag)) { clear_bit(META_sync, &mp->flag); - write_one_page(page, 1); + write_one_page(page); lock_page(page); /* write_one_page unlocks the page */ } } else if (mp->lsn) /* discard_metapage doesn't remove it */ diff --git a/fs/minix/dir.c b/fs/minix/dir.c index 7edc9b395700..baa9721f1299 100644 --- a/fs/minix/dir.c +++ b/fs/minix/dir.c @@ -57,7 +57,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) mark_inode_dirty(dir); } if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); return err; diff --git a/fs/sysv/dir.c b/fs/sysv/dir.c index 5bdae85ceef7..f5191cb2c947 100644 --- a/fs/sysv/dir.c +++ b/fs/sysv/dir.c @@ -45,7 +45,7 @@ static int dir_commit_chunk(struct page *page, loff_t pos, unsigned len) mark_inode_dirty(dir); } if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); return err; diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c index de01b8f2aa78..48609f1d9580 100644 --- a/fs/ufs/dir.c +++ b/fs/ufs/dir.c @@ -53,7 +53,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len) mark_inode_dirty(dir); } if (IS_DIRSYNC(dir)) - err = write_one_page(page, 1); + err = write_one_page(page); else unlock_page(page); return err; diff --git a/include/linux/mm.h b/include/linux/mm.h index 00a8fa7e366a..403f78afee20 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2187,7 +2187,7 @@ extern void filemap_map_pages(struct vm_fault *vmf, extern int filemap_page_mkwrite(struct vm_fault *vmf); /* mm/page-writeback.c */ -int write_one_page(struct page *page, int wait); +int __must_check write_one_page(struct page *page); void task_dirty_inc(struct task_struct *tsk); /* readahead.c */ diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d8ac2a7fb9e7..de0dbf12e2c1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2361,15 +2361,16 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc) } /** - * write_one_page - write out a single page and optionally wait on I/O + * write_one_page - write out a single page and wait on I/O * @page: the page to write - * @wait: if true, wait on writeout * * The page must be locked by the caller and will be unlocked upon return. * - * write_one_page() returns a negative error code if I/O failed. + * write_one_pag
[PATCH v3 02/20] mm: fix mapping_set_error call in me_pagecache_dirty
The error code should be negative. Since this ends up in the default case anyway, this is harmless, but it's less confusing to negate it. Also, later patches will require a negative error code here. Signed-off-by: Jeff Layton Reviewed-by: Ross Zwisler Reviewed-by: Jan Kara Reviewed-by: Matthew Wilcox --- mm/memory-failure.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 27f7210e7fab..4b56e53e5378 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -674,7 +674,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - mapping_set_error(mapping, EIO); + mapping_set_error(mapping, -EIO); } return me_pagecache_clean(p, pfn); -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Introduce 'btrfs inspect-internal dump-csum' option
Example usage: btrfs inspect-internal dump-csum /btrfs/50gbfile /dev/sda4 csum for /btrfs/50gbfile dumped to /btrfs/50gbfile.csumdump Signed-off-by: Lakshmipathi.G --- Makefile | 3 +- cmds-inspect-dump-csum.c | 256 +++ cmds-inspect-dump-csum.h | 24 + cmds-inspect.c | 3 + 4 files changed, 285 insertions(+), 1 deletion(-) create mode 100644 cmds-inspect-dump-csum.c create mode 100644 cmds-inspect-dump-csum.h diff --git a/Makefile b/Makefile index 81598df..6e63e18 100644 --- a/Makefile +++ b/Makefile @@ -102,7 +102,7 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \ cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o \ - mkfs/common.o + cmds-inspect-dump-csum.o mkfs/common.o libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \ kernel-lib/crc32c.o messages.o \ uuid-tree.o utils-lib.o rbtree-utils.o @@ -191,6 +191,7 @@ btrfs_convert_cflags = -DBTRFSCONVERT_EXT2=$(BTRFSCONVERT_EXT2) btrfs_fragments_libs = -lgd -lpng -ljpeg -lfreetype btrfs_debug_tree_objects = cmds-inspect-dump-tree.o btrfs_show_super_objects = cmds-inspect-dump-super.o +btrfs_dump_csum_objects = cmds-inspect-dump-csum.o btrfs_calc_size_objects = cmds-inspect-tree-stats.o # collect values of the variables above diff --git a/cmds-inspect-dump-csum.c b/cmds-inspect-dump-csum.c new file mode 100644 index 000..6065eb4 --- /dev/null +++ b/cmds-inspect-dump-csum.c @@ -0,0 +1,256 @@ +/* + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include "kerncompat.h" +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ctree.h" +#include "disk-io.h" +#include "print-tree.h" +#include "transaction.h" +#include "list.h" +#include "utils.h" +#include "commands.h" +#include "crc32c.h" +#include "cmds-inspect-dump-csum.h" +#include "help.h" +#include "volumes.h" + + +const char * const cmd_inspect_dump_csum_usage[] = { + "btrfs inspect-internal dump-csum ", + "Get csums for the given file.", + NULL +}; + +int btrfs_lookup_csums(struct btrfs_trans_handle *trans, struct btrfs_root *root, + struct btrfs_path *path, u64 bytenr, int cow, int total_csums, FILE *fp) +{ + int ret; + int i; + int start_pos = 0; + struct btrfs_key file_key; + struct btrfs_key found_key; + struct btrfs_csum_item *item; + struct extent_buffer *leaf; + u64 csum_offset = 0; + u16 csum_size = + btrfs_super_csum_size(root->fs_info->super_copy); + int csums_in_item = 0; + unsigned int tree_csum = 0; + int pending_csums = total_csums; + static int cnt=1; + + file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID; + file_key.offset = bytenr; + file_key.type = BTRFS_EXTENT_CSUM_KEY; + ret = btrfs_search_slot(trans, root, &file_key, path, 0, cow); + if (ret < 0) + goto fail; + while(1){ + leaf = path->nodes[0]; + if (ret > 0) { + ret = 1; + if (path->slots[0] == 0) + goto fail; + path->slots[0]--; + btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]); + if (found_key.type != BTRFS_EXTENT_CSUM_KEY){ + fprintf(stderr, "\nInvalid key found."); + goto fail; + } + + csum_offset = ((bytenr - found_key.offset) / root->sectorsize) * csum_size; + csums_in_item = btrfs_item_size_nr(leaf, path->slots[0]); + csums_in_item /= csum_size; + csums_in_item -= ( bytenr - found_key.offset ) / root->sectorsize; + start_pos=csum_offset; + } + if (path->slots[0] >= btrfs_header_nritems(leaf)) { + if (pending_csums > 0){ + ret = btrfs_next_leaf(root, path); +
[RFC xfstests PATCH] xfstests: add a writeback error handling test
This is just an RFC set for now. I've numbered it 999 for the moment so as not to collide with tests being added. I'm working on a set of kernel patches to change how writeback errors are handled and reported in the kernel. Instead of reporting a writeback error to only the first fsync caller on the file, I aim to make the kernel report them once on every file description: https://lkml.org/lkml/2017/4/24/438 This patch adds a test for the new behavior. Basically, open many fds to the same file, turn on dm_error, write to each of the fds, and then fsync them all to ensure that they all get an error back. With the patch series above, ext4 now passes. xfs and btrfs end up in r/o mode after the test. xfs returns -EIO at that point though, and btrfs returns -EROFS. What behavior we actually want there, I'm not certain. We might be able to mitigate that by putting the journals on a separate device? Signed-off-by: Jeff Layton --- common/dmerror| 13 --- src/Makefile | 2 +- src/fsync-err.c | 102 ++ tests/generic/999 | 74 tests/generic/999.out | 3 ++ tests/generic/group | 1 + tools/dmerror | 47 +++ 7 files changed, 236 insertions(+), 6 deletions(-) create mode 100644 src/fsync-err.c create mode 100755 tests/generic/999 create mode 100644 tests/generic/999.out create mode 100755 tools/dmerror diff --git a/common/dmerror b/common/dmerror index d46c5d0b7266..238baa213b1f 100644 --- a/common/dmerror +++ b/common/dmerror @@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then _notrun "Cannot run tests with DAX on dmerror devices" fi -_dmerror_init() +_dmerror_setup() { local dm_backing_dev=$SCRATCH_DEV - $DMSETUP_PROG remove error-test > /dev/null 2>&1 - local blk_dev_size=`blockdev --getsz $dm_backing_dev` DMERROR_DEV='/dev/mapper/error-test' DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0" + DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0" +} + +_dmerror_init() +{ + _dmerror_setup + $DMSETUP_PROG remove error-test > /dev/null 2>&1 $DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \ _fatal "failed to create dm linear device" - - DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0" } _dmerror_mount() diff --git a/src/Makefile b/src/Makefile index e62d7a9774d7..056a75b9f7bb 100644 --- a/src/Makefile +++ b/src/Makefile @@ -12,7 +12,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \ godown resvtest writemod makeextents itrash rename \ multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \ t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \ - holetest t_truncate_self t_mmap_dio af_unix + holetest t_truncate_self t_mmap_dio af_unix fsync-err LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ diff --git a/src/fsync-err.c b/src/fsync-err.c new file mode 100644 index ..8ebfd145bd70 --- /dev/null +++ b/src/fsync-err.c @@ -0,0 +1,102 @@ +/* + * fsync-err.c: test whether writeback errors are reported to all open fds + * Copyright (c) 2017: Jeff Layton + * + * Open a file several times, write to it and then fsync. Flip dm_error over + * to make the backing device stop working. Overwrite the same section and + * call fsync on all fds and verify that we get errors on all of them. Then, + * fsync one more time on all of them and verify that they return 0. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +#define NUM_FDS10 + +static void usage() { + fprintf(stderr, "Usage: fsync-err \n"); +} + +int main(int argc, char **argv) +{ + int fd[NUM_FDS], ret, i; + char *fname, *buf; + + if (argc < 1) { + usage(); + return 1; + } + + /* First argument is filename */ + fname = argv[1]; + + for (i = 0; i < NUM_FDS; ++i) { + fd[i] = open(fname, O_WRONLY | O_CREAT | O_TRUNC, 0644); + if (fd[i] < 0) { + printf("open of fd[%d] failed: %m\n", i); + return 1; + } + } + + buf = "foobar"; + for (i = 0; i < NUM_FDS; ++i) { + ret = write(fd[i], buf, strlen(buf) + 1); + if (ret < 0) { + printf("First write on fd[%d] failed: %m\n", i); + return 1; + } + } + + for (i = 0; i < NUM_FDS; ++i) { + ret = fsync(fd[i]); + if (ret < 0) { + printf("First fsync on fd[%d] failed: %m\n", i); + return 1; + } + } + + /* flip the device to non-working mode */ +
Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls
- Original Message - | In some places, it's trying to reset the mapping error after calling | filemap_fdatawait. That's no longer required. Also, turn several | filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait. | That will at least return writeback errors that occur during the write | phase. | | Signed-off-by: Jeff Layton | --- | fs/gfs2/glops.c | 12 | fs/gfs2/lops.c | 4 +--- | fs/gfs2/super.c | 6 ++ | 3 files changed, 7 insertions(+), 15 deletions(-) | | diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c | index 5db59d444838..7362d19fdc4c 100644 | --- a/fs/gfs2/glops.c | +++ b/fs/gfs2/glops.c | @@ -158,9 +158,7 @@ static void rgrp_go_sync(struct gfs2_glock *gl) | GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE); | | gfs2_log_flush(sdp, gl, NORMAL_FLUSH); | - filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end); | - error = filemap_fdatawait_range(mapping, gl->gl_vm.start, gl->gl_vm.end); | - mapping_set_error(mapping, error); | + filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end); This should probably have "error = ", no? | gfs2_ail_empty_gl(gl); | | spin_lock(&gl->gl_lockref.lock); | @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl) | filemap_fdatawrite(metamapping); | if (ip) { | struct address_space *mapping = ip->i_inode.i_mapping; | - filemap_fdatawrite(mapping); | - error = filemap_fdatawait(mapping); | - mapping_set_error(mapping, error); | + filemap_write_and_wait(mapping); | + } else { | + filemap_fdatawait(metamapping); | } | - error = filemap_fdatawait(metamapping); | - mapping_set_error(metamapping, error); This part doesn't look right at all. There's a big difference in gfs2 between mapping and metamapping. We need to wait for metamapping regardless. (snip) Regards, Bob Peterson Red Hat File Systems -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC xfstests PATCH] xfstests: add a writeback error handling test
On Mon, Apr 24, 2017 at 09:45:51AM -0400, Jeff Layton wrote: > With the patch series above, ext4 now passes. xfs and btrfs end up in > r/o mode after the test. xfs returns -EIO at that point though, and > btrfs returns -EROFS. What behavior we actually want there, I'm not > certain. We might be able to mitigate that by putting the journals on a > separate device? This looks like XFS shut down because of a permanent write error from dm-error. Which seems like the expected behavior. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync
> out: > inode_unlock(inode); > - return ret; > + err = filemap_check_errors(inode->i_mapping); > + return ret ? : err; Can you spell out the whole unary operation instead of this weird GCC extension? Otherwise looks fine: Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 01/20] mm: drop "wait" parameter from write_one_page
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 03/20] buffer: use mapping_set_error instead of setting the flag
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 02/20] mm: fix mapping_set_error call in me_pagecache_dirty
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/20] orangefs: don't call filemap_write_and_wait from fsync
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/20] nilfs2: set the mapping error when calling SetPageError on writeback
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page
On Mon, Apr 24, 2017 at 09:22:48AM -0400, Jeff Layton wrote: > launder_page is just writeback under the page lock. We still need to > mark the mapping for errors there when they occur. > > Signed-off-by: Jeff Layton Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails
Looks fine, Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 11/20] cifs: set mapping error when page writeback fails in writepage or launder_pages
On Mon, Apr 24, 2017 at 09:22:50AM -0400, Jeff Layton wrote: > Signed-off-by: Jeff Layton > --- > fs/cifs/file.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > index 21d404535739..4b696a23b0b1 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct > writeback_control *wbc) > set_page_writeback(page); > retry_write: > rc = cifs_partialpagewrite(page, 0, PAGE_SIZE); > + if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) { > goto retry_write; > + } else if (rc == -EAGAIN) { > redirty_page_for_writepage(wbc, page); > + } else if (rc != 0) { > SetPageError(page); > + mapping_set_error(page->mapping, rc); > + } else { > SetPageUptodate(page); > + } Hmmm. I might be a little too nitpicky, but I hate having the same partial condition duplicated if possible. Why not: if (rc == -EAGAIN) { if (wbc->sync_mode == WB_SYNC_ALL) goto retry_write; redirty_page_for_writepage(wbc, page); } else if (rc) { SetPageError(page); mapping_set_error(page->mapping, rc); } else { SetPageUptodate(page); } Otherwise this looks fine to me: Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Problem with file system
I have a btrfs file system with a few thousand snapshots. When I attempted to delete 20 or so of them the problems started. The disks are being read but except for the first few minutes there are no writes. Memory usage keeps growing until all the memory (24 Gb) is used in a few hours. Eventually the system will crash with out of memory errors. The CPU load is low (<5%) but iowait is around 30 to 50% The drives are mounted but any process that attempts to access them will just hang so I cannot access any data on the drives. Smartctl does not show any issues with the drives. The problem restarts after a reboot once you mount the drives. I tried to zero the log hoping it wouldn't restart after a reboot but that didn't work I am assuming that the attempt to remove the snapshots caused this problem. How do I interrupt the process so I can access the filesystem again? # uname -a Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version btrfs-progs v4.9.1 # btrfs fi show Label: none uuid: 79ba7374-bf77-4868-bb64-656ff5736c44 Total devices 6 FS bytes used 5.65TiB devid1 size 1.82TiB used 1.29TiB path /dev/sdb devid2 size 1.82TiB used 1.29TiB path /dev/sdc devid3 size 1.82TiB used 1.29TiB path /dev/sdd devid4 size 1.82TiB used 1.29TiB path /dev/sde devid5 size 3.64TiB used 3.11TiB path /dev/sdf devid6 size 3.64TiB used 3.11TiB path /dev/sdg # btrfs fi df /pubroot Data, RAID1: total=5.58TiB, used=5.58TiB System, RAID1: total=32.00MiB, used=828.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=104.00GiB, used=70.64GiB GlobalReserve, single: total=512.00MiB, used=28.51MiB -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync
On Mon 24-04-17 09:22:43, Jeff Layton wrote: > ext2 currently does a test+clear of the AS_EIO flag, which is > is problematic for some coming changes. > > What we really need to do instead is call filemap_check_errors > in __generic_file_fsync after syncing out the buffers. That > will be sufficient for this case, and help other callers detect > these errors properly as well. > > With that, we don't need to twiddle it in ext2. > > Suggested-by: Jan Kara > Signed-off-by: Jeff Layton Looks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > fs/ext2/file.c | 2 +- > fs/libfs.c | 3 ++- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/ext2/file.c b/fs/ext2/file.c > index b21891a6bfca..ed00e7ae0ef3 100644 > --- a/fs/ext2/file.c > +++ b/fs/ext2/file.c > @@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t > end, int datasync) > struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping; > > ret = generic_file_fsync(file, start, end, datasync); > - if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) { > + if (ret == -EIO) { > /* We don't really know where the IO error happened... */ > ext2_error(sb, __func__, > "detected IO error when writing metadata buffers"); > diff --git a/fs/libfs.c b/fs/libfs.c > index a8b62e5d43a9..12a48ee442d3 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, > loff_t end, > > out: > inode_unlock(inode); > - return ret; > + err = filemap_check_errors(inode->i_mapping); > + return ret ? : err; > } > EXPORT_SYMBOL(__generic_file_fsync); > > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails
On Mon 24-04-17 09:22:45, Jeff Layton wrote: > In order to get proper error codes from fsync, we must set an error in > the mapping range when writeback fails. > > Signed-off-by: Jeff Layton So I'm fine with the change but please expand the changelog to something like: DAX currently doesn't set errors in the mapping when cache flushing fails in dax_writeback_mapping_range(). Since this function can get called only from fsync(2) or sync(2), this is actually as good as it can currently get since we correctly propagate the error up from dax_writeback_mapping_range() to filemap_fdatawrite(). However in the future better writeback error handling will enable us to properly report these errors on fsync(2) even if there are multiple file descriptors open against the file or if sync(2) gets called before fsync(2). So convert DAX to using standard error reporting through the mapping. After improving the changelog you can add: Reviewed-by: Jan Kara Honza > --- > fs/dax.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 85abd741253d..9b6b04030c3f 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space > *mapping, > > ret = dax_writeback_one(bdev, mapping, indices[i], > pvec.pages[i]); > - if (ret < 0) > + if (ret < 0) { > + mapping_set_error(mapping, ret); > return ret; > + } > } > } > return 0; > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 08/20] mm: ensure that we set mapping error if writeout() fails
On Mon 24-04-17 09:22:47, Jeff Layton wrote: > If writepage fails during a page migration, then we need to ensure that > fsync will see it by flagging the mapping. > > Signed-off-by: Jeff Layton Looks good to me. You can add: Reviewed-by: Jan Kara Honza > --- > mm/migrate.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/mm/migrate.c b/mm/migrate.c > index 738f1d5f8350..3a59830bdae2 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -792,7 +792,11 @@ static int writeout(struct address_space *mapping, > struct page *page) > /* unlocked. Relock */ > lock_page(page); > > - return (rc < 0) ? -EIO : -EAGAIN; > + if (rc < 0) { > + mapping_set_error(mapping, rc); > + return -EIO; > + } > + return -EAGAIN; > } > > /* > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 09/20] 9p: set mapping error when writeback fails in launder_page
On Mon 24-04-17 09:22:48, Jeff Layton wrote: > launder_page is just writeback under the page lock. We still need to > mark the mapping for errors there when they occur. > > Signed-off-by: Jeff Layton Looks good. You can add: Reviewed-by: Jan Kara Honza > --- > fs/9p/vfs_addr.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c > index adaf6f6dd858..7af6e6501698 100644 > --- a/fs/9p/vfs_addr.c > +++ b/fs/9p/vfs_addr.c > @@ -223,8 +223,11 @@ static int v9fs_launder_page(struct page *page) > v9fs_fscache_wait_on_page_write(inode, page); > if (clear_page_dirty_for_io(page)) { > retval = v9fs_vfs_writepage_locked(page); > - if (retval) > + if (retval) { > + if (retval != -EAGAIN) > + mapping_set_error(page->mapping, retval); > return retval; > + } > } > return 0; > } > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails
On Mon 24-04-17 09:22:49, Jeff Layton wrote: > This ensures that we see errors on fsync when writeback fails. > > Signed-off-by: Jeff Layton Hum, but do we really want to clobber mapping errors with temporary stuff like ENOMEM? Or do you want to handle that in mapping_set_error? Honza > --- > fs/fuse/file.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index ec238fb5a584..07d0efcb050c 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page) > err_free: > fuse_request_free(req); > err: > + mapping_set_error(page->mapping, error); > end_page_writeback(page); > return error; > } > -- > 2.9.3 > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls
On Mon, 2017-04-24 at 10:12 -0400, Bob Peterson wrote: > - Original Message - > > In some places, it's trying to reset the mapping error after calling > > filemap_fdatawait. That's no longer required. Also, turn several > > filemap_fdatawrite+filemap_fdatawait calls into filemap_write_and_wait. > > That will at least return writeback errors that occur during the write > > phase. > > > > Signed-off-by: Jeff Layton > > --- > > fs/gfs2/glops.c | 12 > > fs/gfs2/lops.c | 4 +--- > > fs/gfs2/super.c | 6 ++ > > 3 files changed, 7 insertions(+), 15 deletions(-) > > > > diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c > > index 5db59d444838..7362d19fdc4c 100644 > > --- a/fs/gfs2/glops.c > > +++ b/fs/gfs2/glops.c > > @@ -158,9 +158,7 @@ static void rgrp_go_sync(struct gfs2_glock *gl) > > GLOCK_BUG_ON(gl, gl->gl_state != LM_ST_EXCLUSIVE); > > > > gfs2_log_flush(sdp, gl, NORMAL_FLUSH); > > - filemap_fdatawrite_range(mapping, gl->gl_vm.start, gl->gl_vm.end); > > - error = filemap_fdatawait_range(mapping, gl->gl_vm.start, > > gl->gl_vm.end); > > - mapping_set_error(mapping, error); > > + filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end); > > This should probably have "error = ", no? > This error is discarded in the current code after resetting the error in the mapping. With the earlier patches in this set we don't need to reset the error like this anymore. Now, if this code should doing something else with those errors, then that's a separate problem. > > gfs2_ail_empty_gl(gl); > > > > spin_lock(&gl->gl_lockref.lock); > > @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl) > > filemap_fdatawrite(metamapping); > > if (ip) { > > struct address_space *mapping = ip->i_inode.i_mapping; > > - filemap_fdatawrite(mapping); > > - error = filemap_fdatawait(mapping); > > - mapping_set_error(mapping, error); > > + filemap_write_and_wait(mapping); > > + } else { > > + filemap_fdatawait(metamapping); > > } > > - error = filemap_fdatawait(metamapping); > > - mapping_set_error(metamapping, error); > > This part doesn't look right at all. There's a big difference in gfs2 between > mapping and metamapping. We need to wait for metamapping regardless. > ...and this should wait. Basically, filemap_write_and_wait does filemap_fdatawrite and then filemap_fdatawait. This is mostly just replacing the existing code with a more concise helper. -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel wrote: > I have a btrfs file system with a few thousand snapshots. When I > attempted to delete 20 or so of them the problems started. > > The disks are being read but except for the first few minutes there > are no writes. > > Memory usage keeps growing until all the memory (24 Gb) is used in a > few hours. Eventually the system will crash with out of memory errors. Boot with these boot parameters log_buf_len=1M I find it easier to remotely login with another computer to capture problems in case of a crash and I can't save things locally. So on the remote computer use 'journalctl -kf -o short-monotonic' Either on the 1st computer, or from an additional ssh connection from the 2nd: echo 1 >/proc/sys/kernel/sysrq btrfs fi show #you need the UUID for the volume you're going to mount, best to have it in advance mount the file system normally, and once it's starting to have the problem (I guess it happens pretty quickly?) echo t > /proc/sysrq-trigger grep . -IR /sys/fs/btrfs/UUID/allocation/ Paste in the UUID from fi show. If the computer is hanging due to running out of memory, each of these commands can take a while to complete. So it's best to have them all ready to go before you mount the file system, and the problem starts happening. Best if you can issue the commands more than once as the problem gets worse, if you can keep them all organized and labeled. Then attach them (rather than pasting them into the message). > I tried to zero the log hoping it wouldn't restart after a reboot but > that didn't work Yeah don't just start randomly hitting the fs with a hammer like zeroing the log tree. That's for a specific problem and this isn't it. > I am assuming that the attempt to remove the snapshots caused this > problem. How do I interrupt the process so I can access the > filesystem again? Snapshot creation is essentially free. Snapshot removal is expensive. There's no way to answer your questions because your email doesn't even include a call trace. So a developer will need at least the call trace, but there might be some other useful information in a sysrq + t, as well as the allocation states. > # btrfs fi df /pubroot > Data, RAID1: total=5.58TiB, used=5.58TiB > System, RAID1: total=32.00MiB, used=828.00KiB > System, single: total=4.00MiB, used=0.00B > Metadata, RAID1: total=104.00GiB, used=70.64GiB > GlobalReserve, single: total=512.00MiB, used=28.51MiB Later, after this problem is solved, you'll want to get rid of that single system chunk that isn't being used, but might cause a problem in a device failure. sudo btrfs balance start -mconvert=raid1,soft -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 10/20] fuse: set mapping error in writepage_locked when it fails
On Mon, 2017-04-24 at 18:04 +0200, Jan Kara wrote: > On Mon 24-04-17 09:22:49, Jeff Layton wrote: > > This ensures that we see errors on fsync when writeback fails. > > > > Signed-off-by: Jeff Layton > > Hum, but do we really want to clobber mapping errors with temporary stuff > like ENOMEM? Or do you want to handle that in mapping_set_error? > Right now we don't really have such a thing as temporary errors in the writeback codepath. If you return an error here, the data doesn't stay dirty or anything, and I think we want to ensure that that gets reported via fsync. I'd like to see us add better handling for retryable errors for stuff like ENOMEM or EAGAIN. I think this is the first step toward that though. Once we have more consistent handling of writeback errors in general, then we can start doing more interesting things with retryable errors. So yeah, I this is the right thing to do for now. > > > --- > > fs/fuse/file.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > > index ec238fb5a584..07d0efcb050c 100644 > > --- a/fs/fuse/file.c > > +++ b/fs/fuse/file.c > > @@ -1669,6 +1669,7 @@ static int fuse_writepage_locked(struct page *page) > > err_free: > > fuse_request_free(req); > > err: > > + mapping_set_error(page->mapping, error); > > end_page_writeback(page); > > return error; > > } > > -- > > 2.9.3 > > > > -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 11/20] cifs: set mapping error when page writeback fails in writepage or launder_pages
On Mon, 2017-04-24 at 08:27 -0700, Christoph Hellwig wrote: > On Mon, Apr 24, 2017 at 09:22:50AM -0400, Jeff Layton wrote: > > Signed-off-by: Jeff Layton > > --- > > fs/cifs/file.c | 10 ++ > > 1 file changed, 6 insertions(+), 4 deletions(-) > > > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > > index 21d404535739..4b696a23b0b1 100644 > > --- a/fs/cifs/file.c > > +++ b/fs/cifs/file.c > > @@ -2234,14 +2234,16 @@ cifs_writepage_locked(struct page *page, struct > > writeback_control *wbc) > > set_page_writeback(page); > > retry_write: > > rc = cifs_partialpagewrite(page, 0, PAGE_SIZE); > > + if (rc == -EAGAIN && wbc->sync_mode == WB_SYNC_ALL) { > > goto retry_write; > > + } else if (rc == -EAGAIN) { > > redirty_page_for_writepage(wbc, page); > > + } else if (rc != 0) { > > SetPageError(page); > > + mapping_set_error(page->mapping, rc); > > + } else { > > SetPageUptodate(page); > > + } > > Hmmm. I might be a little too nitpicky, but I hate having the same > partial condition duplicated if possible. Why not: > > if (rc == -EAGAIN) { > if (wbc->sync_mode == WB_SYNC_ALL) > goto retry_write; > redirty_page_for_writepage(wbc, page); > } else if (rc) { > SetPageError(page); > mapping_set_error(page->mapping, rc); > } else { > SetPageUptodate(page); > } > > Otherwise this looks fine to me: > > Reviewed-by: Christoph Hellwig No, you're absolutely right. I merged that change into the patch. Thanks for the review so far! -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls
Hi, - Original Message - | On Mon, 2017-04-24 at 10:12 -0400, Bob Peterson wrote: | > > + filemap_write_and_wait_range(mapping, gl->gl_vm.start, gl->gl_vm.end); | > | > This should probably have "error = ", no? | > | | This error is discarded in the current code after resetting the error in | the mapping. With the earlier patches in this set we don't need to reset | the error like this anymore. | | Now, if this code should doing something else with those errors, then | that's a separate problem. Okay, I see. My bad. | > > gfs2_ail_empty_gl(gl); | > > | > > spin_lock(&gl->gl_lockref.lock); | > > @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl) | > > filemap_fdatawrite(metamapping); | > > if (ip) { | > > struct address_space *mapping = ip->i_inode.i_mapping; | > > - filemap_fdatawrite(mapping); | > > - error = filemap_fdatawait(mapping); | > > - mapping_set_error(mapping, error); | > > + filemap_write_and_wait(mapping); | > > + } else { | > > + filemap_fdatawait(metamapping); | > > } | > > - error = filemap_fdatawait(metamapping); | > > - mapping_set_error(metamapping, error); | > | > This part doesn't look right at all. There's a big difference in gfs2 | > between | > mapping and metamapping. We need to wait for metamapping regardless. | > | | ...and this should wait. Basically, filemap_write_and_wait does | filemap_fdatawrite and then filemap_fdatawait. This is mostly just | replacing the existing code with a more concise helper. But this isn't a simple replacement with a helper. This is two different address spaces (mapping and metamapping) and you added an else in there. So with this patch metamapping gets written, and if there's an ip, mapping gets written but it doesn't wait for metamapping. Unless I'm missing something. You could replace both filemap_fdatawrites with the helper instead. Today's code is structured as: (a) write metamapping if (ip) (b) write mapping (c) wait for mapping (d) wait for metamapping If you use the helper for both, it becomes, (a & d)(b & c) which is probably acceptable. (I think we just tried to optimize what the elevator was doing). But the way you've got it coded here still looks wrong. It looks like: (a) if (ip) (b & c) ELSE - (d) So (d) (metamapping) isn't guaranteed to be synced at the end of the function. Of course, you know the modified helper functions better than I do. What am I missing? Regards, Bob Peterson Red Hat File Systems -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 20/20] gfs2: clean up some filemap_* calls
On Mon, 2017-04-24 at 13:41 -0400, Bob Peterson wrote: > Hi, > > - Original Message - > > On Mon, 2017-04-24 at 10:12 -0400, Bob Peterson wrote: > > > > + filemap_write_and_wait_range(mapping, gl->gl_vm.start, > > > > gl->gl_vm.end); > > > > > > This should probably have "error = ", no? > > > > > > > This error is discarded in the current code after resetting the error in > > the mapping. With the earlier patches in this set we don't need to reset > > the error like this anymore. > > > > Now, if this code should doing something else with those errors, then > > that's a separate problem. > > Okay, I see. My bad. > > > > > gfs2_ail_empty_gl(gl); > > > > > > > > spin_lock(&gl->gl_lockref.lock); > > > > @@ -225,12 +223,10 @@ static void inode_go_sync(struct gfs2_glock *gl) > > > > filemap_fdatawrite(metamapping); > > > > if (ip) { > > > > struct address_space *mapping = ip->i_inode.i_mapping; > > > > - filemap_fdatawrite(mapping); > > > > - error = filemap_fdatawait(mapping); > > > > - mapping_set_error(mapping, error); > > > > + filemap_write_and_wait(mapping); > > > > + } else { > > > > + filemap_fdatawait(metamapping); > > > > } > > > > - error = filemap_fdatawait(metamapping); > > > > - mapping_set_error(metamapping, error); > > > > > > This part doesn't look right at all. There's a big difference in gfs2 > > > between > > > mapping and metamapping. We need to wait for metamapping regardless. > > > > > > > ...and this should wait. Basically, filemap_write_and_wait does > > filemap_fdatawrite and then filemap_fdatawait. This is mostly just > > replacing the existing code with a more concise helper. > > But this isn't a simple replacement with a helper. This is two different > address spaces (mapping and metamapping) and you added an else in there. > > So with this patch metamapping gets written, and if there's an ip, > mapping gets written but it doesn't wait for metamapping. Unless > I'm missing something. > > You could replace both filemap_fdatawrites with the helper instead. > Today's code is structured as: > > (a) write metamapping > if (ip) > (b) write mapping > (c) wait for mapping > (d) wait for metamapping > > If you use the helper for both, it becomes, (a & d)(b & c) which is probably > acceptable. (I think we just tried to optimize what the elevator was doing). > > But the way you've got it coded here still looks wrong. It looks like: > (a) > if (ip) >(b & c) > ELSE - >(d) > > So (d) (metamapping) isn't guaranteed to be synced at the end of the function. > Of course, you know the modified helper functions better than I do. > What am I missing? > > You're right of course. I'll fix that up in my tree. Thanks! -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/20] orangefs: don't call filemap_write_and_wait from fsync
I've been running it here... Acked-by: Mike Marshall On Mon, Apr 24, 2017 at 11:23 AM, Christoph Hellwig wrote: > Looks fine, > > Reviewed-by: Christoph Hellwig -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dearest
Good day, I hope that you have a wonderful health, I saw your profile and I would like to discuss an important and confidential business with you, your profile is perfect for a project i need your assistance for that so , let me know if you agree and I will send you all the necessary details. Maria. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/20] dax: set errors in mapping when writeback fails
On Mon, Apr 24, 2017 at 09:22:45AM -0400, Jeff Layton wrote: > In order to get proper error codes from fsync, we must set an error in > the mapping range when writeback fails. > > Signed-off-by: Jeff Layton Works fine in some error injection testing. Tested-by: Ross Zwisler > --- > fs/dax.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 85abd741253d..9b6b04030c3f 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -901,8 +901,10 @@ int dax_writeback_mapping_range(struct address_space > *mapping, > > ret = dax_writeback_one(bdev, mapping, indices[i], > pvec.pages[i]); > - if (ret < 0) > + if (ret < 0) { > + mapping_set_error(mapping, ret); > return ret; > + } > } > } > return 0; > -- > 2.9.3 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] nowait aio: return on congested block device
On 04/19/2017 01:45 AM, Christoph Hellwig wrote: > On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues >> > >> +/* Request queue supports BIO_NOWAIT */ >> +queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q); > > BIO_NOWAIT is gone. And the comment would not be needed if the > flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT. > > And I think all request based drivers should set the flag implicitly > as ->queuecommand can't sleep, and ->queue_rq only when it's always > offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set. > We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for request completions. The ones which wait are MD devices because of sync or suspend operations. The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned, it uses the flag to offload it to a workqueue. The other way to do it implicitly is to change the flag to BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm. Is that what you are hinting at? Or do you have something else in mind? -- Goldwyn -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
At 04/24/2017 11:27 PM, Fred Van Andel wrote: I have a btrfs file system with a few thousand snapshots. When I attempted to delete 20 or so of them the problems started. The disks are being read but except for the first few minutes there are no writes. Memory usage keeps growing until all the memory (24 Gb) is used in a few hours. Eventually the system will crash with out of memory errors. Are you using qgroup/quota? IIRC qgroup for subvolume deletion will cause full subtree rescan which can cause tons of memory. Thanks, Qu The CPU load is low (<5%) but iowait is around 30 to 50% The drives are mounted but any process that attempts to access them will just hang so I cannot access any data on the drives. Smartctl does not show any issues with the drives. The problem restarts after a reboot once you mount the drives. I tried to zero the log hoping it wouldn't restart after a reboot but that didn't work I am assuming that the attempt to remove the snapshots caused this problem. How do I interrupt the process so I can access the filesystem again? # uname -a Linux Backup 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version btrfs-progs v4.9.1 # btrfs fi show Label: none uuid: 79ba7374-bf77-4868-bb64-656ff5736c44 Total devices 6 FS bytes used 5.65TiB devid1 size 1.82TiB used 1.29TiB path /dev/sdb devid2 size 1.82TiB used 1.29TiB path /dev/sdc devid3 size 1.82TiB used 1.29TiB path /dev/sdd devid4 size 1.82TiB used 1.29TiB path /dev/sde devid5 size 3.64TiB used 3.11TiB path /dev/sdf devid6 size 3.64TiB used 3.11TiB path /dev/sdg # btrfs fi df /pubroot Data, RAID1: total=5.58TiB, used=5.58TiB System, RAID1: total=32.00MiB, used=828.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=104.00GiB, used=70.64GiB GlobalReserve, single: total=512.00MiB, used=28.51MiB -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/8] nowait aio: return on congested block device
On 04/24/2017 03:10 PM, Goldwyn Rodrigues wrote: > > > On 04/19/2017 01:45 AM, Christoph Hellwig wrote: >> On Fri, Apr 14, 2017 at 07:02:54AM -0500, Goldwyn Rodrigues wrote: >>> From: Goldwyn Rodrigues >>> >> >>> + /* Request queue supports BIO_NOWAIT */ >>> + queue_flag_set_unlocked(QUEUE_FLAG_NOWAIT, q); >> >> BIO_NOWAIT is gone. And the comment would not be needed if the >> flag had a more descriptive name, e.g. QUEUE_FLAG_NOWAIT_SUPPORT. >> >> And I think all request based drivers should set the flag implicitly >> as ->queuecommand can't sleep, and ->queue_rq only when it's always >> offloaded to a workqueue when the BLK_MQ_F_BLOCKING flag is set. >> > > We introduced QUEUE_FLAG_NOWAIT for devices which would not wait for > request completions. The ones which wait are MD devices because of sync > or suspend operations. > > The only user of BLK_MQ_F_NONBLOCKING seems to be nbd. As you mentioned, > it uses the flag to offload it to a workqueue. > > The other way to do it implicitly is to change the flag to > BLK_MAY_BLOCK_REQS and use it for devices which do wait such as md/dm. > Is that what you are hinting at? Or do you have something else in mind? You are misunderstanding. What Christoph (correctly) says is that request based drivers do not block, so all of those are fine. What you need to worry about is drivers that are NOT request based. In other words, drivers that hook into make_request_fn and process bio's. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
backing up a collection of snapshot subvolumes
I have a remote machine with a filesystem for which I periodically take incremental snapshots for historical reasons. These snapshots are stored in an archival filesystem tree on a file server. Older snapshots are removed and newer ones added on a rotational basis. I need to be able to backup this archive by syncing it with a set of backup drives. Due to the size, I need to back it up incrementally rather than sending the entire content each time. Due to the snapshot rotation, I need to be able to update the state of the archive backup filesystem as a whole, in much the same manner that rsync handles file trees. It seems that I cannot use "btrfs send", as the archive directory contains the snapshots as subvolumes. I cannot use rsync as it treats subvolumes as simple directories, and does not preserve subvolume attributes. Rsync also does not support reflinks, so the snapshot directory content will no longer be reflinked to other snapshots on the archive backup. I cannot use hard links in the incrementals as hard links do not cross subvolume boundaries. Thoughts anyone ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
Chris Murphy posted on Mon, 24 Apr 2017 11:02:02 -0600 as excerpted: > On Mon, Apr 24, 2017 at 9:27 AM, Fred Van Andel > wrote: >> I have a btrfs file system with a few thousand snapshots. When I >> attempted to delete 20 or so of them the problems started. >> >> The disks are being read but except for the first few minutes there are >> no writes. >> >> Memory usage keeps growing until all the memory (24 Gb) is used in a >> few hours. Eventually the system will crash with out of memory errors. In addition to what CMurphy and QW suggested (both valid), I have a couple other suggestions/pointers. They won't help you get out of the current situation, but they might help you stay out of it in the future. 1) A "few thousand snapshots", but no mention of how many subvolumes those snapshots are of, or how many per subvolume. As CMurphy says but I'll expand it here, taking a snapshot is nearly free, just a bit of metadata to write because btrfs is COW-based and all a snapshot does is lock down a copy of everything in the subvolume as it exists currently and the filesystem's already tracking that, but removal is expensive, because btrfs must go thru and check everything to see if it can actually be deleted (no other snapshots referencing the block) or not (something else referencing it). Obviously, then, this checking gets much more complicated the more snapshots of the same subvolume that exist. IOW, it's a scaling issue. The same scaling issue applies to various other btrfs maintenance tasks, including btrfs check (aka btrfsck), and btrfs balance (and thus btrfs device remove, which does an implicit balance). Both of these take *far* longer if the number of snapshots per subvolume is allowed to get out of hand. Due to this scaling issue, the recommendation is no more than 200-300 snapshots per subvolume, and keeping it down to 50-100 max is even better, if you can do it reasonably. That helps keep scaling issues and thus time for any necessary maintenance manageable. Otherwise... well, we've had reports of device removes (aka balances) that would take /months/ to finish at the rate they were going. Obviously, well before it gets to that point it's far faster to simply blow away the filesystem and restore from backups.[1] It follows that if you have an automated system doing the snapshots, it's equally important to have an automated system doing snapshot thinning as well, keeping the number of snapshots per subvolume within manageable scaling limits. So if that's "a few thousand snapshots", I hope you that's of (at least) a double-digit number of subvolumes, keeping the number of snapshots per subvolume under 300, and under 100 if your snapshot rotation schedule will allow it. 2) As Qu suggests, btrfs quotas increase the scaling issues significantly. Additionally, there have been and continue to be accuracy issues with certain quota corner-cases, so they can't be entirely relied upon anyway. Generally, people using btrfs quotas fall into three categories: a) Those who know the problems and are working with Qu and the other devs to report and trace issues so they will eventually work well, ideally with less of a scaling issue as well. Bless them! Keep it up! =:^) b) Those who have a use-case that really depends on quotas. Because btrfs quotas are buggy and not entirely reliable now, not to mention the scaling issues, these users are almost certainly better served using more mature filesystems with mature and dependable quotas. c) Those who don't really care about quotas specifically, and are just using them because it's a nice feature. This likely includes some who are simply running distros that enable quotas. My recommendation for these users is to simply turn btrfs quotas off for now, as they're presently in general more trouble than they're worth, due to both the accuracy and scaling issues. Hopefully quotas will be stable in a couple years, and with developer and tester hard work perhaps the scaling issues will have been reduced as well, and that recommendation can change. But for now, if you don't really need them, leaving quotas off will significantly reduce scaling issues. And if you do need them, they're not yet reliable on btrfs anyway, so better off using something more mature where they actually work. 3) Similarly (tho unlikely to apply in your case), beware of the scaling implications of the various reflink-based copying and dedup utilities, which work via the same copy-on-write and reflinking technology that's behind snapshotting. Tho snapshotting is effectively reflinking /everything/ in the subvolume, so the scaling issues compound much faster there than they will with a more trivial level of reflinking. Of course, when it comes to dedup, a more trivial level of reflinking means less benefit from doing the dedup in the first place, so there's a limit to the effectiveness of dedup before it starts having the same scal
Re: Problem with file system
On 25/04/17 03:26, Qu Wenruo wrote: IIRC qgroup for subvolume deletion will cause full subtree rescan which can cause tons of memory. Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even use this absurd amount of memory for? Is it swappable? Haven't read about RAM limitations for running qgroups before, only about CPU load (which importantly only requires patience, does not crash servers). -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
At 04/25/2017 01:33 PM, Marat Khalili wrote: On 25/04/17 03:26, Qu Wenruo wrote: IIRC qgroup for subvolume deletion will cause full subtree rescan which can cause tons of memory. Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even use this absurd amount of memory for? Is it swappable? The memory is used for 2 reasons. 1) Record which extents are needed to trace Freed at transaction commit. Need better idea to handle them. Maybe create a new tree so that we can write it to disk? Or another qgroup rework? 2) Record current roots referring to this extent Only after v4.10 IIRC. The memory allocated is not swappable. How many memory it uses depends on the number of extents of that subvolume. It's 56 bytes for one extent, both tree block and data extent. To use up 16G ram, it's about 300 million extents. For 5.6T volume, its average extent size is about 20K. It seems that your volume is highly fragmented though. If that's the problem, disabling qgroup may be the best workaround. Thanks, Qu Haven't read about RAM limitations for running qgroups before, only about CPU load (which importantly only requires patience, does not crash servers). -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html