[PATCH] Btrfs: disk-io: replace root args iff only fs_info used
This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Signed-off-by: Daniel Dressler danieru.dress...@gmail.com --- fs/btrfs/ctree.c | 26 +- fs/btrfs/disk-io.c | 32 fs/btrfs/disk-io.h | 4 ++-- fs/btrfs/extent-tree.c | 6 +++--- 4 files changed, 34 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 19bc616..e76a6ba 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1075,7 +1075,7 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, ret = btrfs_dec_ref(trans, root, buf, 1); BUG_ON(ret); /* -ENOMEM */ } - clean_tree_block(trans, root, buf); + clean_tree_block(trans, root-fs_info, buf); *last_ref = 1; } return 0; @@ -1681,7 +1681,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, continue; } - cur = btrfs_find_tree_block(root, blocknr); + cur = btrfs_find_tree_block(root-fs_info, blocknr); if (cur) uptodate = btrfs_buffer_uptodate(cur, gen, 0); else @@ -1946,7 +1946,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, path-locks[level] = 0; path-nodes[level] = NULL; - clean_tree_block(trans, root, mid); + clean_tree_block(trans, root-fs_info, mid); btrfs_tree_unlock(mid); /* once for the path */ free_extent_buffer(mid); @@ -2000,7 +2000,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, if (wret 0 wret != -ENOSPC) ret = wret; if (btrfs_header_nritems(right) == 0) { - clean_tree_block(trans, root, right); + clean_tree_block(trans, root-fs_info, right); btrfs_tree_unlock(right); del_ptr(root, path, level + 1, pslot + 1); root_sub_used(root, right-len); @@ -2044,7 +2044,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, BUG_ON(wret == 1); } if (btrfs_header_nritems(mid) == 0) { - clean_tree_block(trans, root, mid); + clean_tree_block(trans, root-fs_info, mid); btrfs_tree_unlock(mid); del_ptr(root, path, level + 1, pslot); root_sub_used(root, mid-len); @@ -2262,7 +2262,7 @@ static void reada_for_search(struct btrfs_root *root, search = btrfs_node_blockptr(node, slot); blocksize = root-nodesize; - eb = btrfs_find_tree_block(root, search); + eb = btrfs_find_tree_block(root-fs_info, search); if (eb) { free_extent_buffer(eb); return; @@ -2324,7 +2324,7 @@ static noinline void reada_for_balance(struct btrfs_root *root, if (slot 0) { block1 = btrfs_node_blockptr(parent, slot - 1); gen = btrfs_node_ptr_generation(parent, slot - 1); - eb = btrfs_find_tree_block(root, block1); + eb = btrfs_find_tree_block(root-fs_info, block1); /* * if we get -eagain from btrfs_buffer_uptodate, we * don't want to return eagain here. That will loop @@ -2337,7 +2337,7 @@ static noinline void reada_for_balance(struct btrfs_root *root, if (slot + 1 nritems) { block2 = btrfs_node_blockptr(parent, slot + 1); gen = btrfs_node_ptr_generation(parent, slot + 1); - eb = btrfs_find_tree_block(root, block2); + eb = btrfs_find_tree_block(root-fs_info, block2); if (eb btrfs_buffer_uptodate(eb, gen, 1) != 0) block2 = 0; free_extent_buffer(eb); @@ -2455,7 +2455,7 @@ read_block_for_search(struct btrfs_trans_handle *trans, blocknr = btrfs_node_blockptr(b, slot); gen = btrfs_node_ptr_generation(b, slot); - tmp =
Re: Changing label few times killed filesystem?
On 2014-11-21 04:35, Roman Mamedov wrote: On Fri, 21 Nov 2014 01:27:17 + Boris Chernov aqs1...@hotmail.com wrote: I have changed file system label few times in total. When I tried to mount it after that, it became not mountable: # mount /dev/sdb1 /mnt mount: Not a directory I'd say that implies something is wrong with your /mnt, rather than /dev/sdb1. Before mounting try things like ls -la /mnt/, umount /mnt, etc. Or simply mounting somewhere else other than /mnt/ Before I attempted mounting to /mnt I tried to mount with KDE Device Notifier to /media/username/label, then I have tried to create directory manually in /media/ and tried to mount in the command-line, then tried /mnt, and error was the same. So I'm sure there is nothing wrong with my mount points. Now I have rebooted and tried to mount in KDE Device Notifier to /media/username/label, it failed again, so I tried from command-line as root: # mkdir /media/sdb1 ls -la /media/sdb1 mount /dev/sdb1 /media/sdb1 total 8 drwxr-sr-x 2 root disk 4096 Nov 21 08:12 . drwsrwsrwT 7 root disk 4096 Nov 21 08:12 .. ...and that's it, no output from mount command (it just hanged and become unkillable process). Please let me know if there is anything else I could try to either restore it or debug it (to at least understand why exactly it screwed up itself so it will not happen again to me or anyone else). If it matters, the disk is with single partition (BTRFS-only), was plugged-in all the time and I use Xeon-based workstation with ECC memory. In the dmesg I see the following, it seems after encountering btrfs bugs in its recovery tools (mentioned in my previous mail) I have also encountered btrfs bug in the kernel: [ 339.349260] BTRFS info (device sdb1): disk space caching is enabled [ 339.397438] parent transid verify failed on 29458432 wanted 5 found 2759 [ 339.397505] [ cut here ] [ 339.397510] kernel BUG at fs/btrfs/locking.c:269! [ 339.397513] invalid opcode: [#1] SMP [ 339.397517] Modules linked in: ppp_deflate bsd_comp ppp_async crc_ccitt ppp_generic slhc snd_aloop snd_hrtimer xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables snd_ice1724 snd_ak4113 snd_pt2258 snd_ak4114 snd_i2c snd_ice17xx_ak4xxx snd_ak4xxx_adda snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd soundcore ac97_bus vmnet(O) parport_pc parport vmw_vsock_vmci_transport vsock vmw_vmci vmmon(O) cpufreq_conservative cpufreq_powersave cpufreq_userspace cpufreq_stats zram nvidia(PO) cfg80211 rfkill binfmt_misc uinput zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc iTCO_wdt iTCO_vendor_support usblp kvm_intel kvm ses enclosure cdc_ether psmouse option i2c_i801 pcspkr usbnet mii usb_wwan usbserial serio_raw i7core_edac edac_core uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media evdev joydev jc42 w83627ehf lm90 coretemp adt7475 hwmon_vid adm1021 ttm drm_kms_helper drm i2c_algo_bit i2c_core msr loop fuse tpm_infineon tpm_tis lpc_ich mfd_core tpm button acpi_cpufreq processor thermal_sys autofs4 ext4 crc16 mbcache jbd2 btrfs xor raid6_pq usb_storage sg sd_mod sr_mod cdrom crc_t10dif crct10dif_common hid_generic usbhid hid ahci libahci libata crc32c_intel scsi_mod e1000e ptp pps_core xhci_hcd ehci_pci ehci_hcd usbcore usb_common [last unloaded: vmnet] [ 339.397584] CPU: 0 PID: 25752 Comm: mount Tainted: P O 3.15.0-pf2 #1 [ 339.397585] Hardware name: Supermicro X8SIE/X8SIE, BIOS 1.2 08/19/11 [ 339.397586] task: 880036c93f80 ti: 8805702b4000 task.ti: 8805702b4000 [ 339.397587] RIP: 0010:[a0245050] [a0245050] btrfs_assert_tree_read_locked.part.0+0x0/0x10 [btrfs] [ 339.397604] RSP: 0018:8805702b7bf0 EFLAGS: 00010246 [ 339.397605] RAX: RBX: 8804db6da800 RCX: 0581 [ 339.397606] RDX: RSI: 8804db58d0e0 RDI: 8804db6da800 [ 339.397607] RBP: 0001 R08: 0001b830 R09: 88063fc1b830 [ 339.397608] R10: 88061afec700 R11: ea00136d6300 R12: 0005 [ 339.397609] R13: 88008c978820 R14: 88061af51000 R15: 8804db6da800 [ 339.397610] FS: 7f55bf45b840() GS:88063fc0() knlGS: [ 339.397612] CS: 0010 DS: ES: CR0: 8005003b [ 339.397613] CR2: 7f6b280af000 CR3: 0004da047000 CR4: 07f0 [ 339.397614] Stack: [ 339.397614] a024557d 8804db6da800 a0208838 [ 339.397616] 88008c978820 [ 339.397617] a02093a0 1c18 0005 8804db6da800 [ 339.397619] Call Trace: [ 339.397629]
Re: [RFC PATCH 0/6] btrfs: implement swap file support
On Tue, Nov 18, 2014 at 11:22:35PM -0800, Omar Sandoval wrote: Here's a nice little bit of insanity I put together in that direction -- consider it a discussion point more than a patch. It does two things: - Uses an ITER_BVEC iov_iter to do direct_IO for swap_readpage. This makes swap_readpage a synchronous operation, but I think that's the best we can do with the existing interface. Note that -read_iter for direct-io supports async I/O in general. By resurrecting some of the older attempts to do in-kernel aio this could be made async easily. - Unless I'm missing something, there don't appear to be any instances of ITER_BVEC | READ in the kernel, so the dio path doesn't know not to dirty pages it gets that way. Dave Kleikamp and Ming Lei each previously submitted patches doing this as part of adding an aio_kernel interface. (The NFS direct I/O implementation doesn't know how to deal with these either, so this patch actually breaks the only existing user of this code path, but in the interest of keeping the patch short, I didn't try to fix it :) Right, we'd need to look into. Bonus points of allowing this as a zero copy read. Btw, in the long run I would much prefer killing of the current horrible swap using bmap path in favor of an enhanced direct I/O path. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/5] nfs: don't dirty ITER_BVEC pages read through direct I/O
As with the generic blockdev code, kernel pages shouldn't be dirtied by the direct I/O path. Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/nfs/direct.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c index 10bf072..a67fa2c 100644 --- a/fs/nfs/direct.c +++ b/fs/nfs/direct.c @@ -88,6 +88,7 @@ struct nfs_direct_req { struct pnfs_ds_commit_info ds_cinfo;/* Storage for cinfo */ struct work_struct work; int flags; + int should_dirty; /* should we mark read pages dirty? */ #define NFS_ODIRECT_DO_COMMIT (1) /* an unstable reply was received */ #define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */ struct nfs_writeverfverf; /* unstable write verifier */ @@ -370,7 +371,8 @@ static void nfs_direct_read_completion(struct nfs_pgio_header *hdr) struct nfs_page *req = nfs_list_entry(hdr-pages.next); struct page *page = req-wb_page; - if (!PageCompound(page) bytes hdr-good_bytes) + if (!PageCompound(page) bytes hdr-good_bytes + dreq-should_dirty) set_page_dirty(page); bytes += req-wb_bytes; nfs_list_remove_request(req); @@ -542,6 +544,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct iov_iter *iter, dreq-inode = inode; dreq-bytes_left = count; dreq-ctx = get_nfs_open_context(nfs_file_open_context(iocb-ki_filp)); + dreq-should_dirty = !(iter-type ITER_BVEC); l_ctx = nfs_get_lock_context(dreq-ctx); if (IS_ERR(l_ctx)) { result = PTR_ERR(l_ctx); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/5] btrfs: enable swap file support
Implement the swap file a_ops on btrfs. Activation simply checks for a usable swap file: it must be fully allocated (no holes), support direct I/O (so no compressed or inline extents) and should be nocow (I'm not sure about that last one). Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/btrfs/inode.c | 71 1 file changed, 71 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d23362f..b8fd36b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9442,6 +9442,75 @@ out_inode: } +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, + sector_t *span) +{ + struct inode *inode = file_inode(file); + struct btrfs_inode *ip = BTRFS_I(inode); + int ret = 0; + u64 isize = inode-i_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + u64 start, len; + + if (ip-flags BTRFS_INODE_COMPRESS) { + /* Can't do direct I/O on a compressed file. */ + pr_err(BTRFS: swapfile is compressed); + return -EINVAL; + } + if (!(ip-flags BTRFS_INODE_NODATACOW)) { + /* The swap file can't be copy-on-write. */ + pr_err(BTRFS: swapfile is copy-on-write); + return -EINVAL; + } + + lock_extent_bits(ip-io_tree, 0, isize - 1, 0, cached_state); + + /* +* All of the extents must be allocated and support direct I/O. Inline +* extents and compressed extents fall back to buffered I/O, so those +* are no good. +*/ + start = 0; + while (start isize) { + len = isize - start; + em = btrfs_get_extent(inode, NULL, 0, start, len, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out; + } + + if (test_bit(EXTENT_FLAG_VACANCY, em-flags) || + em-block_start == EXTENT_MAP_HOLE) { + pr_err(BTRFS: swapfile has holes); + ret = -EINVAL; + goto out; + } + if (em-block_start == EXTENT_MAP_INLINE) { + pr_err(BTRFS: swapfile is inline); + ret = -EINVAL; + goto out; + } + if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { + pr_err(BTRFS: swapfile is compresed); + ret = -EINVAL; + goto out; + } + + start = extent_map_end(em); + free_extent_map(em); + } + +out: + unlock_extent_cached(ip-io_tree, 0, isize - 1, cached_state, +GFP_NOFS); + return ret; +} + +static void btrfs_swap_deactivate(struct file *file) +{ +} + static const struct inode_operations btrfs_dir_inode_operations = { .getattr= btrfs_getattr, .lookup = btrfs_lookup, @@ -9519,6 +9588,8 @@ static const struct address_space_operations btrfs_aops = { .releasepage= btrfs_releasepage, .set_page_dirty = btrfs_set_page_dirty, .error_remove_page = generic_error_remove_page, + .swap_activate = btrfs_swap_activate, + .swap_deactivate = btrfs_swap_deactivate, }; static const struct address_space_operations btrfs_symlink_aops = { -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/5] direct-io: don't dirty ITER_BVEC pages on read
Reads through the iov_iter infrastructure for kernel pages shouldn't be dirtied by the direct I/O code. This is based on Dave Kleikamp's and Ming Lei's previously posted patches. Cc: Dave Kleikamp dave.kleik...@oracle.com Cc: Ming Lei ming@canonical.com Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/direct-io.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index e181b6b..e542ce4 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -120,6 +120,7 @@ struct dio { spinlock_t bio_lock;/* protects BIO fields below */ int page_errors;/* errno from get_user_pages() */ int is_async; /* is IO async ? */ + int should_dirty; /* should we mark read pages dirty? */ bool defer_completion; /* defer AIO completion to workqueue? */ int io_error; /* IO error in completion path */ unsigned long refcount; /* direct_io_worker() and bios */ @@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio) dio-refcount++; spin_unlock_irqrestore(dio-bio_lock, flags); - if (dio-is_async dio-rw == READ) + if (dio-is_async dio-rw == READ dio-should_dirty) bio_set_pages_dirty(bio); if (sdio-submit_io) @@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio) if (!uptodate) dio-io_error = -EIO; - if (dio-is_async dio-rw == READ) { + if (dio-is_async dio-rw == READ dio-should_dirty) { bio_check_pages_dirty(bio); /* transfers ownership */ } else { bio_for_each_segment_all(bvec, bio, i) { struct page *page = bvec-bv_page; - if (dio-rw == READ !PageCompound(page)) + if (dio-rw == READ !PageCompound(page) dio-should_dirty) set_page_dirty_lock(page); page_cache_release(page); } @@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, dio-inode = inode; dio-rw = rw; + dio-should_dirty = !(iter-type ITER_BVEC); /* * For AIO O_(D)SYNC writes we need to defer completions to a workqueue -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] btrfs: implement swap file support
On Fri, Nov 21, 2014 at 02:06:57AM -0800, Christoph Hellwig wrote: On Tue, Nov 18, 2014 at 11:22:35PM -0800, Omar Sandoval wrote: Here's a nice little bit of insanity I put together in that direction -- consider it a discussion point more than a patch. It does two things: - Uses an ITER_BVEC iov_iter to do direct_IO for swap_readpage. This makes swap_readpage a synchronous operation, but I think that's the best we can do with the existing interface. Note that -read_iter for direct-io supports async I/O in general. By resurrecting some of the older attempts to do in-kernel aio this could be made async easily. - Unless I'm missing something, there don't appear to be any instances of ITER_BVEC | READ in the kernel, so the dio path doesn't know not to dirty pages it gets that way. Dave Kleikamp and Ming Lei each previously submitted patches doing this as part of adding an aio_kernel interface. (The NFS direct I/O implementation doesn't know how to deal with these either, so this patch actually breaks the only existing user of this code path, but in the interest of keeping the patch short, I didn't try to fix it :) Right, we'd need to look into. Bonus points of allowing this as a zero copy read. Btw, in the long run I would much prefer killing of the current horrible swap using bmap path in favor of an enhanced direct I/O path. Looks like I just raced on your email and sent a v2 of my patch series before seeing this response :) I'll take a closer look at this tomorrow, thanks for getting back to me. -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.18.0-rc4 - WARNING: CPU: 3 PID: 3889 at fs/btrfs/extent-tree.c:3785 btrfs_free_reserved_data_space+0xc8/0xe0 [btrfs]()
On 12. nov. 2014 09:29, Torbjørn wrote: Hi, After upgrade to 3.18.0-rc4 get a lot of these warnings. The system seems to otherwise behave normally. I have not observed any actual issues. --snip-- Just to close off this email thread. After upgrading to 3.18.0-rc5 these messages have disappeared. For rc-4 it was about 80 000 messages a day. -- Torbjørn -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/5] btrfs: don't allow -C or +c chattrs on a swap file
swap_activate will check for a compressed or copy-on-write file; we shouldn't allow it to become either once it has already been activated. Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/btrfs/ioctl.c | 50 +++--- 1 file changed, 31 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 4399f0c..f022dce 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -293,14 +293,21 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } } else { /* -* Revert back under same assuptions as above +* swap_activate checks that we don't swapon a copy-on-write +* file, but we must also make sure that it doesn't become +* copy-on-write. */ - if (S_ISREG(mode)) { - if (inode-i_size == 0) - ip-flags = ~(BTRFS_INODE_NODATACOW -| BTRFS_INODE_NODATASUM); - } else { - ip-flags = ~BTRFS_INODE_NODATACOW; + if (!IS_SWAPFILE(inode)) { + /* +* Revert back under same assumptions as above +*/ + if (S_ISREG(mode)) { + if (inode-i_size == 0) + ip-flags = ~(BTRFS_INODE_NODATACOW | + BTRFS_INODE_NODATASUM); + } else { + ip-flags = ~BTRFS_INODE_NODATACOW; + } } } @@ -317,20 +324,25 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) if (ret ret != -ENODATA) goto out_drop; } else if (flags FS_COMPR_FL) { - const char *comp; - - ip-flags |= BTRFS_INODE_COMPRESS; - ip-flags = ~BTRFS_INODE_NOCOMPRESS; + /* +* Like nodatacow, swap_activate checks that we don't swapon a +* compressed file, so we shouldn't let it become compressed. +*/ + if (!IS_SWAPFILE(inode)) { + const char *comp; - if (root-fs_info-compress_type == BTRFS_COMPRESS_LZO) - comp = lzo; - else - comp = zlib; - ret = btrfs_set_prop(inode, btrfs.compression, -comp, strlen(comp), 0); - if (ret) - goto out_drop; + ip-flags |= BTRFS_INODE_COMPRESS; + ip-flags = ~BTRFS_INODE_NOCOMPRESS; + if (root-fs_info-compress_type == BTRFS_COMPRESS_LZO) + comp = lzo; + else + comp = zlib; + ret = btrfs_set_prop(inode, btrfs.compression, +comp, strlen(comp), 0); + if (ret) + goto out_drop; + } } else { ret = btrfs_set_prop(inode, btrfs.compression, NULL, 0, 0); if (ret ret != -ENODATA) -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/5] swap: use direct I/O for SWP_FILE swap_readpage
On Mon, Nov 17, 2014 at 07:48:17AM -0800, Christoph Hellwig wrote: With the new iov_iter infrastructure that supprots direct I/O to kernel pages please get rid of the -readpage hack first. I'm still utterly disapoined that this crap ever got merged. Cc: Mel Gorman mgor...@suse.de Signed-off-by: Omar Sandoval osan...@osandov.com --- mm/page_io.c | 32 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/mm/page_io.c b/mm/page_io.c index 955db8b..10715e0 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -283,8 +283,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, set_page_writeback(page); unlock_page(page); - ret = mapping-a_ops-direct_IO(ITER_BVEC | WRITE, - kiocb, from, + ret = mapping-a_ops-direct_IO(WRITE, kiocb, from, kiocb.ki_pos); if (ret == PAGE_SIZE) { count_vm_event(PSWPOUT); @@ -348,12 +347,37 @@ int swap_readpage(struct page *page) } if (sis-flags SWP_FILE) { + struct kiocb kiocb; struct file *swap_file = sis-swap_file; struct address_space *mapping = swap_file-f_mapping; + struct bio_vec bv = { + .bv_page = page, + .bv_len = PAGE_SIZE, + .bv_offset = 0, + }; + struct iov_iter to = { + .type = ITER_BVEC | READ, + .count = PAGE_SIZE, + .iov_offset = 0, + .nr_segs = 1, + }; + to.bvec = bv; /* older gcc versions are broken */ + + init_sync_kiocb(kiocb, swap_file); + kiocb.ki_pos = page_file_offset(page); + kiocb.ki_nbytes = PAGE_SIZE; - ret = mapping-a_ops-readpage(swap_file, page); - if (!ret) + ret = mapping-a_ops-direct_IO(READ, kiocb, to, + kiocb.ki_pos); + if (ret == PAGE_SIZE) { + SetPageUptodate(page); count_vm_event(PSWPIN); + ret = 0; + } else { + ClearPageUptodate(page); + SetPageError(page); + } + unlock_page(page); return ret; } -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/5] btrfs: implement swap file support
This patch series, based on 3.18-rc5, implements support for swap files on BTRFS. The standard swap file implementation uses the filesystem's implementation of bmap() to get a list of physical blocks on disk, which the swap file code then does I/O on directly without going through the filesystem. This doesn't work for BTRFS, which is copy-on-write and therefore moves disk blocks around (COW isn't the only thing that can shuffle around disk blocks: consider defragmentation, balancing, etc.). Swap-over-NFS introduced an interface through which a filesystem can arbitrate swap I/O through address space operations: - swap_activate() is called by swapon() and informs the address space that the given file is going to be used for swap, so it should take adequate measures like reserving space on disk and pinning block lookup information in memory - swap_deactivate() is used to clean up on swapoff() - readpage() is used to page in (read a page from disk) - direct_IO() is used to page out (write a page out to disk) Version 2 modifies this interface in the first part of the patch series to use direct_IO for both reads and writes, which makes things much cleaner. The second part of the patch series implements support for the interface on BTRFS, which just means implementing swap_{,de}activate and adding some chattr checks, which raises the following considerations: - We can't do direct I/O on compressed or inline extents, so we can't use files with either for swap. - Supporting COW swapfiles might also come with some weird edge cases? This functionality is tenuously tested in a virtual machine with some artificial workloads. Comment away. Omar Sandoval (5): direct-io: don't dirty ITER_BVEC pages on read nfs: don't dirty ITER_BVEC pages read through direct I/O swap: use direct I/O for SWP_FILE swap_readpage btrfs: don't allow -C or +c chattrs on a swap file btrfs: enable swap file support v2: use direct_IO for swap_readpage fs/btrfs/inode.c | 71 fs/btrfs/ioctl.c | 50 --- fs/direct-io.c | 8 --- fs/nfs/direct.c | 5 +++- mm/page_io.c | 32 + 5 files changed, 139 insertions(+), 27 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] btrfs: implement swap file support
Sorry for the noise, looks like Christoph got back to me on the previous RFC just before I sent this out -- disregard this for now. -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] btrfs: implement swap file support
On Fri, Nov 21, 2014 at 02:15:31AM -0800, Omar Sandoval wrote: Sorry for the noise, looks like Christoph got back to me on the previous RFC just before I sent this out -- disregard this for now. If the NFS people are fine with this version I'd certainly welcome it as a first step. Additional improvements are of course always welcome. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: remove empty fs_devices to prevent memory runout
Hi Gui, Thanks for attempting this. There was this patch previously attempted for the same problem, which I had to nack.. [PATCH 1/2] btrfs: device list could grow infinite I haven't seen your fix in full yet, But looks like you are using dev_t to suffice the problem of identifying unique device. wonder how does this would this react when device goes through the disappear and reappear events ? Thanks, Anand On 11/20/14 10:15 AM, Gui Hecheng wrote: There is a global list @fs_uuids to keep @fs_devices object for each created btrfs. But when a btrfs becomes empty (all devices belong to it are gone), its @fs_devices remains in @fs_uuids list until module exit. If we keeps mkfs.btrfs on the same device again and again, all empty @fs_devices produced are sure to eat up our memory. So this case has better to be prevented. I think that each time we setup btrfs on that device, we should check whether we are stealing some device from another btrfs seen before. To faciliate the search procedure, we could insert all @btrfs_device in a rb_root, one @btrfs_device per each physical device, with @bdev-bd_dev as key. Each time device stealing happens, we should replace the corresponding @btrfs_device in the rb_root with an up-to-date version. If the stolen device is the last device in its @fs_devices, then we have an empty btrfs to be deleted. Actually there are 3 ways to steal devices and lead to empty btrfs 1. mkfs, with -f option 2. device add, with -f option 3. device replace, with -f option We should act under these cases. Moreover, if there are seed devices, then it is asured that the devices in cloned @fs_devices are not treated as valid devices. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 181 - fs/btrfs/volumes.h | 6 ++ 3 files changed, 172 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 54bd91e..ee09a56 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2154,6 +2154,7 @@ static void __exit exit_btrfs_fs(void) btrfs_end_io_wq_exit(); unregister_filesystem(btrfs_fs_type); btrfs_exit_sysfs(); + btrfs_cleanup_valid_dev_root(); btrfs_cleanup_fs_uuids(); btrfs_exit_compress(); btrfs_hash_exit(); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0192051..ba86b1b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include linux/kthread.h #include linux/raid/pq.h #include linux/semaphore.h +#include linux/rbtree.h #include asm/div64.h #include ctree.h #include extent_map.h @@ -52,6 +53,120 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *device); DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); +static struct rb_root valid_dev_root = RB_ROOT; + +static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev) +{ + struct rb_node **p; + struct rb_node *parent; + struct rb_node *new; + struct btrfs_device *old_dev; + + WARN_ON(!mutex_is_locked(uuid_mutex)); + + parent = NULL; + new = new_dev-rb_node; + + p = valid_dev_root.rb_node; + while (*p) { + parent = *p; + old_dev = rb_entry(parent, struct btrfs_device, rb_node); + + if (new_dev-devnum old_dev-devnum) + p = parent-rb_left; + else if (new_dev-devnum old_dev-devnum) + p = parent-rb_right; + else { + rb_replace_node(parent, new, valid_dev_root); + RB_CLEAR_NODE(parent); + + goto out; + } + } + + old_dev = NULL; + rb_link_node(new, parent, p); + rb_insert_color(new, valid_dev_root); + +out: + return old_dev; +} + +static void free_fs_devices(struct btrfs_fs_devices *fs_devices) +{ + struct btrfs_device *device; + WARN_ON(fs_devices-opened); + while (!list_empty(fs_devices-devices)) { + device = list_entry(fs_devices-devices.next, + struct btrfs_device, dev_list); + list_del(device-dev_list); + rcu_string_free(device-name); + kfree(device); + } + kfree(fs_devices); +} + +static void remove_empty_fs_if_need(struct btrfs_fs_devices *old_fs) +{ + struct btrfs_fs_devices *seed_fs; + + if (!list_empty(old_fs-devices)) + return; + + list_del(old_fs-list); + + /* free the seed clones */ + seed_fs = old_fs-seed; + free_fs_devices(old_fs); + while (seed_fs) { + old_fs = seed_fs; + seed_fs = seed_fs-seed; + free_fs_devices(old_fs); + } + +} + +static void replace_invalid_device(struct btrfs_device *new_dev) +{ + struct btrfs_device
Re: BTRFS messes up snapshot LV with origin
On 11/20/2014 10:22 PM, Duncan wrote: But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. You take Us away, not add them UID = unique ID GUID = globally unique ID UUID = universally unique ID And other file systems have the same issues. XFS, for example uses UUIDs in the same way. It just has a command to re-brand the filesystem's UUID which you apply to the LVM snapshot immediately after taking the snapshot. (problem long-since established and understood since 2009 or so.) I don't know if this approach would work for BRFS with subvolumes. Example Citation :: http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/ XFS also has the nouuids mount option. btrfs has device= mount option. But any system with unique ids will have this identical issue when block-snapshot support is added underneath. -- Rob. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Changing label few times killed filesystem?
Chris Murphy posted on Thu, 20 Nov 2014 19:20:22 -0700 as excerpted: On Thu, Nov 20, 2014 at 6:27 PM, Boris Chernov aqs1...@hotmail.com wrote: Since it failed after checking extents I decided to try --init-extent-tree: There might be hope yet if you didn't use --repair which is said on the wiki and many times on this list is kindof a last resort. But at the very least before going with the hammer approach you should upgrade your btrfs-progs which is kind old. Current is 3.17.2. I suggest upgrading and just posting the results from 'btrfs check device' without any options and see what you get. This check and --repair code are mostly in btrfs-progs, whereas the mount time fixing code is in the kernel. So upgrading btrfs-progs may be sufficient for your case, but ultimately it might be necessary to go to a newer kernel also. Btrfs v3.14.1 I'm with Chris here. Additionally, I note that you (OP) are using kernel 3.15.x, while the entire kernel 3.15 series (which wasn't long-term supported so the last kernel update was shortly after 3.16 was released) is effectively blacklisted for btrfs, as it had a major btrfs bug in the compression handling code. (However, if you are not now and never did use compression on that filesystem, that bug shouldn't affect you, but others might.) The same bug was in 3.16.0 and 3.16.1, but was fixed in 3.16.2 (or was it 3.16.3) plus. So later 3.16 series kernels should be reasonably good. Unfortunately, 3.17 added another bug, this time with read-only snapshot handling. I don't do snapshots here and have been running it fine, but you'll want 3.17.2 plus if you do read-only snapshots. I've not yet switched to kernel 3.18 series (late development stage at this point) here, but while there was a problem in rc4, rc5 appears to be good according to reports I've seen. Meanwhile, userspace-side, there have been a number of fixes to btrfs check and the restore code in the 3.16 and 3.17 series, and while running the latest userspace isn't as critical as the kernel for normal operations (online operations) since for them the kernel is the operational code, for fixup (offline operations like btrfs check and btrfs restore), you really do want to be running the latest userspace, because in that case it's the userspace code that's actually doing the work. Meanwhile, in the other subthread you mentioned not understanding transid. FWIW transaction ID and generation are used interchangeably in btrfs discussions and refer to the the same thing -- a monotonically increasing number that gets bumped every time the root tree and superblocks are committed. Normally later generations/transids indicate later commits and thus closer a filesystem state closer to current. Note that you can use btrfs-show-super to display information from the superblocks including what it thinks the current generation/transid should be. Which brings us to the output. In most cases when there's problems with the transid/generation, wanted will be a bit higher than found, something like found 25456, wanted 25459. That simply means that the three latest commits got lost somewhere and you may have to settle for an older one (which is where the wiki restore article you mentioned comes in). But there were a number of reports recently where wanted was *MUCH* *LOWER* than found (like wanted 5, found 2753), which is what you're seeing. Unfortunately I don't remember the resolution of those reports, or indeed, if the bug has been traced yet. There is another bug (or possibly the above was after this one hit if it didn't stop further commits in some cases, thus resetting the generation to zero and increasing it again from there), however, where the transid was being zeroed. Wanted 26473, found 0. One of the devs mentioned tracing that one, tho again I'm not sure current status except that they mentioned it so they're obviously working on it. To my knowledge, these were *NOT* in the context of relabeling, however, so it's quite possible you're seeing the one bug, and the relabeling is simply coincidence. Again, however, you're running a 3.15 kernel that's effectively btrfs blacklisted, and and an even older 3.14 userspace. I can't promise upgrading will give you an actual fix, but certainly, getting current on your kernel and userspace will at least get you on the same page as most folks here, so we know we're not dealing with old and in the case of the kernel known blacklisted versions, and the bugs in play will at least be current ones, not long since fixed ones. And for the kernel, avoid 3.15 series entirely, along with early 3.16 (before 3.16.3) and 3.17 (before 3.17.2), plus early development 3.18 (current rcs should be better). -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the
Re: BTRFS messes up snapshot LV with origin
Robert White posted on Fri, 21 Nov 2014 03:35:05 -0800 as excerpted: On 11/20/2014 10:22 PM, Duncan wrote: But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. You take Us away, not add them UID = unique ID GUID = globally unique ID UUID = universally unique ID I was making a joke, as I happened to notice un-UUID =3 U-s just as I was writing that. Universally unique ID = UUID, un-UUID (not universally unique ID) = UUUID = U^3ID. =:^) Of course formally it'd be NUID (not/non- unique) or some such, but un- UUID served my purpose well enough, including the joke once I noticed it, so... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send and an existing backup
On 2014-11-20 09:10, Duncan wrote: Bardur Arantsson posted on Thu, 20 Nov 2014 14:17:52 +0100 as excerpted: If you have no other backups, I would really recommend that you *don't* use btrfs for your backup, or at least have a *third* backup which isn't on btrfs -- there are *still* problems with btrfs that can potentially wreck your backup filesystem. (Although it's obviously less likely if the external HDD will only be connected occasionally.) Don't get me wrong, btrfs is becoming more and more stable, but I wouldn't trust it with my *only* backup, especially if also running btrfs on the backed-up filesystem. This. My working versions and first backups are btrfs. My secondary backups are reiserfs (my old filesystem of choice, which has been very reliable for me), just in case both the btrfs versions bite the dust due to a bug in btrfs itself. Likewise, except I use compressed, encrypted tarballs stored on both Amazon S3 and Dropbox. smime.p7s Description: S/MIME Cryptographic Signature
Re: soft lockup - CPU#0 stuck - Kernel 3.17.2
On 15.11.2014 00:47, Chris Mason wrote: Ok, I think this is related to the new fair read/writer lock implementation in the generic kernel code. btrfs_clear_path_blocking() will end up taking locks outside of the strict locking order the rest of Btrfs uses. This used to be fine because we hold the blocking lock while we do it, but with the new queued locks we're running into trouble. We hit a similar bug earlier and I convinced myself the problem was only with btrfs_next_leaf and our trylock. But it's a bigger problem than I realized. So for now I've changed btrfs_clear_path_blocking to honor the rules and fixed up our trylock to make it faster. I'm letting a test run on this patch over the weekend, since I don't want any surprises with your backup farm. -chris Hi Chris Here comes a short feedback I applied your patch (Fix lockups from btrfs_clear_path_blocking) yesterday and my system survived this night! ;-) Thank you for the quick fix. Regards Patrick -- Patrick Schmid sch...@phys.ethz.ch support: +41 44 633 2668 IT Services Group, HPT H 8voice: +41 44 633 3997 Departement Physik, ETH Zurich CH-8093 Zurich, Switzerland -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, 21 Nov 2014 09:05:32 +0200, Brendan Hide wrote: On 2014/11/21 06:58, Zygo Blaxell wrote: I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. Simply monitoring the smart status without a self-test isn't really that great. I'm not sure on the default config, but smartd can be made to initiate a smart self-test at regular intervals. Depending on the test type (short, long, etc) it could include a full surface scan. This can reveal things like bad sectors before you ever hit them during normal system usage. WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. This is the same problem that I'm currently trying to resolve. I have one drive in a raid1 setup which shows no issues in smart status but often has checksum errors. In my situation what I've found is that if I scrub let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days try again there will be errors, even in old files which have not been accessed for months. If I do a read-only scrub to get a list of errors, a second scrub immediately after will show exactly the same errors. Apart from the scrub errors the system logs shows no issues with that particular drive. My next step is to disable autodefrag see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation ensure that outside of normal file access, data isn't being rewritten between scrubs) -- Ian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: soft lockup - CPU#0 stuck - Kernel 3.17.2
On Fri, Nov 21, 2014 at 8:01 AM, Patrick Schmid sch...@phys.ethz.ch wrote: On 15.11.2014 00:47, Chris Mason wrote: Ok, I think this is related to the new fair read/writer lock implementation in the generic kernel code. btrfs_clear_path_blocking() will end up taking locks outside of the strict locking order the rest of Btrfs uses. This used to be fine because we hold the blocking lock while we do it, but with the new queued locks we're running into trouble. We hit a similar bug earlier and I convinced myself the problem was only with btrfs_next_leaf and our trylock. But it's a bigger problem than I realized. So for now I've changed btrfs_clear_path_blocking to honor the rules and fixed up our trylock to make it faster. I'm letting a test run on this patch over the weekend, since I don't want any surprises with your backup farm. -chris Hi Chris Here comes a short feedback I applied your patch (Fix lockups from btrfs_clear_path_blocking) yesterday and my system survived this night! ;-) Thank you for the quick fix. Great to hear, thanks for following up. I'm sending this to Linus for the next 3.18 rc. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/20/2014 5:45 PM, Robert White wrote: Nice attempt at saving face, but wrong as _always_. The CONFIG_PATA_ACPI option has been in the kernel since 2008 and lots of people have used it. If you search for ACPI ide you'll find people complaining in 2008-2010 about windows error messages indicating the device is present in their system but no OS driver is available. Nope... not finding it. The closest thing was one or two people who said ACPI when they meant AHCI ( and were quickly corrected ). This is probably what you were thinking of since windows xp did not ship with an ahci driver so it was quite common for winxp users to have this problem when in _AHCI_ mode. That you have yet to see a single system that implements it is about the worst piece of internet research I've ever seen. Do you not _get_ that your opinion about what exists and how it works is not authoritative? Show me one and I'll give you a cookie. I have disassembled a number of acpi tables and yet to see one that has it. What's more, motherboard vendors tend to implement only the absolute minimum they have to. Since nobody actually needs this feature, they aren't going to bother with it. Do you not get that your hand waving arguments of you can google for it are not authoritative? You can also find articles about both windows and linux systems actively using ACPI fan control going back to 2009 Maybe you should have actually read those articles. Linux supports acpi fan control, unfortunately, almost no motherboards actually implement it. Almost everyone who wants fan control working in linux has to install lm-sensors and load a driver that directly accesses one of the embedded controllers that motherboards tend to use and run the fancontrol script to manipulate the pwm channels on that controller. These days you also have to boot with a kernel argument to allow loading the driver since ACPI claims those IO ports for its own use which creates a conflict. Windows users that want to do this have to install a program... I believe a popular one is called q-fan, that likewise directly accesses the embedded controller registers to control the fan, since the acpi tables don't bother properly implementing the acpi fan spec. Then there are thinkpads, and one or two other laptops ( asus comes to mind ) that went and implemented their own proprietary acpi interfaces for fancontrol instead of following the spec, which required some reverse engineering and yet more drivers to handle these proprietary acpi interfaces. You can google for thinkfan if you want to see this. These are not hard searches to pull off. These are not obscure references. Go to the google box and start typing ACPI fan... and check the autocomplete. I'll skip ovea all the parts where you don't know how a chipset works and blah, blah, blah... You really should have just stopped at I don't know and I've never because you keep demonstrating that you _don't_ know, and that you really _should_ _never_. Tell us more about the lizard aliens controlling your computer, I find your versions of realty fascinating... By all means, keep embarrassing yourself with nonsense and trying to cover it up by being rude and insulting. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ /UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq 8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA= =cXJr -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/20/2014 6:08 PM, Robert White wrote: Well you should have _actually_ trimmed your response down to not pressing send. _Many_ motherboards have complete RAID support at levels 0, 1, 10, and five 5. A few have RAID6. Some of them even use the LSI chip-set. Yes, there are some expensive server class motherboards out there with integrated real raid chips. Your average consumer class motherboards are not those. They contain intel, nvidia, sil, promise, and via chipsets that are fake raid. Seriously... are you trolling this list with disinformation or just repeating tribal knowledge from fifteen year old copies of PC Magazine? Please drop the penis measuring. Yea, some of the IDE motherboards and that only had RAID1 and RAID0 (and indeed some of the add-on controllers) back in the IDE-only days were really lame just-forked-write devices with no integrity checks (hence fake raid) but that's from like the 1990s; it's paleolithic age wisdom at this point. Wrong again... fakeraid became popular with the advent of SATA since it was easy to add a knob to the bios to switch it between AHCI and RAID mode, and just change the pci device id. These chipsets are still quite common today and several of them do support raid5 and raid10 ( well, really it's raid 0 + raid1, but that's a whole nother can of worms ). Recent intel chips also now have a caching mode for having an SSD cache a larger HDD. Intel has also done a lot of work integrating support for their chipset into mdadm in the last year or three. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c= =4Yg+ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/5] direct-io: don't dirty ITER_BVEC pages on read
On 11/21/2014 04:08 AM, Omar Sandoval wrote: Reads through the iov_iter infrastructure for kernel pages shouldn't be dirtied by the direct I/O code. This is based on Dave Kleikamp's and Ming Lei's previously posted patches. Acked-by: Dave Kleikamp dave.kleik...@oracle.com Cc: Ming Lei ming@canonical.com Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/direct-io.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index e181b6b..e542ce4 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -120,6 +120,7 @@ struct dio { spinlock_t bio_lock;/* protects BIO fields below */ int page_errors;/* errno from get_user_pages() */ int is_async; /* is IO async ? */ + int should_dirty; /* should we mark read pages dirty? */ bool defer_completion; /* defer AIO completion to workqueue? */ int io_error; /* IO error in completion path */ unsigned long refcount; /* direct_io_worker() and bios */ @@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio) dio-refcount++; spin_unlock_irqrestore(dio-bio_lock, flags); - if (dio-is_async dio-rw == READ) + if (dio-is_async dio-rw == READ dio-should_dirty) bio_set_pages_dirty(bio); if (sdio-submit_io) @@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio *bio) if (!uptodate) dio-io_error = -EIO; - if (dio-is_async dio-rw == READ) { + if (dio-is_async dio-rw == READ dio-should_dirty) { bio_check_pages_dirty(bio); /* transfers ownership */ } else { bio_for_each_segment_all(bvec, bio, i) { struct page *page = bvec-bv_page; - if (dio-rw == READ !PageCompound(page)) + if (dio-rw == READ !PageCompound(page) dio-should_dirty) set_page_dirty_lock(page); page_cache_release(page); } @@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, dio-inode = inode; dio-rw = rw; + dio-should_dirty = !(iter-type ITER_BVEC); /* * For AIO O_(D)SYNC writes we need to defer completions to a workqueue -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: qgroup: add BUILD_BUG to report pointer cast breakage
On Thu, Nov 13, 2014 at 12:54:39AM +0900, Daniel Dressler wrote: Our ulist data structure stores at max 64bit values. qgroup has used this structure to store pointers. In the future when we upgrade to 128bit this casting of pointers to uint64_t will break. What are we going to upgrade to 128 bits? This patch adds a BUILD_BUG ensuring that this code will not be left untouched in the upgrade. It also marks this issue on the TODO list so it may be addressed before such an upgrade. + (BUILD_BUG_ON_ZERO(sizeof(uintptr_t) sizeof(u64)) + \ So it's pointers, this is similar to the 32bit - 64bit transition, a change that will affect everything, one BUILD_BUG on does not make any difference. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: ctree: reduce args where only fs_info used
On Wed, Nov 12, 2014 at 01:43:09PM +0900, Daniel Dressler wrote: This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. Thanks for picking up the project. A mere formality, can you please justify the paragraphs to 74 chars? -- This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. -- This patch does not address the two functions in ctree.c (insert_ptr, and split_item) which only use root for BUG_ONs in ctree.c This patch affects the following functions: 1) fixup_low_keys 2) btrfs_set_item_key_safe Please send one patch per function change, unless there are more that are somehow entangled that it would make it hard to separate. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: delayed-inode: replace root args iff only fs_info used
On Mon, Nov 17, 2014 at 10:05:02PM +0900, Daniel Dressler wrote: This is the second independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. This patch affects the following function(s): 1) btrfs_wq_run_delayed_node Signed-off-by: Daniel Dressler danieru.dress...@gmail.com Apart from the narrow paragraphs, Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: ctree: reduce args where only fs_info used
Ah thanks David for looking at this. Sorry for the thin paragraphs my vim was warming too early about long lines. I will reformat it to break at 74 chars. No problem, I'll redo everything so it is one function per patch. Now fair warning: there are about 102 functions to cleanup. I was a bit worried that many patches would cause too much maintainer overhead but it is no problem for me. Only a few functions have dependecies on other functions needing cleanup. Thus there will be some small patch series for those function sets. A big benefit of one function one patch is that extent-io.c will no longer be a 34 function monster patch. Thank you David, I'll redo all these patches. Is there any rate limiting I should be doing? I don't want to flood the list with burst of dozen plus patches, or is that an okay volume? Daniel 2014-11-22 0:55 GMT+09:00 David Sterba dste...@suse.cz: On Wed, Nov 12, 2014 at 01:43:09PM +0900, Daniel Dressler wrote: This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. Thanks for picking up the project. A mere formality, can you please justify the paragraphs to 74 chars? -- This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. -- This patch does not address the two functions in ctree.c (insert_ptr, and split_item) which only use root for BUG_ONs in ctree.c This patch affects the following functions: 1) fixup_low_keys 2) btrfs_set_item_key_safe Please send one patch per function change, unless there are more that are somehow entangled that it would make it hard to separate. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: disk-io: replace root args iff only fs_info used
On Fri, Nov 21, 2014 at 05:15:07PM +0900, Daniel Dressler wrote: This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. It's good to have this kind of introduction but it really belongs ot the cover letter not the individual patches. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Now that I see that, I'm not sure that my previous comment about 'one patch per function' is the right way to go. This patch looks good as it stands. The change is simple enough that I won't be opposed to grouping even more functions together as long as it stays revieweable. The patches are likely to clash with a lot of pending patches, so you may want to base it on the integration branch next time. This would make maintainers' life easier and also raises chances to merge the patches. Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: disk-io: replace root args iff only fs_info used
Thank you David this is helpful feedback. What would a cover letter be like? Would that be a separate email to the list, or maybe the first email in a patch series? Sorry I've twice looked for the integration repo. I found some that look like it could be but those had older commits. Could you direct me to the exact branch I'd love to work against it. These patches were done against linux-next. I think small one function patches might be best. I have the codebase mapped out and each file's functions-to-be-cleaned count varies wildly. If I did batch files together and split large files apart there would be no rhyme or reason for the groupings. With single function patches it is very clear what changes are justified since they should only occur in the affected function or in a call-site. With multiple functions the call-site changes get mixed up would it would be harder to review. Daniel 2014-11-22 1:15 GMT+09:00 David Sterba dste...@suse.cz: On Fri, Nov 21, 2014 at 05:15:07PM +0900, Daniel Dressler wrote: This is the 3rd independent patch of a larger project to cleanup btrfs's internal usage of btrfs_root. Many functions take btrfs_root only to grab the fs_info struct. By requiring a root these functions cause programmer overhead. That these functions can accept any valid root is not obvious until inspection. This patch reduces the specificity of such functions to accept the fs_info directly. These patches can be applied independently and thus are not being submitted as a patch series. There should be about 26 patches by the project's completion. Each patch will cleanup between 1 and 34 functions apiece. Each patch covers a single file's functions. It's good to have this kind of introduction but it really belongs ot the cover letter not the individual patches. This patch affects the following function(s): 1) csum_tree_block 2) csum_dirty_buffer 3) check_tree_block_fsid 4) btrfs_find_tree_block 5) clean_tree_block Now that I see that, I'm not sure that my previous comment about 'one patch per function' is the right way to go. This patch looks good as it stands. The change is simple enough that I won't be opposed to grouping even more functions together as long as it stays revieweable. The patches are likely to clash with a lot of pending patches, so you may want to base it on the integration branch next time. This would make maintainers' life easier and also raises chances to merge the patches. Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/5] btrfs: don't allow -C or +c chattrs on a swap file
On Fri, Nov 21, 2014 at 02:08:30AM -0800, Omar Sandoval wrote: @@ -293,14 +293,21 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg) } } else { You can put the condition here, instead of shifting the nested block. } else if (!IS_SWAPFILE(inode)) { /* - * Revert back under same assuptions as above + * swap_activate checks that we don't swapon a copy-on-write + * file, but we must also make sure that it doesn't become + * copy-on-write. */ - if (S_ISREG(mode)) { - if (inode-i_size == 0) - ip-flags = ~(BTRFS_INODE_NODATACOW - | BTRFS_INODE_NODATASUM); - } else { - ip-flags = ~BTRFS_INODE_NODATACOW; + if (!IS_SWAPFILE(inode)) { + /* + * Revert back under same assumptions as above + */ + if (S_ISREG(mode)) { + if (inode-i_size == 0) + ip-flags = ~(BTRFS_INODE_NODATACOW | +BTRFS_INODE_NODATASUM); + } else { + ip-flags = ~BTRFS_INODE_NODATACOW; + } } } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong bt...@iarmst.co.uk wrote: In my situation what I've found is that if I scrub let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days try again there will be errors, even in old files which have not been accessed for months. What are the devices? And if they're SSDs are they powered off for these few days? I take it the scrub error type is corruption? You can use badblocks to write a known pattern to the drive. Then power off and leave it for a few days. Then read the drive, matching against the pattern, and see if there are any discrepancies. Doing this outside the code path of Btrfs would fairly conclusively indicate whether it's hardware or software induced. Assuming you have another copy of all of these files :-) you could just sha256sum the two copies to see if they have in fact changed. If they have, well then you've got some silent data corruption somewhere somehow. But if they always match, then that suggests a bug. I don't see how you can get bogus corruption messages, and for it to not be a bug. When you do these scrubs that come up clean, and then later come up with corruptions, have you done any software updates? My next step is to disable autodefrag see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation ensure that outside of normal file access, data isn't being rewritten between scrubs) I wouldn't expect autodefrag to touch old files not accessed for months. Doesn't it only affect actively used files? -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote: On 2014/11/21 06:58, Zygo Blaxell wrote: You have one reallocated sector, so the drive has lost some data at some time in the last 49000(!) hours. Normally reallocations happen during writes so the data that was lost was data you were in the process of overwriting anyway; however, the reallocated sector count could also be a sign of deteriorating drive integrity. In /var/lib/smartmontools there might be a csv file with logged error attribute data that you could use to figure out whether that reallocation was recent. I also notice you are not running regular SMART self-tests (e.g. by smartctl -t long) and the last (and first, and only!) self-test the drive ran was ~12000 hours ago. That means most of your SMART data is about 18 months old. The drive won't know about sectors that went bad in the last year and a half unless the host happens to stumble across them during a read. The drive is over five years old in operating hours alone. It is probably so fragile now that it will break if you try to move it. All interesting points. Do you schedule SMART self-tests on your own systems? I have smartd running. In theory it tracks changes and sends alerts if it figures a drive is going to fail. But, based on what you've indicated, that isn't good enough. I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 That seems a little off. If there were 5 read errors, I'd expect the drive to have errors in the SMART error log. Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem. There have been a number of fixes to csums in btrfs pulled into the kernel recently, and I've retired two five-year-old computers this summer due to RAM/CPU failures. The difference here is that the issue only affects the one drive. This leaves the probable cause at: - the drive itself - the cable/ports with a negligibly-possible cause at the motherboard chipset. If it was cable, there should be UDMA CRC errors or similar in the SMART counters, but they are zero. You can also try swapping the cable and seeing whether the errors move. I've found many bad cables that way. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: BTRFS messes up snapshot LV with origin
On Fri, Nov 21, 2014 at 06:22:57AM +, Duncan wrote: After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. That isn't true at all. A repairing fsck can take less than 1% of the overall volume size, and a full conversion from another filesystem type can take less than 10%. Usually I can find enough space by blowing away the swap LV for a few hours. I do NOT usually have 13TB of slack space lying around in a 26TB disk array, nor do I have enough bandwidth to move those 13TB to another machine without great inconvenience. But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs =:^), because they're no longer unique, requiring them to be unique just as the label says cannot be considered a bug. It's simply stricter enforcement of the rules, which are, after all, plainly stated in the descriptive name. It's not a bug as long as I can completely control which devices are searched for UUIDs, and the system behaves sanely when multiple UUIDs are found through automatic discovery; otherwise, it's not only a bug, it's a DoS attack security vulnerability. Consider what happens if someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a fake filesystem with those UUIDs, puts the fake filesystem on a USB stick, and plugs it back into the victim machine... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: [PATCH v2 5/5] btrfs: enable swap file support
On Fri, Nov 21, 2014 at 02:08:31AM -0800, Omar Sandoval wrote: Implement the swap file a_ops on btrfs. Activation simply checks for a usable swap file: it must be fully allocated (no holes), support direct I/O (so no compressed or inline extents) and should be nocow (I'm not sure about that last one). Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/btrfs/inode.c | 71 1 file changed, 71 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d23362f..b8fd36b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9442,6 +9442,75 @@ out_inode: } +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, +sector_t *span) +{ + struct inode *inode = file_inode(file); + struct btrfs_inode *ip = BTRFS_I(inode); 'ip' looks strange in context of a filesystem, please pick a different name (eg. 'inode' or whatever). + int ret = 0; + u64 isize = inode-i_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + u64 start, len; + + if (ip-flags BTRFS_INODE_COMPRESS) { + /* Can't do direct I/O on a compressed file. */ + pr_err(BTRFS: swapfile is compressed); Please use the btrfs_err macros everywhere. + return -EINVAL; + } + if (!(ip-flags BTRFS_INODE_NODATACOW)) { + /* The swap file can't be copy-on-write. */ + pr_err(BTRFS: swapfile is copy-on-write); + return -EINVAL; + } + + lock_extent_bits(ip-io_tree, 0, isize - 1, 0, cached_state); + + /* + * All of the extents must be allocated and support direct I/O. Inline + * extents and compressed extents fall back to buffered I/O, so those + * are no good. + */ + start = 0; + while (start isize) { + len = isize - start; + em = btrfs_get_extent(inode, NULL, 0, start, len, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out; + } + + if (test_bit(EXTENT_FLAG_VACANCY, em-flags) || + em-block_start == EXTENT_MAP_HOLE) { If the no-holes feature is enabled on the filesystem, there won't be any such extent representing the hole. You have to check that each two consecutive extents are adjacent. + pr_err(BTRFS: swapfile has holes); + ret = -EINVAL; + goto out; + } + if (em-block_start == EXTENT_MAP_INLINE) { + pr_err(BTRFS: swapfile is inline); While the test is valid, this would mean that the file is smaller than the inline limit, which is now one page. I think the generic swap code would refuse such a small file anyway. + ret = -EINVAL; + goto out; + } + if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { + pr_err(BTRFS: swapfile is compresed); + ret = -EINVAL; + goto out; + } I think the preallocated extents should be refused as well. This means the filesystem has enough space to hold the data but it would still have to go through the allocation and could in turn stress the memory management code that triggered the swapping activity in the first place. Though it's probably still possible to reach such corner case even with fully allocated nodatacow file, this should be reviewed anyway. + + start = extent_map_end(em); + free_extent_map(em); + } + +out: + unlock_extent_cached(ip-io_tree, 0, isize - 1, cached_state, + GFP_NOFS); + return ret; +} -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell zblax...@furryterror.org wrote: I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. Most attributes are always updated without issuing a smart test of any kind. A drive I have here only has four offline updateable attributes. When it comes to bad sectors, the drive won't use a sector that persistently fails writes. So you don't really have to worry about latent bad sectors that don't have data on them already. The sectors you care about are the ones with data. A scrub reads all of those sectors. First the drive could report a read error in which case Btrfs raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or rebuild it from parity, and write to the affected sector; and also this same mechanism happens in normal reads so it's a kind of passive scrub. But it happens to miss checking inactively read data, which a scrub will check. Second, the drive could report no problem, and Btrfs raid1/10 could still fix the problem in case of a csum mismatch. And it looks like soonish we'll see this apply to raid5/6. So I think a nightly long smart test is a bit overkill. I think you could do nightly -t short tests which will report problems scrub won't notice, such as higher seek times or lower throughput performance. And then scrub once a week. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. Libata resetting the link should be recorded in kernel messages. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
On Thu, Nov 20, 2014 at 11:22 PM, Duncan 1i5t5.dun...@cox.net wrote: When I have such a filesystem level problem, I simply dd from the backing device to some other location, generally to a file that's on a different filesystem (preferrably non-btrfs, I use reiserfs as I've found it very resilient, here), in which case btrfs device scan won't see the UUID on the copy as it scans block devices, not inside non-device files. That's hours of dd and you have to find space to do it. After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. They're more like a qcow2 snapshot than a conventional LVM snapshot; a big difference being if you delete the snapshot, or you delete a bunch of files in a thin volume and follow it with fstrim, the unused extents are returned to the thin pool. There has been a fragmentation problem with thin volumes; I don't know if that's solved yet. And I don't know if it exacerbates things with Btrfs fragmentation. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/5] btrfs: enable swap file support
On Fri, Nov 21, 2014 at 6:00 PM, David Sterba dste...@suse.cz wrote: On Fri, Nov 21, 2014 at 02:08:31AM -0800, Omar Sandoval wrote: Implement the swap file a_ops on btrfs. Activation simply checks for a usable swap file: it must be fully allocated (no holes), support direct I/O (so no compressed or inline extents) and should be nocow (I'm not sure about that last one). Signed-off-by: Omar Sandoval osan...@osandov.com --- fs/btrfs/inode.c | 71 1 file changed, 71 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d23362f..b8fd36b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9442,6 +9442,75 @@ out_inode: } +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, +sector_t *span) +{ + struct inode *inode = file_inode(file); + struct btrfs_inode *ip = BTRFS_I(inode); 'ip' looks strange in context of a filesystem, please pick a different name (eg. 'inode' or whatever). + int ret = 0; + u64 isize = inode-i_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + u64 start, len; + + if (ip-flags BTRFS_INODE_COMPRESS) { + /* Can't do direct I/O on a compressed file. */ + pr_err(BTRFS: swapfile is compressed); Please use the btrfs_err macros everywhere. + return -EINVAL; + } + if (!(ip-flags BTRFS_INODE_NODATACOW)) { + /* The swap file can't be copy-on-write. */ + pr_err(BTRFS: swapfile is copy-on-write); + return -EINVAL; + } + + lock_extent_bits(ip-io_tree, 0, isize - 1, 0, cached_state); + + /* + * All of the extents must be allocated and support direct I/O. Inline + * extents and compressed extents fall back to buffered I/O, so those + * are no good. + */ + start = 0; + while (start isize) { + len = isize - start; + em = btrfs_get_extent(inode, NULL, 0, start, len, 0); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out; + } + + if (test_bit(EXTENT_FLAG_VACANCY, em-flags) || + em-block_start == EXTENT_MAP_HOLE) { If the no-holes feature is enabled on the filesystem, there won't be any such extent representing the hole. You have to check that each two consecutive extents are adjacent. If no-holes is enabled it means file extent items aren't used to represent holes. But extent maps are still used to represent holes in memory, and that's what the code is looking for and therefore it's correct. + pr_err(BTRFS: swapfile has holes); + ret = -EINVAL; + goto out; + } + if (em-block_start == EXTENT_MAP_INLINE) { + pr_err(BTRFS: swapfile is inline); While the test is valid, this would mean that the file is smaller than the inline limit, which is now one page. I think the generic swap code would refuse such a small file anyway. + ret = -EINVAL; + goto out; + } + if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { + pr_err(BTRFS: swapfile is compresed); + ret = -EINVAL; + goto out; + } I think the preallocated extents should be refused as well. This means the filesystem has enough space to hold the data but it would still have to go through the allocation and could in turn stress the memory management code that triggered the swapping activity in the first place. Though it's probably still possible to reach such corner case even with fully allocated nodatacow file, this should be reviewed anyway. + + start = extent_map_end(em); + free_extent_map(em); + } + +out: + unlock_extent_cached(ip-io_tree, 0, isize - 1, cached_state, + GFP_NOFS); + return ret; +} -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: make sure logged extents complete in the current transaction V3
Liu Bo pointed out that my previous fix would lose the generation update in the scenario I described. It is actually much worse than that, we could lose the entire extent if we lose power right after the transaction commits. Consider the following write extent 0-4k log extent in log tree commit transaction power fail happens here ordered extent completes We would lose the 0-4k extent because it hasn't updated the actual fs tree, and the transaction commit will reset the log so it isn't replayed. If we lose power before the transaction commit we are save, otherwise we are not. Fix this by keeping track of all extents we logged in this transaction. Then when we go to commit the transaction make sure we wait for all of those ordered extents to complete before proceeding. This will make sure that if we lose power after the transaction commit we still have our data. This also fixes the problem of the improperly updated extent generation. Thanks, cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik jba...@fb.com --- V2-V3: Cleanup pending ordered properly on transaction abort. fs/btrfs/disk-io.c | 20 fs/btrfs/ordered-data.c | 9 +++-- fs/btrfs/ordered-data.h | 8 +++- fs/btrfs/transaction.c | 33 + fs/btrfs/transaction.h | 2 ++ fs/btrfs/tree-log.c | 6 +++--- 6 files changed, 72 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2409718..fe4db97 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4128,6 +4128,25 @@ again: return 0; } +static void btrfs_free_pending_ordered(struct btrfs_transaction *cur_trans, + struct btrfs_fs_info *fs_info) +{ + struct btrfs_ordered_extent *ordered; + + spin_lock(fs_info-trans_lock); + while (!list_empty(cur_trans-pending_ordered)) { + ordered = list_first_entry(cur_trans-pending_ordered, + struct btrfs_ordered_extent, + trans_list); + list_del_init(ordered-trans_list); + spin_unlock(fs_info-trans_lock); + + btrfs_put_ordered_extent(ordered); + spin_lock(fs_info-trans_lock); + } + spin_unlock(fs_info-trans_lock); +} + void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, struct btrfs_root *root) { @@ -4139,6 +4158,7 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, cur_trans-state = TRANS_STATE_UNBLOCKED; wake_up(root-fs_info-transaction_wait); + btrfs_free_pending_ordered(cur_trans, root-fs_info); btrfs_destroy_delayed_inodes(root); btrfs_assert_delayed_root_empty(root); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index ac734ec..269e21d 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -220,6 +220,7 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, INIT_LIST_HEAD(entry-work_list); init_completion(entry-completion); INIT_LIST_HEAD(entry-log_list); + INIT_LIST_HEAD(entry-trans_list); trace_btrfs_ordered_extent_add(inode, entry); @@ -443,6 +444,8 @@ void btrfs_get_logged_extents(struct inode *inode, ordered = rb_entry(n, struct btrfs_ordered_extent, rb_node); if (!list_empty(ordered-log_list)) continue; + if (test_bit(BTRFS_ORDERED_LOGGED, ordered-flags)) + continue; list_add_tail(ordered-log_list, logged_list); atomic_inc(ordered-refs); } @@ -472,7 +475,8 @@ void btrfs_submit_logged_extents(struct list_head *logged_list, spin_unlock_irq(log-log_extents_lock[index]); } -void btrfs_wait_logged_extents(struct btrfs_root *log, u64 transid) +void btrfs_wait_logged_extents(struct btrfs_trans_handle *trans, + struct btrfs_root *log, u64 transid) { struct btrfs_ordered_extent *ordered; int index = transid % 2; @@ -497,7 +501,8 @@ void btrfs_wait_logged_extents(struct btrfs_root *log, u64 transid) wait_event(ordered-wait, test_bit(BTRFS_ORDERED_IO_DONE, ordered-flags)); - btrfs_put_ordered_extent(ordered); + if (!test_and_set_bit(BTRFS_ORDERED_LOGGED, ordered-flags)) + list_add_tail(ordered-trans_list, trans-ordered); spin_lock_irq(log-log_extents_lock[index]); } spin_unlock_irq(log-log_extents_lock[index]); diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index d81a274..0124bff 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -71,6 +71,8 @@ struct btrfs_ordered_sum { ordered
[PATCH 1/4] fs: split update_time() into update_time() and write_time()
In preparation for adding support for the lazytime mount option, we need to be able to separate out the update_time() and write_time() inode operations. Currently, only btrfs and xfs uses update_time(). We needed to preserve update_time() because btrfs wants to have a special btrfs_root_readonly() check; otherwise we could drop the update_time() inode operation entirely. Signed-off-by: Theodore Ts'o ty...@mit.edu Cc: x...@oss.sgi.com Cc: linux-btrfs@vger.kernel.org --- fs/btrfs/inode.c | 10 ++ fs/inode.c | 29 ++--- fs/xfs/xfs_iops.c | 39 --- include/linux/fs.h | 1 + 4 files changed, 45 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d23362f..a5e0d0d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now, inode-i_mtime = *now; if (flags S_ATIME) inode-i_atime = *now; + return 0; +} + +static int btrfs_write_time(struct inode *inode) +{ return btrfs_dirty_inode(inode); } @@ -9462,6 +9467,7 @@ static const struct inode_operations btrfs_dir_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, .tmpfile= btrfs_tmpfile, }; static const struct inode_operations btrfs_dir_ro_inode_operations = { @@ -9470,6 +9476,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct file_operations btrfs_dir_file_operations = { @@ -9540,6 +9547,7 @@ static const struct inode_operations btrfs_file_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_special_inode_operations = { .getattr= btrfs_getattr, @@ -9552,6 +9560,7 @@ static const struct inode_operations btrfs_special_inode_operations = { .get_acl= btrfs_get_acl, .set_acl= btrfs_set_acl, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; static const struct inode_operations btrfs_symlink_inode_operations = { .readlink = generic_readlink, @@ -9565,6 +9574,7 @@ static const struct inode_operations btrfs_symlink_inode_operations = { .listxattr = btrfs_listxattr, .removexattr= btrfs_removexattr, .update_time= btrfs_update_time, + .write_time = btrfs_write_time, }; const struct dentry_operations btrfs_dentry_operations = { diff --git a/fs/inode.c b/fs/inode.c index 26753ba..8f5c4b5 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { - if (inode-i_op-update_time) - return inode-i_op-update_time(inode, time, flags); - - if (flags S_ATIME) - inode-i_atime = *time; - if (flags S_VERSION) - inode_inc_iversion(inode); - if (flags S_CTIME) - inode-i_ctime = *time; - if (flags S_MTIME) - inode-i_mtime = *time; + int ret; + + if (inode-i_op-update_time) { + ret = inode-i_op-update_time(inode, time, flags); + if (ret) + return ret; + } else { + if (flags S_ATIME) + inode-i_atime = *time; + if (flags S_VERSION) + inode_inc_iversion(inode); + if (flags S_CTIME) + inode-i_ctime = *time; + if (flags S_MTIME) + inode-i_mtime = *time; + } + if (inode-i_op-write_time) + return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); return 0; } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index ec6dcdc..0e9653c 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -984,10 +984,8 @@ xfs_vn_setattr( } STATIC int -xfs_vn_update_time( - struct inode*inode, - struct timespec *now, - int flags) +xfs_vn_write_time( + struct inode*inode) { struct xfs_inode*ip = XFS_I(inode); struct xfs_mount*mp = ip-i_mount; @@ -1004,21 +1002,16 @@ xfs_vn_update_time( } xfs_ilock(ip, XFS_ILOCK_EXCL); - if (flags S_CTIME) { - inode-i_ctime = *now; -
[PATCH 2/4] vfs: add support for a lazytime mount option
Add a new mount option which enables a new lazytime mode. This mode causes atime, mtime, and ctime updates to only be made to the in-memory version of the inode. The on-disk times will only get updated when (a) if the inode needs to be updated for some non-time related change, (b) if userspace calls fsync(), syncfs() or sync(), or (c) just before an undeleted inode is evicted from memory. This is OK according to POSIX because there are no guarantees after a crash unless userspace explicitly requests via a fsync(2) call. For workloads which feature a large number of random write to a preallocated file, the lazytime mount option significantly reduces writes to the inode table. The repeated 4k writes to a single block will result in undesirable stress on flash devices and SMR disk drives. Even on conventional HDD's, the repeated writes to the inode table block will trigger Adjacent Track Interference (ATI) remediation latencies, which very negatively impact 99.9 percentile latencies --- which is a very big deal for web serving tiers (for example). Google-Bug-Id: 18297052 Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 38 +- fs/inode.c | 18 ++ fs/proc_namespace.c | 1 + fs/sync.c | 7 +++ include/linux/fs.h | 1 + include/uapi/linux/fs.h | 1 + 6 files changed, 65 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ef9bef1..ce7de22 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) inode-i_state = ~I_DIRTY_PAGES; dirty = inode-i_state I_DIRTY; - inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC); + inode-i_state = ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME); spin_unlock(inode-i_lock); /* Don't write the inode if only I_DIRTY_PAGES was set */ if (dirty (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { @@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +/* + * This works like wait_sb_inodes(), but it is called *before* we kick + * the bdi so the inodes can get written out. + */ +static void flush_sb_dirty_time(struct super_block *sb) +{ + struct inode *inode, *old_inode = NULL; + + WARN_ON(!rwsem_is_locked(sb-s_umount)); + spin_lock(inode_sb_list_lock); + list_for_each_entry(inode, sb-s_inodes, i_sb_list) { + int dirty_time; + + spin_lock(inode-i_lock); + if (inode-i_state (I_FREEING|I_WILL_FREE|I_NEW)) { + spin_unlock(inode-i_lock); + continue; + } + dirty_time = inode-i_state I_DIRTY_TIME; + __iget(inode); + spin_unlock(inode-i_lock); + spin_unlock(inode_sb_list_lock); + + iput(old_inode); + old_inode = inode; + + if (dirty_time) + mark_inode_dirty(inode); + cond_resched(); + spin_lock(inode_sb_list_lock); + } + spin_unlock(inode_sb_list_lock); + iput(old_inode); +} + /** * writeback_inodes_sb_nr -writeback dirty inodes from given super_block * @sb: the superblock @@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(sb-s_umount)); + flush_sb_dirty_time(sb); bdi_queue_work(sb-s_bdi, work); wait_for_completion(done); diff --git a/fs/inode.c b/fs/inode.c index 8f5c4b5..6e91aca 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,18 @@ static void evict(struct inode *inode) BUG_ON(!(inode-i_state I_FREEING)); BUG_ON(!list_empty(inode-i_lru)); + if (inode-i_nlink inode-i_state I_DIRTY_TIME) { + if (inode-i_op-write_time) + inode-i_op-write_time(inode); + else if (inode-i_sb-s_op-write_inode) { + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + }; + mark_inode_dirty(inode); + inode-i_sb-s_op-write_inode(inode, wbc); + } + } + if (!list_empty(inode-i_wb_list)) inode_wb_list_del(inode); @@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } + if (inode-i_sb-s_flags MS_LAZYTIME) { + spin_lock(inode-i_lock); + inode-i_state |= I_DIRTY_TIME; + spin_unlock(inode-i_lock); + return 0; + } if (inode-i_op-write_time) return
[PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale
Guarantee that the on-disk timestamps will be no more than 24 hours stale. Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 1 + fs/inode.c | 7 ++- include/linux/fs.h | 1 + 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ce7de22..eb04277 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) if (flags (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { trace_writeback_dirty_inode_start(inode, flags); + inode-i_ts_dirty_day = 0; if (sb-s_op-dirty_inode) sb-s_op-dirty_inode(inode, flags); diff --git a/fs/inode.c b/fs/inode.c index 6e91aca..f0d6232 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { + int days_since_boot = jiffies / (HZ * 86400); int ret; if (inode-i_op-update_time) { @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } - if (inode-i_sb-s_flags MS_LAZYTIME) { + if ((inode-i_sb-s_flags MS_LAZYTIME) + (!inode-i_ts_dirty_day || +inode-i_ts_dirty_day == days_since_boot)) { spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); + inode-i_ts_dirty_day = days_since_boot; return 0; } + inode-i_ts_dirty_day = 0; if (inode-i_op-write_time) return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 489b2f2..e3574cd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -575,6 +575,7 @@ struct inode { struct timespec i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; + unsigned short i_ts_dirty_day; unsigned inti_blkbits; blkcnt_ti_blocks; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()
On Fri, Nov 21, 2014 at 2:59 PM, Theodore Ts'o ty...@mit.edu wrote: In preparation for adding support for the lazytime mount option, we need to be able to separate out the update_time() and write_time() inode operations. Currently, only btrfs and xfs uses update_time(). We needed to preserve update_time() because btrfs wants to have a special btrfs_root_readonly() check; otherwise we could drop the update_time() inode operation entirely. No objections here, I'll give the queue a whirl. You can add my acked-by-or-Ill-fix-whatever-breaks I look forward to the patch that makes us all lazy by default. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale
On Nov 21, 2014, at 1:59 PM, Theodore Ts'o ty...@mit.edu wrote: Guarantee that the on-disk timestamps will be no more than 24 hours stale. Signed-off-by: Theodore Ts'o ty...@mit.edu --- fs/fs-writeback.c | 1 + fs/inode.c | 7 ++- include/linux/fs.h | 1 + 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index ce7de22..eb04277 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) if (flags (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) { trace_writeback_dirty_inode_start(inode, flags); + inode-i_ts_dirty_day = 0; if (sb-s_op-dirty_inode) sb-s_op-dirty_inode(inode, flags); diff --git a/fs/inode.c b/fs/inode.c index 6e91aca..f0d6232 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, struct inode *inode, */ static int update_time(struct inode *inode, struct timespec *time, int flags) { + int days_since_boot = jiffies / (HZ * 86400); int ret; if (inode-i_op-update_time) { @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct timespec *time, int flags) if (flags S_MTIME) inode-i_mtime = *time; } - if (inode-i_sb-s_flags MS_LAZYTIME) { + if ((inode-i_sb-s_flags MS_LAZYTIME) + (!inode-i_ts_dirty_day || + inode-i_ts_dirty_day == days_since_boot)) { spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); + inode-i_ts_dirty_day = days_since_boot; It isn't clear if this is correct. It looks like the condition will only be entered if i_ts_dirty_day == days_since_boot, but that is only set inside the condition? Shouldn't this be: inode-i_ts_dirty_day = ~0U; if ((inode-i_sb-s_flags MS_LAZYTIME) inode-i_ts_dirty_day != days_since_boot)) { spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); inode-i_ts_dirty_day = days_since_boot; } and days_since_boot should be declared unsigned short so it wraps in the same way as i_ts_dirty_day, maybe using typeof(i_ts_dirty_day) so that there isn't any need to update this in the future if the type changes. Conceivably this could be an unsigned char, if that packed into struct inode better, at the risk of losing a timestamp update on an inode in cache for 256+ days and only modifying it 256 days later. Cheers, Andreas return 0; } + inode-i_ts_dirty_day = 0; if (inode-i_op-write_time) return inode-i_op-write_time(inode); mark_inode_dirty_sync(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 489b2f2..e3574cd 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -575,6 +575,7 @@ struct inode { struct timespec i_ctime; spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ unsigned short i_bytes; + unsigned short i_ts_dirty_day; unsigned inti_blkbits; blkcnt_ti_blocks; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/21/2014 07:11 AM, Phillip Susi wrote: On 11/20/2014 5:45 PM, Robert White wrote: If you search for ACPI ide you'll find people complaining in 2008-2010 about windows error messages indicating the device is present in their system but no OS driver is available. Nope... not finding it. The closest thing was one or two people who said ACPI when they meant AHCI ( and were quickly corrected ). This is probably what you were thinking of since windows xp did not ship with an ahci driver so it was quite common for winxp users to have this problem when in _AHCI_ mode. I have to give you that one... I should have never trusted any reference to windows. Most of those references to windows support were getting AHCI and ACPI mixed up. Lolz windows users... They didn't get into ACPI disk support till 2010. I should have known they were behind the times. I had to scroll down almost a whole page to find the linux support. So lets just look at the top of the ide/ide-acpi.c from linux 2.6 to consult about when ACPI got into the IDE business... linux/drivers/ide/ide-acpi.c /* * Provides ACPI support for IDE drives. * * Copyright (C) 2005 Intel Corp. * Copyright (C) 2005 Randy Dunlap * Copyright (C) 2006 SUSE Linux Products GmbH * Copyright (C) 2006 Hannes Reinecke */ Here's a bug from 2005 of someone having a problem with the ACPI IDE support... https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=6cad=rjauact=8ved=0CDkQFjAFurl=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D5604ei=g6VvVL73K-HLsASIrYKIDgusg=AFQjCNGTuuXPJk91svGJtRAf35DUqVqrLgsig2=eHxwbLYXn4ED5jG-guoZqg People debating the merits of the ACPI IDE drivers in 2005. https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=12cad=rjauact=8ved=0CGUQFjALurl=http%3A%2F%2Fwww.linuxquestions.org%2Fquestions%2Fslackware-14%2Fbare-ide-and-bare-acpi-kernels-297525%2Fei=g6VvVL73K-HLsASIrYKIDgusg=AFQjCNFoyKgH2sOteWwRN_Tdrfw9hOmVGQsig2=BmMVcZl24KRz4s4gEvLN_w So you got me... windows was behind the curve by five years instead of just three... my bad... But yea, nobody has ever used that ACPI disk drive support that's been in the kernel for nine years. Even when you get me for referencing windows, you're still wrong... How many times will you try get out of being hideously horribly wrong about ACPI supporting disk/storage IO? It is neither recent nor rare. How much egg does your face really need before you just see that your fantasy that it's new and uncommon is a delusional mistake? Methinks Misters Dunning and Kruger need a word with you... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale
On Fri, Nov 21, 2014 at 02:19:07PM -0600, Andreas Dilger wrote: - if (inode-i_sb-s_flags MS_LAZYTIME) { + if ((inode-i_sb-s_flags MS_LAZYTIME) + (!inode-i_ts_dirty_day || +inode-i_ts_dirty_day == days_since_boot)) { spin_lock(inode-i_lock); inode-i_state |= I_DIRTY_TIME; spin_unlock(inode-i_lock); + inode-i_ts_dirty_day = days_since_boot; It isn't clear if this is correct. It looks like the condition will only be entered if i_ts_dirty_day == days_since_boot, but that is only set inside the condition? If i_ts_dirty_day is zero, the timestamps don't have to written to disk. This is either because the inode has been written to disk, or the system has been up for less than a day, such that when we last a lazy mtime update (i.e., we skipped the call to mark_inode_dirty) since jiffies / (HZ * 86400) was zero. If it is non-zero, then the timestamps were updated but were not sent to disk N days since the system was booted. So long as it remains N days since the system was booted, we can skip calling mark_inode_dirty(). But once it becomes N+1 days since the system was booted, then we will call mark_inode_dirty() and set i_ts_dirty_day to zero. I'll add a comment so it's a bit more obvious what we're doing here, but I'm pretty sure we currently have is in fact correct. and days_since_boot should be declared unsigned short so it wraps in the same way as i_ts_dirty_day Good point, thanks. This will only be an issue after the system has been up for almost 90 years, but we might as well get it right, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On 11/21/2014 01:12 PM, Robert White wrote: (wrong links included in post...) Dangit... those two links were bad... wrong clipboard... /sigh... I'll just stand on the pasted text from the driver. 8-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()
Out of curiosity, why does btrfs_update_time() need to call btrfs_root_readonly()? Why can't it just depend on the __mnt_want_write() call in touch_atime()? Surely if there are times when it's not OK to write into a btrfs file system and mnt_is_readonly() returns false, the VFS is going to get very confused abyway. If the btrfs_update_time() is not necessary, then we could drop btrfs_update_time() and update_time() from the inode operations entirely, and depend on the VFS-level code in update_time(). - Ted -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH] dm: add dm-power-fail target
Hello, I'm hoping some FS guys can weigh in and verify my approach for testing power fail conditions, and the DM guys to of course verify I didn't completely fail at making a DM target. All suggestions welcome, I want to have a nice robust tool for testing power fail so we can be sure our fsync/commit code is all working properly. Thanks, Josef For power fail testing we want to be able to catch cases where the fs isn't waiting for IO to complete properly. This target aims to do this by creating a linear mapping on a device and keeping track of all WRITEs and FLUSH/FUA operations. Any WRITEs are added to a global list on completion, then when we submit a FLUSH this list is spliced onto the request and once that request completes then it marks those blocks as valid. FUA bypasses this logic altogether and is considered valid once it completes. There are two modes of operation here 1) Zero - any WRITE that wasn't flushed will return 0's when we try to read from it. This is meant for BTRFS (or any COW fs) where we only write blocks once, so if we try to read a block we didn't properly flush then we can just send back 0's. 2) Split - the device is split in half and written to alternating sides. This is for overwriting fs'es (ext*/xfs). Once a FLUSH occurs we walk all the completed WRITEs and set their mirror to whichever mirror they last wrote to. Anything that wasn't flushed properly will point to its previous mirror. We then have 3 different power fail events we can do 1) DROP_WRITES - just drop all writes immediately, any new writes are immediately completed and outstanding ones are treated as if they never completed. 2) DROP_AFTER_FUA - this allows a FUA to go through, and as soon as it completes we drop all writes. This is meant to be really evil about making sure your commit code is doing the right thing by dropping at the worst possible moment. 3) DROP_AFTER_FLUSH - this allows a FLUSH to go through, and as soon as it completes we drop all writes. This can be good for testing fdatasync on overwrite fs'es that may only issue a FLUSH and not have to update metadata. There is also an option to return -EIO as soon as we have our power fail event to make it easier to script stress tests. The idea is to be as evil as possible wrt how a device's cache would react to a powerfail event. I had originally bolted this onto dm-flakey but it got complicated trying to work in these new options with its original behavior so I created a new target instead. Signed-off-by: Josef Bacik jba...@fb.com --- drivers/md/Kconfig | 14 + drivers/md/Makefile| 1 + drivers/md/dm-power-fail.c | 691 + 3 files changed, 706 insertions(+) create mode 100644 drivers/md/dm-power-fail.c diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 5bdedf6..bc3d6ca 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -432,4 +432,18 @@ config DM_SWITCH If unsure, say N. +config DM_POWER_FAIL + tristate Power fail target support + depends on BLK_DEV_DM + ---help--- + This device-mapper target creates a device that can be used for + testing a file systems ability to survive power fails. There are + several modes of operation in order to test a variety of power fail + scenarios. + + To compile this code as a module, choos M here: the module will be + called dm-power-fail. + + If unsure, say N. + endif # MD diff --git a/drivers/md/Makefile b/drivers/md/Makefile index a2da532..c667218 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -55,6 +55,7 @@ obj-$(CONFIG_DM_CACHE)+= dm-cache.o obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o obj-$(CONFIG_DM_ERA) += dm-era.o +obj-$(CONFIG_DM_POWER_FAIL)+= dm-power-fail.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs+= dm-uevent.o diff --git a/drivers/md/dm-power-fail.c b/drivers/md/dm-power-fail.c new file mode 100644 index 000..44c5045 --- /dev/null +++ b/drivers/md/dm-power-fail.c @@ -0,0 +1,691 @@ +/* + * Copyright (C) 2014 Facebook. All rights reserved. + * + * This file is released under the GPL. + */ + +#include linux/device-mapper.h + +#include linux/module.h +#include linux/init.h +#include linux/blkdev.h +#include linux/bio.h +#include linux/slab.h + +#define DM_MSG_PREFIX power-fail + +/* + * The way this interface is meant to be used is like this + * + * dmsetup create powerfail + * mkfs /dev/powerfail + * mount /dev/powerfail /mnt/test + * do some stuff + * sleep 30 + * dmsetup message powerfail (drop_after_flush|drop_after_fua|drop_writes) + * umount /mnt/test + * dmsetup message powerfail redirect_reads + * fsck /dev/powerfail || exit 1 + * mount /dev/powerfail /mnt/test + * verify contents + * + * You can set redirect_reads whenever you want, but the idea is that you want + * the teardown
Re: BTRFS messes up snapshot LV with origin
Chris Murphy posted on Fri, 21 Nov 2014 11:23:45 -0700 as excerpted: On Thu, Nov 20, 2014 at 11:22 PM, Duncan 1i5t5.dun...@cox.net wrote: When I have such a filesystem level problem, I simply dd from the backing device to some other location, generally to a file that's on a different filesystem (preferrably non-btrfs, I use reiserfs as I've found it very resilient, here), in which case btrfs device scan won't see the UUID on the copy as it scans block devices, not inside non-device files. That's hours of dd and you have to find space to do it. I did it recently here. There's a method to my sub-100-GiB partition madness! =:^) The partitions in question were on SSD, and were small enough I could simply DD them to files on my media filesystem, which was after all designed to be able to take full ISO images, etc. Additionally, due to size and reasonably consistent linear intra-file access patterns, the media filesystem's still on much cheaper spinning rust, while most of the system's on much faster to random-access but far more expensive SSD, so in this case one side was SSD, the other spinning rust. Tho granted, if you're doing single-partition/filesystem multi-TiB filesystems, it does get to be a problem. As there would have been if the filesystem in question was the media filesystem, altho that one's not yet btrfs for a reason. But still, if there's room enough for an LVM snapshot in the first place, with a different layout, there'd be room for the same data as a file. That's pretty basic. After all, an LVM block-level snapshot takes the same space as a file containing the same raw data, and if there's room for the data in an LVM snapshot, given a different layout, there's room for exactly the same amount of data as a file on a different filesystem, piped thru some compressor if necessary due to tight datasize constraints. That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Zygo Blaxell posted on Fri, 21 Nov 2014 12:56:23 -0500 as excerpted: It's not a bug as long as I can completely control which devices are searched for UUIDs, and the system behaves sanely when multiple UUIDs are found through automatic discovery; otherwise, it's not only a bug, it's a DoS attack security vulnerability. Consider what happens if someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a fake filesystem with those UUIDs, puts the fake filesystem on a USB stick, and plugs it back into the victim machine... With the current state of USB vulnerability (firmware reprogrammed as an input device, etc, the vuln has been all over the tech news for some months now), anyone with USB access to the machine is simply another case of anyone with physical access to the machine, they're normally assumed to be able to be able to at minimum take down the machine, the ultimate DoS, in any case, and often to have effective root, tho that can be mitigated to some extent with encryption, etc. It's generally assumed that if you have physical access, as required to plug in that USB, game over, the machine is effectively p40wn3d. At the /very/ least, with physical access it's vulnerable to the sledgehammer DoS, and there's little to be done about that but prevent physical access by all means necessary (armed guards, nuclear silo hosting, etc) in the first place. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Duncan posted on Fri, 21 Nov 2014 22:49:06 + as excerpted: Chris Murphy posted... That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) Oops, I mis-parsed thin. Good point and thanks, Chris. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: disk-io: replace root args iff only fs_info used
On Sat, Nov 22, 2014 at 01:37:10AM +0900, Daniel Dressler wrote: What would a cover letter be like? Would that be a separate email to the list, or maybe the first email in a patch series? It's a separate mail that does not carry any diff but an overview of the following patches. The patches are threaded under that mail. This is what the git command does for you: git format-patch -o output --thread --cover-letter from..to and in the directory 'output' you'll find the cover letter plus patches. The cover contains some stub and should be edited. Then send them via 'git send-email'. Sorry I've twice looked for the integration repo. I found some that look like it could be but those had older commits. Could you direct me to the exact branch I'd love to work against it. These patches were done against linux-next. The integration is in Chris' git, but the branch may not be the most recent compared to Linus' tree or the pending for-linus branches. This depens on the phase of the development cycle or the stability of the patches in the integration branch as it's supposed to be base of the next pull. What you did is fine under current conditions. If the integration is made public you can check if your patches are merged or not and then refresh the patch series eventually. I think small one function patches might be best. I have the codebase mapped out and each file's functions-to-be-cleaned count varies wildly. If I did batch files together and split large files apart there would be no rhyme or reason for the groupings. With single function patches it is very clear what changes are justified since they should only occur in the affected function or in a call-site. With multiple functions the call-site changes get mixed up would it would be harder to review. Up to you. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS messes up snapshot LV with origin
Duncan posted on Fri, 21 Nov 2014 23:41:49 + as excerpted: Duncan posted on Fri, 21 Nov 2014 22:49:06 + as excerpted: Chris Murphy posted... That's not true for thin volume snapshots. They take up next to no space upon creation, they don't need space reserved in advance. Thus the mention of compression if necessary. Thin-volume snapshots are effectively compression by another name, and a raw dd from them should compress pretty much equally well, depending on compression method chosen, of course. =:^) Oops, I mis-parsed thin. Good point and thanks, Chris. ... And Zygo, who pointed out my error as well. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: ctree: reduce args where only fs_info used
On Sat, Nov 22, 2014 at 01:03:32AM +0900, Daniel Dressler wrote: No problem, I'll redo everything so it is one function per patch. Now fair warning: there are about 102 functions to cleanup. I was a bit worried that many patches would cause too much maintainer overhead but it is no problem for me. Yeah, I'm aware that it's all over the sources. I'd say send no more than 30 patches in a burst first and see how it'd work. Only a few functions have dependecies on other functions needing cleanup. Thus there will be some small patch series for those function sets. A big benefit of one function one patch is that extent-io.c will no longer be a 34 function monster patch. Is there any rate limiting I should be doing? I don't want to flood the list with burst of dozen plus patches, or is that an okay volume? Should be fine. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote: On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell zblax...@furryterror.org wrote: I run 'smartctl -t long' from cron overnight (or whenever the drives are most idle). You can also set up smartd.conf to launch the self tests; however, the syntax for test scheduling is byzantine compared to cron (and that's saying something!). On multi-drive systems I schedule a different drive for each night. If you are also doing btrfs scrub, then stagger the scheduling so e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. smartd is OK for monitoring test logs and email alerts. I've had no problems there. Most attributes are always updated without issuing a smart test of any kind. A drive I have here only has four offline updateable attributes. One of those four is Offline_Uncorrectable, which is a really important attribute to monitor! When it comes to bad sectors, the drive won't use a sector that persistently fails writes. So you don't really have to worry about latent bad sectors that don't have data on them already. The sectors you care about are the ones with data. A scrub reads all of those sectors. A scrub reads all the _allocated_ sectors. A long selftest reads _everything_, and also exercises the electronics and mechanics of the drive in ways that normal operation doesn't. I have several disks that are less than 25% occupied, which means scrubs will ignore 75% of the disk surface at any given time. A sharp increase in the number of bad sectors (no matter how they are detected) usually indicates a total drive failure is coming. Many drives have been nice enough to give me enough warning for their RMA replacements to be delivered just a few hours before the drive totally fails. First the drive could report a read error in which case Btrfs raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or rebuild it from parity, and write to the affected sector; and also this same mechanism happens in normal reads so it's a kind of passive scrub. But it happens to miss checking inactively read data, which a scrub will check. Second, the drive could report no problem, and Btrfs raid1/10 could still fix the problem in case of a csum mismatch. And it looks like soonish we'll see this apply to raid5/6. So I think a nightly long smart test is a bit overkill. I think you could do nightly -t short tests which will report problems scrub won't notice, such as higher seek times or lower throughput performance. And then scrub once a week. Drives quite often drop a sector or two over the years, and it can be harmless. What you want to be watching out for is hundreds of bad sectors showing up over a period of few days--that means something is rattling around on the disk platters, damaging the hardware as it goes. To get that data, you have to test the disks every few days. The drive itself could be failing in some way that prevents recording SMART errors (e.g. because of host timeouts triggering a bus reset, which also prevents the SMART counter update for what was going wrong at the time). This is unfortunately quite common, especially with drives configured for non-RAID workloads. Libata resetting the link should be recorded in kernel messages. This is true, but the original question was about SMART data coverage. This is why it's important to monitor both. -- Chris Murphy signature.asc Description: Digital signature
Re: scrub implies failing drive - smartctl blissfully unaware
On Fri, 21 Nov 2014 10:45:21 -0700 Chris Murphy wrote: On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong bt...@iarmst.co.uk wrote: In my situation what I've found is that if I scrub let it fix the errors then a second pass immediately after will show no errors. If I then leave it a few days try again there will be errors, even in old files which have not been accessed for months. What are the devices? And if they're SSDs are they powered off for these few days? I take it the scrub error type is corruption? It's spinning rust and the checksum error is always on the one drive (a SAMSUNG HD204UI). The firmware has been updated, since some were shipped with a bad version which could result in data corruption. You can use badblocks to write a known pattern to the drive. Then power off and leave it for a few days. Then read the drive, matching against the pattern, and see if there are any discrepancies. Doing this outside the code path of Btrfs would fairly conclusively indicate whether it's hardware or software induced. Unfortunately I'm reluctant to go the badblock route for the entire drive since it's the second drive in a 2 drive raid1 and I don't currently have a spare. There is a small 6G partition that I can use, but given that the drive is large and the errors are few, it could take a while for anything to show. I also have a second 2 drive btrfs raid1 in the same machine that doesn't have this problem. All the drives are running off the same controller. Assuming you have another copy of all of these files :-) you could just sha256sum the two copies to see if they have in fact changed. If they have, well then you've got some silent data corruption somewhere somehow. But if they always match, then that suggests a bug. Some of the files already have an md5 linked to them, while others have parity files to give some level of recovery from corruption or damage. Checking against these show no problems, so I assume that btrfs is doing its job only serving an intact file. I don't see how you can get bogus corruption messages, and for it to not be a bug. When you do these scrubs that come up clean, and then later come up with corruptions, have you done any software updates? No software updates between clean corrupt. I don't have to power down or reboot either for checksum errors to appear. I don't think the corruption messages are bogus, but are indicating a genuine problem. What I would like to be able to do is compare the corrupt block with the one used to repair it and see what the difference is. As I've already stated, the system logs are clean the smart logs aren't showing any issues. (Well, until today when a self-test failed with a read error, but it must be an unused sector since the scrub doesn't hit it there are no re-allocated sectors yet) My next step is to disable autodefrag see if the problem persists. (I'm not suggesting a problem with autodefrag, I just want to remove it from the equation ensure that outside of normal file access, data isn't being rewritten between scrubs) I wouldn't expect autodefrag to touch old files not accessed for months. Doesn't it only affect actively used files? The drive is mainly used to hold old archive files, though there are daily rotating files on it as well. The corruption affects both new and old files. -- Ian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html