date:20141121

[PATCH] Btrfs: disk-io: replace root args iff only fs_info used

2014-11-21 Thread Daniel Dressler

This is the 3rd independent patch of a larger
project to cleanup btrfs's internal usage of
btrfs_root. Many functions take btrfs_root
only to grab the fs_info struct.

By requiring a root these functions cause
programmer overhead. That these functions can
accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such
functions to accept the fs_info directly.

These patches can be applied independently
and thus are not being submitted as a patch
series. There should be about 26 patches by
the project's completion. Each patch will
cleanup between 1 and 34 functions apiece.
Each patch covers a single file's functions.

This patch affects the following function(s):
  1) csum_tree_block
  2) csum_dirty_buffer
  3) check_tree_block_fsid
  4) btrfs_find_tree_block
  5) clean_tree_block

Signed-off-by: Daniel Dressler 
---
 fs/btrfs/ctree.c   | 26 +-
 fs/btrfs/disk-io.c | 32 
 fs/btrfs/disk-io.h |  4 ++--
 fs/btrfs/extent-tree.c |  6 +++---
 4 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 19bc616..e76a6ba 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1075,7 +1075,7 @@ static noinline int update_ref_for_cow(struct 
btrfs_trans_handle *trans,
ret = btrfs_dec_ref(trans, root, buf, 1);
BUG_ON(ret); /* -ENOMEM */
}
-   clean_tree_block(trans, root, buf);
+   clean_tree_block(trans, root->fs_info, buf);
*last_ref = 1;
}
return 0;
@@ -1681,7 +1681,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
continue;
}
 
-   cur = btrfs_find_tree_block(root, blocknr);
+   cur = btrfs_find_tree_block(root->fs_info, blocknr);
if (cur)
uptodate = btrfs_buffer_uptodate(cur, gen, 0);
else
@@ -1946,7 +1946,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
 
path->locks[level] = 0;
path->nodes[level] = NULL;
-   clean_tree_block(trans, root, mid);
+   clean_tree_block(trans, root->fs_info, mid);
btrfs_tree_unlock(mid);
/* once for the path */
free_extent_buffer(mid);
@@ -2000,7 +2000,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
if (wret < 0 && wret != -ENOSPC)
ret = wret;
if (btrfs_header_nritems(right) == 0) {
-   clean_tree_block(trans, root, right);
+   clean_tree_block(trans, root->fs_info, right);
btrfs_tree_unlock(right);
del_ptr(root, path, level + 1, pslot + 1);
root_sub_used(root, right->len);
@@ -2044,7 +2044,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
BUG_ON(wret == 1);
}
if (btrfs_header_nritems(mid) == 0) {
-   clean_tree_block(trans, root, mid);
+   clean_tree_block(trans, root->fs_info, mid);
btrfs_tree_unlock(mid);
del_ptr(root, path, level + 1, pslot);
root_sub_used(root, mid->len);
@@ -2262,7 +2262,7 @@ static void reada_for_search(struct btrfs_root *root,
 
search = btrfs_node_blockptr(node, slot);
blocksize = root->nodesize;
-   eb = btrfs_find_tree_block(root, search);
+   eb = btrfs_find_tree_block(root->fs_info, search);
if (eb) {
free_extent_buffer(eb);
return;
@@ -2324,7 +2324,7 @@ static noinline void reada_for_balance(struct btrfs_root 
*root,
if (slot > 0) {
block1 = btrfs_node_blockptr(parent, slot - 1);
gen = btrfs_node_ptr_generation(parent, slot - 1);
-   eb = btrfs_find_tree_block(root, block1);
+   eb = btrfs_find_tree_block(root->fs_info, block1);
/*
 * if we get -eagain from btrfs_buffer_uptodate, we
 * don't want to return eagain here.  That will loop
@@ -2337,7 +2337,7 @@ static noinline void reada_for_balance(struct btrfs_root 
*root,
if (slot + 1 < nritems) {
block2 = btrfs_node_blockptr(parent, slot + 1);
gen = btrfs_node_ptr_generation(parent, slot + 1);
-   eb = btrfs_find_tree_block(root, block2);
+   eb = btrfs_find_tree_block(root->fs_info, block2);
if (eb && btrfs_buffer_uptodate(eb, gen, 1) != 0)
block2 = 0;
free_extent_buffer(eb);
@@ -2455,7 +2455,7 @@ read_block_for_search(struct btrfs_trans_handle *trans,
blocknr = btrfs_node_blockptr(b, slot);
gen = btrfs_node_ptr_generation(b, slot);
 
-   tmp = btrfs_find_tr

Re: Changing label few times killed filesystem?

2014-11-21 Thread Boris Chernov


On 2014-11-21 04:35, Roman Mamedov wrote:

On Fri, 21 Nov 2014 01:27:17 +
Boris Chernov  wrote:

  I have changed file system label few times in total. When I tried
to mount it after that, it became not mountable:

# mount /dev/sdb1 /mnt
mount: Not a directory

I'd say that implies something is wrong with your /mnt, rather than /dev/sdb1.
Before mounting try things like "ls -la /mnt/", "umount /mnt", etc.
Or simply mounting somewhere else other than /mnt/
Before I attempted mounting to /mnt I tried to mount with KDE 
Device Notifier to /media/username/label, then I have tried to create 
directory manually in /media/ and tried to mount in the command-line, 
then tried /mnt, and error was the same. So I'm sure there is nothing 
wrong with my mount points.
Now I have rebooted and tried to mount in KDE Device Notifier to 
/media/username/label, it failed again, so I tried from command-line as 
root:


# mkdir /media/sdb1 && ls -la /media/sdb1 && mount /dev/sdb1 /media/sdb1
total 8
drwxr-sr-x 2 root disk 4096 Nov 21 08:12 .
drwsrwsrwT 7 root disk 4096 Nov 21 08:12 ..

...and that's it, no output from mount command (it just hanged and 
become unkillable process). Please let me know if there is anything else 
I could try to either restore it or debug it (to at least understand why 
exactly it screwed up itself so it will not happen again to me or anyone 
else). If it matters, the disk is with single partition (BTRFS-only), 
was plugged-in all the time and I use Xeon-based workstation with ECC 
memory. In the dmesg I see the following, it seems after encountering 
btrfs bugs in its recovery tools (mentioned in my previous mail) I have 
also encountered btrfs bug in the kernel:


[  339.349260] BTRFS info (device sdb1): disk space caching is enabled
[  339.397438] parent transid verify failed on 29458432 wanted 5 found 2759
[  339.397505] [ cut here ]
[  339.397510] kernel BUG at fs/btrfs/locking.c:269!
[  339.397513] invalid opcode:  [#1] SMP
[  339.397517] Modules linked in: ppp_deflate bsd_comp ppp_async 
crc_ccitt ppp_generic slhc snd_aloop snd_hrtimer xt_conntrack 
iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables 
snd_ice1724 snd_ak4113 snd_pt2258 snd_ak4114 snd_i2c snd_ice17xx_ak4xxx 
snd_ak4xxx_adda snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm 
snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device 
snd_timer snd soundcore ac97_bus vmnet(O) parport_pc parport 
vmw_vsock_vmci_transport vsock vmw_vmci vmmon(O) cpufreq_conservative 
cpufreq_powersave cpufreq_userspace cpufreq_stats zram nvidia(PO) 
cfg80211 rfkill binfmt_misc uinput zfs(PO) zunicode(PO) zavl(PO) 
zcommon(PO) znvpair(PO) spl(O) nfsd auth_rpcgss oid_registry nfs_acl nfs 
lockd fscache sunrpc iTCO_wdt iTCO_vendor_support usblp kvm_intel kvm 
ses enclosure cdc_ether psmouse option i2c_i801 pcspkr usbnet mii 
usb_wwan usbserial serio_raw i7core_edac edac_core uvcvideo 
videobuf2_vmalloc videobuf2_memops videobuf2_core videodev media evdev 
joydev jc42 w83627ehf lm90 coretemp adt7475 hwmon_vid adm1021 ttm 
drm_kms_helper drm i2c_algo_bit i2c_core msr loop fuse tpm_infineon 
tpm_tis lpc_ich mfd_core tpm button acpi_cpufreq processor thermal_sys 
autofs4 ext4 crc16 mbcache jbd2 btrfs xor raid6_pq usb_storage sg sd_mod 
sr_mod cdrom crc_t10dif crct10dif_common hid_generic usbhid hid ahci 
libahci libata crc32c_intel scsi_mod e1000e ptp pps_core xhci_hcd 
ehci_pci ehci_hcd usbcore usb_common [last unloaded: vmnet]
[  339.397584] CPU: 0 PID: 25752 Comm: mount Tainted: P   O 
3.15.0-pf2 #1
[  339.397585] Hardware name: Supermicro X8SIE/X8SIE, BIOS 1.2
08/19/11
[  339.397586] task: 880036c93f80 ti: 8805702b4000 task.ti: 
8805702b4000
[  339.397587] RIP: 0010:[] [] 
btrfs_assert_tree_read_locked.part.0+0x0/0x10 [btrfs]

[  339.397604] RSP: 0018:8805702b7bf0  EFLAGS: 00010246
[  339.397605] RAX:  RBX: 8804db6da800 RCX: 
0581
[  339.397606] RDX:  RSI: 8804db58d0e0 RDI: 
8804db6da800
[  339.397607] RBP: 0001 R08: 0001b830 R09: 
88063fc1b830
[  339.397608] R10: 88061afec700 R11: ea00136d6300 R12: 
0005
[  339.397609] R13: 88008c978820 R14: 88061af51000 R15: 
8804db6da800
[  339.397610] FS:  7f55bf45b840() GS:88063fc0() 
knlGS:

[  339.397612] CS:  0010 DS:  ES:  CR0: 8005003b
[  339.397613] CR2: 7f6b280af000 CR3: 0004da047000 CR4: 
07f0

[  339.397614] Stack:
[  339.397614]  a024557d 8804db6da800 a0208838 

[  339.397616]     
88008c978820
[  339.397617]  a02093a0 1c18 0005 
8804db6da800

[  339.397619] Call Trace:
[  339.397629]  [] ? 
btrfs_tree_read_unlock_blocking+0x8d/0xc0 [btrfs]
[

Re: [RFC PATCH 0/6] btrfs: implement swap file support

2014-11-21 Thread Christoph Hellwig

On Tue, Nov 18, 2014 at 11:22:35PM -0800, Omar Sandoval wrote:
> Here's a nice little bit of insanity I put together in that direction --
> consider it a discussion point more than a patch. It does two things:
> 
> - Uses an ITER_BVEC iov_iter to do direct_IO for swap_readpage. This makes
>   swap_readpage a synchronous operation, but I think that's the best we can do
>   with the existing interface.

Note that ->read_iter for direct-io supports async I/O in general.  By
resurrecting some of the older attempts to do in-kernel aio this could
be made async easily.

> - Unless I'm missing something, there don't appear to be any instances of
>   ITER_BVEC | READ in the kernel, so the dio path doesn't know not to dirty
>   pages it gets that way. Dave Kleikamp and Ming Lei each previously submitted
>   patches doing this as part of adding an aio_kernel interface. (The NFS 
> direct
>   I/O implementation doesn't know how to deal with these either, so this patch
>   actually breaks the only existing user of this code path, but in the 
> interest
>   of keeping the patch short, I didn't try to fix it :)

Right, we'd need to look into.  Bonus points of allowing this as a zero
copy read.


Btw, in the long run I would much prefer killing of the current horrible
swap using bmap path in favor of an enhanced direct I/O path.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/5] nfs: don't dirty ITER_BVEC pages read through direct I/O

2014-11-21 Thread Omar Sandoval

As with the generic blockdev code, kernel pages shouldn't be dirtied by the
direct I/O path.

Signed-off-by: Omar Sandoval 
---
 fs/nfs/direct.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 10bf072..a67fa2c 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -88,6 +88,7 @@ struct nfs_direct_req {
struct pnfs_ds_commit_info ds_cinfo;/* Storage for cinfo */
struct work_struct  work;
int flags;
+   int should_dirty;   /* should we mark read pages 
dirty? */
 #define NFS_ODIRECT_DO_COMMIT  (1) /* an unstable reply was 
received */
 #define NFS_ODIRECT_RESCHED_WRITES (2) /* write verification failed */
struct nfs_writeverfverf;   /* unstable write verifier */
@@ -370,7 +371,8 @@ static void nfs_direct_read_completion(struct 
nfs_pgio_header *hdr)
struct nfs_page *req = nfs_list_entry(hdr->pages.next);
struct page *page = req->wb_page;
 
-   if (!PageCompound(page) && bytes < hdr->good_bytes)
+   if (!PageCompound(page) && bytes < hdr->good_bytes &&
+   dreq->should_dirty)
set_page_dirty(page);
bytes += req->wb_bytes;
nfs_list_remove_request(req);
@@ -542,6 +544,7 @@ ssize_t nfs_file_direct_read(struct kiocb *iocb, struct 
iov_iter *iter,
dreq->inode = inode;
dreq->bytes_left = count;
dreq->ctx = get_nfs_open_context(nfs_file_open_context(iocb->ki_filp));
+   dreq->should_dirty = !(iter->type & ITER_BVEC);
l_ctx = nfs_get_lock_context(dreq->ctx);
if (IS_ERR(l_ctx)) {
result = PTR_ERR(l_ctx);
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 5/5] btrfs: enable swap file support

2014-11-21 Thread Omar Sandoval

Implement the swap file a_ops on btrfs. Activation simply checks for a usable
swap file: it must be fully allocated (no holes), support direct I/O (so no
compressed or inline extents) and should be nocow (I'm not sure about that last
one).

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/inode.c | 71 
 1 file changed, 71 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..b8fd36b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9442,6 +9442,75 @@ out_inode:
 
 }
 
+static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
+  sector_t *span)
+{
+   struct inode *inode = file_inode(file);
+   struct btrfs_inode *ip = BTRFS_I(inode);
+   int ret = 0;
+   u64 isize = inode->i_size;
+   struct extent_state *cached_state = NULL;
+   struct extent_map *em;
+   u64 start, len;
+
+   if (ip->flags & BTRFS_INODE_COMPRESS) {
+   /* Can't do direct I/O on a compressed file. */
+   pr_err("BTRFS: swapfile is compressed");
+   return -EINVAL;
+   }
+   if (!(ip->flags & BTRFS_INODE_NODATACOW)) {
+   /* The swap file can't be copy-on-write. */
+   pr_err("BTRFS: swapfile is copy-on-write");
+   return -EINVAL;
+   }
+
+   lock_extent_bits(&ip->io_tree, 0, isize - 1, 0, &cached_state);
+
+   /*
+* All of the extents must be allocated and support direct I/O. Inline
+* extents and compressed extents fall back to buffered I/O, so those
+* are no good.
+*/
+   start = 0;
+   while (start < isize) {
+   len = isize - start;
+   em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
+   if (IS_ERR(em)) {
+   ret = PTR_ERR(em);
+   goto out;
+   }
+
+   if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
+   em->block_start == EXTENT_MAP_HOLE) {
+   pr_err("BTRFS: swapfile has holes");
+   ret = -EINVAL;
+   goto out;
+   }
+   if (em->block_start == EXTENT_MAP_INLINE) {
+   pr_err("BTRFS: swapfile is inline");
+   ret = -EINVAL;
+   goto out;
+   }
+   if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+   pr_err("BTRFS: swapfile is compresed");
+   ret = -EINVAL;
+   goto out;
+   }
+
+   start = extent_map_end(em);
+   free_extent_map(em);
+   }
+
+out:
+   unlock_extent_cached(&ip->io_tree, 0, isize - 1, &cached_state,
+GFP_NOFS);
+   return ret;
+}
+
+static void btrfs_swap_deactivate(struct file *file)
+{
+}
+
 static const struct inode_operations btrfs_dir_inode_operations = {
.getattr= btrfs_getattr,
.lookup = btrfs_lookup,
@@ -9519,6 +9588,8 @@ static const struct address_space_operations btrfs_aops = 
{
.releasepage= btrfs_releasepage,
.set_page_dirty = btrfs_set_page_dirty,
.error_remove_page = generic_error_remove_page,
+   .swap_activate  = btrfs_swap_activate,
+   .swap_deactivate = btrfs_swap_deactivate,
 };
 
 static const struct address_space_operations btrfs_symlink_aops = {
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/5] direct-io: don't dirty ITER_BVEC pages on read

2014-11-21 Thread Omar Sandoval

Reads through the iov_iter infrastructure for kernel pages shouldn't be dirtied
by the direct I/O code.

This is based on Dave Kleikamp's and Ming Lei's previously posted patches.

Cc: Dave Kleikamp 
Cc: Ming Lei 
Signed-off-by: Omar Sandoval 
---
 fs/direct-io.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index e181b6b..e542ce4 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -120,6 +120,7 @@ struct dio {
spinlock_t bio_lock;/* protects BIO fields below */
int page_errors;/* errno from get_user_pages() */
int is_async;   /* is IO async ? */
+   int should_dirty;   /* should we mark read pages dirty? */
bool defer_completion;  /* defer AIO completion to workqueue? */
int io_error;   /* IO error in completion path */
unsigned long refcount; /* direct_io_worker() and bios */
@@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct 
dio_submit *sdio)
dio->refcount++;
spin_unlock_irqrestore(&dio->bio_lock, flags);
 
-   if (dio->is_async && dio->rw == READ)
+   if (dio->is_async && dio->rw == READ && dio->should_dirty)
bio_set_pages_dirty(bio);
 
if (sdio->submit_io)
@@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio 
*bio)
if (!uptodate)
dio->io_error = -EIO;
 
-   if (dio->is_async && dio->rw == READ) {
+   if (dio->is_async && dio->rw == READ && dio->should_dirty) {
bio_check_pages_dirty(bio); /* transfers ownership */
} else {
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
 
-   if (dio->rw == READ && !PageCompound(page))
+   if (dio->rw == READ && !PageCompound(page) && 
dio->should_dirty)
set_page_dirty_lock(page);
page_cache_release(page);
}
@@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct 
inode *inode,
 
dio->inode = inode;
dio->rw = rw;
+   dio->should_dirty = !(iter->type & ITER_BVEC);
 
/*
 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/6] btrfs: implement swap file support

2014-11-21 Thread Omar Sandoval

On Fri, Nov 21, 2014 at 02:06:57AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 18, 2014 at 11:22:35PM -0800, Omar Sandoval wrote:
> > Here's a nice little bit of insanity I put together in that direction --
> > consider it a discussion point more than a patch. It does two things:
> > 
> > - Uses an ITER_BVEC iov_iter to do direct_IO for swap_readpage. This makes
> >   swap_readpage a synchronous operation, but I think that's the best we can 
> > do
> >   with the existing interface.
> 
> Note that ->read_iter for direct-io supports async I/O in general.  By
> resurrecting some of the older attempts to do in-kernel aio this could
> be made async easily.
> 
> > - Unless I'm missing something, there don't appear to be any instances of
> >   ITER_BVEC | READ in the kernel, so the dio path doesn't know not to dirty
> >   pages it gets that way. Dave Kleikamp and Ming Lei each previously 
> > submitted
> >   patches doing this as part of adding an aio_kernel interface. (The NFS 
> > direct
> >   I/O implementation doesn't know how to deal with these either, so this 
> > patch
> >   actually breaks the only existing user of this code path, but in the 
> > interest
> >   of keeping the patch short, I didn't try to fix it :)
> 
> Right, we'd need to look into.  Bonus points of allowing this as a zero
> copy read.
> 
> 
> Btw, in the long run I would much prefer killing of the current horrible
> swap using bmap path in favor of an enhanced direct I/O path.
Looks like I just raced on your email and sent a v2 of my patch series before
seeing this response :) I'll take a closer look at this tomorrow, thanks for
getting back to me.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0-rc4 - WARNING: CPU: 3 PID: 3889 at fs/btrfs/extent-tree.c:3785 btrfs_free_reserved_data_space+0xc8/0xe0 [btrfs]()

2014-11-21 Thread Torbjørn


On 12. nov. 2014 09:29, Torbjørn wrote:

Hi,

After upgrade to 3.18.0-rc4 get a lot of these warnings.
The system seems to otherwise behave normally. I have not observed any 
actual issues.



--snip--

Just to close off this email thread.
After upgrading to 3.18.0-rc5 these messages have disappeared.
For rc-4 it was about 80 000 messages a day.

--
Torbjørn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/5] btrfs: don't allow -C or +c chattrs on a swap file

2014-11-21 Thread Omar Sandoval

swap_activate will check for a compressed or copy-on-write file; we shouldn't
allow it to become either once it has already been activated.

Signed-off-by: Omar Sandoval 
---
 fs/btrfs/ioctl.c | 50 +++---
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4399f0c..f022dce 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -293,14 +293,21 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
}
} else {
/*
-* Revert back under same assuptions as above
+* swap_activate checks that we don't swapon a copy-on-write
+* file, but we must also make sure that it doesn't become
+* copy-on-write.
 */
-   if (S_ISREG(mode)) {
-   if (inode->i_size == 0)
-   ip->flags &= ~(BTRFS_INODE_NODATACOW
-| BTRFS_INODE_NODATASUM);
-   } else {
-   ip->flags &= ~BTRFS_INODE_NODATACOW;
+   if (!IS_SWAPFILE(inode)) {
+   /*
+* Revert back under same assumptions as above
+*/
+   if (S_ISREG(mode)) {
+   if (inode->i_size == 0)
+   ip->flags &= ~(BTRFS_INODE_NODATACOW |
+  BTRFS_INODE_NODATASUM);
+   } else {
+   ip->flags &= ~BTRFS_INODE_NODATACOW;
+   }
}
}
 
@@ -317,20 +324,25 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
if (ret && ret != -ENODATA)
goto out_drop;
} else if (flags & FS_COMPR_FL) {
-   const char *comp;
-
-   ip->flags |= BTRFS_INODE_COMPRESS;
-   ip->flags &= ~BTRFS_INODE_NOCOMPRESS;
+   /*
+* Like nodatacow, swap_activate checks that we don't swapon a
+* compressed file, so we shouldn't let it become compressed.
+*/
+   if (!IS_SWAPFILE(inode)) {
+   const char *comp;
 
-   if (root->fs_info->compress_type == BTRFS_COMPRESS_LZO)
-   comp = "lzo";
-   else
-   comp = "zlib";
-   ret = btrfs_set_prop(inode, "btrfs.compression",
-comp, strlen(comp), 0);
-   if (ret)
-   goto out_drop;
+   ip->flags |= BTRFS_INODE_COMPRESS;
+   ip->flags &= ~BTRFS_INODE_NOCOMPRESS;
 
+   if (root->fs_info->compress_type == BTRFS_COMPRESS_LZO)
+   comp = "lzo";
+   else
+   comp = "zlib";
+   ret = btrfs_set_prop(inode, "btrfs.compression",
+comp, strlen(comp), 0);
+   if (ret)
+   goto out_drop;
+   }
} else {
ret = btrfs_set_prop(inode, "btrfs.compression", NULL, 0, 0);
if (ret && ret != -ENODATA)
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/5] swap: use direct I/O for SWP_FILE swap_readpage

2014-11-21 Thread Omar Sandoval

On Mon, Nov 17, 2014 at 07:48:17AM -0800, Christoph Hellwig wrote:
> With the new iov_iter infrastructure that supprots direct I/O to kernel
> pages please get rid of the ->readpage hack first.  I'm still utterly
> disapoined that this crap ever got merged.

Cc: Mel Gorman 
Signed-off-by: Omar Sandoval 
---
 mm/page_io.c | 32 
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 955db8b..10715e0 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -283,8 +283,7 @@ int __swap_writepage(struct page *page, struct 
writeback_control *wbc,
 
set_page_writeback(page);
unlock_page(page);
-   ret = mapping->a_ops->direct_IO(ITER_BVEC | WRITE,
-   &kiocb, &from,
+   ret = mapping->a_ops->direct_IO(WRITE, &kiocb, &from,
kiocb.ki_pos);
if (ret == PAGE_SIZE) {
count_vm_event(PSWPOUT);
@@ -348,12 +347,37 @@ int swap_readpage(struct page *page)
}
 
if (sis->flags & SWP_FILE) {
+   struct kiocb kiocb;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
+   struct bio_vec bv = {
+   .bv_page = page,
+   .bv_len = PAGE_SIZE,
+   .bv_offset = 0,
+   };
+   struct iov_iter to = {
+   .type = ITER_BVEC | READ,
+   .count = PAGE_SIZE,
+   .iov_offset = 0,
+   .nr_segs = 1,
+   };
+   to.bvec = &bv;  /* older gcc versions are broken */
+
+   init_sync_kiocb(&kiocb, swap_file);
+   kiocb.ki_pos = page_file_offset(page);
+   kiocb.ki_nbytes = PAGE_SIZE;
 
-   ret = mapping->a_ops->readpage(swap_file, page);
-   if (!ret)
+   ret = mapping->a_ops->direct_IO(READ, &kiocb, &to,
+   kiocb.ki_pos);
+   if (ret == PAGE_SIZE) {
+   SetPageUptodate(page);
count_vm_event(PSWPIN);
+   ret = 0;
+   } else {
+   ClearPageUptodate(page);
+   SetPageError(page);
+   }
+   unlock_page(page);
return ret;
}
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/5] btrfs: implement swap file support

2014-11-21 Thread Omar Sandoval

This patch series, based on 3.18-rc5, implements support for swap files on
BTRFS.

The standard swap file implementation uses the filesystem's implementation of
bmap() to get a list of physical blocks on disk, which the swap file code then
does I/O on directly without going through the filesystem. This doesn't work
for BTRFS, which is copy-on-write and therefore moves disk blocks around (COW
isn't the only thing that can shuffle around disk blocks: consider
defragmentation, balancing, etc.).

Swap-over-NFS introduced an interface through which a filesystem can arbitrate
swap I/O through address space operations:

- swap_activate() is called by swapon() and informs the address space that the
  given file is going to be used for swap, so it should take adequate measures
  like reserving space on disk and pinning block lookup information in memory
- swap_deactivate() is used to clean up on swapoff()
- readpage() is used to page in (read a page from disk)
- direct_IO() is used to page out (write a page out to disk)

Version 2 modifies this interface in the first part of the patch series to use
direct_IO for both reads and writes, which makes things much cleaner.

The second part of the patch series implements support for the interface on
BTRFS, which just means implementing swap_{,de}activate and adding some chattr
checks, which raises the following considerations:

- We can't do direct I/O on compressed or inline extents, so we can't use files
  with either for swap.
- Supporting COW swapfiles might also come with some weird edge cases?

This functionality is tenuously tested in a virtual machine with some
artificial workloads. Comment away.

Omar Sandoval (5):
  direct-io: don't dirty ITER_BVEC pages on read
  nfs: don't dirty ITER_BVEC pages read through direct I/O
  swap: use direct I/O for SWP_FILE swap_readpage
  btrfs: don't allow -C or +c chattrs on a swap file
  btrfs: enable swap file support

v2: use direct_IO for swap_readpage

 fs/btrfs/inode.c | 71 
 fs/btrfs/ioctl.c | 50 ---
 fs/direct-io.c   |  8 ---
 fs/nfs/direct.c  |  5 +++-
 mm/page_io.c | 32 +
 5 files changed, 139 insertions(+), 27 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/5] btrfs: implement swap file support

2014-11-21 Thread Omar Sandoval

Sorry for the noise, looks like Christoph got back to me on the previous RFC
just before I sent this out -- disregard this for now.

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/5] btrfs: implement swap file support

2014-11-21 Thread Christoph Hellwig

On Fri, Nov 21, 2014 at 02:15:31AM -0800, Omar Sandoval wrote:
> Sorry for the noise, looks like Christoph got back to me on the previous RFC
> just before I sent this out -- disregard this for now.

If the NFS people are fine with this version I'd certainly welcome it as
a first step.  Additional improvements are of course always welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: remove empty fs_devices to prevent memory runout

2014-11-21 Thread Anand Jain


Hi Gui,

 Thanks for attempting this.

 There was this patch previously attempted for the same problem,
 which I had to nack..
   [PATCH 1/2] btrfs: device list could grow infinite

 I haven't seen your fix in full yet, But looks like you are using
 dev_t to suffice the problem of identifying unique device. wonder
 how does this would this react when device goes through the disappear
 and reappear events ?

Thanks, Anand


On 11/20/14 10:15 AM, Gui Hecheng wrote:

There is a global list @fs_uuids to keep @fs_devices object
for each created btrfs. But when a btrfs becomes "empty"
(all devices belong to it are gone), its @fs_devices remains
in @fs_uuids list until module exit.
If we keeps mkfs.btrfs on the same device again and again,
all empty @fs_devices produced are sure to eat up our memory.
So this case has better to be prevented.

I think that each time we setup btrfs on that device, we should
check whether we are stealing some device from another btrfs
seen before. To faciliate the search procedure, we could insert
all @btrfs_device in a rb_root, one @btrfs_device per each physical
device, with @bdev->bd_dev as key. Each time device stealing happens,
we should replace the corresponding @btrfs_device in the rb_root with
an up-to-date version.
If the stolen device is the last device in its @fs_devices,
then we have an empty btrfs to be deleted.

Actually there are 3 ways to steal devices and lead to empty btrfs
 1. mkfs, with -f option
 2. device add, with -f option
 3. device replace, with -f option
We should act under these cases.

Moreover, if there are seed devices, then it is asured that
the devices in cloned @fs_devices are not treated as valid devices.

Signed-off-by: Gui Hecheng 
---
  fs/btrfs/super.c   |   1 +
  fs/btrfs/volumes.c | 181 -
  fs/btrfs/volumes.h |   6 ++
  3 files changed, 172 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 54bd91e..ee09a56 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2154,6 +2154,7 @@ static void __exit exit_btrfs_fs(void)
btrfs_end_io_wq_exit();
unregister_filesystem(&btrfs_fs_type);
btrfs_exit_sysfs();
+   btrfs_cleanup_valid_dev_root();
btrfs_cleanup_fs_uuids();
btrfs_exit_compress();
btrfs_hash_exit();
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0192051..ba86b1b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include "ctree.h"
  #include "extent_map.h"
@@ -52,6 +53,120 @@ static void btrfs_dev_stat_print_on_load(struct 
btrfs_device *device);

  DEFINE_MUTEX(uuid_mutex);
  static LIST_HEAD(fs_uuids);
+static struct rb_root valid_dev_root = RB_ROOT;
+
+static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev)
+{
+   struct rb_node **p;
+   struct rb_node *parent;
+   struct rb_node *new;
+   struct btrfs_device *old_dev;
+
+   WARN_ON(!mutex_is_locked(&uuid_mutex));
+
+   parent = NULL;
+   new = &new_dev->rb_node;
+
+   p = &valid_dev_root.rb_node;
+   while (*p) {
+   parent = *p;
+   old_dev = rb_entry(parent, struct btrfs_device, rb_node);
+
+   if (new_dev->devnum < old_dev->devnum)
+   p = &parent->rb_left;
+   else if (new_dev->devnum > old_dev->devnum)
+   p = &parent->rb_right;
+   else {
+   rb_replace_node(parent, new, &valid_dev_root);
+   RB_CLEAR_NODE(parent);
+
+   goto out;
+   }
+   }
+
+   old_dev = NULL;
+   rb_link_node(new, parent, p);
+   rb_insert_color(new, &valid_dev_root);
+
+out:
+   return old_dev;
+}
+
+static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
+{
+   struct btrfs_device *device;
+   WARN_ON(fs_devices->opened);
+   while (!list_empty(&fs_devices->devices)) {
+   device = list_entry(fs_devices->devices.next,
+   struct btrfs_device, dev_list);
+   list_del(&device->dev_list);
+   rcu_string_free(device->name);
+   kfree(device);
+   }
+   kfree(fs_devices);
+}
+
+static void remove_empty_fs_if_need(struct btrfs_fs_devices *old_fs)
+{
+   struct btrfs_fs_devices *seed_fs;
+
+   if (!list_empty(&old_fs->devices))
+   return;
+
+   list_del(&old_fs->list);
+
+   /* free the seed clones */
+   seed_fs = old_fs->seed;
+   free_fs_devices(old_fs);
+   while (seed_fs) {
+   old_fs = seed_fs;
+   seed_fs = seed_fs->seed;
+   free_fs_devices(old_fs);
+   }
+
+}
+
+static void replace_invalid_device(struct btrfs_device *new_dev)
+{
+   struct btrfs_device *invalid_dev;
+   struct btrfs_fs_devices *old_fs;
+
+

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Robert White


On 11/20/2014 10:22 PM, Duncan wrote:

But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs
=:^), because they're no longer unique, requiring them to be unique just
as the label says cannot be considered a bug.  It's simply stricter
enforcement of the rules, which are, after all, plainly stated in the
descriptive name.


You take "U"s away, not add them

UID = unique ID
GUID = globally unique ID
UUID = universally unique ID


And other file systems have the same issues. XFS, for example uses UUIDs 
in the same way. It just has a command to re-brand the filesystem's UUID 
which you apply to the LVM snapshot immediately after taking the 
snapshot. (problem long-since established and understood since 2009 or so.)


I don't know if this approach would work for BRFS with subvolumes.

Example Citation :: 
http://www.miljan.org/main/2009/11/16/lvm-snapshots-and-xfs/


XFS also has the nouuids mount option.

btrfs has device= mount option.

But any system with unique ids will have this identical issue when 
block-snapshot support is added underneath.


-- Rob.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Changing label few times killed filesystem?

2014-11-21 Thread Duncan

Chris Murphy posted on Thu, 20 Nov 2014 19:20:22 -0700 as excerpted:

> On Thu, Nov 20, 2014 at 6:27 PM, Boris Chernov 
> wrote:
> 
>> Since it failed after "checking extents" I decided to try
>> --init-extent-tree:
> 
> There might be hope yet if you didn't use --repair which is said on the
> wiki and many times on this list is kindof a last resort. But at the
> very least before going with the hammer approach you should upgrade your
> btrfs-progs which is kind old. Current is 3.17.2. I suggest upgrading
> and just posting the results from 'btrfs check ' without any
> options and see what you get. This check and --repair code are mostly in
> btrfs-progs, whereas the mount time fixing code is in the kernel. So
> upgrading btrfs-progs may be sufficient for your case, but ultimately it
> might be necessary to go to a newer kernel also.
> 
>> Btrfs v3.14.1

I'm with Chris here.

Additionally, I note that you (OP) are using kernel 3.15.x, while the 
entire kernel 3.15 series (which wasn't long-term supported so the last 
kernel update was shortly after 3.16 was released) is effectively 
blacklisted for btrfs, as it had a major btrfs bug in the compression 
handling code.  (However, if you are not now and never did use 
compression on that filesystem, that bug shouldn't affect you, but others 
might.)  The same bug was in 3.16.0 and 3.16.1, but was fixed in 3.16.2 
(or was it 3.16.3) plus.  So later 3.16 series kernels should be 
reasonably good.

Unfortunately, 3.17 added another bug, this time with read-only snapshot 
handling.  I don't do snapshots here and have been running it fine, but 
you'll want 3.17.2 plus if you do read-only snapshots.

I've not yet switched to kernel 3.18 series (late development stage at 
this point) here, but while there was a problem in rc4, rc5 appears to be 
good according to reports I've seen.

Meanwhile, userspace-side, there have been a number of fixes to btrfs 
check and the restore code in the 3.16 and 3.17 series, and while running 
the latest userspace isn't as critical as the kernel for normal 
operations (online operations) since for them the kernel is the 
operational code, for fixup (offline operations like btrfs check and 
btrfs restore), you really do want to be running the latest userspace, 
because in that case it's the userspace code that's actually doing the 
work.

Meanwhile, in the other subthread you mentioned not understanding 
transid.  FWIW transaction ID and generation are used interchangeably in 
btrfs discussions and refer to the the same thing -- a monotonically 
increasing number that gets bumped every time the root tree and 
superblocks are committed.  Normally later generations/transids indicate 
later commits and thus closer a filesystem state closer to current.  Note 
that you can use btrfs-show-super to display information from the 
superblocks including what it thinks the current generation/transid 
should be.

Which brings us to the output.  In most cases when there's problems with 
the transid/generation, wanted will be a bit higher than found, something 
like found 25456, wanted 25459.  That simply means that the three latest 
commits got lost somewhere and you may have to settle for an older one 
(which is where the wiki restore article you mentioned comes in).

But there were a number of reports recently where wanted was *MUCH* 
*LOWER* than found (like wanted 5, found 2753), which is what you're 
seeing.  Unfortunately I don't remember the resolution of those reports, 
or indeed, if the bug has been traced yet.

There is another bug (or possibly the above was after this one hit if it 
didn't stop further commits in some cases, thus resetting the generation 
to zero and increasing it again from there), however, where the transid 
was being zeroed.  Wanted 26473, found 0.  One of the devs mentioned 
tracing that one, tho again I'm not sure current status except that they 
mentioned it so they're obviously working on it.

To my knowledge, these were *NOT* in the context of relabeling, however, 
so it's quite possible you're seeing the one bug, and the relabeling is 
simply coincidence.

Again, however, you're running a 3.15 kernel that's effectively btrfs 
blacklisted, and and an even older 3.14 userspace.  I can't promise 
upgrading will give you an actual fix, but certainly, getting current on 
your kernel and userspace will at least get you on the same page as most 
folks here, so we know we're not dealing with old and in the case of the 
kernel known blacklisted versions, and the bugs in play will at least be 
current ones, not long since fixed ones.  And for the kernel, avoid 3.15 
series entirely, along with early 3.16 (before 3.16.3) and 3.17 (before 
3.17.2), plus early development 3.18 (current rcs should be better).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Duncan

Robert White posted on Fri, 21 Nov 2014 03:35:05 -0800 as excerpted:

> On 11/20/2014 10:22 PM, Duncan wrote:
>> But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs
>> =:^), because they're no longer unique, requiring them to be unique
>> just as the label says cannot be considered a bug.  It's simply
>> stricter enforcement of the rules, which are, after all, plainly stated
>> in the descriptive name.
> 
> You take "U"s away, not add them
> 
> UID = unique ID GUID = globally unique ID UUID = universally unique ID

I was making a joke, as I happened to notice un-UUID =3 U-s just as I was 
writing that.  Universally unique ID = UUID, un-UUID (not universally 
unique ID) = UUUID = U^3ID. =:^)

Of course formally it'd be NUID (not/non- unique) or some such, but un-
UUID served my purpose well enough, including the joke once I noticed it, 
so...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs send and an existing backup

2014-11-21 Thread Austin S Hemmelgarn


On 2014-11-20 09:10, Duncan wrote:

Bardur Arantsson posted on Thu, 20 Nov 2014 14:17:52 +0100 as excerpted:


If you have no other backups, I would really recommend that you *don't*
use btrfs for your backup, or at least have a *third* backup which isn't
on btrfs -- there are *still* problems with btrfs that can potentially
wreck your backup filesystem. (Although it's obviously less likely if
the external HDD will only be connected occasionally.)

Don't get me wrong, btrfs is becoming more and more stable, but I
wouldn't trust it with my *only* backup, especially if also running
btrfs on the backed-up filesystem.


This.

My working versions and first backups are btrfs.  My secondary backups
are reiserfs (my old filesystem of choice, which has been very reliable
for me), just in case both the btrfs versions bite the dust due to a bug
in btrfs itself.

Likewise, except I use compressed, encrypted tarballs stored on both 
Amazon S3 and Dropbox.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: soft lockup - CPU#0 stuck - Kernel 3.17.2

2014-11-21 Thread Patrick Schmid

On 15.11.2014 00:47, Chris Mason wrote:
> 
> Ok, I think this is related to the new fair read/writer lock 
> implementation in the generic kernel code.  btrfs_clear_path_blocking() 
> will end up taking locks outside of the strict locking order the rest 
> of Btrfs uses.  This used to be fine because we hold the blocking lock 
> while we do it, but with the new queued locks we're running into 
> trouble.  We hit a similar bug earlier and I convinced myself the 
> problem was only with btrfs_next_leaf and our trylock.  But it's a 
> bigger problem than I realized.
> 
> So for now I've changed btrfs_clear_path_blocking to honor the rules 
> and fixed up our trylock to make it faster.
> 
> I'm letting a test run on this patch over the weekend, since I don't 
> want any surprises with your backup farm.
> 
> -chris

Hi Chris

Here comes a short feedback I applied your patch (Fix lockups from
btrfs_clear_path_blocking) yesterday and my system survived this night! ;-)

Thank you for the quick fix.

Regards Patrick
-- 
Patrick Schmid   support: +41 44 633 2668
IT Services Group, HPT H 8voice:   +41 44 633 3997
Departement Physik, ETH Zurich
CH-8093 Zurich, Switzerland
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Ian Armstrong

On Fri, 21 Nov 2014 09:05:32 +0200, Brendan Hide wrote:

> On 2014/11/21 06:58, Zygo Blaxell wrote:

> > I also notice you are not running regular SMART self-tests (e.g.
> > by smartctl -t long) and the last (and first, and only!) self-test
> > the drive ran was ~12000 hours ago.  That means most of your SMART
> > data is about 18 months old.  The drive won't know about sectors
> > that went bad in the last year and a half unless the host happens
> > to stumble across them during a read.
> >
> > The drive is over five years old in operating hours alone.  It is
> > probably so fragile now that it will break if you try to move it.

> All interesting points. Do you schedule SMART self-tests on your own 
> systems? I have smartd running. In theory it tracks changes and sends 
> alerts if it figures a drive is going to fail. But, based on what
> you've indicated, that isn't good enough.

Simply monitoring the smart status without a self-test isn't really that
great. I'm not sure on the default config, but smartd can be made to
initiate a smart self-test at regular intervals. Depending on the test
type (short, long, etc) it could include a full surface scan. This can
reveal things like bad sectors before you ever hit them during normal
system usage.

> 
> > WARNING: errors detected during scrubbing, corrected.
> > [snip]
> > scrub device /dev/sdb2 (id 2) done
> > scrub started at Tue Nov 18 03:22:58 2014 and finished
> > after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors
> > error details: read=5 csum=5415
> > corrected errors: 5420, uncorrectable errors: 0, unverified
> > errors: 164 That seems a little off.  If there were 5 read errors,
> > I'd expect the drive to have errors in the SMART error log.
> >
> > Checksum errors could just as easily be a btrfs bug or a RAM/CPU
> > problem. There have been a number of fixes to csums in btrfs pulled
> > into the kernel recently, and I've retired two five-year-old
> > computers this summer due to RAM/CPU failures.

> The difference here is that the issue only affects the one drive.
> This leaves the probable cause at:
> - the drive itself
> - the cable/ports
> 
> with a negligibly-possible cause at the motherboard chipset.

This is the same problem that I'm currently trying to resolve. I have
one drive in a raid1 setup which shows no issues in smart status but
often has checksum errors.

In my situation what I've found is that if I scrub & let it fix the
errors then a second pass immediately after will show no errors. If I
then leave it a few days & try again there will be errors, even in
old files which have not been accessed for months.

If I do a read-only scrub to get a list of errors, a second scrub
immediately after will show exactly the same errors.

Apart from the scrub errors the system logs shows no issues with that
particular drive.

My next step is to disable autodefrag & see if the problem persists.
(I'm not suggesting a problem with autodefrag, I just want to remove it
from the equation & ensure that outside of normal file access, data
isn't being rewritten between scrubs)

-- 
Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: soft lockup - CPU#0 stuck - Kernel 3.17.2

2014-11-21 Thread Chris Mason




On Fri, Nov 21, 2014 at 8:01 AM, Patrick Schmid  
wrote:

On 15.11.2014 00:47, Chris Mason wrote:


 Ok, I think this is related to the new fair read/writer lock
 implementation in the generic kernel code.  
btrfs_clear_path_blocking()
 will end up taking locks outside of the strict locking order the 
rest
 of Btrfs uses.  This used to be fine because we hold the blocking 
lock

 while we do it, but with the new queued locks we're running into
 trouble.  We hit a similar bug earlier and I convinced myself the
 problem was only with btrfs_next_leaf and our trylock.  But it's a
 bigger problem than I realized.

 So for now I've changed btrfs_clear_path_blocking to honor the rules
 and fixed up our trylock to make it faster.

 I'm letting a test run on this patch over the weekend, since I don't
 want any surprises with your backup farm.

 -chris


Hi Chris

Here comes a short feedback I applied your patch (Fix lockups from
btrfs_clear_path_blocking) yesterday and my system survived this 
night! ;-)


Thank you for the quick fix.


Great to hear, thanks for following up.  I'm sending this to Linus for 
the next 3.18 rc.


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Phillip Susi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/20/2014 5:45 PM, Robert White wrote:
> Nice attempt at saving face, but wrong as _always_.
> 
> The CONFIG_PATA_ACPI option has been in the kernel since 2008 and
> lots of people have used it.
> 
> If you search for "ACPI ide" you'll find people complaining in
> 2008-2010 about windows error messages indicating the device is
> present in their system but no OS driver is available.

Nope... not finding it.  The closest thing was one or two people who
said ACPI when they meant AHCI ( and were quickly corrected ).  This
is probably what you were thinking of since windows xp did not ship
with an ahci driver so it was quite common for winxp users to have
this problem when in _AHCI_ mode.

> That you "have yet to see a single system that implements it" is
> about the worst piece of internet research I've ever seen. Do you
> not _get_ that your opinion about what exists and how it works is
> not authoritative?

Show me one and I'll give you a cookie.  I have disassembled a number
of acpi tables and yet to see one that has it.  What's more,
motherboard vendors tend to implement only the absolute minimum they
have to.  Since nobody actually needs this feature, they aren't going
to bother with it.  Do you not get that your hand waving arguments of
"you can google for it" are not authoritative?

> You can also find articles about both windows and linux systems
> actively using ACPI fan control going back to 2009

Maybe you should have actually read those articles.  Linux supports
acpi fan control, unfortunately, almost no motherboards actually
implement it.  Almost everyone who wants fan control working in linux
has to install lm-sensors and load a driver that directly accesses one
of the embedded controllers that motherboards tend to use and run the
fancontrol script to manipulate the pwm channels on that controller.
These days you also have to boot with a kernel argument to allow
loading the driver since ACPI claims those IO ports for its own use
which creates a conflict.

Windows users that want to do this have to install a program... I
believe a popular one is called q-fan, that likewise directly accesses
the embedded controller registers to control the fan, since the acpi
tables don't bother properly implementing the acpi fan spec.

Then there are thinkpads, and one or two other laptops ( asus comes to
mind ) that went and implemented their own proprietary acpi interfaces
for fancontrol instead of following the spec, which required some
reverse engineering and yet more drivers to handle these proprietary
acpi interfaces.  You can google for "thinkfan" if you want to see this.

> These are not hard searches to pull off. These are not obscure 
> references. Go to the google box and start typing "ACPI fan..."
> and check the autocomplete.
> 
> I'll skip ovea all the parts where you don't know how a chipset
> works and blah, blah, blah...
> 
> You really should have just stopped at "I don't know" and "I've
> never" because you keep demonstrating that you _don't_ know, and
> that you really _should_ _never_.
> 
> Tell us more about the lizard aliens controlling your computer, I
> find your versions of realty fascinating...

By all means, keep embarrassing yourself with nonsense and trying to
cover it up by being rude and insulting.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1YxAAoJEI5FoCIzSKrwi54H/Rkd7DloqC9x9QwN4QdmWcAZ
/UQg3hcRbtB3wpmp34Mnb3SS0Ii2mCh/dtKmdRGBNE/x5nU1WiQEHHCicKX3Avvq
8OXLNQrsf+xZL9/HGtUJ3RefpEkmwIG5NgFfKJHtv6Iq204Umq32JUxRla+ZQE5s
MrUparigpUlj26lrnShc6ByDUqYK3wOTsDxEMxrOyAgi/n/7ESHV/dZVaqsE6jGQ
OvPynf1FqJoJSSYC7sNE0XLqfHMu2wnSxcoF6MpuHXlDiwtrSH07tuwgrhCNPagY
j7gQyxucew8oim8lcfs+4rrQ60wwVzlsEJwjA9rAXQF7U2x/WoB+ArYhgmJUMgA=
=cXJr
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Phillip Susi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/20/2014 6:08 PM, Robert White wrote:
> Well you should have _actually_ trimmed your response down to not 
> pressing send.
> 
> _Many_ motherboards have complete RAID support at levels 0, 1, 10,
> and five 5. A few have RAID6.
> 
> Some of them even use the LSI chip-set.

Yes, there are some expensive server class motherboards out there with
integrated real raid chips.  Your average consumer class motherboards
are not those.  They contain intel, nvidia, sil, promise, and via
chipsets that are fake raid.

> Seriously... are you trolling this list with disinformation or
> just repeating tribal knowledge from fifteen year old copies of PC
> Magazine?

Please drop the penis measuring.

> Yea, some of the IDE motherboards and that only had RAID1 and RAID0
> (and indeed some of the add-on controllers) back in the IDE-only
> days were really lame just-forked-write devices with no integrity
> checks (hence "fake raid") but that's from like the 1990s; it's
> paleolithic age "wisdom" at this point.

Wrong again... fakeraid became popular with the advent of SATA since
it was easy to add a knob to the bios to switch it between AHCI and
RAID mode, and just change the pci device id.  These chipsets are
still quite common today and several of them do support raid5 and
raid10 ( well, really it's raid 0 + raid1, but that's a whole nother
can of worms ).  Recent intel chips also now have a caching mode for
having an SSD cache a larger HDD.  Intel has also done a lot of work
integrating support for their chipset into mdadm in the last year or
three.

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUb1ngAAoJEI5FoCIzSKrwqMQIAJ3MfA4n74aJ1KUdfHYOz96o
vwPBNQJ953yozmCHfjERbTCQlKT5AzwQHWpHoFWsQ4gYoNGmeE1jy2rsqxMfujff
eQekfISyX3POExnsr3LnfHWI2/Om39+EAxVPxbA5LN6SC1SCWRut7Q3bQqkuxj/S
bYRU65XJ9BZ6eYznutMDFdEELyAr8b9/wnatI/ohzmebOBDgFzBrn8gwilCctz7X
DI39HTkCvciWKVXNyVdUZKI5S+MRCEB2JZAkCy3x8LLsENmMnO0xN32o5Od0zlGn
nFLcLQFrZfz5dY2ZusxP+z0z0x4RW3sikd4RZ99PEHBkFa5CgJIFrBxtQAsLi1c=
=4Yg+
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/5] direct-io: don't dirty ITER_BVEC pages on read

2014-11-21 Thread Dave Kleikamp

On 11/21/2014 04:08 AM, Omar Sandoval wrote:
> Reads through the iov_iter infrastructure for kernel pages shouldn't be 
> dirtied
> by the direct I/O code.
> 
> This is based on Dave Kleikamp's and Ming Lei's previously posted patches.

Acked-by: Dave Kleikamp 

> Cc: Ming Lei 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/direct-io.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index e181b6b..e542ce4 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -120,6 +120,7 @@ struct dio {
>   spinlock_t bio_lock;/* protects BIO fields below */
>   int page_errors;/* errno from get_user_pages() */
>   int is_async;   /* is IO async ? */
> + int should_dirty;   /* should we mark read pages dirty? */
>   bool defer_completion;  /* defer AIO completion to workqueue? */
>   int io_error;   /* IO error in completion path */
>   unsigned long refcount; /* direct_io_worker() and bios */
> @@ -392,7 +393,7 @@ static inline void dio_bio_submit(struct dio *dio, struct 
> dio_submit *sdio)
>   dio->refcount++;
>   spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
> - if (dio->is_async && dio->rw == READ)
> + if (dio->is_async && dio->rw == READ && dio->should_dirty)
>   bio_set_pages_dirty(bio);
>  
>   if (sdio->submit_io)
> @@ -463,13 +464,13 @@ static int dio_bio_complete(struct dio *dio, struct bio 
> *bio)
>   if (!uptodate)
>   dio->io_error = -EIO;
>  
> - if (dio->is_async && dio->rw == READ) {
> + if (dio->is_async && dio->rw == READ && dio->should_dirty) {
>   bio_check_pages_dirty(bio); /* transfers ownership */
>   } else {
>   bio_for_each_segment_all(bvec, bio, i) {
>   struct page *page = bvec->bv_page;
>  
> - if (dio->rw == READ && !PageCompound(page))
> + if (dio->rw == READ && !PageCompound(page) && 
> dio->should_dirty)
>   set_page_dirty_lock(page);
>   page_cache_release(page);
>   }
> @@ -1177,6 +1178,7 @@ do_blockdev_direct_IO(int rw, struct kiocb *iocb, 
> struct inode *inode,
>  
>   dio->inode = inode;
>   dio->rw = rw;
> + dio->should_dirty = !(iter->type & ITER_BVEC);
>  
>   /*
>* For AIO O_(D)SYNC writes we need to defer completions to a workqueue
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] Btrfs: qgroup: add BUILD_BUG to report pointer cast breakage

2014-11-21 Thread David Sterba

On Thu, Nov 13, 2014 at 12:54:39AM +0900, Daniel Dressler wrote:
> Our ulist data structure stores at max 64bit
> values. qgroup has used this structure to store
> pointers. In the future when we upgrade to 128bit
> this casting of pointers to uint64_t will break.

What are we going to upgrade to 128 bits?

> This patch adds a BUILD_BUG ensuring that this
> code will not be left untouched in the upgrade.
> 
> It also marks this issue on the TODO list so it
> may be addressed before such an upgrade.
> 
> + (BUILD_BUG_ON_ZERO(sizeof(uintptr_t) > sizeof(u64)) + \

So it's pointers, this is similar to the 32bit -> 64bit transition, a
change that will affect everything, one BUILD_BUG on does not make any
difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: ctree: reduce args where only fs_info used

2014-11-21 Thread David Sterba

On Wed, Nov 12, 2014 at 01:43:09PM +0900, Daniel Dressler wrote:
> This patch is part of a larger project to cleanup
> btrfs's internal usage of struct btrfs_root. Many
> functions take btrfs_root only to grab a pointer
> to fs_info.

Thanks for picking up the project.

A mere formality, can you please justify the paragraphs to 74 chars?

--
This patch is part of a larger project to cleanup btrfs's internal usage
of struct btrfs_root. Many functions take btrfs_root only to grab a
pointer to fs_info.
--

> This patch does not address the two functions in
> ctree.c (insert_ptr, and split_item) which only
> use root for BUG_ONs in ctree.c
> 
> This patch affects the following functions:
>   1) fixup_low_keys
>   2) btrfs_set_item_key_safe

Please send one patch per function change, unless there are more that
are somehow entangled that it would make it hard to separate.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: delayed-inode: replace root args iff only fs_info used

2014-11-21 Thread David Sterba

On Mon, Nov 17, 2014 at 10:05:02PM +0900, Daniel Dressler wrote:
> This is the second independent patch of a larger
> project to cleanup btrfs's internal usage of
> btrfs_root. Many functions take btrfs_root
> only to grab the fs_info struct.
> 
> By requiring a root these functions cause
> programmer overhead. That these functions can
> accept any valid root is not obvious until
> inspection.
> 
> This patch reduces the specificity of such
> functions to accept the fs_info directly.
> 
> These patches can be applied independently
> and thus are not being submitted as a patch
> series. There should be about 26 patches by
> the project's completion. Each patch will
> cleanup between 1 and 34 functions apiece.
> Each patch covers a single file's functions.
> 
> This patch affects the following function(s):
>   1) btrfs_wq_run_delayed_node
> 
> Signed-off-by: Daniel Dressler 

Apart from the narrow paragraphs,

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: ctree: reduce args where only fs_info used

2014-11-21 Thread Daniel Dressler

Ah thanks David for looking at this.

Sorry for the thin paragraphs my vim was warming too early about long
lines. I will reformat it to break at 74 chars.

No problem, I'll redo everything so it is one function per patch. Now
fair warning: there are about 102 functions to cleanup. I was a bit
worried that many patches would cause too much maintainer overhead but
it is no problem for me. Only a few functions have dependecies on
other functions needing cleanup. Thus there will be some small patch
series for those function sets. A big benefit of one function one
patch is that extent-io.c will no longer be a 34 function monster
patch.

Thank you David, I'll redo all these patches.

Is there any rate limiting I should be doing? I don't want to flood
the list with burst of dozen plus patches, or is that an okay volume?

Daniel

2014-11-22 0:55 GMT+09:00 David Sterba :
> On Wed, Nov 12, 2014 at 01:43:09PM +0900, Daniel Dressler wrote:
>> This patch is part of a larger project to cleanup
>> btrfs's internal usage of struct btrfs_root. Many
>> functions take btrfs_root only to grab a pointer
>> to fs_info.
>
> Thanks for picking up the project.
>
> A mere formality, can you please justify the paragraphs to 74 chars?
>
> --
> This patch is part of a larger project to cleanup btrfs's internal usage
> of struct btrfs_root. Many functions take btrfs_root only to grab a
> pointer to fs_info.
> --
>
>> This patch does not address the two functions in
>> ctree.c (insert_ptr, and split_item) which only
>> use root for BUG_ONs in ctree.c
>>
>> This patch affects the following functions:
>>   1) fixup_low_keys
>>   2) btrfs_set_item_key_safe
>
> Please send one patch per function change, unless there are more that
> are somehow entangled that it would make it hard to separate.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: disk-io: replace root args iff only fs_info used

2014-11-21 Thread David Sterba

On Fri, Nov 21, 2014 at 05:15:07PM +0900, Daniel Dressler wrote:
> This is the 3rd independent patch of a larger
> project to cleanup btrfs's internal usage of
> btrfs_root. Many functions take btrfs_root
> only to grab the fs_info struct.
> 
> By requiring a root these functions cause
> programmer overhead. That these functions can
> accept any valid root is not obvious until
> inspection.
> 
> This patch reduces the specificity of such
> functions to accept the fs_info directly.
> 
> These patches can be applied independently
> and thus are not being submitted as a patch
> series. There should be about 26 patches by
> the project's completion. Each patch will
> cleanup between 1 and 34 functions apiece.
> Each patch covers a single file's functions.

It's good to have this kind of introduction but it really belongs ot the
cover letter not the individual patches.
> 
> This patch affects the following function(s):
>   1) csum_tree_block
>   2) csum_dirty_buffer
>   3) check_tree_block_fsid
>   4) btrfs_find_tree_block
>   5) clean_tree_block

Now that I see that, I'm not sure that my previous comment about 'one
patch per function' is the right way to go. This patch looks good as it
stands. The change is simple enough that I won't be opposed to grouping
even more functions together as long as it stays revieweable.

The patches are likely to clash with a lot of pending patches, so you
may want to base it on the integration branch next time. This would make
maintainers' life easier and also raises chances to merge the patches.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: disk-io: replace root args iff only fs_info used

2014-11-21 Thread Daniel Dressler

Thank you David this is helpful feedback.

What would a cover letter be like? Would that be a separate email to
the list, or maybe the first email in a patch series?

Sorry I've twice looked for the integration repo. I found some that
look like it could be but those had older commits. Could you direct me
to the exact branch I'd love to work against it. These patches were
done against linux-next.

I think small one function patches might be best. I have the codebase
mapped out and each file's functions-to-be-cleaned count varies
wildly. If I did batch files together and split large files apart
there would be no rhyme or reason for the groupings. With single
function patches it is very clear what changes are justified since
they should only occur in the affected function or in a call-site.
With multiple functions the call-site changes get mixed up would it
would be harder to review.

Daniel

2014-11-22 1:15 GMT+09:00 David Sterba :
> On Fri, Nov 21, 2014 at 05:15:07PM +0900, Daniel Dressler wrote:
>> This is the 3rd independent patch of a larger
>> project to cleanup btrfs's internal usage of
>> btrfs_root. Many functions take btrfs_root
>> only to grab the fs_info struct.
>>
>> By requiring a root these functions cause
>> programmer overhead. That these functions can
>> accept any valid root is not obvious until
>> inspection.
>>
>> This patch reduces the specificity of such
>> functions to accept the fs_info directly.
>>
>> These patches can be applied independently
>> and thus are not being submitted as a patch
>> series. There should be about 26 patches by
>> the project's completion. Each patch will
>> cleanup between 1 and 34 functions apiece.
>> Each patch covers a single file's functions.
>
> It's good to have this kind of introduction but it really belongs ot the
> cover letter not the individual patches.
>>
>> This patch affects the following function(s):
>>   1) csum_tree_block
>>   2) csum_dirty_buffer
>>   3) check_tree_block_fsid
>>   4) btrfs_find_tree_block
>>   5) clean_tree_block
>
> Now that I see that, I'm not sure that my previous comment about 'one
> patch per function' is the right way to go. This patch looks good as it
> stands. The change is simple enough that I won't be opposed to grouping
> even more functions together as long as it stays revieweable.
>
> The patches are likely to clash with a lot of pending patches, so you
> may want to base it on the integration branch next time. This would make
> maintainers' life easier and also raises chances to merge the patches.
>
> Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/5] btrfs: don't allow -C or +c chattrs on a swap file

2014-11-21 Thread David Sterba

On Fri, Nov 21, 2014 at 02:08:30AM -0800, Omar Sandoval wrote:
> @@ -293,14 +293,21 @@ static int btrfs_ioctl_setflags(struct file *file, void 
> __user *arg)
>   }
>   } else {

You can put the condition here, instead of shifting the nested block.

} else if (!IS_SWAPFILE(inode)) {

>   /*
> -  * Revert back under same assuptions as above
> +  * swap_activate checks that we don't swapon a copy-on-write
> +  * file, but we must also make sure that it doesn't become
> +  * copy-on-write.
>*/
> - if (S_ISREG(mode)) {
> - if (inode->i_size == 0)
> - ip->flags &= ~(BTRFS_INODE_NODATACOW
> -  | BTRFS_INODE_NODATASUM);
> - } else {
> - ip->flags &= ~BTRFS_INODE_NODATACOW;
> + if (!IS_SWAPFILE(inode)) {
> + /*
> +  * Revert back under same assumptions as above
> +  */
> + if (S_ISREG(mode)) {
> + if (inode->i_size == 0)
> + ip->flags &= ~(BTRFS_INODE_NODATACOW |
> +BTRFS_INODE_NODATASUM);
> + } else {
> + ip->flags &= ~BTRFS_INODE_NODATACOW;
> + }
>   }
>   }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Chris Murphy

On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong  wrote:

> In my situation what I've found is that if I scrub & let it fix the
> errors then a second pass immediately after will show no errors. If I
> then leave it a few days & try again there will be errors, even in
> old files which have not been accessed for months.

What are the devices? And if they're SSDs are they powered off for
these few days? I take it the scrub error type is corruption?

You can use badblocks to write a known pattern to the drive. Then
power off and leave it for a few days. Then read the drive, matching
against the pattern, and see if there are any discrepancies. Doing
this outside the code path of Btrfs would fairly conclusively indicate
whether it's hardware or software induced.

Assuming you have another copy of all of these files :-) you could
just sha256sum the two copies to see if they have in fact changed. If
they have, well then you've got some silent data corruption somewhere
somehow. But if they always match, then that suggests a bug. I don't
see how you can get bogus corruption messages, and for it to not be a
bug. When you do these scrubs that come up clean, and then later come
up with corruptions, have you done any software updates?

> My next step is to disable autodefrag & see if the problem persists.
> (I'm not suggesting a problem with autodefrag, I just want to remove it
> from the equation & ensure that outside of normal file access, data
> isn't being rewritten between scrubs)

I wouldn't expect autodefrag to touch old files not accessed for
months. Doesn't it only affect actively used files?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Zygo Blaxell

On Fri, Nov 21, 2014 at 09:05:32AM +0200, Brendan Hide wrote:
> On 2014/11/21 06:58, Zygo Blaxell wrote:
> >You have one reallocated sector, so the drive has lost some data at some
> >time in the last 49000(!) hours.  Normally reallocations happen during
> >writes so the data that was "lost" was data you were in the process of
> >overwriting anyway; however, the reallocated sector count could also be
> >a sign of deteriorating drive integrity.
> >
> >In /var/lib/smartmontools there might be a csv file with logged error
> >attribute data that you could use to figure out whether that reallocation
> >was recent.
> >
> >I also notice you are not running regular SMART self-tests (e.g.
> >by smartctl -t long) and the last (and first, and only!) self-test the
> >drive ran was ~12000 hours ago.  That means most of your SMART data is
> >about 18 months old.  The drive won't know about sectors that went bad
> >in the last year and a half unless the host happens to stumble across
> >them during a read.
> >
> >The drive is over five years old in operating hours alone.  It is probably
> >so fragile now that it will break if you try to move it.
> All interesting points. Do you schedule SMART self-tests on your own
> systems? I have smartd running. In theory it tracks changes and
> sends alerts if it figures a drive is going to fail. But, based on
> what you've indicated, that isn't good enough.

I run 'smartctl -t long' from cron overnight (or whenever the drives
are most idle).  You can also set up smartd.conf to launch the self
tests; however, the syntax for test scheduling is byzantine compared to
cron (and that's saying something!).  On multi-drive systems I schedule
a different drive for each night.

If you are also doing btrfs scrub, then stagger the scheduling so
e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.

smartd is OK for monitoring test logs and email alerts.  I've had no
problems there.

> >WARNING: errors detected during scrubbing, corrected.
> >[snip]
> >scrub device /dev/sdb2 (id 2) done
> > scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 
> > seconds
> > total bytes scrubbed: 189.49GiB with 5420 errors
> > error details: read=5 csum=5415
> > corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164
> >That seems a little off.  If there were 5 read errors, I'd expect the drive 
> >to
> >have errors in the SMART error log.
> >
> >Checksum errors could just as easily be a btrfs bug or a RAM/CPU problem.
> >There have been a number of fixes to csums in btrfs pulled into the kernel
> >recently, and I've retired two five-year-old computers this summer due
> >to RAM/CPU failures.
> The difference here is that the issue only affects the one drive.
> This leaves the probable cause at:
> - the drive itself
> - the cable/ports
> 
> with a negligibly-possible cause at the motherboard chipset.

If it was cable, there should be UDMA CRC errors or similar in the SMART
counters, but they are zero.  You can also try swapping the cable and
seeing whether the errors move.  I've found many bad cables that way.

The drive itself could be failing in some way that prevents recording
SMART errors (e.g. because of host timeouts triggering a bus reset,
which also prevents the SMART counter update for what was going wrong at
the time).  This is unfortunately quite common, especially with drives
configured for non-RAID workloads.

> 
> -- 
> __
> Brendan Hide
> http://swiftspirit.co.za/
> http://www.webafrica.co.za/?AFF1E97
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Zygo Blaxell

On Fri, Nov 21, 2014 at 06:22:57AM +, Duncan wrote:
> After all, an LVM block-level snapshot takes the same space as a file 
> containing the same raw data, and if there's room for the data in an LVM 
> snapshot, given a different layout, there's room for exactly the same 
> amount of data as a file on a different filesystem, piped thru some 
> compressor if necessary due to tight datasize constraints.

That isn't true at all.  A repairing fsck can take less than 1% of the
overall volume size, and a full conversion from another filesystem type
can take less than 10%.  Usually I can find enough space by blowing away
the swap LV for a few hours.

I do NOT usually have 13TB of slack space lying around in a 26TB disk
array, nor do I have enough bandwidth to move those 13TB to another
machine without great inconvenience.

> But while other filesystems might allow un-UUIDs (heh, UUUIDs or U3IDs 
> =:^), because they're no longer unique, requiring them to be unique just 
> as the label says cannot be considered a bug.  It's simply stricter 
> enforcement of the rules, which are, after all, plainly stated in the 
> descriptive name.

It's not a bug as long as I can completely control which devices are
searched for UUIDs, and the system behaves sanely when multiple UUIDs
are found through automatic discovery; otherwise, it's not only a bug,
it's a DoS attack security vulnerability.  Consider what happens if
someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a fake
filesystem with those UUIDs, puts the fake filesystem on a USB stick,
and plugs it back into the victim machine...

> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

signature.asc
Description: Digital signature

Re: [PATCH v2 5/5] btrfs: enable swap file support

2014-11-21 Thread David Sterba

On Fri, Nov 21, 2014 at 02:08:31AM -0800, Omar Sandoval wrote:
> Implement the swap file a_ops on btrfs. Activation simply checks for a usable
> swap file: it must be fully allocated (no holes), support direct I/O (so no
> compressed or inline extents) and should be nocow (I'm not sure about that 
> last
> one).
> 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/inode.c | 71 
> 
>  1 file changed, 71 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index d23362f..b8fd36b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9442,6 +9442,75 @@ out_inode:
>  
>  }
>  
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> *file,
> +sector_t *span)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_inode *ip = BTRFS_I(inode);

'ip' looks strange in context of a filesystem, please pick a different
name (eg. 'inode' or whatever).

> + int ret = 0;
> + u64 isize = inode->i_size;
> + struct extent_state *cached_state = NULL;
> + struct extent_map *em;
> + u64 start, len;
> +
> + if (ip->flags & BTRFS_INODE_COMPRESS) {
> + /* Can't do direct I/O on a compressed file. */
> + pr_err("BTRFS: swapfile is compressed");

Please use the btrfs_err macros everywhere.

> + return -EINVAL;
> + }
> + if (!(ip->flags & BTRFS_INODE_NODATACOW)) {
> + /* The swap file can't be copy-on-write. */
> + pr_err("BTRFS: swapfile is copy-on-write");
> + return -EINVAL;
> + }
> +
> + lock_extent_bits(&ip->io_tree, 0, isize - 1, 0, &cached_state);
> +
> + /*
> +  * All of the extents must be allocated and support direct I/O. Inline
> +  * extents and compressed extents fall back to buffered I/O, so those
> +  * are no good.
> +  */
> + start = 0;
> + while (start < isize) {
> + len = isize - start;
> + em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> + if (IS_ERR(em)) {
> + ret = PTR_ERR(em);
> + goto out;
> + }
> +
> + if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
> + em->block_start == EXTENT_MAP_HOLE) {

If the no-holes feature is enabled on the filesystem, there won't be any
such extent representing the hole. You have to check that each two
consecutive extents are adjacent.

> + pr_err("BTRFS: swapfile has holes");
> + ret = -EINVAL;
> + goto out;
> + }
> + if (em->block_start == EXTENT_MAP_INLINE) {
> + pr_err("BTRFS: swapfile is inline");

While the test is valid, this would mean that the file is smaller than
the inline limit, which is now one page. I think the generic swap code
would refuse such a small file anyway.

> + ret = -EINVAL;
> + goto out;
> + }
> + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> + pr_err("BTRFS: swapfile is compresed");
> + ret = -EINVAL;
> + goto out;
> + }

I think the preallocated extents should be refused as well. This means
the filesystem has enough space to hold the data but it would still have
to go through the allocation and could in turn stress the memory
management code that triggered the swapping activity in the first place.

Though it's probably still possible to reach such corner case even with
fully allocated nodatacow file, this should be reviewed anyway.

> +
> + start = extent_map_end(em);
> + free_extent_map(em);
> + }
> +
> +out:
> + unlock_extent_cached(&ip->io_tree, 0, isize - 1, &cached_state,
> +  GFP_NOFS);
> + return ret;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Chris Murphy

On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell  wrote:

> I run 'smartctl -t long' from cron overnight (or whenever the drives
> are most idle).  You can also set up smartd.conf to launch the self
> tests; however, the syntax for test scheduling is byzantine compared to
> cron (and that's saying something!).  On multi-drive systems I schedule
> a different drive for each night.
>
> If you are also doing btrfs scrub, then stagger the scheduling so
> e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.
>
> smartd is OK for monitoring test logs and email alerts.  I've had no
> problems there.

Most attributes are always updated without issuing a smart test of any
kind. A drive I have here only has four offline updateable attributes.

When it comes to bad sectors, the drive won't use a sector that
persistently fails writes. So you don't really have to worry about
latent bad sectors that don't have data on them already. The sectors
you care about are the ones with data. A scrub reads all of those
sectors.

First the drive could report a read error in which case Btrfs
raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
rebuild it from parity, and write to the affected sector; and also
this same mechanism happens in normal reads so it's a kind of passive
scrub. But it happens to miss checking inactively read data, which a
scrub will check.

Second, the drive could report no problem, and Btrfs raid1/10 could
still fix the problem in case of a csum mismatch. And it looks like
soonish we'll see this apply to raid5/6.

So I think a nightly long smart test is a bit overkill. I think you
could do nightly -t short tests which will report problems scrub won't
notice, such as higher seek times or lower throughput performance. And
then scrub once a week.

> The drive itself could be failing in some way that prevents recording
> SMART errors (e.g. because of host timeouts triggering a bus reset,
> which also prevents the SMART counter update for what was going wrong at
> the time).  This is unfortunately quite common, especially with drives
> configured for non-RAID workloads.

Libata resetting the link should be recorded in kernel messages.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Chris Murphy

On Thu, Nov 20, 2014 at 11:22 PM, Duncan <1i5t5.dun...@cox.net> wrote:

>
> When I have such a filesystem level problem, I simply dd from the backing
> device to some other location, generally to a file that's on a different
> filesystem (preferrably non-btrfs, I use reiserfs as I've found it very
> resilient, here), in which case btrfs device scan won't see the UUID on
> the copy as it scans block devices, not inside non-device files.

That's hours of dd and you have to find space to do it.

> After all, an LVM block-level snapshot takes the same space as a file
> containing the same raw data, and if there's room for the data in an LVM
> snapshot, given a different layout, there's room for exactly the same
> amount of data as a file on a different filesystem, piped thru some
> compressor if necessary due to tight datasize constraints.

That's not true for thin volume snapshots. They take up next to no
space upon creation, they don't need space reserved in advance.
They're more like a qcow2 snapshot than a conventional LVM snapshot; a
big difference being if you delete the snapshot, or you delete a bunch
of files in a thin volume and follow it with fstrim, the unused
extents are returned to the thin pool.

There has been a fragmentation problem with thin volumes; I don't know
if that's solved yet. And I don't know if it exacerbates things with
Btrfs fragmentation.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/5] btrfs: enable swap file support

2014-11-21 Thread Filipe David Manana

On Fri, Nov 21, 2014 at 6:00 PM, David Sterba  wrote:
> On Fri, Nov 21, 2014 at 02:08:31AM -0800, Omar Sandoval wrote:
>> Implement the swap file a_ops on btrfs. Activation simply checks for a usable
>> swap file: it must be fully allocated (no holes), support direct I/O (so no
>> compressed or inline extents) and should be nocow (I'm not sure about that 
>> last
>> one).
>>
>> Signed-off-by: Omar Sandoval 
>> ---
>>  fs/btrfs/inode.c | 71 
>> 
>>  1 file changed, 71 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index d23362f..b8fd36b 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -9442,6 +9442,75 @@ out_inode:
>>
>>  }
>>
>> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
>> *file,
>> +sector_t *span)
>> +{
>> + struct inode *inode = file_inode(file);
>> + struct btrfs_inode *ip = BTRFS_I(inode);
>
> 'ip' looks strange in context of a filesystem, please pick a different
> name (eg. 'inode' or whatever).
>
>> + int ret = 0;
>> + u64 isize = inode->i_size;
>> + struct extent_state *cached_state = NULL;
>> + struct extent_map *em;
>> + u64 start, len;
>> +
>> + if (ip->flags & BTRFS_INODE_COMPRESS) {
>> + /* Can't do direct I/O on a compressed file. */
>> + pr_err("BTRFS: swapfile is compressed");
>
> Please use the btrfs_err macros everywhere.
>
>> + return -EINVAL;
>> + }
>> + if (!(ip->flags & BTRFS_INODE_NODATACOW)) {
>> + /* The swap file can't be copy-on-write. */
>> + pr_err("BTRFS: swapfile is copy-on-write");
>> + return -EINVAL;
>> + }
>> +
>> + lock_extent_bits(&ip->io_tree, 0, isize - 1, 0, &cached_state);
>> +
>> + /*
>> +  * All of the extents must be allocated and support direct I/O. Inline
>> +  * extents and compressed extents fall back to buffered I/O, so those
>> +  * are no good.
>> +  */
>> + start = 0;
>> + while (start < isize) {
>> + len = isize - start;
>> + em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
>> + if (IS_ERR(em)) {
>> + ret = PTR_ERR(em);
>> + goto out;
>> + }
>> +
>> + if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
>> + em->block_start == EXTENT_MAP_HOLE) {
>
> If the no-holes feature is enabled on the filesystem, there won't be any
> such extent representing the hole. You have to check that each two
> consecutive extents are adjacent.

If no-holes is enabled it means file extent items aren't used to
represent holes. But extent maps are still used to represent holes in
memory, and that's what the code is looking for and therefore it's
correct.

>
>> + pr_err("BTRFS: swapfile has holes");
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> + if (em->block_start == EXTENT_MAP_INLINE) {
>> + pr_err("BTRFS: swapfile is inline");
>
> While the test is valid, this would mean that the file is smaller than
> the inline limit, which is now one page. I think the generic swap code
> would refuse such a small file anyway.
>
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
>> + pr_err("BTRFS: swapfile is compresed");
>> + ret = -EINVAL;
>> + goto out;
>> + }
>
> I think the preallocated extents should be refused as well. This means
> the filesystem has enough space to hold the data but it would still have
> to go through the allocation and could in turn stress the memory
> management code that triggered the swapping activity in the first place.
>
> Though it's probably still possible to reach such corner case even with
> fully allocated nodatacow file, this should be reviewed anyway.
>
>> +
>> + start = extent_map_end(em);
>> + free_extent_map(em);
>> + }
>> +
>> +out:
>> + unlock_extent_cached(&ip->io_tree, 0, isize - 1, &cached_state,
>> +  GFP_NOFS);
>> + return ret;
>> +}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: make sure logged extents complete in the current transaction V3

2014-11-21 Thread Josef Bacik

Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described.  It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits.  Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed.  If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction.  Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding.  This will make sure that if we lose
power after the transaction commit we still have our data.  This also fixes the
problem of the improperly updated extent generation.  Thanks,

cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
V2->V3: Cleanup pending ordered properly on transaction abort.

 fs/btrfs/disk-io.c  | 20 
 fs/btrfs/ordered-data.c |  9 +++--
 fs/btrfs/ordered-data.h |  8 +++-
 fs/btrfs/transaction.c  | 33 +
 fs/btrfs/transaction.h  |  2 ++
 fs/btrfs/tree-log.c |  6 +++---
 6 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2409718..fe4db97 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4128,6 +4128,25 @@ again:
return 0;
 }
 
+static void btrfs_free_pending_ordered(struct btrfs_transaction *cur_trans,
+  struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_ordered_extent *ordered;
+
+   spin_lock(&fs_info->trans_lock);
+   while (!list_empty(&cur_trans->pending_ordered)) {
+   ordered = list_first_entry(&cur_trans->pending_ordered,
+  struct btrfs_ordered_extent,
+  trans_list);
+   list_del_init(&ordered->trans_list);
+   spin_unlock(&fs_info->trans_lock);
+
+   btrfs_put_ordered_extent(ordered);
+   spin_lock(&fs_info->trans_lock);
+   }
+   spin_unlock(&fs_info->trans_lock);
+}
+
 void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
   struct btrfs_root *root)
 {
@@ -4139,6 +4158,7 @@ void btrfs_cleanup_one_transaction(struct 
btrfs_transaction *cur_trans,
cur_trans->state = TRANS_STATE_UNBLOCKED;
wake_up(&root->fs_info->transaction_wait);
 
+   btrfs_free_pending_ordered(cur_trans, root->fs_info);
btrfs_destroy_delayed_inodes(root);
btrfs_assert_delayed_root_empty(root);
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index ac734ec..269e21d 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -220,6 +220,7 @@ static int __btrfs_add_ordered_extent(struct inode *inode, 
u64 file_offset,
INIT_LIST_HEAD(&entry->work_list);
init_completion(&entry->completion);
INIT_LIST_HEAD(&entry->log_list);
+   INIT_LIST_HEAD(&entry->trans_list);
 
trace_btrfs_ordered_extent_add(inode, entry);
 
@@ -443,6 +444,8 @@ void btrfs_get_logged_extents(struct inode *inode,
ordered = rb_entry(n, struct btrfs_ordered_extent, rb_node);
if (!list_empty(&ordered->log_list))
continue;
+   if (test_bit(BTRFS_ORDERED_LOGGED, &ordered->flags))
+   continue;
list_add_tail(&ordered->log_list, logged_list);
atomic_inc(&ordered->refs);
}
@@ -472,7 +475,8 @@ void btrfs_submit_logged_extents(struct list_head 
*logged_list,
spin_unlock_irq(&log->log_extents_lock[index]);
 }
 
-void btrfs_wait_logged_extents(struct btrfs_root *log, u64 transid)
+void btrfs_wait_logged_extents(struct btrfs_trans_handle *trans,
+  struct btrfs_root *log, u64 transid)
 {
struct btrfs_ordered_extent *ordered;
int index = transid % 2;
@@ -497,7 +501,8 @@ void btrfs_wait_logged_extents(struct btrfs_root *log, u64 
transid)
wait_event(ordered->wait, test_bit(BTRFS_ORDERED_IO_DONE,
   &ordered->flags));
 
-   btrfs_put_ordered_extent(ordered);
+   if (!test_and_set_bit(BTRFS_ORDERED_LOGGED, &ordered->flags))
+   list_add_tail(&ordered->trans_list, &trans->ordered);
spin_lock_irq(&log->log_extents_lock[index]);
}
spin_unlock_irq(&log->log_extents_lock[index]);
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index d81a274..0124bff 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -71,6 +71,8 @@ struct btrfs_ordered_sum {

[PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-21 Thread Theodore Ts'o

In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.

Signed-off-by: Theodore Ts'o 
Cc: x...@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
---
 fs/btrfs/inode.c   | 10 ++
 fs/inode.c | 29 ++---
 fs/xfs/xfs_iops.c  | 39 ---
 include/linux/fs.h |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d23362f..a5e0d0d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5574,6 +5574,11 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec *now,
inode->i_mtime = *now;
if (flags & S_ATIME)
inode->i_atime = *now;
+   return 0;
+}
+
+static int btrfs_write_time(struct inode *inode)
+{
return btrfs_dirty_inode(inode);
 }
 
@@ -9462,6 +9467,7 @@ static const struct inode_operations 
btrfs_dir_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
.tmpfile= btrfs_tmpfile,
 };
 static const struct inode_operations btrfs_dir_ro_inode_operations = {
@@ -9470,6 +9476,7 @@ static const struct inode_operations 
btrfs_dir_ro_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 static const struct file_operations btrfs_dir_file_operations = {
@@ -9540,6 +9547,7 @@ static const struct inode_operations 
btrfs_file_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_special_inode_operations = {
.getattr= btrfs_getattr,
@@ -9552,6 +9560,7 @@ static const struct inode_operations 
btrfs_special_inode_operations = {
.get_acl= btrfs_get_acl,
.set_acl= btrfs_set_acl,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 static const struct inode_operations btrfs_symlink_inode_operations = {
.readlink   = generic_readlink,
@@ -9565,6 +9574,7 @@ static const struct inode_operations 
btrfs_symlink_inode_operations = {
.listxattr  = btrfs_listxattr,
.removexattr= btrfs_removexattr,
.update_time= btrfs_update_time,
+   .write_time = btrfs_write_time,
 };
 
 const struct dentry_operations btrfs_dentry_operations = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..8f5c4b5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1499,17 +1499,24 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
-   if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
-
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_VERSION)
-   inode_inc_iversion(inode);
-   if (flags & S_CTIME)
-   inode->i_ctime = *time;
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
+   int ret;
+
+   if (inode->i_op->update_time) {
+   ret = inode->i_op->update_time(inode, time, flags);
+   if (ret)
+   return ret;
+   } else {
+   if (flags & S_ATIME)
+   inode->i_atime = *time;
+   if (flags & S_VERSION)
+   inode_inc_iversion(inode);
+   if (flags & S_CTIME)
+   inode->i_ctime = *time;
+   if (flags & S_MTIME)
+   inode->i_mtime = *time;
+   }
+   if (inode->i_op->write_time)
+   return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
return 0;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec6dcdc..0e9653c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -984,10 +984,8 @@ xfs_vn_setattr(
 }
 
 STATIC int
-xfs_vn_update_time(
-   struct inode*inode,
-   struct timespec *now,
-   int flags)
+xfs_vn_write_time(
+   struct inode*inode)
 {
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
@@ -1004,21 +1002,16 @@ xfs_vn_update_time(
}
 
xfs_ilock(ip, XFS_ILOCK_EXCL);
-   if (flags & S_CTIME) {
-   inode->i

[PATCH 2/4] vfs: add support for a lazytime mount option

2014-11-21 Thread Theodore Ts'o

Add a new mount option which enables a new "lazytime" mode.  This mode
causes atime, mtime, and ctime updates to only be made to the
in-memory version of the inode.  The on-disk times will only get
updated when (a) if the inode needs to be updated for some non-time
related change, (b) if userspace calls fsync(), syncfs() or sync(), or
(c) just before an undeleted inode is evicted from memory.

This is OK according to POSIX because there are no guarantees after a
crash unless userspace explicitly requests via a fsync(2) call.

For workloads which feature a large number of random write to a
preallocated file, the lazytime mount option significantly reduces
writes to the inode table.  The repeated 4k writes to a single block
will result in undesirable stress on flash devices and SMR disk
drives.  Even on conventional HDD's, the repeated writes to the inode
table block will trigger Adjacent Track Interference (ATI) remediation
latencies, which very negatively impact 99.9 percentile latencies ---
which is a very big deal for web serving tiers (for example).

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c   | 38 +-
 fs/inode.c  | 18 ++
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 +++
 include/linux/fs.h  |  1 +
 include/uapi/linux/fs.h |  1 +
 6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ef9bef1..ce7de22 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -483,7 +483,7 @@ __writeback_single_inode(struct inode *inode, struct 
writeback_control *wbc)
if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
inode->i_state &= ~I_DIRTY_PAGES;
dirty = inode->i_state & I_DIRTY;
-   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+   inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_TIME);
spin_unlock(&inode->i_lock);
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
@@ -1277,6 +1277,41 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
 }
 
+/*
+ * This works like wait_sb_inodes(), but it is called *before* we kick
+ * the bdi so the inodes can get written out.
+ */
+static void flush_sb_dirty_time(struct super_block *sb)
+{
+   struct inode *inode, *old_inode = NULL;
+
+   WARN_ON(!rwsem_is_locked(&sb->s_umount));
+   spin_lock(&inode_sb_list_lock);
+   list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+   int dirty_time;
+
+   spin_lock(&inode->i_lock);
+   if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+   spin_unlock(&inode->i_lock);
+   continue;
+   }
+   dirty_time = inode->i_state & I_DIRTY_TIME;
+   __iget(inode);
+   spin_unlock(&inode->i_lock);
+   spin_unlock(&inode_sb_list_lock);
+
+   iput(old_inode);
+   old_inode = inode;
+
+   if (dirty_time)
+   mark_inode_dirty(inode);
+   cond_resched();
+   spin_lock(&inode_sb_list_lock);
+   }
+   spin_unlock(&inode_sb_list_lock);
+   iput(old_inode);
+}
+
 /**
  * writeback_inodes_sb_nr -writeback dirty inodes from given super_block
  * @sb: the superblock
@@ -1388,6 +1423,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+   flush_sb_dirty_time(sb);
bdi_queue_work(sb->s_bdi, &work);
wait_for_completion(&done);
 
diff --git a/fs/inode.c b/fs/inode.c
index 8f5c4b5..6e91aca 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,18 @@ static void evict(struct inode *inode)
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(!list_empty(&inode->i_lru));
 
+   if (inode->i_nlink && inode->i_state & I_DIRTY_TIME) {
+   if (inode->i_op->write_time)
+   inode->i_op->write_time(inode);
+   else if (inode->i_sb->s_op->write_inode) {
+   struct writeback_control wbc = {
+   .sync_mode = WB_SYNC_NONE,
+   };
+   mark_inode_dirty(inode);
+   inode->i_sb->s_op->write_inode(inode, &wbc);
+   }
+   }
+
if (!list_empty(&inode->i_wb_list))
inode_wb_list_del(inode);
 
@@ -1515,6 +1527,12 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
+   if (inode->i_sb->s_flags & MS_LAZYTIME) {
+   spin_lock(&inode->i_lock);
+   inode->i_state |= I_DIRTY_TIME;
+   spin_unlock(&inode->i_lock);
+   return 0;
+   }

[PATCH 4/4] ext4: add support for a lazytime mount option

2014-11-21 Thread Theodore Ts'o

Add an optimization for the MS_LAZYTIME mount option so that we will
opportunistically write out any inodes with the I_DIRTY_TIME flag set
in a particular inode table block when we need to update some inode in
that inode table block anyway.

Also add some temporary code so that we can set the lazytime mount
option without needing a modified /sbin/mount program which can set
MS_LAZYTIME.  We can eventually make this go away once util-linux has
added support.

Google-Bug-Id: 18297052

Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c| 46 +-
 fs/ext4/super.c|  9 +
 fs/inode.c | 36 
 include/linux/fs.h |  2 ++
 4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3356ab5..07ceafb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4163,6 +4163,49 @@ static int ext4_inode_blocks_set(handle_t *handle,
 }
 
 /*
+ * Opportunistically update the other time fields for other inodes in
+ * the same inode table block.
+ */
+static void ext4_update_other_inodes_time(struct super_block *sb,
+ unsigned long orig_ino, char *buf)
+{
+   struct ext4_inode_info  *ei;
+   struct ext4_inode   *raw_inode;
+   unsigned long   ino;
+   struct inode*inode;
+   int i, inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
+   int inode_size = EXT4_INODE_SIZE(sb);
+
+   ino = orig_ino & ~(inodes_per_block - 1);
+   for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size) {
+   if (ino == orig_ino)
+   continue;
+   inode = find_active_inode_nowait(sb, ino);
+   if (!inode ||
+   (inode->i_state & I_DIRTY_TIME) == 0) {
+   iput(inode);
+   continue;
+   }
+   raw_inode = (struct ext4_inode *) buf;
+   ei = EXT4_I(inode);
+
+   spin_lock(&inode->i_lock);
+   inode->i_state &= ~I_DIRTY_TIME;
+   inode->i_ts_dirty_day = 0;
+   spin_unlock(&inode->i_lock);
+
+   spin_lock(&ei->i_raw_lock);
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   ext4_inode_csum_set(inode, raw_inode, ei);
+   spin_unlock(&ei->i_raw_lock);
+   iput(inode);
+   }
+}
+
+
+/*
  * Post the struct inode info into an on-disk inode location in the
  * buffer-cache.  This gobbles the caller's reference to the
  * buffer_head in the inode location struct.
@@ -4273,9 +4316,10 @@ static int ext4_do_update_inode(handle_t *handle,
}
 
ext4_inode_csum_set(inode, raw_inode, ei);
-
spin_unlock(&ei->i_raw_lock);
 
+   ext4_update_other_inodes_time(inode->i_sb, inode->i_ino, bh->b_data);
+
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
if (!err)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4b79f39..1ac1914 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1133,6 +1133,7 @@ enum {
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_usrquota, Opt_grpquota, Opt_i_version,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_lazytime, Opt_nolazytime,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
@@ -1195,6 +1196,8 @@ static const match_table_t tokens = {
{Opt_i_version, "i_version"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
+   {Opt_lazytime, "lazytime"},
+   {Opt_nolazytime, "nolazytime"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_removed, "mblk_io_submit"},
{Opt_removed, "nomblk_io_submit"},
@@ -1450,6 +1453,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
case Opt_i_version:
sb->s_flags |= MS_I_VERSION;
return 1;
+   case Opt_lazytime:
+   sb->s_flags |= MS_LAZYTIME;
+   return 1;
+   case Opt_nolazytime:
+   sb->s_flags &= ~MS_LAZYTIME;
+   return 1;
}
 
for (m = ext4_mount_opts; m->token != Opt_err; m++)
diff --git a/fs/inode.c b/fs/inode.c
index f0d6232..89cfca7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1292,6 +1292,42 @@ struct inode *ilookup(struct super_block *sb, unsigned 
long ino)
 }
 EXPORT_SYMBOL(ilookup);
 
+/**
+ * find_active_inode_nowait - find an active inode in the inode cache
+ * @sb:super block of file system to search
+ * @ino:   inode number to search for
+ *
+ * Search for an a

[PATCH 0/4] add support for a lazytime mount option

2014-11-21 Thread Theodore Ts'o

This is an updated version of what had originally been an
ext4-specific patch which significantly improves performance by lazily
writing timestamp updates (and in particular, mtime updates) to disk.
The in-memory timestamps are always correct, but they are only written
to disk when required for correctness.

This provides a huge performance boost for ext4 due to how it handles
journalling, but it's valuable for all file systems running on flash
storage or drive-managed SMR disks by reducing the metadata write
load.  So upon request, I've moved the functionality to the VFS layer.
Once the /sbin/mount program adds support for MS_LAZYTIME, all file
systems should be able to benefit from this optimization.

There is still an ext4-specific optimization, which may be applicable
for other file systems which store more than one inode in a block, but
it will require file system specific code.  It is purely optional,
however.

Please note the changes to update_time() and the new write_time() inode
operations functions, which impact btrfs and xfs.  The changes are
fairly simple, but I would appreciate confirmation from the btrfs and
xfs teams that I got things right.   Thanks!!

Theodore Ts'o (4):
  fs: split update_time() into update_time() and write_time()
  vfs: add support for a lazytime mount option
  vfs: don't let the dirty time inodes get more than a day stale
  ext4: add support for a lazytime mount option

 fs/btrfs/inode.c| 10 ++
 fs/ext4/inode.c | 46 +-
 fs/ext4/super.c |  9 +
 fs/fs-writeback.c   | 39 +-
 fs/inode.c  | 88 ++---
 fs/proc_namespace.c |  1 +
 fs/sync.c   |  7 
 fs/xfs/xfs_iops.c   | 39 +-
 include/linux/fs.h  |  5 +++
 include/uapi/linux/fs.h |  1 +
 10 files changed, 209 insertions(+), 36 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-21 Thread Theodore Ts'o

Guarantee that the on-disk timestamps will be no more than 24 hours
stale.

Signed-off-by: Theodore Ts'o 
---
 fs/fs-writeback.c  | 1 +
 fs/inode.c | 7 ++-
 include/linux/fs.h | 1 +
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ce7de22..eb04277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
trace_writeback_dirty_inode_start(inode, flags);
 
+   inode->i_ts_dirty_day = 0;
if (sb->s_op->dirty_inode)
sb->s_op->dirty_inode(inode, flags);
 
diff --git a/fs/inode.c b/fs/inode.c
index 6e91aca..f0d6232 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
  */
 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
+   int days_since_boot = jiffies / (HZ * 86400);
int ret;
 
if (inode->i_op->update_time) {
@@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct 
timespec *time, int flags)
if (flags & S_MTIME)
inode->i_mtime = *time;
}
-   if (inode->i_sb->s_flags & MS_LAZYTIME) {
+   if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
+   (!inode->i_ts_dirty_day ||
+inode->i_ts_dirty_day == days_since_boot)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
+   inode->i_ts_dirty_day = days_since_boot;
return 0;
}
+   inode->i_ts_dirty_day = 0;
if (inode->i_op->write_time)
return inode->i_op->write_time(inode);
mark_inode_dirty_sync(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 489b2f2..e3574cd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -575,6 +575,7 @@ struct inode {
struct timespec i_ctime;
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
+   unsigned short  i_ts_dirty_day;
unsigned inti_blkbits;
blkcnt_ti_blocks;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-21 Thread Chris Mason


On Fri, Nov 21, 2014 at 2:59 PM, Theodore Ts'o  wrote:

In preparation for adding support for the lazytime mount option, we
need to be able to separate out the update_time() and write_time()
inode operations.  Currently, only btrfs and xfs uses update_time().

We needed to preserve update_time() because btrfs wants to have a
special btrfs_root_readonly() check; otherwise we could drop the
update_time() inode operation entirely.


No objections here, I'll give the queue a whirl.  You can add my 
acked-by-or-Ill-fix-whatever-breaks


I look forward to the patch that makes us all lazy by default.

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-21 Thread Andreas Dilger

On Nov 21, 2014, at 1:59 PM, Theodore Ts'o  wrote:
> Guarantee that the on-disk timestamps will be no more than 24 hours
> stale.
> 
> Signed-off-by: Theodore Ts'o 
> ---
> fs/fs-writeback.c  | 1 +
> fs/inode.c | 7 ++-
> include/linux/fs.h | 1 +
> 3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ce7de22..eb04277 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1141,6 +1141,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>   if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>   trace_writeback_dirty_inode_start(inode, flags);
> 
> + inode->i_ts_dirty_day = 0;
>   if (sb->s_op->dirty_inode)
>   sb->s_op->dirty_inode(inode, flags);
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 6e91aca..f0d6232 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1511,6 +1511,7 @@ static int relatime_need_update(struct vfsmount *mnt, 
> struct inode *inode,
>  */
> static int update_time(struct inode *inode, struct timespec *time, int flags)
> {
> + int days_since_boot = jiffies / (HZ * 86400);
>   int ret;
> 
>   if (inode->i_op->update_time) {
> @@ -1527,12 +1528,16 @@ static int update_time(struct inode *inode, struct 
> timespec *time, int flags)
>   if (flags & S_MTIME)
>   inode->i_mtime = *time;
>   }
> - if (inode->i_sb->s_flags & MS_LAZYTIME) {
> + if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> + (!inode->i_ts_dirty_day ||
> +  inode->i_ts_dirty_day == days_since_boot)) {
>   spin_lock(&inode->i_lock);
>   inode->i_state |= I_DIRTY_TIME;
>   spin_unlock(&inode->i_lock);
> + inode->i_ts_dirty_day = days_since_boot;

It isn't clear if this is correct.  It looks like the condition will
only be entered if i_ts_dirty_day == days_since_boot, but that is only
set inside the condition?  Shouldn't this be:

inode->i_ts_dirty_day = ~0U;

if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
inode->i_ts_dirty_day != days_since_boot)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_DIRTY_TIME;
spin_unlock(&inode->i_lock);
inode->i_ts_dirty_day = days_since_boot;
}

and "days_since_boot" should be declared unsigned short so it wraps
in the same way as i_ts_dirty_day, maybe using typeof(i_ts_dirty_day)
so that there isn't any need to update this in the future if the type
changes.  Conceivably this could be an unsigned char, if that packed
into struct inode better, at the risk of losing a timestamp update on
an inode in cache for 256+ days and only modifying it 256 days later.

Cheers, Andreas

>   return 0;
>   }
> + inode->i_ts_dirty_day = 0;
>   if (inode->i_op->write_time)
>   return inode->i_op->write_time(inode);
>   mark_inode_dirty_sync(inode);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 489b2f2..e3574cd 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -575,6 +575,7 @@ struct inode {
>   struct timespec i_ctime;
>   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
>   unsigned short  i_bytes;
> + unsigned short  i_ts_dirty_day;
>   unsigned inti_blkbits;
>   blkcnt_ti_blocks;
> 
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Robert White


On 11/21/2014 07:11 AM, Phillip Susi wrote:

On 11/20/2014 5:45 PM, Robert White wrote:

If you search for "ACPI ide" you'll find people complaining in
2008-2010 about windows error messages indicating the device is
present in their system but no OS driver is available.


Nope... not finding it.  The closest thing was one or two people who
said ACPI when they meant AHCI ( and were quickly corrected ).  This
is probably what you were thinking of since windows xp did not ship
with an ahci driver so it was quite common for winxp users to have
this problem when in _AHCI_ mode.


I have to give you that one... I should have never trusted any reference 
to windows.


Most of those references to windows support were getting AHCI and ACPI 
mixed up. Lolz windows users... They didn't get into ACPI disk support 
till 2010. I should have known they were behind the times. I had to 
scroll down almost a whole page to find the linux support.


So lets just look at the top of the ide/ide-acpi.c from linux 2.6 to 
consult about when ACPI got into the IDE business...


linux/drivers/ide/ide-acpi.c
/*
 * Provides ACPI support for IDE drives.
 *
 * Copyright (C) 2005 Intel Corp.
 * Copyright (C) 2005 Randy Dunlap
 * Copyright (C) 2006 SUSE Linux Products GmbH
 * Copyright (C) 2006 Hannes Reinecke
 */

Here's a bug from 2005 of someone having a problem with the ACPI IDE 
support...


https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=6&cad=rja&uact=8&ved=0CDkQFjAF&url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D5604&ei=g6VvVL73K-HLsASIrYKIDg&usg=AFQjCNGTuuXPJk91svGJtRAf35DUqVqrLg&sig2=eHxwbLYXn4ED5jG-guoZqg

People debating the merits of the ACPI IDE drivers in 2005.

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=12&cad=rja&uact=8&ved=0CGUQFjAL&url=http%3A%2F%2Fwww.linuxquestions.org%2Fquestions%2Fslackware-14%2Fbare-ide-and-bare-acpi-kernels-297525%2F&ei=g6VvVL73K-HLsASIrYKIDg&usg=AFQjCNFoyKgH2sOteWwRN_Tdrfw9hOmVGQ&sig2=BmMVcZl24KRz4s4gEvLN_w

So "you got me"... windows was behind the curve by five years instead of 
just three... my bad...


But yea, nobody has ever used that ACPI disk drive support that's been 
in the kernel for nine years.


Even when you "get me" for referencing windows, you're still wrong...

How many times will you try get out of being hideously horribly wrong 
about ACPI supporting disk/storage IO? It is neither "recent" nor "rare".


How much egg does your face really need before you just see that your 
fantasy that it's "new" and uncommon is a delusional mistake?



Methinks Misters Dunning and Kruger need a word with you...


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] vfs: don't let the dirty time inodes get more than a day stale

2014-11-21 Thread Theodore Ts'o

On Fri, Nov 21, 2014 at 02:19:07PM -0600, Andreas Dilger wrote:
> > -   if (inode->i_sb->s_flags & MS_LAZYTIME) {
> > +   if ((inode->i_sb->s_flags & MS_LAZYTIME) &&
> > +   (!inode->i_ts_dirty_day ||
> > +inode->i_ts_dirty_day == days_since_boot)) {
> > spin_lock(&inode->i_lock);
> > inode->i_state |= I_DIRTY_TIME;
> > spin_unlock(&inode->i_lock);
> > +   inode->i_ts_dirty_day = days_since_boot;
> 
> It isn't clear if this is correct.  It looks like the condition will
> only be entered if i_ts_dirty_day == days_since_boot, but that is only
> set inside the condition?

If i_ts_dirty_day is zero, the timestamps don't have to written to
disk.  This is either because the inode has been written to disk, or
the system has been up for less than a day, such that when we last a
lazy mtime update (i.e., we skipped the call to mark_inode_dirty)
since jiffies / (HZ * 86400) was zero.

If it is non-zero, then the timestamps were updated but were not sent
to disk N days since the system was booted.  So long as it remains N
days since the system was booted, we can skip calling
mark_inode_dirty().  But once it becomes N+1 days since the system was
booted, then we will call mark_inode_dirty() and set i_ts_dirty_day to
zero.

I'll add a comment so it's a bit more obvious what we're doing here,
but I'm pretty sure we currently have is in fact correct.

> and "days_since_boot" should be declared unsigned short so it wraps
> in the same way as i_ts_dirty_day

Good point, thanks.  This will only be an issue after the system has
been up for almost 90 years, but we might as well get it right,

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Robert White


On 11/21/2014 01:12 PM, Robert White wrote:
> (wrong links included in post...)
Dangit... those two links were bad... wrong clipboard... /sigh...

I'll just stand on the pasted text from the driver. 8-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] fs: split update_time() into update_time() and write_time()

2014-11-21 Thread Theodore Ts'o

Out of curiosity, why does btrfs_update_time() need to call
btrfs_root_readonly()?  Why can't it just depend on the
__mnt_want_write() call in touch_atime()?

Surely if there are times when it's not OK to write into a btrfs file
system and mnt_is_readonly() returns false, the VFS is going to get
very confused abyway.

If the btrfs_update_time() is not necessary, then we could drop
btrfs_update_time() and update_time() from the inode operations
entirely, and depend on the VFS-level code in update_time().

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH] dm: add dm-power-fail target

2014-11-21 Thread Josef Bacik

Hello,

I'm hoping some FS guys can weigh in and verify my approach for testing power
fail conditions, and the DM guys to of course verify I didn't completely fail at
making a DM target.  All suggestions welcome, I want to have a nice robust tool
for testing power fail so we can be sure our fsync/commit code is all working
properly.  Thanks,

Josef


For power fail testing we want to be able to catch cases where the fs isn't
waiting for IO to complete properly.  This target aims to do this by creating a
linear mapping on a device and keeping track of all WRITEs and FLUSH/FUA
operations.  Any WRITEs are added to a global list on completion, then when we
submit a FLUSH this list is spliced onto the request and once that request
completes then it marks those blocks as valid.  FUA bypasses this logic
altogether and is considered valid once it completes.  There are two modes of
operation here

1) Zero - any WRITE that wasn't flushed will return 0's when we try to read from
it.  This is meant for BTRFS (or any COW fs) where we only write blocks once, so
if we try to read a block we didn't properly flush then we can just send back
0's.

2) Split - the device is split in half and written to alternating sides.  This
is for overwriting fs'es (ext*/xfs).  Once a FLUSH occurs we walk all the
completed WRITEs and set their mirror to whichever mirror they last wrote to.
Anything that wasn't flushed properly will point to its previous mirror.

We then have 3 different power fail events we can do

1) DROP_WRITES - just drop all writes immediately, any new writes are
immediately completed and outstanding ones are treated as if they never
completed.

2) DROP_AFTER_FUA - this allows a FUA to go through, and as soon as it completes
we drop all writes.  This is meant to be really evil about making sure your
commit code is doing the right thing by dropping at the worst possible moment.

3) DROP_AFTER_FLUSH - this allows a FLUSH to go through, and as soon as it
completes we drop all writes.  This can be good for testing fdatasync on
overwrite fs'es that may only issue a FLUSH and not have to update metadata.

There is also an option to return -EIO as soon as we have our power fail event
to make it easier to script stress tests.

The idea is to be as evil as possible wrt how a device's cache would react to a
powerfail event.  I had originally bolted this onto dm-flakey but it got
complicated trying to work in these new options with its original behavior so I
created a new target instead.

Signed-off-by: Josef Bacik 
---
 drivers/md/Kconfig |  14 +
 drivers/md/Makefile|   1 +
 drivers/md/dm-power-fail.c | 691 +
 3 files changed, 706 insertions(+)
 create mode 100644 drivers/md/dm-power-fail.c

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 5bdedf6..bc3d6ca 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -432,4 +432,18 @@ config DM_SWITCH
 
  If unsure, say N.
 
+config DM_POWER_FAIL
+   tristate "Power fail target support"
+   depends on BLK_DEV_DM
+   ---help---
+ This device-mapper target creates a device that can be used for
+ testing a file systems ability to survive power fails.  There are
+ several modes of operation in order to test a variety of power fail
+ scenarios.
+
+ To compile this code as a module, choos M here: the module will be
+ called dm-power-fail.
+
+ If unsure, say N.
+
 endif # MD
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index a2da532..c667218 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -55,6 +55,7 @@ obj-$(CONFIG_DM_CACHE)+= dm-cache.o
 obj-$(CONFIG_DM_CACHE_MQ)  += dm-cache-mq.o
 obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o
 obj-$(CONFIG_DM_ERA)   += dm-era.o
+obj-$(CONFIG_DM_POWER_FAIL)+= dm-power-fail.o
 
 ifeq ($(CONFIG_DM_UEVENT),y)
 dm-mod-objs+= dm-uevent.o
diff --git a/drivers/md/dm-power-fail.c b/drivers/md/dm-power-fail.c
new file mode 100644
index 000..44c5045
--- /dev/null
+++ b/drivers/md/dm-power-fail.c
@@ -0,0 +1,691 @@
+/*
+ * Copyright (C) 2014 Facebook. All rights reserved.
+ *
+ * This file is released under the GPL.
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DM_MSG_PREFIX "power-fail"
+
+/*
+ * The way this interface is meant to be used is like this
+ *
+ * dmsetup create powerfail
+ * mkfs /dev/powerfail
+ * mount /dev/powerfail /mnt/test
+ * do some stuff &
+ * sleep 30
+ * dmsetup message powerfail (drop_after_flush|drop_after_fua|drop_writes)
+ * umount /mnt/test
+ * dmsetup message powerfail redirect_reads
+ * fsck /dev/powerfail || exit 1
+ * mount /dev/powerfail /mnt/test
+ * 
+ *
+ * You can set redirect_reads whenever you want, but the idea is that you want
+ * the teardown stuff to work like normal, and then flip the switch for us to
+ * start returning garbage on any writes that wou

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Duncan

Chris Murphy posted on Fri, 21 Nov 2014 11:23:45 -0700 as excerpted:

> On Thu, Nov 20, 2014 at 11:22 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> 
> 
>> When I have such a filesystem level problem, I simply dd from the
>> backing device to some other location, generally to a file that's on a
>> different filesystem (preferrably non-btrfs, I use reiserfs as I've
>> found it very resilient, here), in which case btrfs device scan won't
>> see the UUID on the copy as it scans block devices, not inside
>> non-device files.
> 
> That's hours of dd and you have to find space to do it.

I did it recently here.  There's a method to my sub-100-GiB partition 
madness! =:^)  The partitions in question were on SSD, and were small 
enough I could simply DD them to files on my media filesystem, which was 
after all designed to be able to take full ISO images, etc.

Additionally, due to size and reasonably consistent linear intra-file 
access patterns, the media filesystem's still on much cheaper spinning 
rust, while most of the system's on much faster to random-access but far 
more expensive SSD, so in this case one side was SSD, the other spinning 
rust.

Tho granted, if you're doing single-partition/filesystem multi-TiB 
filesystems, it does get to be a problem.  As there would have been if 
the filesystem in question was the media filesystem, altho that one's not 
yet btrfs for a reason.  But still, if there's room enough for an LVM 
snapshot in the first place, with a different layout, there'd be room for 
the same data as a file.  That's pretty basic.

>> After all, an LVM block-level snapshot takes the same space as a file
>> containing the same raw data, and if there's room for the data in an
>> LVM snapshot, given a different layout, there's room for exactly the
>> same amount of data as a file on a different filesystem, piped thru
>> some compressor if necessary due to tight datasize constraints.
> 
> That's not true for thin volume snapshots. They take up next to no space
> upon creation, they don't need space reserved in advance.

Thus the mention of compression if necessary.  Thin-volume snapshots are 
effectively compression by another name, and a raw dd from them should 
compress pretty much equally well, depending on compression method 
chosen, of course. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Duncan

Zygo Blaxell posted on Fri, 21 Nov 2014 12:56:23 -0500 as excerpted:

> It's not a bug as long as I can completely control which devices are
> searched for UUIDs, and the system behaves sanely when multiple UUIDs
> are found through automatic discovery; otherwise, it's not only a bug,
> it's a DoS attack security vulnerability.  Consider what happens if
> someone looks at /sys/fs/btrfs, reads the non-secret UUIDs, builds a
> fake filesystem with those UUIDs, puts the fake filesystem on a USB
> stick, and plugs it back into the victim machine...

With the current state of USB vulnerability (firmware reprogrammed as an 
input device, etc, the vuln has been all over the tech news for some 
months now), anyone with USB access to the machine is simply another case 
of anyone with physical access to the machine, they're normally assumed 
to be able to be able to at minimum take down the machine, the ultimate 
DoS, in any case, and often to have effective root, tho that can be 
mitigated to some extent with encryption, etc.  It's generally assumed 
that if you have physical access, as required to plug in that USB, game 
over, the machine is effectively p40wn3d.  At the /very/ least, with 
physical access it's vulnerable to the sledgehammer DoS, and there's 
little to be done about that but prevent physical access by all means 
necessary (armed guards, nuclear silo hosting, etc) in the first place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Duncan

Duncan posted on Fri, 21 Nov 2014 22:49:06 + as excerpted:

> Chris Murphy posted...

>> That's not true for thin volume snapshots. They take up next to no
>> space upon creation, they don't need space reserved in advance.
> 
> Thus the mention of compression if necessary.  Thin-volume snapshots are
> effectively compression by another name, and a raw dd from them should
> compress pretty much equally well, depending on compression method
> chosen, of course. =:^)

Oops, I mis-parsed "thin".  Good point and thanks, Chris.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: disk-io: replace root args iff only fs_info used

2014-11-21 Thread David Sterba

On Sat, Nov 22, 2014 at 01:37:10AM +0900, Daniel Dressler wrote:
> What would a cover letter be like? Would that be a separate email to
> the list, or maybe the first email in a patch series?

It's a separate mail that does not carry any diff but an overview of
the following patches. The patches are threaded under that mail. This is
what the git command does for you:

  git format-patch -o output --thread --cover-letter from..to

and in the directory 'output' you'll find the cover letter plus patches.
The cover contains some stub and should be edited.  Then send them via
'git send-email'.

> Sorry I've twice looked for the integration repo. I found some that
> look like it could be but those had older commits. Could you direct me
> to the exact branch I'd love to work against it. These patches were
> done against linux-next.

The integration is in Chris' git, but the branch may not be the most
recent compared to Linus' tree or the pending for-linus branches. This
depens on the phase of the development cycle or the stability of the
patches in the integration branch as it's supposed to be base of the
next pull.

What you did is fine under current conditions. If the integration is
made public you can check if your patches are merged or not and then
refresh the patch series eventually.

> I think small one function patches might be best. I have the codebase
> mapped out and each file's functions-to-be-cleaned count varies
> wildly. If I did batch files together and split large files apart
> there would be no rhyme or reason for the groupings. With single
> function patches it is very clear what changes are justified since
> they should only occur in the affected function or in a call-site.
> With multiple functions the call-site changes get mixed up would it
> would be harder to review.

Up to you.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS messes up snapshot LV with origin

2014-11-21 Thread Duncan

Duncan posted on Fri, 21 Nov 2014 23:41:49 + as excerpted:

> Duncan posted on Fri, 21 Nov 2014 22:49:06 + as excerpted:
> 
>> Chris Murphy posted...
> 
>>> That's not true for thin volume snapshots. They take up next to no
>>> space upon creation, they don't need space reserved in advance.
>> 
>> Thus the mention of compression if necessary.  Thin-volume snapshots
>> are effectively compression by another name, and a raw dd from them
>> should compress pretty much equally well, depending on compression
>> method chosen, of course. =:^)
> 
> Oops, I mis-parsed "thin".  Good point and thanks, Chris.

... And Zygo, who pointed out my error as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: ctree: reduce args where only fs_info used

2014-11-21 Thread David Sterba

On Sat, Nov 22, 2014 at 01:03:32AM +0900, Daniel Dressler wrote:
> No problem, I'll redo everything so it is one function per patch. Now
> fair warning: there are about 102 functions to cleanup. I was a bit
> worried that many patches would cause too much maintainer overhead but
> it is no problem for me.

Yeah, I'm aware that it's all over the sources. I'd say send no more
than 30 patches in a burst first and see how it'd work.

> Only a few functions have dependecies on
> other functions needing cleanup. Thus there will be some small patch
> series for those function sets. A big benefit of one function one
> patch is that extent-io.c will no longer be a 34 function monster
> patch.
> 
> Is there any rate limiting I should be doing? I don't want to flood
> the list with burst of dozen plus patches, or is that an okay volume?

Should be fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Zygo Blaxell

On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote:
> On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell  
> wrote:
> 
> > I run 'smartctl -t long' from cron overnight (or whenever the drives
> > are most idle).  You can also set up smartd.conf to launch the self
> > tests; however, the syntax for test scheduling is byzantine compared to
> > cron (and that's saying something!).  On multi-drive systems I schedule
> > a different drive for each night.
> >
> > If you are also doing btrfs scrub, then stagger the scheduling so
> > e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.
> >
> > smartd is OK for monitoring test logs and email alerts.  I've had no
> > problems there.
> 
> Most attributes are always updated without issuing a smart test of any
> kind. A drive I have here only has four offline updateable attributes.

One of those four is Offline_Uncorrectable, which is a really important
attribute to monitor!

> When it comes to bad sectors, the drive won't use a sector that
> persistently fails writes. So you don't really have to worry about
> latent bad sectors that don't have data on them already. The sectors
> you care about are the ones with data. A scrub reads all of those
> sectors.

A scrub reads all the _allocated_ sectors.  A long selftest reads
_everything_, and also exercises the electronics and mechanics of the
drive in ways that normal operation doesn't.  I have several disks that
are less than 25% occupied, which means scrubs will ignore 75% of the
disk surface at any given time.

A sharp increase in the number of bad sectors (no matter how they are
detected) usually indicates a total drive failure is coming.  Many drives
have been nice enough to give me enough warning for their RMA replacements
to be delivered just a few hours before the drive totally fails.

> First the drive could report a read error in which case Btrfs
> raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
> rebuild it from parity, and write to the affected sector; and also
> this same mechanism happens in normal reads so it's a kind of passive
> scrub. But it happens to miss checking inactively read data, which a
> scrub will check.
> 
> Second, the drive could report no problem, and Btrfs raid1/10 could
> still fix the problem in case of a csum mismatch. And it looks like
> soonish we'll see this apply to raid5/6.
> 
> So I think a nightly long smart test is a bit overkill. I think you
> could do nightly -t short tests which will report problems scrub won't
> notice, such as higher seek times or lower throughput performance. And
> then scrub once a week.

Drives quite often drop a sector or two over the years, and it can
be harmless.  What you want to be watching out for is hundreds of bad
sectors showing up over a period of few days--that means something is
rattling around on the disk platters, damaging the hardware as it goes.
To get that data, you have to test the disks every few days.

> > The drive itself could be failing in some way that prevents recording
> > SMART errors (e.g. because of host timeouts triggering a bus reset,
> > which also prevents the SMART counter update for what was going wrong at
> > the time).  This is unfortunately quite common, especially with drives
> > configured for non-RAID workloads.
> 
> Libata resetting the link should be recorded in kernel messages.

This is true, but the original question was about SMART data coverage.
This is why it's important to monitor both.

> -- 
> Chris Murphy

signature.asc
Description: Digital signature

Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-21 Thread Ian Armstrong

On Fri, 21 Nov 2014 10:45:21 -0700 Chris Murphy wrote:

> On Fri, Nov 21, 2014 at 5:55 AM, Ian Armstrong 
> wrote:
> 
> > In my situation what I've found is that if I scrub & let it fix the
> > errors then a second pass immediately after will show no errors. If
> > I then leave it a few days & try again there will be errors, even in
> > old files which have not been accessed for months.
> 
> What are the devices? And if they're SSDs are they powered off for
> these few days? I take it the scrub error type is corruption?

It's spinning rust and the checksum error is always on the one drive
(a SAMSUNG HD204UI). The firmware has been updated, since some were
shipped with a bad version which could result in data corruption.

> You can use badblocks to write a known pattern to the drive. Then
> power off and leave it for a few days. Then read the drive, matching
> against the pattern, and see if there are any discrepancies. Doing
> this outside the code path of Btrfs would fairly conclusively indicate
> whether it's hardware or software induced.

Unfortunately I'm reluctant to go the badblock route for the entire
drive since it's the second drive in a 2 drive raid1 and I don't
currently have a spare. There is a small 6G partition that I can use,
but given that the drive is large and the errors are few, it could take
a while for anything to show.

I also have a second 2 drive btrfs raid1 in the same machine that
doesn't have this problem. All the drives are running off the same
controller.

> Assuming you have another copy of all of these files :-) you could
> just sha256sum the two copies to see if they have in fact changed. If
> they have, well then you've got some silent data corruption somewhere
> somehow. But if they always match, then that suggests a bug.

Some of the files already have an md5 linked to them, while others have
parity files to give some level of recovery from corruption or damage.
Checking against these show no problems, so I assume that btrfs is
doing its job & only serving an intact file.

> I don't
> see how you can get bogus corruption messages, and for it to not be a
> bug. When you do these scrubs that come up clean, and then later come
> up with corruptions, have you done any software updates?

No software updates between clean & corrupt. I don't have to power down
or reboot either for checksum errors to appear.

I don't think the corruption messages are bogus, but are indicating a
genuine problem. What I would like to be able to do is compare the
corrupt block with the one used to repair it and see what the difference
is. As I've already stated, the system logs are clean & the smart logs
aren't showing any issues. (Well, until today when a self-test failed
with a read error, but it must be an unused sector since the scrub
doesn't hit it & there are no re-allocated sectors yet)

> > My next step is to disable autodefrag & see if the problem persists.
> > (I'm not suggesting a problem with autodefrag, I just want to
> > remove it from the equation & ensure that outside of normal file
> > access, data isn't being rewritten between scrubs)
> 
> I wouldn't expect autodefrag to touch old files not accessed for
> months. Doesn't it only affect actively used files?

The drive is mainly used to hold old archive files, though there are
daily rotating files on it as well. The corruption affects both new and
old files.

-- 
Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

59 matches

Mail list logo