Instant reboot while defragging on 3.13-rc5

2013-12-29 Thread Szőts Ákos
Dear list members,

I upgraded to 3.13-rc5 kernel and started to defrag my whole file system with 
the following commands:

cd /
for i in */*; do if [[ $i != windows* ]]; then echo --$i--; btrfs 
fi defrag -clzo -r $i; fi; done 21 | tee /root/defrag

After 10 or 15 seconds my computer reboots itself without any warning. In 
dmesg I see only the following:
[  531.866566] btrfs: device fsid 2cc2b266-3f04-4f1c-8477-cf7efd6ac139 devid 1 
transid 396679 /dev/sda7
[  532.001919] watchdog watchdog0: watchdog did not stop!
[  532.002719] watchdog watchdog0: watchdog did not stop!

Together with this I'm constantly getting sysfs WARNINGS into dmesg, but I 
don't think this would be related:

WARNING: CPU: 5 PID: 1644 at /home/abuild/rpmbuild/BUILD/kernel-
desktop-3.13.rc5/linux-3.13-rc5/fs/sysfs/group.c:214 device_del+0x3b/0x1b0()
sysfs group 81c7e040 not found for kobject 'hidraw2'

Versions:
OS: openSUSE 13.1
Kernel: 3.13.0-rc5-1.g7127d5f-desktop
btrfs: v3.12+20131125

Ákos
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is anyone using btrfs send/receive for backups instead of rsync?

2013-12-29 Thread Duncan
Chris Murphy posted on Sat, 28 Dec 2013 16:11:37 -0700 as excerpted:

 I am slightly bugged about a 16MB file having nearly 2000 extents,
 basically it's being turned into a bunch of 8KB fragments. I know
 nothing of the pros and cons of how systemd is writing journals, but
 they don't seem very big so I don't understand why they're preallocated,
 which on btrfs appears instantly defeated by COW upon the journal being
 modified. It seems to me either the journal doesn't need to be
 preallocated (at least on btrfs) or maybe systemd should set xattr +C on
 /var/log/journal?
 
 That does disable checksumming though, along with data cow.

While I don't (yet?) use systemd here, from what I understand of its 
journal, it's essentially a binary database, and isn't necessarily even 
sequentially written as is traditional with log files.  That would 
explain the preallocation.  And given your mention of 8KB fragments, I 
wonder if that's its record-size?

Meanwhile, as with any pre-allocated-then-written-into file, including VM 
images and pre-allocated bittorrent files, the systemd journal is a known 
worst-case for COW filesystems including btrfs.

And setting NOCOW on the file (xattr +C) before those write-intos is the 
knob btrfs exposes to deal with the problem... for all the above cases.

Yes, it does turn off checksumming as well as COW, but given the write-
into scenario, that's actually best anyway, because otherwise btrfs has 
to keep updating the checksums as the internal writes are occurring, and 
that's both CPU intensive and potentially rate-limited, and an invitation 
to race conditions since the writing application and btrfs' checksumming 
are constantly lock-fighting, the one to update the file, the other to 
update the checksum based on the new data.

But in all these cases, it's also quite common for the application doing 
the writing to have its own checksumming/error-detection and possible 
correction -- it pretty much comes with the territory -- in which case 
btrfs attempting to do the same is simply superfluous even if it weren't 
a race-condition trigger.  Certainly torrents include checksumming -- 
that automatically guaranteed download integrity is part of what makes 
the protocol as popular as it is.  And databases too.  I don't actually 
know enough about VMs to know if it's the case there or not, but 
certainly, unexpected bit-flipping is likely to corrupt and crash the VM, 
just as it tends to do with on-the-metal operating systems.

If/when the file reaches effective stasis, as a torrented file once it's 
fully downloaded, /then/ it's reasonable to kill the NOCOW and do a final 
(sequential-write) copy/move so btrfs has it checksummed too.  And 
database and VM backups... if they're not being actively used, btrfs 
checksumming can guard against bitrot there too.  Similarly systemd's 
binary journal, once those are taken out of active logging, yeah, let 
btrfs do its normal thing.

But for all these cases, as long as the files are being actively written 
into, NOCOW, including its NOSUM implications, is exactly and precisely 
what they SHOULD be when the filesystem hosting them is btrfs.

And I'm predicting that since btrfs is the assumed successor to the ext* 
series as the Linux default filesystem, and systemd is targeting Linux 
default initsystem status as well, it's only logical that at some point 
systemd will detect what filesystem it's logging to, and will 
automatically set NOCOW on the journal file when that filesystem is 
btrfs.  Most Linux-targeted databases and file-preallocating torrent 
clients will no doubt do exactly the same thing.  Either that, or in 
their documentation, they'll suggest setting NOCOW on the target 
directory when setting up the app in the first place.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Btrfs: return free space to global_rsv as much as possible

2013-12-29 Thread Liu Bo
@full is not protected within global_rsv.lock, so we may think global_rsv
is already full but in fact it's not, so we miss the opportunity to return
free space to global_rsv directly when we release other block_rsvs.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9c01509..009980c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4637,7 +4637,7 @@ void btrfs_block_rsv_release(struct btrfs_root *root,
 u64 num_bytes)
 {
struct btrfs_block_rsv *global_rsv = root-fs_info-global_block_rsv;
-   if (global_rsv-full || global_rsv == block_rsv ||
+   if (global_rsv == block_rsv ||
block_rsv-space_info != global_rsv-space_info)
global_rsv = NULL;
block_rsv_release_bytes(root-fs_info, block_rsv, global_rsv,
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Btrfs: release subvolume's block_rsv before transaction commit

2013-12-29 Thread Liu Bo
We don't have to keep subvolume's block_rsv during transaction commit,
and within transaction commit, we may also need the free space reclaimed
from this block_rsv to process delayed refs.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/ioctl.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 21da576..347bf61 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -417,7 +417,8 @@ static noinline int create_subvol(struct inode *dir,
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
-   goto out;
+   btrfs_subvolume_release_metadata(root, block_rsv, 
qgroup_reserved);
+   return ret;
}
trans-block_rsv = block_rsv;
trans-bytes_reserved = block_rsv.size;
@@ -542,6 +543,8 @@ static noinline int create_subvol(struct inode *dir,
 fail:
trans-block_rsv = NULL;
trans-bytes_reserved = 0;
+   btrfs_subvolume_release_metadata(root, block_rsv, qgroup_reserved);
+
if (async_transid) {
*async_transid = trans-transid;
err = btrfs_commit_transaction_async(trans, root, 1);
@@ -555,8 +558,6 @@ fail:
 
if (!ret)
d_instantiate(dentry, btrfs_lookup_dentry(dir, dentry));
-out:
-   btrfs_subvolume_release_metadata(root, block_rsv, qgroup_reserved);
return ret;
 }
 
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Migrate to bcache: A few questions

2013-12-29 Thread Kai Krakow
Hello list!

I'm planning to buy a small SSD (around 60GB) and use it for bcache in front 
of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back caching. Btrfs 
is my root device, thus the system must be able to boot from bcache using 
init ramdisk. My /boot is a separate filesystem outside of btrfs and will be 
outside of bcache. I am using Gentoo as my system.

I have a few questions:

* How stable is it? I've read about some csum errors lately...

* I want to migrate my current storage to bcache without replaying a backup.
  Is it possible?

* Did others already use it? What is the perceived performance for desktop
  workloads in comparision to not using bcache?

* How well does bcache handle power outages? Btrfs does handle them very
  well since many months.

* How well does it play with dracut as initrd? Is it as simple as telling it
  the new device nodes or is there something complicate to configure?

* How does bcache handle a failing SSD when it starts to wear out in a few
  years?

* Is it worth waiting for hot-relocation support in btrfs to natively use
  a SSD as cache?

* Would you recommend going with a bigger/smaller SSD? I'm planning to use
  only 75% of it for bcache so wear-leveling can work better, maybe use
  another part of it for hibernation (suspend to disk).

Regards,
Kai

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: systemd-journal, nodatacow, was: Is anyone using btrfs send/receive for backups instead of rsync?

2013-12-29 Thread Chris Murphy

On Dec 29, 2013, at 5:39 AM, Duncan 1i5t5.dun...@cox.net wrote:

 Yes, it does turn off checksumming as well as COW, but given the write-
 into scenario, that's actually best anyway, because otherwise btrfs has 
 to keep updating the checksums

On second thought, I'm less concerned with bitrot and checksumming being lost 
with nodatacow, than I am with significantly increasing the chance the journal 
is irreparably lost due to corruption during an unclean shutdown.


 But in all these cases, it's also quite common for the application doing 
 the writing to have its own checksumming/error-detection and possible 
 correction -- it pretty much comes with the territory -- in which case 
 btrfs attempting to do the same is simply superfluous even if it weren't 
 a race-condition trigger.

I don't know what kind of checksumming systemd performs on the journal, but 
whenever Btrfs has found corruption with the journal file(s), systemd-journald 
has also found corruption and starts a new log. So it makes sense to rely on 
its own mechanisms, than Btrfs's.

 And I'm predicting that since btrfs is the assumed successor to the ext* 
 series as the Linux default filesystem, and systemd is targeting Linux 
 default initsystem status as well, it's only logical that at some point 
 systemd will detect what filesystem it's logging to, and will 
 automatically set NOCOW on the journal file when that filesystem is 
 btrfs. 

Is this something that should be brought up on the systemd-devel@ list? Or 
maybe file it as an RFE against systemd at freedesktop.org?

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Migrate to bcache: A few questions

2013-12-29 Thread Chris Murphy

On Dec 29, 2013, at 2:11 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:

 
 * How stable is it? I've read about some csum errors lately…

Seems like bcache devs are still looking into the recent btrfs csum issues.

 
 * I want to migrate my current storage to bcache without replaying a backup.
  Is it possible?
 
 * Did others already use it? What is the perceived performance for desktop
  workloads in comparision to not using bcache?
 
 * How well does bcache handle power outages? Btrfs does handle them very
  well since many months.
 
 * How well does it play with dracut as initrd? Is it as simple as telling it
  the new device nodes or is there something complicate to configure?
 
 * How does bcache handle a failing SSD when it starts to wear out in a few
  years?

I think most of these questions are better suited for the bcache list. I think 
there are still many uncertainties about the behavior of SSDs during power 
failures when they aren't explicitly designed with power failure protection in 
mind. At best I'd hope for a rollback involving data loss, but hopefully not a 
corrupt file system. I'd rather lose the last minute of data supposedly written 
to the drive, than have to do a fuil restore from backup.

 
 * Is it worth waiting for hot-relocation support in btrfs to natively use
  a SSD as cache?

I haven't read anything about it. Don't see it listed in project ideas.

 
 * Would you recommend going with a bigger/smaller SSD? I'm planning to use
  only 75% of it for bcache so wear-leveling can work better, maybe use
  another part of it for hibernation (suspend to disk).

I think that depends greatly on workload. If you're writing or reading a lot of 
disparate files, or a lot of small file random writes (mail server), I'd go 
bigger. By default sequential IO isn't cached. So I think you can get a big 
boost in responsiveness with a relatively small bcache size.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Migrate to bcache: A few questions

2013-12-29 Thread Kai Krakow
Chris Murphy li...@colorremedies.com schrieb:

 I think most of these questions are better suited for the bcache list.

Ah yes, you are true. I will repost the non-btrfs related questions to the 
bcache list. But actually I am most interested in using bcache together 
btrfs, so getting a general picture of its current state in this combination 
would be nice - and so these questions may be partially appropriate here.

 I
 think there are still many uncertainties about the behavior of SSDs during
 power failures when they aren't explicitly designed with power failure
 protection in mind. At best I'd hope for a rollback involving data loss,
 but hopefully not a corrupt file system. I'd rather lose the last minute
 of data supposedly written to the drive, than have to do a fuil restore
 from backup.

These thought are actually quite interesting. So you are saying that data 
may not be fully written to SSD although the kernel thinks so? This is 
probably very dangerous. The bcache module could not ensure coherence 
between its backing devices and its own contents - and data loss will occur 
and probably destroy important file system structures.

I understand your words as data may only partially being written. This, of 
course, may happen to HDDs as well. But usually a file system works with 
transactions so the last incomplete transaction can simply be thrown away. I 
hope bcache implements the same architecture. But what does it mean for the 
stacked write-back architecture?

As I understand, bcache may use write-through for sequential writes, but 
write-back for random writes. In this case, part of the data may have hit 
the backing device, other data does only exist in the bcache. If that last 
transaction is not closed due to power-loss, and then thrown away, we have 
part of the transaction already written to the backing device that the 
filesystem does not know of after resume.

I'd appreciate some thoughts about it but this topic is probably also best 
moved over to the bcache list.

Thanks,
Kai 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Migrate to bcache: A few questions

2013-12-29 Thread Chris Murphy

On Dec 29, 2013, at 6:22 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote:

 So you are saying that data 
 may not be fully written to SSD although the kernel thinks so?

Drives shouldn't lie when asked to flush to disk, but they do. Older article 
about this at lwn is a decent primer on the subject of write barriers.

http://lwn.net/Articles/283161/

 This is 
 probably very dangerous. The bcache module could not ensure coherence 
 between its backing devices and its own contents - and data loss will occur 
 and probably destroy important file system structures.

I don't know the details, there's more detail on lkml.org and bcache lists. My 
impression is that short of bugs, it should be much safer than you describe. 
It's not like a linear/concat md or LVM device fail scenario. There's good info 
in the bcache.h file:

http://lxr.free-electrons.com/source/drivers/md/bcache/bcache.h

If anything, once the kinks are worked out, under heavy random write IO I'd 
expect bcache to improve the likelihood data isn't lost. Faster speed of SSD 
means we get a faster commit of the data to stable media. Also bcache assumes 
the cache is always dirty on startup, no matter whether the shutdown was clean 
or dirty, so the code is explicitly designed to resolve the state of the cache 
relative to the backing device. It's actually pretty fascinating work.

It may not be required, but I'd expect we'd want the write cache on the backing 
device disabled. It should still honor write barriers but it kinda seems 
unnecessary and riskier to have it enabled (which is the default with consumer 
drives).


 As I understand, bcache may use write-through for sequential writes, but 
 write-back for random writes. In this case, part of the data may have hit 
 the backing device, other data does only exist in the bcache. If that last 
 transaction is not closed due to power-loss, and then thrown away, we have 
 part of the transaction already written to the backing device that the 
 filesystem does not know of after resume.

In the write through case we should be no worse off than the bare drive in a 
power loss. In the write back case the SSD should have committed more data than 
the HDD could have in the same situation. I don't understand the details of how 
partially successful writes to the backing media are handled when the system 
comes back up. Since bcache is also COW, SSD blocks aren't reused until data is 
committed to the backing device.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Migrate to bcache: A few questions

2013-12-29 Thread Duncan
Kai Krakow posted on Sun, 29 Dec 2013 22:11:16 +0100 as excerpted:

 Hello list!
 
 I'm planning to buy a small SSD (around 60GB) and use it for bcache in
 front of my 3x 1TB HDD btrfs setup (mraid1+draid0) using write-back
 caching. Btrfs is my root device, thus the system must be able to boot
 from bcache using init ramdisk. My /boot is a separate filesystem
 outside of btrfs and will be outside of bcache. I am using Gentoo as my
 system.

Gentooer here too. =:^)

 I have a few questions:
 
 * How stable is it? I've read about some csum errors lately...

FWIW, both bcache and btrfs are new and still developing technology.  
While I'm using btrfs here, I have tested usable (which for root means 
either means directly bootable or that you have tested booting to a 
recovery image and restoring from there, I do the former, here) backups, 
as STRONGLY recommended for btrfs in its current state, but haven't had 
to use them.

And I considered bcache previously and might otherwise be using it, but 
at least personally, I'm not willing to try BOTH of them at once, since 
neither one is mature yet and if there are problems as there very well 
might be, I'd have the additional issue of figuring out which one was the 
problem, and I'm personally not prepared to deal with that.

Instead, at this point I'd recommend choosing /either/ bcache /or/ btrfs, 
and using bcache with a more mature filesystem like ext4 or (what I used 
for years previous and still use for spinning rust) reiserfs.

And as I said, keep your backups as current as you're willing to deal 
with losing what's not backed up, and tested usable and (for root) either 
bootable or restorable from alternate boot, because while at least btrfs 
is /reasonably/ stable for /ordinary/ daily use, there remain corner-
cases and you never know when your case is going to BE a corner-case!

 * I want to migrate my current storage to bcache without replaying a
 backup.  Is it possible?

Since I've not actually used bcache, I won't try to answer some of these, 
but will answer based on what I've seen on the list where I can...  I 
don't know on this one.

 * Did others already use it? What is the perceived performance for
 desktop workloads in comparision to not using bcache?

Others are indeed already using it.  I've seen some btrfs/bcache problems 
reported on this list, but as mentioned above, when both are in use that 
means figuring out which is the problem, and at least from the btrfs side 
I've not seen a lot of resolution in that regard.  From here it /looks/ 
like that's simply being punted at this time, as there's still more 
easily traceable problems without the additional bcache variable to work 
on first.  But it's quite possible the bcache list is actively tackling 
btrfs/bache combination problems, as I'm not subscribed there.

So I can't answer the desktop performance comparison question directly, 
but given that I /am/ running btrfs on SSD, I /can/ say I'm quite happy 
with that. =:^)

Keep in mind...

We're talking storage cache here.  Given the cost of memory and common 
system configurations these days, 4-16 gig of memory on a desktop isn't 
unusual or cost prohibitive, and a common desktop working set should well 
fit.

I suspect my desktop setup, 16 gigs memory backing a 6-core AMD fx6100 
(bulldozer-1) @ 3.6 GHz, is probably a bit toward the high side even for 
a gentooer, but not inordinately so.  Based on my usage...

Typical app memory usage runs 1-2 GiB (that's with KDE 4.12.49. from 
the gentoo/kde overlay, but USE=-semantic-desktop, etc).  Buffer memory 
runs a few MiB but isn't normally significant, so it can fold into that 
same 1-2 GiB too.

That leaves a full 14 GiB for cache.  But at least with /my/ usage, 
normal non-update cache memory usage tends to be below ~6 GiB too, so 
total apps/buffer/cache memory usage tends to be below 8 GiB as well.

When I'm doing multi-job builds or working with big media files, I'll 
sometimes go above 8 gig usage, and that occasional cache-spill was why I 
upgraded to 16 gig.  But in practice, 10 gig would take care of that most 
of the time, and were it not for the accident of powers-of-two meaning 
16 gig is the notch above 8 gig, 10 or 12 gig would be plenty.  Truth be 
told, I so seldom use that last 4 gig that it's almost embarrassing.

* Tho if I ran multi-GiB VMs that'd use up that extra memory real fast!  
But while that /is/ becoming more common, I'm not exactly sure I'd 
classify 4 gigs plus of VM usage as desktop usage just yet.  
Workstation, yes, and definitely server, but not really desktop.

All that as background to this...

* Cache works only after first access.  If you only access something 
occasionally, it may not be worth caching at all.

* Similarly, if access isn't time critical, think of playing a huge video 
file where only a few meg in memory at once is plenty, and where storage 
access is several times faster than play-speed, cache isn't particularly 
useful.

* Bcache 

[PATCH] Btrfs: use WARN_ON_ONCE instead for btrfs_invalidate_inodes

2013-12-29 Thread Liu Bo
So after transaction is aborted, we need to cleanup inode resources by
calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
roots' refs to be zero in old times and sets a WARN_ON(), however, this
is not always true within cleaning up transaction, so WARN_ON_ONCE() is
better, and we won't get another syslog message bomb.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f1a7744..ef7f6af 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4761,7 +4761,7 @@ void btrfs_invalidate_inodes(struct btrfs_root *root)
struct inode *inode;
u64 objectid = 0;
 
-   WARN_ON(btrfs_root_refs(root-root_item) != 0);
+   WARN_ON_ONCE(btrfs_root_refs(root-root_item) != 0);
 
spin_lock(root-inode_lock);
 again:
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html