Re: Copying related snapshots to another server with btrfs send/receive?

2014-05-07 Thread Marc MERLIN
On Mon, May 05, 2014 at 03:24:45AM +, Duncan wrote:
 *However*: snapshotting a read-only snapshot and making the new one 
 writable is easy enough[1].  Just keep the originals read-only so they 
 can be used as parents/clones, and make a second, writable snapshot of 
 the first, to do your writable stuff in.
 
 ---
 [1]  Snapshotting a snapshot: I'm getting a metaphorical flashing light 

I already snapshot ro snapshots as rw snapshots and that works fine.
It's actually rely on this in my script:
http://marc.merlins.org/perso/btrfs/post_2014-03-22_Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive.html
(skip to the bottom)
# We make a read-write snapshot in case you want to use it for a chroot
# and some testing with a writeable filesystem or want to boot from a
# last good known snapshot.
btrfs subvolume snapshot $src_newsnap $src_newsnaprw
$ssh btrfs subvolume snapshot $dest_pool/$src_newsnap 
$dest_pool/$src_newsnaprw

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl

2014-05-07 Thread Anand Jain



Thanks for working on this.
I am running some tests will let you know.

Anand


On 05/06/2014 02:33 PM, Qu Wenruo wrote:

Old btrfs can't find a missing btrfs device since there is no
mechanism for block layer to inform fs layer.

But we can use a workaround that only check status(by using
request_queue-queue_flags) of every device in a btrfs
filesystem when calling dev_info/rm_dev ioctl, since other ioctls
do not really cares about missing device.

Cc: Anand Jain anand.j...@oracle.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
  fs/btrfs/ioctl.c   |  1 +
  fs/btrfs/volumes.c | 25 -
  fs/btrfs/volumes.h |  2 ++
  3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0401397..7680a40 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2606,6 +2606,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, 
void __user *arg)
goto out;
}

+   btrfs_check_dev_missing(root, dev, 1);
di_args-devid = dev-devid;
di_args-bytes_used = dev-bytes_used;
di_args-total_bytes = dev-total_bytes;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d241130a..c7d7908 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1548,9 +1548,10 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
 * is held.
 */
list_for_each_entry(tmp, devices, dev_list) {
+   btrfs_check_dev_missing(root, tmp, 0);
if (tmp-in_fs_metadata 
!tmp-is_tgtdev_for_dev_replace 
-   !tmp-bdev) {
+   (!tmp-bdev || tmp-missing)) {
device = tmp;
break;
}
@@ -6300,3 +6301,25 @@ int btrfs_scratch_superblock(struct btrfs_device *device)

return 0;
  }
+
+/* If need_lock is set, uuid_mutex will be used */
+int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev,
+   int need_lock)
+{
+   struct request_queue *q;
+
+   if (unlikely(!dev || !dev-bdev || !dev-bdev-bd_queue))
+   return -ENOENT;
+   q = dev-bdev-bd_queue;
+
+   if (need_lock)
+   mutex_lock(uuid_mutex);
+   if (test_bit(QUEUE_FLAG_DEAD, q-queue_flags) ||
+   test_bit(QUEUE_FLAG_DYING, q-queue_flags)) {
+   dev-missing = 1;
+   root-fs_info-fs_devices-missing_devices++;
+   }
+   if (need_lock)
+   mutex_unlock(uuid_mutex);
+   return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 80754f9..47a44af 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -356,6 +356,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root,
  int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *extent_root,
u64 chunk_offset, u64 chunk_size);
+int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev,
+   int need_lock);
  static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
  int index)
  {


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs fi show show full?

2014-05-07 Thread Marc MERLIN
On Tue, May 06, 2014 at 08:10:00PM +, Duncan wrote:
 Marc MERLIN posted on Sun, 04 May 2014 22:50:29 -0700 as excerpted:
 
  In the second FS:
  Label: btrfs_pool1  uuid: [...]
Total devices 1 FS bytes used 442.17GiB
devid1 size 865.01GiB used 751.04GiB path [...]
  
  The difference is huge between 'Total used' and 'devid used'.
  
  Is btrfs going to fix this on its own, or likely not and I'm stuck doing
  a full balance (without filters since I'm balancing data and not
  metadata)?
  
  If that helps.
  legolas:~# btrfs fi df /mnt/btrfs_pool1
  Data, single: total=734.01GiB, used=435.29GiB
  System, DUP: total=8.00MiB, used=96.00KiB
  System, single: total=4.00MiB, used=0.00
  Metadata, DUP: total=8.50GiB, used=6.74GiB
  Metadata, single: total=8.00MiB, used=0.00
 
 Definitely helps.  The spread is in data.
 
 Try
 
 btrfs balance start -dusage=20 /mnt/btrfs_pool1

So, I had already tried -dusage=50 yesterday, and I'm now reasonable:
Label: btrfs_pool1  uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6
Total devices 1 FS bytes used 443.22GiB
devid1 size 865.01GiB used 514.04GiB path /dev/mapper/cryptroot


 something like -dusage=50 or -dusage=80, likely MUCH faster, but will 
 return less chunks to unallocated, as well.  Still, your spread between 

(fewer)

 data-total and data-used is high enough, I expect -dusage=20 will give 
 you pretty good results.

So, on
On 
http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
I wrote
In the case above, because the filesystem is only 55% full, I can ask
balance to rewrite all chunks that are more than 55% full:
legolas:~# btrfs balance start -dusage=55 /mnt/btrfs_pool1

Did I get this right?
I'm not sure I did, since it seems the bigger the -dusage number, the
more work balance has to do.

If I asked -dsuage=85, it would do all chunks that are more than 15%
full?
So, do I need to change the text above to say more than 45% full ?

More generally, does it not make sense to just use the same percentage
in -dusage than the percentage of total filesytem full?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl

2014-05-07 Thread Qu Wenruo


 Original Message 
Subject: Re: [RFC PATCH 1/2] btrfs: Add missing device check in 
dev_info/rm_dev ioctl

From: Anand Jain anand.j...@oracle.com
To: Qu Wenruo quwen...@cn.fujitsu.com
Date: 2014年05月07日 16:00



Thanks for working on this.
I am running some tests will let you know.

Anand


Thanks for your tests.
I have only check the scsi_device/X:X:X:X/device/delete interface to 
remove the device,

so if you have some other device remove tests, that would be much nicer.

Thanks,
Qu


On 05/06/2014 02:33 PM, Qu Wenruo wrote:

Old btrfs can't find a missing btrfs device since there is no
mechanism for block layer to inform fs layer.

But we can use a workaround that only check status(by using
request_queue-queue_flags) of every device in a btrfs
filesystem when calling dev_info/rm_dev ioctl, since other ioctls
do not really cares about missing device.

Cc: Anand Jain anand.j...@oracle.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
  fs/btrfs/ioctl.c   |  1 +
  fs/btrfs/volumes.c | 25 -
  fs/btrfs/volumes.h |  2 ++
  3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0401397..7680a40 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2606,6 +2606,7 @@ static long btrfs_ioctl_dev_info(struct 
btrfs_root *root, void __user *arg)

  goto out;
  }

+btrfs_check_dev_missing(root, dev, 1);
  di_args-devid = dev-devid;
  di_args-bytes_used = dev-bytes_used;
  di_args-total_bytes = dev-total_bytes;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d241130a..c7d7908 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1548,9 +1548,10 @@ int btrfs_rm_device(struct btrfs_root *root, 
char *device_path)

   * is held.
   */
  list_for_each_entry(tmp, devices, dev_list) {
+btrfs_check_dev_missing(root, tmp, 0);
  if (tmp-in_fs_metadata 
  !tmp-is_tgtdev_for_dev_replace 
-!tmp-bdev) {
+(!tmp-bdev || tmp-missing)) {
  device = tmp;
  break;
  }
@@ -6300,3 +6301,25 @@ int btrfs_scratch_superblock(struct 
btrfs_device *device)


  return 0;
  }
+
+/* If need_lock is set, uuid_mutex will be used */
+int btrfs_check_dev_missing(struct btrfs_root *root, struct 
btrfs_device *dev,

+int need_lock)
+{
+struct request_queue *q;
+
+if (unlikely(!dev || !dev-bdev || !dev-bdev-bd_queue))
+return -ENOENT;
+q = dev-bdev-bd_queue;
+
+if (need_lock)
+mutex_lock(uuid_mutex);
+if (test_bit(QUEUE_FLAG_DEAD, q-queue_flags) ||
+test_bit(QUEUE_FLAG_DYING, q-queue_flags)) {
+dev-missing = 1;
+root-fs_info-fs_devices-missing_devices++;
+}
+if (need_lock)
+mutex_unlock(uuid_mutex);
+return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 80754f9..47a44af 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -356,6 +356,8 @@ unsigned long btrfs_full_stripe_len(struct 
btrfs_root *root,

  int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
  struct btrfs_root *extent_root,
  u64 chunk_offset, u64 chunk_size);
+int btrfs_check_dev_missing(struct btrfs_root *root, struct 
btrfs_device *dev,

+int need_lock);
  static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
int index)
  {



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid0 vs single, and should we allow -mdup by default on SSDs?

2014-05-07 Thread Marc MERLIN
Hi Chris and other devs,

Does it really make sense to turn off -mdup on SSDs? I would argue that
no. In my case dmcrypt protected me from that, so I'm happy, but even if
I didn't use it, I'd want the protection of -mdup, even if the
protection mght only be partial.

On Tue, May 06, 2014 at 05:16:08PM +, Duncan wrote:
 Single only stripes in such extremely large (1 GiB data, quarter-GiB 
 metadata, per strip) chunks that it doesn't matter for speed, and then 
 only as a result of its chunk allocation policy.  If one can define such 
 large strips as striping, which it is in a way, but not really in the 
 practical sense.
 
Oh good, I didn't know it was that big.

 The effect of a lost device, then, is more or less random, tho for single 
 metadata the effect is likely to be quite large up to total loss, due to 
 the damage to the tree.  It's not out of thin air that the multi-device 

Yes. I totally use either -mdup or -mraid1.

 That contrasts with raid0, where the striping is at sizes well under a 
 chunk (memory page size or 4 MiB on x86/amd64 data I believe, tho the 
 fact that files under the 16 MiB node size may actually be entirely 
 folded into metadata and not have a data extent allocation at all skews 
 things for up to the 16 MiB metadata node size), so the definition of 
 small file likely to be recovered is **MUCH** smaller on raid0, than on 
 single.

Great to know, I'll use -m raid1 -d single next time.

 Effectively, raid0 data you're only (relatively) likely to recover files 
 smaller than 16 MiB, while single data, it's files smaller than 1 GiB.

Thanks much for that.

On Tue, May 06, 2014 at 07:05:52PM +, Duncan wrote:
 1) In ordered to do that, btrfs (I guess mkfs.btrfs in this case) must be 
 able to detect that the device *IS* ssd.  Depending on the SSD, the 
 kernel version, and whether the btrfs is being created direct on bare-
 metal device or on some device layered (lvm or dmcrypt or whatever) on 
 top of the bare metal, btrfs may or may not successfully detect that.
 
 Obviously in your case[1] the ssd wasn't detected.

Indeed.  I also found out why my SSD has -mdup: It's on top of dmcrypt
so btrfs failed to see it was and SSD and gave me -mdup. Good, that's
what I wanted anyway :)
 
 I believe I've seen you mention using dmcrypt or the like, however, which 
 probably doesn't pass whatever is used for ssd protection on thru, thus 
 explaining btrfs not seeing it and having to specify it yourself, if you 
 wish.
 
You guessed correctly, congrats.

 2) The only reason I happen to know about the SSD metadata single-device 
 single mode default exception (where metadata otherwise defaults to dup 
 mode on single-device, and to raid1 mode on multi-device regardless of 
 the media), is as a result of I believe Chris Mason commenting on it in 
 an on-list reply.
 The reasoning given in that reply was not the erase-block reason I've 
 seen someone else mention here (and which doesn't quite make sense to me, 
 since I don't know why that would make a difference), but rather:

Yes. I personally don't think it's a good idea. Basically when having 2
copies, they could still end up on the same erase block, making them
less redundant.
My answer to that is 'so what?'
There are plenty of other times where dup would be useful on an SSD. I
really don't see the point of trying to it off by default just because
maybe in one case it would not offer extra protection.

 Some SSD firmware does automatic deduplication and compression.  On these 
 devices, DUP-mode would almost certainly be stored as a single internal 
 data block with two external address references anyway, so it would 
 actually be single in any case, and defaulting to single (a) doesn't hide 
 that fact, and (b) reduces overhead that's justified for safety 
 otherwise, but if the firmware is doing an end run around that safety 
 anyway, might as well just shortcut the overhead as well.

If some SSDs do this, let's not punish those have SSDs that don't.
 
 However, while the btrfs default will apply to all (detected) ssds, not 
 all ssds have firmware that does this internal deduplication!

Exactly.

On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote:
 Well, assuming that by -d linear you meant -d single. Btrfs doesn't call 
 it linear, tho at the data safety level, btrfs single is actually quite 
 comparable to mdadm linear.  =:^)  

Yes, I meant single, sorry :)
(aka linear for mdadm)
 
  At the time I used -m raid1 -d raid0, but it sounds for slightly extra
  recoverability, I should have ued -m raid1 -d linear (and yes, I
  undertand that one should not consider a -d linear recoverable when a
  drive went missing).
 
 That appears to be a very good use of either -d raid0 or -d single, yes.  
 And since you're apparently not streaming such high resolution video that 
 you NEED the raid0, single does indeed give you a somewhat better chance 
 at recovery.
 
zoneminder saves 'video' as a stream of independent small jpegs, 

Re: raid0 vs single, and should we allow -mdup by default on SSDs?

2014-05-07 Thread Hugo Mills
On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote:
 On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote:
  That appears to be a very good use of either -d raid0 or -d single, yes.  
  And since you're apparently not streaming such high resolution video that 
  you NEED the raid0, single does indeed give you a somewhat better chance 
  at recovery.
  
 zoneminder saves 'video' as a stream of independent small jpegs, so I'm
 good. Actually come to think of it they're so small that they probably
 all ended up in the raid1 metadata. That also means that I'm not getting
 twice the storage space like I planned to. Oh well...

   There's a mount option to change the threshold at which files are
inlined in metadata: maxinline=bytes. You could play with that for
this particular use-case.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- I am but mad north-north-west:  when the wind is southerly, I ---  
   know a hawk from a handsaw.   


signature.asc
Description: Digital signature


Re: raid0 vs single, and should we allow -mdup by default on SSDs?

2014-05-07 Thread Marc MERLIN
On Wed, May 07, 2014 at 09:29:41AM +0100, Hugo Mills wrote:
 On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote:
  On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote:
   That appears to be a very good use of either -d raid0 or -d single, yes.  
   And since you're apparently not streaming such high resolution video that 
   you NEED the raid0, single does indeed give you a somewhat better chance 
   at recovery.
   
  zoneminder saves 'video' as a stream of independent small jpegs, so I'm
  good. Actually come to think of it they're so small that they probably
  all ended up in the raid1 metadata. That also means that I'm not getting
  twice the storage space like I planned to. Oh well...
 
There's a mount option to change the threshold at which files are
 inlined in metadata: maxinline=bytes. You could play with that for
 this particular use-case.

Oh cool, thank you.

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does Suse do live filesystem revert with btrfs?

2014-05-07 Thread Marc MERLIN
On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote:
 Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted:
 
  On Mon, May 05, 2014 at 01:36:39AM +0100, Hugo Mills wrote:
 I'm guessing it involves reflink copies of files from the snapshot
  back to the original, and then restarting affected services. That's
  about the only other thing that I can think of, but it's got load of
  race conditions in it (albeit difficult to hit in most cases, I
  suspect).
  
  Aaah, right, you can use a script to see the file differences between
  two snapshots, and then restore that with reflink if you can truly get a
  list of all changed files.
  However, that is indeed not atomic at all, even if faster than rsync.
 
 Would send/receive help in such a script?

Not really, you still end up with a new snapshot that you can't live
switch to.

It's really either
1) reboot
2) use cp --reflink to copy a list of changed files (as well as rm to
delete the ones that were removed).

I'm currently using btrfs-diff (below) which shows changed files but it
doesn't show files deleted.

Is there something better that would show me which files changed and how
between 2 snapshots?

btrfs-diff:
-
#!/bin/bash

usage() { echo $@ 2; echo Usage: $0 older-snapshot newer-snapshot 2; 
exit 1; }

[ $# -eq 2 ] || usage Incorrect invocation;
SNAPSHOT_OLD=$1;
SNAPSHOT_NEW=$2;

[ -d $SNAPSHOT_OLD ] || usage $SNAPSHOT_OLD does not exist;
[ -d $SNAPSHOT_NEW ] || usage $SNAPSHOT_NEW does not exist;

OLD_TRANSID=`btrfs subvolume find-new $SNAPSHOT_OLD 999`
OLD_TRANSID=${OLD_TRANSID#transid marker was }
[ -n $OLD_TRANSID -a $OLD_TRANSID -gt 0 ] || usage Failed to find 
generation for $SNAPSHOT_NEW

btrfs subvolume find-new $SNAPSHOT_NEW $OLD_TRANSID | sed '$d' | cut -f17- 
-d' ' | sort | uniq
-

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on software RAID0

2014-05-07 Thread Marc MERLIN
On Tue, May 06, 2014 at 09:02:46AM +0200, john terragon wrote:
 just one last doubt:
 
 why do you use --align-payload=1024? (or 8912)
 Cryptsetup man says that the default for the payload alignment is 2048
 (512-byte sectors). So, it's already aligned by default to 4K-byte
 physical sectors (if that was your concern). Am I missing something?

With 4K sectors, I agree that 2048 would be better.

What I was trying to do there is avoid write amplification.

After reading 
http://wiki.drewhess.com/wiki/Creating_an_encrypted_filesystem_on_a_partition
I went with
mdadm --create /dev/md8 --level=5 --raid-devices=5 /dev/sd[abdef]1 --chunk=256 
--bitmap=/boot/bitmap-md8
which I believe required me to use
cryptsetup luksFormat --align-payload=1024 -s 256 -c aes-xts-plain64  /dev/md8
(that was with 5 drives, or 4 drives with data).

Would agree with the math?

If so, for 4K sector sizes, if we have to use align-payload=1024, in
turn I'd have to use --chunk=512. 
Does that sound right?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs fi show show full?

2014-05-07 Thread Brendan Hide

On 2014/05/07 09:59 AM, Marc MERLIN wrote:

[snip]

Did I get this right?
I'm not sure I did, since it seems the bigger the -dusage number, the
more work balance has to do.

If I asked -dsuage=85, it would do all chunks that are more than 15%
full?


-dusage=85 balances all chunks that up to 85% full. The higher the 
number, the more work that needs to be done.

So, do I need to change the text above to say more than 45% full ?

More generally, does it not make sense to just use the same percentage
in -dusage than the percentage of total filesytem full?

Thanks,
Marc


Separately, Duncan has made me realise my halfway up algorithm is not 
very good - it was probably just good enough at the time and worked 
well enough that I wasn't prompted to analyse it further.


Doing a simulation with randomly-semi-filled chunks, df at 55%, and 
chunk utilisation at 86%, -dusage=55 balances 30% of the chunks, almost 
perfectly bringing chunk utilisation down to 56%. In my algorithm I 
would have used -dusage=70 which in my simulation would have balanced 
34% of the chunks - but bringing chunk utilisation down to 55% - a bit 
of wasted effort and unnecessary SSD wear.


I think now that I need to experiment with a much lower -dusage value 
and perhaps to repeat the balance with the df value (55 in the example) 
if the chunk usage is still too high. Getting an optimal first value 
algorithmically might prove a challenge - I might just end up picking 
some arbitrary percentage point below the df value.


Pathological use-cases still apply however (for example if all chunks 
except one are exactly 54% full). The up-side is that if the algorithm 
is applied regularly (as in scripted and scheduled) then the situation 
will always be that the majority of chunks are going to be relatively 
full, avoiding the pathological use-case.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using mount -o bind vs mount -o subvol=vol

2014-05-07 Thread Marc MERLIN
On Mon, May 05, 2014 at 02:12:30AM +, Duncan wrote:
 Marc MERLIN posted on Sat, 03 May 2014 17:47:32 -0700 as excerpted:

Just as an FYI, like (likely) most subscribers, I do prefer Cc on
replies. Without that, I'm much less likely to see your message timely,
or at all if I'm behind on Email.
 
 TL;DR: Put simply, with certain sometimes major exceptions, IMO subvolumes 
 are /mostly/ a solution looking for a problem.  In the /general/ case, I 
 don't see the point and personally STRONGLY prefer multiple independent 
 partitions for their much stronger data safety and mounting/backup 
 flexibility.  That's why I use independent partitions, here.
 
I'm a partitions guy, but now that I have subvolumes which can be
snapshotted/backed up independently, I'm much happier with a single
shared pool.
Look at a btrfs pool like an LVM pool, except more flexible.

To each their own I guess.

 1) Multiple subvolumes on a common filesystem share the filesystem tree- 
 and super-structure.  If something happens to that filesystem, you had 
 all your data eggs in that one basket and the bottom just dropped out of 
 it!  If you can't recover, kiss **ALL** those data eggs goodbye!

Backups :)
(and having your booting filesystem on a different pool from you data
pool).
 
 3) Filesystem size and time to complete whole-filesystem operations such 
 as balance, scrub and check are directly related; the larger the 
 filesystem, the longer such operations take.  There are reports here of 
 balances taking days on multi-terabyte filesystems, and double-digit 
 hours isn't unusual at all.

True, but if I have a 10TB array, I'm not going to cut it into 10 1TB
arrays just for that.
 
 Now ask yourself, how likely are you to routinely run a scrub or balance 
 as preventive maintenance if you know it's going to take the entire day 
 to finish?  Here, the times are literally so trivial can and do run a 
 full filesystem rebalance to time it and make this point and maintenance 
 such as scrub or balance simply ceases to be an issue.

It runs nightly from cron on my laptop. 1TB filesystem on SSD, no sweat.
 
 4) Many distros are using btrfs subvolumes on a single btrfs storage 
 pool the way they formerly used LVM volume groups, as a common storage 
 pool allowing them the flexibility to (re)allocate space to whatever lvm 
 volume or btrfs subvolume needs it.

Yep.
 
 OTOH, for users and distros with a pretty good idea of what their 
 allocations are going to look like, generally due to the experience 
 they've gained over the years, that extra flexibility isn't a big benefit 

You and me yes, most other people no.
And to be honest, I've been doing this for 20 years, and my guesses are
not always right 10 years later on a machine that's still running :)
(of which I have several)

 6) Subvolumes be used to control snapshotting since snapshots stop at 
 subvolume boundaries.  In the presence of point #5 storage pools, and 
 given the reality of btrfs NOCOW attribute behavior when mixed with 
 snapshots, subvolumes become an important tool for limiting snapshot 
 coverage area, in particular, for demarcing areas that should NOT be 
 snapshotted when the filesystem or parent subvolume is snapshotted, due 
 for instance to the horrible interaction between large heavy-internal-
 rewrite files and COW, which means they should be set NOCOW, coupled with 
 the horrible interaction between NOCOW on such files and snapshotting.

Yep.
 
 Similarly, subvolumes and their boundaries can be used to set borders for 
 frequency or timing of snapshotting, say snapshotting the general
 root/system tree before updates, while snapshotting /home hourly.

Yep.

 Point #6 is, I'd argue, one of the few legitimate use-cases for 
 subvolumes as opposed to independent filesystems, and it actually loses 
 relevancy if #4 is subsumed to point #1 and #3, already.  However, given 
 the reality of popular distro btrfs layouts and usage, #4 is in practice 
 overruling all the others in many distro-default btrfs deployments today, 
 and #6 then becomes relevant.

subvolumes are also used as units of backup for btrfs send.
 
 So my vote would be, for example (modified slightly for posting from my 
 own mounts):
 
 mount /dev/sda5 /
 mount /dev/sda4 /var/log
 mount /dev/sda6 /home

On my laptop:
/dev/mapper/cryptroot on / type btrfs 
(rw,noatime,compress=lzo,ssd,discard,space_cache)
/dev/mapper/cryptroot on /usr type btrfs 
(rw,noatime,compress=lzo,ssd,discard,space_cache)
/dev/mapper/cryptroot on /var type btrfs 
(rw,noatime,compress=lzo,ssd,discard,space_cache)
/dev/mapper/cryptroot on /home type btrfs 
(rw,noatime,compress=lzo,ssd,discard,space_cache)
/dev/mapper/cryptroot on /tmp type btrfs 
(rw,noexec,noatime,compress=lzo,ssd,discard,space_cache)
/dev/mapper/cryptroot on /var/local/nobckd2 type btrfs 
(rw,noatime,compress=lzo,ssd,discard,space_cache)
/dev/mapper/disk2 on /var/local/space type btrfs 
(rw,noatime,compress=lzo,discard,space_cache)

/dev/mapper/cryptroot 

[PATCH 2/3] Crypto: xxhash: add tests

2014-05-07 Thread Liu Bo
Signed-off-by: Liu Bo bo.li@oracle.com
---
 crypto/testmgr.c | 10 ++
 crypto/testmgr.h | 33 +
 2 files changed, 43 insertions(+)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index dc3cf35..27ba702 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -3153,6 +3153,16 @@ static const struct alg_test_desc alg_test_descs[] = {
}
}
}, {
+   .alg = xxh32,
+   .test = alg_test_hash,
+   .fips_allowed = 1,
+   .suite = {
+   .hash = {
+   .vecs = xxh32_tv_template,
+   .count = XXH32_TEST_VECTORS
+   }
+   }
+   }, {
.alg = zlib,
.test = alg_test_pcomp,
.fips_allowed = 1,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 3db83db..8e56884 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -26660,6 +26660,39 @@ static struct hash_testvec michael_mic_tv_template[] = 
{
}
 };
 
+#define XXH32_TEST_VECTORS 3
+
+static struct hash_testvec xxh32_tv_template[] = {
+   {
+   .plaintext = \x9e,
+   .psize = 1,
+   .digest  = \xe5\xbe\x5c\xb8,
+   },
+   {
+   .plaintext = \x9e\xff\x1f\x4b\x5e\x53\x2f\xdd
+\xb5\x54\x4d\x2a\x95\x2b,
+   .psize = 14,
+   .digest = \xb4\x0a\xaa\xe5,
+   },
+   {
+   .plaintext = \x9e\xff\x1f\x4b\x5e\x53\x2f\xdd
+\xb5\x54\x4d\x2a\x95\x2b\x57\xae
+\x5d\xba\x74\xe9\xd3\xa6\x4c\x98
+\x30\x60\xc0\x80\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00\x00\x00\x00
+\x00\x00\x00\x00\x00,
+   .psize = 101,
+   .digest = \x12\xa4\x1a\x1f,
+   }
+};
+
 /*
  * CRC32C test vectors
  */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Btrfs: add another checksum algorithm xxhash

2014-05-07 Thread Liu Bo
xxHash is an extremely fast non-cryptographic Hash algorithm, working at speeds
close to RAM limits.[1]  And xxhash is 32-bits hash, same as crc32.

This modifies btrfs's checksum API a bit and adopts xxhash as an alternative
checksum algorithm.

Note: We needs to update btrfs-progs side as well to set it up.

[1]: https://code.google.com/p/xxhash/

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/Kconfig|  22 
 fs/btrfs/compression.c  |   6 +--
 fs/btrfs/ctree.h|  12 +++--
 fs/btrfs/dir-item.c |  10 ++--
 fs/btrfs/disk-io.c  | 126 
 fs/btrfs/disk-io.h  |   2 -
 fs/btrfs/extent-tree.c  |  43 ++-
 fs/btrfs/file-item.c|   9 ++--
 fs/btrfs/free-space-cache.c |  15 +++---
 fs/btrfs/hash.c |  75 --
 fs/btrfs/hash.h |  22 
 fs/btrfs/inode-item.c   |   6 +--
 fs/btrfs/inode.c|  16 +++---
 fs/btrfs/props.c|  37 +++--
 fs/btrfs/props.h|   3 +-
 fs/btrfs/scrub.c|  70 +++-
 fs/btrfs/send.c |   7 ++-
 fs/btrfs/super.c|   9 ++--
 fs/btrfs/tree-log.c |   2 +-
 19 files changed, 331 insertions(+), 161 deletions(-)

diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index a66768e..ef45456 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -2,6 +2,7 @@ config BTRFS_FS
tristate Btrfs filesystem support
select CRYPTO
select CRYPTO_CRC32C
+   select CRYPTO_XXH32
select ZLIB_INFLATE
select ZLIB_DEFLATE
select LZO_COMPRESS
@@ -88,3 +89,24 @@ config BTRFS_ASSERT
  any of the assertions trip.  This is meant for btrfs developers only.
 
  If unsure, say N.
+
+choice
+   prompt choose checksum algorithm
+   default BTRFS_CRC32C
+   help
+  This option allows to select a checksum algorithm
+
+config BTRFS_CRC32C
+   depends on CRYPTO_CRC32C
+   bool BTRFS_CRC32C
+   help
+  crc32c
+
+config BTRFS_XXH32
+   depends on CRYPTO_XXH32
+   bool BTRFS_XXH32
+   help
+  xxhash
+
+endchoice
+
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d43c544..889b0f1 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -41,6 +41,7 @@
 #include compression.h
 #include extent_io.h
 #include extent_map.h
+#include hash.h
 
 struct compressed_bio {
/* number of bios pending for this compressed extent */
@@ -114,17 +115,16 @@ static int check_compressed_csum(struct inode *inode,
char *kaddr;
u32 csum;
u32 *cb_sum = cb-sums;
+   struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info;
 
if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)
return 0;
 
for (i = 0; i  cb-nr_pages; i++) {
page = cb-compressed_pages[i];
-   csum = ~(u32)0;
 
kaddr = kmap_atomic(page);
-   csum = btrfs_csum_data(kaddr, csum, PAGE_CACHE_SIZE);
-   btrfs_csum_final(csum, (char *)csum);
+   btrfs_csum_data(fs_info, kaddr, PAGE_CACHE_SIZE, (char *)csum);
kunmap_atomic(kaddr);
 
if (csum != *cb_sum) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba6b885..cbb6533 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -176,12 +176,16 @@ struct btrfs_ordered_sum;
 /* 32 bytes in various csum fields */
 #define BTRFS_CSUM_SIZE 32
 
-/* csum types */
+/*
+ * csum types,
+ * - 4 bytes for CRC32(crc32c)
+ * - 4 bytes for XXH32(xxhash)
+ */
 #define BTRFS_CSUM_TYPE_CRC32  0
+#define BTRFS_CSUM_TYPE_XXH32  1
 
-static int btrfs_csum_sizes[] = { 4, 0 };
+static int btrfs_csum_sizes[] = { 4, 4, 0 };
 
-/* four bytes for CRC32 */
 #define BTRFS_EMPTY_DIR_SIZE 0
 
 /* spefic to btrfs_map_block(), therefore not in include/linux/blk_types.h */
@@ -1688,6 +1692,8 @@ struct btrfs_fs_info {
 
struct semaphore uuid_tree_rescan_sem;
unsigned int update_uuid_tree_gen:1;
+
+   struct crypto_shash *tfm;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index a0691df..1332858 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -87,7 +87,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
 
key.objectid = objectid;
btrfs_set_key_type(key, BTRFS_XATTR_ITEM_KEY);
-   key.offset = btrfs_name_hash(name, name_len);
+   key.offset = btrfs_name_hash(root-fs_info, name, name_len);
 
data_size = sizeof(*dir_item) + name_len + data_len;
dir_item = insert_with_overflow(trans, root, path, key, data_size,
@@ -138,7 +138,7 @@ int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, 
struct btrfs_root
 
key.objectid = btrfs_ino(dir);
btrfs_set_key_type(key, BTRFS_DIR_ITEM_KEY);
-   key.offset = btrfs_name_hash(name, name_len);

[PATCH] Btrfs-progs: add xxhash

2014-05-07 Thread Liu Bo
From: root root@localhost.localdomain

Signed-off-by: Liu Bo bo.li@oracle.com
---
 Makefile  |4 +-
 crc32c.h  |4 +-
 disk-io.c |2 +-
 hash.h|2 +-
 xxhash.c  |  448 +
 xxhash.h  |  171 +++
 6 files changed, 626 insertions(+), 5 deletions(-)
 create mode 100644 xxhash.c
 create mode 100644 xxhash.h

diff --git a/Makefile b/Makefile
index 369df6c..1d70bc9 100644
--- a/Makefile
+++ b/Makefile
@@ -16,10 +16,10 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o 
cmds-device.o cmds-scrub.o \
   cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \
   cmds-property.o cmds-dedup.o
 libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \
-  uuid-tree.o
+  uuid-tree.o xxhash.o
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
   crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \
-  extent_io.h ioctl.h ctree.h btrfsck.h
+  extent_io.h ioctl.h ctree.h btrfsck.h xxhash.h
 TESTS = fsck-tests.sh
 
 INSTALL = install
diff --git a/crc32c.h b/crc32c.h
index c552ef6..6dd0ce2 100644
--- a/crc32c.h
+++ b/crc32c.h
@@ -25,9 +25,11 @@
 #include btrfs/kerncompat.h
 #endif /* BTRFS_FLAT_INCLUDES */
 
+#include xxhash.h
+
 u32 crc32c_le(u32 seed, unsigned char const *data, size_t length);
 void crc32c_optimization_init(void);
 
-#define crc32c(seed, data, length) crc32c_le(seed, (unsigned char const 
*)data, length)
+#define crc32c(seed, data, length) XXH32(data, length, 0)
 #define btrfs_crc32c crc32c
 #endif
diff --git a/disk-io.c b/disk-io.c
index 19b95a7..2c72f7f 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -67,7 +67,7 @@ u32 btrfs_csum_data(struct btrfs_root *root, char *data, u32 
seed, size_t len)
 
 void btrfs_csum_final(u32 crc, char *result)
 {
-   *(__le32 *)result = ~cpu_to_le32(crc);
+   *(__le32 *)result = cpu_to_le32(crc);
 }
 
 static int __csum_tree_block_size(struct extent_buffer *buf, u16 csum_size,
diff --git a/hash.h b/hash.h
index c0b88a1..2d1a71d 100644
--- a/hash.h
+++ b/hash.h
@@ -22,6 +22,6 @@
 
 static inline u64 btrfs_name_hash(const char *name, int len)
 {
-   return ~(crc32c((u32)(~0), name, len));
+   return crc32c((u32)(~0), name, len);
 }
 #endif
diff --git a/xxhash.c b/xxhash.c
new file mode 100644
index 000..f855a58
--- /dev/null
+++ b/xxhash.c
@@ -0,0 +1,448 @@
+/*
+xxHash - Fast Hash algorithm
+Copyright (C) 2012-2014, Yann Collet.
+BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+* Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+* Redistributions in binary form must reproduce the above
+copyright notice, this list of conditions and the following disclaimer
+in the documentation and/or other materials provided with the
+distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+You can contact the author at :
+- xxHash source repository : http://code.google.com/p/xxhash/
+*/
+
+
+//**
+// Tuning parameters
+//**
+// Unaligned memory access is automatically enabled for common CPU, such as 
x86.
+// For others CPU, the compiler will be more cautious, and insert extra code 
to ensure aligned access is respected.
+// If you know your target CPU supports unaligned memory access, you want to 
force this option manually to improve performance.
+// You can also enable this parameter if you know your input data will always 
be aligned (boundaries of 4, for U32).
+#if defined(__ARM_FEATURE_UNALIGNED) || defined(__i386) || defined(_M_IX86) || 
defined(__x86_64__) || defined(_M_X64)
+#  define XXH_USE_UNALIGNED_ACCESS 1
+#endif
+
+// XXH_ACCEPT_NULL_INPUT_POINTER :
+// If the input pointer is a null pointer, xxHash default behavior is to 
trigger a memory access error, since it is a bad pointer.
+// When this option is enabled, xxHash output for null input pointers will be 
the same as a null-length 

[PATCH 1/3] Crypto: add xxhash algorithm

2014-05-07 Thread Liu Bo
This will be used in btrfs, and maybe in others in the future.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 crypto/Kconfig  |   7 +
 crypto/Makefile |   1 +
 crypto/xxhash.c | 383 
 include/crypto/xxhash.h | 209 ++
 4 files changed, 600 insertions(+)
 create mode 100644 crypto/xxhash.c
 create mode 100644 include/crypto/xxhash.h

diff --git a/crypto/Kconfig b/crypto/Kconfig
index ce4012a..2e56de0 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -622,6 +622,13 @@ config CRYPTO_GHASH_CLMUL_NI_INTEL
  GHASH is message digest algorithm for GCM (Galois/Counter Mode).
  The implementation is accelerated by CLMUL-NI of Intel.
 
+config CRYPTO_XXH32
+   tristate XXHASH digest algorithm
+   select CRYPTO_HASH
+   help
+ xxHash - Fast Hash Algorithm
+ source repository : http://code.google.com/p/xxhash/
+ 
 comment Ciphers
 
 config CRYPTO_AES
diff --git a/crypto/Makefile b/crypto/Makefile
index 38e64231..7c3f363 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -97,6 +97,7 @@ obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o
 obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
 obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
 obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
+obj-$(CONFIG_CRYPTO_XXH32) += xxhash.o
 
 #
 # generic algorithms and the async_tx api
diff --git a/crypto/xxhash.c b/crypto/xxhash.c
new file mode 100644
index 000..b84c7cf
--- /dev/null
+++ b/crypto/xxhash.c
@@ -0,0 +1,383 @@
+/*
+ * xxHash - Fast Hash algorithm
+ * Copyright (C) 2012-2014, Yann Collet.
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice,
+ * this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the distribution.
+
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+ * You can contact the author at :
+ * xxHash source repository : http://code.google.com/p/xxhash/
+ */
+
+#include crypto/internal/hash.h
+#include crypto/xxhash.h
+#include linux/string.h
+#include linux/mm.h
+#include linux/module.h
+#include linux/init.h
+#include linux/types.h
+#include linux/slab.h
+
+static inline u32 XXH_readLE32_align(const u32 * ptr, XXH_endianess endian,
+XXH_alignment align)
+{
+   if (align == XXH_unaligned)
+   return endian ==
+   XXH_littleEndian ? A32(ptr) : XXH_swap32(A32(ptr));
+   else
+   return endian == XXH_littleEndian ? *ptr : XXH_swap32(*ptr);
+}
+
+static inline u32 XXH_readLE32(const u32 * ptr, XXH_endianess endian)
+{
+   return XXH_readLE32_align(ptr, endian, XXH_unaligned);
+}
+
+/* Simple Hash Functions */
+static inline u32 XXH32_endian_align(const void *input, int len, u32 seed,
+XXH_endianess endian,
+XXH_alignment align)
+{
+   const u8 *p = (const u8 *)input;
+   const u8 *const bEnd = p + len;
+   u32 h32;
+
+#ifdef XXH_ACCEPT_NULL_INPUT_POINTER
+   if (p == NULL) {
+   len = 0;
+   p = (const u8 *)(size_t) 16;
+   }
+#endif
+   if (len = 16) {
+   const u8 *const limit = bEnd - 16;
+   u32 v1 = seed + PRIME32_1 + PRIME32_2;
+   u32 v2 = seed + PRIME32_2;
+   u32 v3 = seed + 0;
+   u32 v4 = seed - PRIME32_1;
+   u32 tmp;
+
+   do {
+   tmp = XXH_readLE32_align((const u32 *)p, endian, 
align); 
+   v1 += tmp * PRIME32_2;
+   v1 = XXH_rotl32(v1, 13);
+   v1 *= PRIME32_1;
+   p += 4;
+
+   tmp = XXH_readLE32_align((const u32 *)p, endian, 
align); 
+   

[RFC PATCH 0/3] Btrfs: add xxhash algorithm

2014-05-07 Thread Liu Bo
xxHash is an extremely fast non-cryptographic Hash algorithm, working at speeds
close to RAM limits.[1]  And xxhash is 32-bits hash, same as crc32.

Here is the hash comparsion extracted from the link[1]:
(single thread, Windows Seven 32 bits, using Open Source's SMHasher on a Core 2
Duo @3GHz)


NameSpeed   Q.Score   Author
xxHash  5.4 GB/s 10
CRC32   0.43 GB/s 9



This patch set adds xxhash into linux kernel and then modifies btrfs's checksum
API a bit and adopts xxhash as an alternative checksum algorithm.

At the very first stage of RFC, I only ran xfstests through to make sure it
can work.  A bunch of performance tests will be made in the future.

Note:

We need to update btrfs-progs side as well to set it up, I attach a hacky
patch just for users to play with ;-)



[1]: https://code.google.com/p/xxhash/

Liu Bo (3):
  Crypto: add xxhash algorithm
  Crypto: xxhash: add tests
  Btrfs: add another checksum algorithm xxhash

 crypto/Kconfig  |   7 +
 crypto/Makefile |   1 +
 crypto/testmgr.c|  10 ++
 crypto/testmgr.h|  33 
 crypto/xxhash.c | 383 
 fs/btrfs/Kconfig|  22 +++
 fs/btrfs/compression.c  |   6 +-
 fs/btrfs/ctree.h|  12 +-
 fs/btrfs/dir-item.c |  10 +-
 fs/btrfs/disk-io.c  | 126 ---
 fs/btrfs/disk-io.h  |   2 -
 fs/btrfs/extent-tree.c  |  43 +++--
 fs/btrfs/file-item.c|   9 +-
 fs/btrfs/free-space-cache.c |  15 +-
 fs/btrfs/hash.c |  75 +++--
 fs/btrfs/hash.h |  22 ++-
 fs/btrfs/inode-item.c   |   6 +-
 fs/btrfs/inode.c|  16 +-
 fs/btrfs/props.c|  37 -
 fs/btrfs/props.h|   3 +-
 fs/btrfs/scrub.c|  70 ++--
 fs/btrfs/send.c |   7 +-
 fs/btrfs/super.c|   9 +-
 fs/btrfs/tree-log.c |   2 +-
 include/crypto/xxhash.h | 209 
 25 files changed, 974 insertions(+), 161 deletions(-)
 create mode 100644 crypto/xxhash.c
 create mode 100644 include/crypto/xxhash.h

-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] Btrfs: add xxhash algorithm

2014-05-07 Thread Tomasz Torcz
On Wed, May 07, 2014 at 06:56:29PM +0800, Liu Bo wrote:
 xxHash is an extremely fast non-cryptographic Hash algorithm, working at 
 speeds
 close to RAM limits.[1]  And xxhash is 32-bits hash, same as crc32.
 
 Here is the hash comparsion extracted from the link[1]:
 (single thread, Windows Seven 32 bits, using Open Source's SMHasher on a Core 
 2
 Duo @3GHz)
 
 
 NameSpeed   Q.Score   Author
 xxHash  5.4 GB/s 10
 CRC32   0.43 GB/s 9
 

  Core 2 Duo is awfully old CPU. Since 2008, Intel CPUs have crc32 instruction,
hugely speeding up CRC operations.
 

-- 
Tomasz Torcz God, root, what's the difference?
xmpp: zdzich...@chrome.pl God is more forgiving.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using noCow with snapshots ?

2014-05-07 Thread Duncan
Russell Coker posted on Wed, 07 May 2014 15:36:15 +1000 as excerpted:

 How could BTRFS and a database fight about data recovery?
 
 BTRFS offers similar guarantees about data durability etc to other
 journalled filesystems and only differs by having checksums so that
 while a snapshot might have half the data that was written by an app you
 at least know that the half will be consistent.
 
 If you had database files on a separate subvol to the database log then
 you would be at risk of having problems making a any sort of consistent
 snapshot (the Debian approach of /var/log/mysql and /var/lib/mysql is a
 bad idea). But there would be no difference with LVM snapshots in that
 regard.

Race conditions having to do with unsynced checkpoints, primarily.  And 
it's actually the btrfs checksumming that seems to create the problem.

The symptom being reported (tho I can say I've not seen further reports 
recently, maybe it's fixed now) was that the checksummed values btrfs 
restored as correct were considered corrupted by the database or vm.  
If the checksums checked out after btrfs did its replay (as they did or 
btrfs would error on access), but the databases and VMs were still 
reporting corruption, then the explanation that was left was that the 
btrfs replay and checksum validation was screwing up the application's 
own checksumming validation, which could be explained if the two were 
sufficiently out of sync that btrfs fixing its own view was actually 
breaking the view as seen by the data validating app.

Tho as I said I've not seen that sort of report in several kernel cycles 
now.  But I'm not sure whether that's because the issues have been fixed 
or for some other reason (maybe everybody experiencing the problem gave 
up and switched to some other filesystem now, and the message is out 
there well enough that new people see it before they experience and 
report the same thing, or similar but everybody's switched to NOCOW now 
and knows not to do snapshotting on the NOCOW files, or...).

Regardless, NOCOW and not doing snapshotting (because it triggers COW 
anyway) on gig-plus internal-write files remains a very good idea.  
(Also, quotas and quota sequence numbers play into the combinational 
explosion problem along with snapshot-aware-defrag, too.  See the writeup 
on that that Dave wrote while he was on paternity leave.)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: check, fix csum check in the presence of non-inlined refs

2014-05-07 Thread Filipe David Borba Manana
When we have non-inlined extent references, we were failing to find the
corresponding extent item for an existing csum item in the csum tree.

Reproducer:

   mkfs.btrfs -f /dev/sdd
   mount /dev/sdd /mnt

   xfs_io -f -c falloc 780366 135302 /mnt/foo
   xfs_io -c falloc 327680 151552 /mnt/foo
   xfs_io -c pwrite -S 0xff -b 131072 0 131072 /mnt/foo
   sync

   for i in `seq 1 40`; do btrfs subvolume snapshot /mnt /mnt/snap$i ; done
   umount /mnt

   btrfs check /dev/sdd

The check command exited with status 1 and the following output:

   Checking filesystem on /dev/sdd
   UUID: 2416ab5f-9d71-457e-bb13-a27d4f6b399a
   checking extents
   checking free space cache
   checking fs roots
   checking csums
   There are no extents for csum range 12980224-12984320
   Csum exists for 12980224-12984320 but there is no extent record
   found 1388544 bytes used err is 1
   total csum bytes: 132
   total tree bytes: 704512
   total fs tree bytes: 573440
   total extent tree bytes: 16384
   btree space waste bytes: 564479
   file data blocks allocated: 19341312
referenced 14606336
   Btrfs v3.14.1-94-g80597e7

After this change it no longer erroneously reports a missing extent for the
csum item and exits with a status of 0.

Also added missing btrfs_prev_leaf() return value checks, as we were ignoring
errors and non-existence of left siblings completely.

Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
---
 cmds-check.c | 38 +++---
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 103efc5..18612c8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3650,8 +3650,7 @@ static int check_extent_exists(struct btrfs_root *root, 
u64 bytenr,
 
key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
-   key.offset = 0;
-
+   key.offset = (u64)-1;
 
 again:
ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path,
@@ -3661,10 +3660,17 @@ again:
btrfs_free_path(path);
return ret;
} else if (ret) {
-   if (path-slots[0])
+   if (path-slots[0]  0) {
path-slots[0]--;
-   else
-   btrfs_prev_leaf(root, path);
+   } else {
+   ret = btrfs_prev_leaf(root, path);
+   if (ret  0) {
+   goto out;
+   } else if (ret  0) {
+   ret = 0;
+   goto out;
+   }
+   }
}
 
btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]);
@@ -3674,13 +3680,22 @@ again:
 * bytenr, so walk back one more just in case.  Dear future traveler,
 * first congrats on mastering time travel.  Now if it's not too much
 * trouble could you go back to 2006 and tell Chris to make the
-* BLOCK_GROUP_ITEM_KEY lower than the EXTENT_ITEM_KEY please?
+* BLOCK_GROUP_ITEM_KEY (and BTRFS_*_REF_KEY) lower than the
+* EXTENT_ITEM_KEY please?
 */
-   if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
-   if (path-slots[0])
+   while (key.type  BTRFS_EXTENT_ITEM_KEY) {
+   if (path-slots[0]  0) {
path-slots[0]--;
-   else
-   btrfs_prev_leaf(root, path);
+   } else {
+   ret = btrfs_prev_leaf(root, path);
+   if (ret  0) {
+   goto out;
+   } else if (ret  0) {
+   ret = 0;
+   goto out;
+   }
+   }
+   btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]);
}
 
while (num_bytes) {
@@ -3752,7 +3767,8 @@ again:
}
ret = 0;
 
-   if (num_bytes) {
+out:
+   if (num_bytes  !ret) {
fprintf(stderr, There are no extents for csum range 
%Lu-%Lu\n, bytenr, bytenr+num_bytes);
ret = 1;
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs snapshot sizes

2014-05-07 Thread Marc MERLIN
So have others found a good way to have an idea about how much space is
taken by each snapshot?

I've tried quota trees, but I'm not sure how to read the output, or if it's
correct (including the negative numbers some have mentioned). Are there
other options?

I think the main problem is that the shared data field is not working,
making it harder to know which blocks are only used in a given snapshot.


subvol  group totalunshared 
 
--- 
backup/debian32 0/262   403.84G  -5.46G 
 
backup/debian32_daily_20140504_00:03:01 0/3660  446.45G   0.00G 
 
backup/debian32_daily_20140505_00:03:01 0/3687  431.11G   0.00G 
 
backup/debian32_daily_20140506_00:03:00 0/3705  420.83G   0.00G 
 
backup/debian32_daily_20140507_00:03:01 0/3724  411.87G   0.00G 
 
backup/debian32_weekly_20140504_00:04:010/3675  446.45G   0.00G 
 

backup/debian64 0/263   855.97G  -1.50G 
 
backup/debian64_daily_20140504_00:03:01 0/3662  860.19G   0.00G 
 
backup/debian64_daily_20140505_00:03:01 0/3690  859.32G   0.00G 
 
backup/debian64_daily_20140506_00:03:00 0/3707  858.15G   0.00G 
 
backup/debian64_daily_20140507_00:03:01 0/3726  857.47G   0.00G 
 
backup/debian64_weekly_20140504_00:04:010/3676  860.19G   0.00G 
 

backup/ubuntu   0/264   360.28G   0.00G 
 
backup/ubuntu_daily_20140504_00:03:01   0/3664  364.53G   0.00G 
 
backup/ubuntu_daily_20140505_00:03:01   0/3692  362.44G   0.00G 
 
backup/ubuntu_daily_20140506_00:03:00   0/3709  360.91G   0.00G 
 
backup/ubuntu_daily_20140507_00:03:01   0/3727  360.33G   0.00G 
 
backup/ubuntu_weekly_20140504_00:04:01  0/3677  364.53G   0.00G 
 


Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does Suse do live filesystem revert with btrfs?

2014-05-07 Thread Duncan
Marc MERLIN posted on Wed, 07 May 2014 01:56:12 -0700 as excerpted:

 On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote:
 Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted:
 
  
  Aaah, right, you can use a script to see the file differences between
  two snapshots, and then restore that with reflink if you can truly
  get a list of all changed files.
  However, that is indeed not atomic at all, even if faster than rsync.
 
 Would send/receive help in such a script?
 
 Not really, you still end up with a new snapshot that you can't live
 switch to.
 
 It's really either 1) reboot 2) use cp --reflink to copy a list of
 changed files (as well as rm to delete the ones that were removed).

What I meant was... use send/receive locally, in place of the
cp --reflink.

But now that I think of it, at least in the normal sense that wouldn't 
work, since send is like diff and receive like patch, but what would be 
needed would actually be an option similar to patch --reverse.  With 
something like that, you could (in theory, in practice it'd be racy if 
other running apps were writing to it too) reverse the live subvolume 
to the state of the snapshot.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does Suse do live filesystem revert with btrfs?

2014-05-07 Thread Marc MERLIN
On Wed, May 07, 2014 at 11:35:52AM +, Duncan wrote:
 Marc MERLIN posted on Wed, 07 May 2014 01:56:12 -0700 as excerpted:
 
  On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote:
  Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted:
  
   
   Aaah, right, you can use a script to see the file differences between
   two snapshots, and then restore that with reflink if you can truly
   get a list of all changed files.
   However, that is indeed not atomic at all, even if faster than rsync.
  
  Would send/receive help in such a script?
  
  Not really, you still end up with a new snapshot that you can't live
  switch to.
  
  It's really either 1) reboot 2) use cp --reflink to copy a list of
  changed files (as well as rm to delete the ones that were removed).
 
 What I meant was... use send/receive locally, in place of the
 cp --reflink.

This won't work since it can only work on another read-only subvolume.

But you could use btrfs send -p to get a list of changes between 2
snapshots, decode that (without btrfs receive) just to spit out the
names of the files that changed or got deleted.
It would be wasteful since it would cause all the changed blocks to be
read on the source, but still better than nothing.

Really, we'd just need a btrfs --send --dry-run -v -p vol1 vol2 
which would spit out a list of the file ops it would do.

That'd be enough to simply grep out the deletes, do them locally and
then use cp --reflink on everything else.

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs/035: update clone test to expect EOPNOTSUPP

2014-05-07 Thread David Disseldorp
With kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, the first
clone-range overwrite attempt now fails with EOPNOTSUPP, rather than
tripping a Btrfs BUG_ON().

This test now trips a new Btrfs bug, in which EIO is returned for
subsequent reads following the second clone range ioctl.

Signed-off-by: David Disseldorp dd...@suse.de
---
 tests/btrfs/035 | 11 +++
 tests/btrfs/035.out |  5 +
 2 files changed, 16 insertions(+)

diff --git a/tests/btrfs/035 b/tests/btrfs/035
index 6808179..c9530f6 100755
--- a/tests/btrfs/035
+++ b/tests/btrfs/035
@@ -57,21 +57,32 @@ src_str=aa
 echo -n $src_str  $SCRATCH_MNT/src
 
 $CLONER_PROG $SCRATCH_MNT/src  $SCRATCH_MNT/src.clone1
+cat $SCRATCH_MNT/src.clone1
+echo
 
 src_str=bbcc
 
 echo -n $src_str  $SCRATCH_MNT/src
 
 $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2
+cat $SCRATCH_MNT/src.clone2
+echo
 
+# Prior to kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, this clone
+# resulted in a BUG_ON in __btrfs_drop_extents(). The kernel now returns
+# EOPNOTSUPP up to userspace.
 snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'`
 echo attempting ioctl (src.clone1 src)
 $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
$SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src
+cat $SCRATCH_MNT/src
+echo
 
 snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'`
 echo attempting ioctl (src.clone2 src)
 $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
$SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src
+# BUG: subsequent access attempts currently result in EIO...
+cat $SCRATCH_MNT/src
 
 status=0 ; exit
diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out
index f86cadf..0ea2c4f 100644
--- a/tests/btrfs/035.out
+++ b/tests/btrfs/035.out
@@ -1,3 +1,8 @@
 QA output created by 035
+aa
+bbcc
 attempting ioctl (src.clone1 src)
+clone failed: Operation not supported
+bbcc
 attempting ioctl (src.clone2 src)
+bbcc
-- 
1.8.4.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using mount -o bind vs mount -o subvol=vol

2014-05-07 Thread Duncan
Marc MERLIN posted on Wed, 07 May 2014 03:55:51 -0700 as excerpted:

 subvolumes are also used as units of backup for btrfs send.

Hmm, yes.  Thanks.  I don't use send/receive here so forgot about that.

 So my vote would be, for example (modified slightly for posting from my
 own mounts):
 
 mount /dev/sda5 /
 mount /dev/sda4 /var/log
 mount /dev/sda6 /home
 
 On my laptop: [snip]

FWIW, those were examples.  I actually have more.

 But to each their own :)

Indeed.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs fi show show full?

2014-05-07 Thread Duncan
On Wed, 7 May 2014 04:30:30 -0700
Marc MERLIN m...@merlins.org wrote:

  -dusage=85 balances all chunks that up to 85% full. The higher the
  number, the more work that needs to be done.  
 
 Aah, right. I see why it's more work. =20 only makes is process the
 few chunks that are up to 20% full which won't be many if your FS
 is almost full.

It's actually even less work than you imply.

Balance only has to rewrite the actual content, not the empty space in
the chunk.  So 20% full means it's only writing 20% of the
(possible/full) content, thus only taking 20% of the time to rewrite
that chunk that it'd take to rewrite a full chunk.

Which is why a usage=5 or 20 goes so fast, even if the system's
actually mostly empty but is all allocated.  With a 20% full chunk it's
rewriting five chunks into one; at 5%, it's rewriting 20 chunks into
one.  That goes pretty fast, even if there's a bunch of them to write!

-- 
Duncan - No HTML messages please, as they are filtered as spam.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs issues in 3.14

2014-05-07 Thread Kenny MacDermid
On Wed, May 7, 2014 at 9:35 AM, Kenny MacDermid
kenny.macder...@gmail.com wrote:
 On Tue, May 6, 2014 at 11:22 PM, Liu Bo bo.li@oracle.com wrote:

 What does sysrq+w say when the hang happens?

 The whole system isn't hung, I may have explained that wrong. The
 system will hang if I try to shutdown, and the process will hang if I
 try to kill -9 it.

 It looks like the browser is in this state currently so I did an 'echo
 w /proc/sysrq-trigger' and have attached the full dmesg with the
 browser issues and the output.

I had to hard reboot to clear that issue, and I decided to do another
'btrfs check' while /home was unmounted. It generated the following
output:

checking extents
checking free space cache
Wanted bytes 45056, found 32768 for off 63805808640
Wanted bytes 90016, found 32768 for off 63805808640
cache appears valid but isnt 62843256832
Checking filesystem on //dev/mapper/home
UUID: 9a60a25f-eeb4-494c-b1af-ebd8e4f79b6b
found 13672418478 bytes used err is -22
total csum bytes: 72089212
total tree bytes: 906100736
total fs tree bytes: 808370176
total extent tree bytes: 18153472
btree space waste bytes: 116247440
file data blocks allocated: 101046853632
 referenced 73680674816
Btrfs v3.14.1

This is on the new filesystem. I redid the dmcrypt and the lvm lv when
I recreated the filesystem as well, so it's less than a week old.
Before rebuilding the old was was telling me:

Checking filesystem on /dev/mapper/home
UUID: 4f5d7a10-d003-48a7-a901-bf22d534888f
free space inode generation (0) did not match free space cache
generation (115200)
found 29963117667 bytes used err is 1
total csum bytes: 63740440
total tree bytes: 745504768
total fs tree bytes: 624951296
total extent tree bytes: 36749312
btree space waste bytes: 119018687
file data blocks allocated: 181026942976
 referenced 73759866880
Btrfs v0.20-rc1-358-g194aa4a-dirty

and

checking extents
checking free space cache
checking fs roots
root 257 inode 29647 errors 200, dir isize wrong
root 257 inode 391917 errors 200, dir isize wrong
root 257 inode 497392 errors 410, odd dir item, nbytes wrong
Checking filesystem on /dev/mapper/home
UUID: 4f5d7a10-d003-48a7-a901-bf22d534888f
free space inode generation (0) did not match free space cache
generation (115200)
found 31310902624 bytes used err is 1
total csum bytes: 63579480
total tree bytes: 743342080
total fs tree bytes: 623198208
total extent tree bytes: 36601856
btree space waste bytes: 118906643
file data blocks allocated: 180831965184
 referenced 73631731712
Btrfs v3.14
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Please review and comment, dealing with btrfs full issues

2014-05-07 Thread David Sterba
On Tue, May 06, 2014 at 05:43:24PM +0100, Hugo Mills wrote:
  So in my case when I hit that case, I had to use dusage=0 to recover.
  Anything above that just didn't work.
  
  I suspect when using more than zero the first chunk it wanted to balance
  wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it
  didn't need a destination for the data. That is actually an interesting
  workaround for that case.
 
I've actually looked into implementing a smallest=n filter that
 would taken only the n least-full chunks (by fraction) and balance
 those. However, it's not entirely trivial to do efficiently with the
 current filtering code.

I've prototyped something similar, to limit the number of balanced
chunks by a number. To achieve n least-full chunks would be an
iterative process of increasing the usage filter and limiting the number
of chunks until the desired N is reached.

N=n
F=0
while (N  0) {
balance -dusage=F,limit=N
N -= number of balanced chunks
F++
}

The patch is in branch dev/balance-limit in my git repos.

We can then implement the n-least-full as a synthetic filter from
userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Smallest-n balance filter (was Re: Please review and comment, dealing with btrfs full issues)

2014-05-07 Thread Hugo Mills
On Wed, May 07, 2014 at 04:09:27PM +0200, David Sterba wrote:
 On Tue, May 06, 2014 at 05:43:24PM +0100, Hugo Mills wrote:
   So in my case when I hit that case, I had to use dusage=0 to recover.
   Anything above that just didn't work.
   
   I suspect when using more than zero the first chunk it wanted to balance
   wasn't empty - and it had nowhere to put it. Then when you did dusage=0, 
   it
   didn't need a destination for the data. That is actually an interesting
   workaround for that case.
  
 I've actually looked into implementing a smallest=n filter that
  would taken only the n least-full chunks (by fraction) and balance
  those. However, it's not entirely trivial to do efficiently with the
  current filtering code.
 
 I've prototyped something similar, to limit the number of balanced
 chunks by a number. To achieve n least-full chunks would be an
 iterative process of increasing the usage filter and limiting the number
 of chunks until the desired N is reached.
 
 N=n
 F=0
 while (N  0) {
   balance -dusage=F,limit=N
   N -= number of balanced chunks
   F++
 }
 
 The patch is in branch dev/balance-limit in my git repos.
 
 We can then implement the n-least-full as a synthetic filter from
 userspace.

   This is inefficient, because we've got an O(m) pass through all the
chunks for every call. If we reduce the number of calls by increasing
the increment of F (F+=3, say), then we risk overbalancing, or missing
out on smaller chunks we could have balanced earlier. From a practical
point of view, it may make little difference, but the computer
scientist in me is going ew.

   The other method, for small n only, would be to construct the list
first, an O(m log n) operation for a filesystem of size m, requiring
O(n) storage, and then iterate over just those chunks. The problem
with that is the storage requirements, and keeping track of the state
of the list for restart purposes. [actually, there's probably an O(m)
algorithm to get the n smallest items, but those are a bit
complicated]

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- A diverse working environment:  Di longer you vork here, di ---   
 verse it gets.  


signature.asc
Description: Digital signature


Re: Smallest-n balance filter (was Re: Please review and comment, dealing with btrfs full issues)

2014-05-07 Thread David Sterba
On Wed, May 07, 2014 at 03:23:01PM +0100, Hugo Mills wrote:
  N=n
  F=0
  while (N  0) {
  balance -dusage=F,limit=N
  N -= number of balanced chunks
  F++
  }
  
  The patch is in branch dev/balance-limit in my git repos.
  
  We can then implement the n-least-full as a synthetic filter from
  userspace.
 
This is inefficient, because we've got an O(m) pass through all the
 chunks for every call. If we reduce the number of calls by increasing
 the increment of F (F+=3, say), then we risk overbalancing, or missing
 out on smaller chunks we could have balanced earlier. From a practical
 point of view, it may make little difference, but the computer
 scientist in me is going ew.

I'm trying to find the practical way, no doubts about the inefficiencies.
The +1 increment was meant to outline the idea, I'm usually using the
sequence 0, 1, 5, 10, [etc +10].

I think we can afford some inaccuracy, I as a user would not mind if
there's some overbalancing (within a sane margin).

The other method, for small n only, would be to construct the list
 first, an O(m log n) operation for a filesystem of size m, requiring
 O(n) storage, and then iterate over just those chunks.

The size of filesystem matters, but the scanning phase of balance uses
in-memory structures and this should not be that bad for terabyte-sized
filesystems (ie. number of blockgoups will be some thousands).

Possibly we can stop looking for new chunks in the first phase of
balance when there are already N candidate chunks found, and process them.

 The problem with that is the storage requirements, and keeping track
 of the state of the list for restart purposes. [actually, there's
 probably an O(m) algorithm to get the n smallest items, but those are
 a bit complicated]

If the filesystem is under load, the chunks' usage may increase or
decrease in time and as we know, balance takes time, so the
chunk-todo-list may look different when next one is about to be
processed.

But yeah, this could be a cheaper check to skip a given chunk if it's
out of the filter criteria than going through the whole list again.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: faster/more efficient insertion of file extent items

2014-05-07 Thread Liu Bo
On Sun, Feb 09, 2014 at 11:45:12PM +, Filipe David Borba Manana wrote:
 This is an extension to my previous commit titled:
 
   Btrfs: faster file extent item replace operations
   (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9)
 
 Instead of inserting the new file extent item if we deleted existing
 file extent items covering our target file range, also allow to insert
 the new file extent item if we didn't find any existing items to delete
 and replace_extent != 0, since in this case our caller would do another
 tree search to insert the new file extent item anyway, therefore just
 combine the two tree searches into a single one, saving cpu time, reducing
 lock contention and reducing btree node/leaf COW operations.
 
 This covers the case where applications keep doing tail append writes to
 files, which for example is the case of Apache CouchDB (its database and
 view index files are always open with O_APPEND).

(I'm tracking a bug which is very hard to reproduce and the stack seems to
locate on this area.)

Even I know that this has been merged, I still have to say that this just
makes the code nearly hard-to-maintained.

__btrfs_drop_extents() has already been one of the most complex function since
it was written, but now it's become more and more complex!

I'm not sure whether the gained performance number deserves that kind of
complexity, man, to be honest, try to ask yourself how much time you'll spend in
re-understanding the code and all the details.

thanks,
-liubo

 
 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---
  fs/btrfs/file.c |   52 ++--
  1 file changed, 30 insertions(+), 22 deletions(-)
 
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index 0165b86..006af2f 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -720,7 +720,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
   if (drop_cache)
   btrfs_drop_extent_cache(inode, start, end - 1, 0);
  
 - if (start = BTRFS_I(inode)-disk_i_size)
 + if (start = BTRFS_I(inode)-disk_i_size  !replace_extent)
   modify_tree = 0;
  
   while (1) {
 @@ -938,34 +938,42 @@ next_slot:
* Set path-slots[0] to first slot, so that after the delete
* if items are move off from our leaf to its immediate left or
* right neighbor leafs, we end up with a correct and adjusted
 -  * path-slots[0] for our insertion.
 +  * path-slots[0] for our insertion (if replace_extent != 0).
*/
   path-slots[0] = del_slot;
   ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
   if (ret)
   btrfs_abort_transaction(trans, root, ret);
 + }
  
 - leaf = path-nodes[0];
 - /*
 -  * leaf eb has flag EXTENT_BUFFER_STALE if it was deleted (that
 -  * is, its contents got pushed to its neighbors), in which case
 -  * it means path-locks[0] == 0
 -  */
 - if (!ret  replace_extent  leafs_visited == 1 
 - path-locks[0] 
 - btrfs_leaf_free_space(root, leaf) =
 - sizeof(struct btrfs_item) + extent_item_size) {
 -
 - key.objectid = ino;
 - key.type = BTRFS_EXTENT_DATA_KEY;
 - key.offset = start;
 - setup_items_for_insert(root, path, key,
 -extent_item_size,
 -extent_item_size,
 -sizeof(struct btrfs_item) +
 -extent_item_size, 1);
 - *key_inserted = 1;
 + leaf = path-nodes[0];
 + /*
 +  * If btrfs_del_items() was called, it might have deleted a leaf, in
 +  * which case it unlocked our path, so check path-locks[0] matches a
 +  * write lock.
 +  */
 + if (!ret  replace_extent  leafs_visited == 1 
 + (path-locks[0] == BTRFS_WRITE_LOCK_BLOCKING ||
 +  path-locks[0] == BTRFS_WRITE_LOCK) 
 + btrfs_leaf_free_space(root, leaf) =
 + sizeof(struct btrfs_item) + extent_item_size) {
 +
 + key.objectid = ino;
 + key.type = BTRFS_EXTENT_DATA_KEY;
 + key.offset = start;
 + if (!del_nr  path-slots[0]  btrfs_header_nritems(leaf)) {
 + struct btrfs_key slot_key;
 +
 + btrfs_item_key_to_cpu(leaf, slot_key, path-slots[0]);
 + if (btrfs_comp_cpu_keys(key, slot_key)  0)
 + path-slots[0]++;
   }
 + setup_items_for_insert(root, path, key,
 +extent_item_size,
 +extent_item_size,
 +sizeof(struct btrfs_item) +
 + 

Re: [PATCH] Btrfs: faster/more efficient insertion of file extent items

2014-05-07 Thread Josef Bacik

On 05/07/2014 11:21 AM, Liu Bo wrote:

On Sun, Feb 09, 2014 at 11:45:12PM +, Filipe David Borba Manana wrote:

This is an extension to my previous commit titled:

   Btrfs: faster file extent item replace operations
   (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9)

Instead of inserting the new file extent item if we deleted existing
file extent items covering our target file range, also allow to insert
the new file extent item if we didn't find any existing items to delete
and replace_extent != 0, since in this case our caller would do another
tree search to insert the new file extent item anyway, therefore just
combine the two tree searches into a single one, saving cpu time, reducing
lock contention and reducing btree node/leaf COW operations.

This covers the case where applications keep doing tail append writes to
files, which for example is the case of Apache CouchDB (its database and
view index files are always open with O_APPEND).


(I'm tracking a bug which is very hard to reproduce and the stack seems to
locate on this area.)

Even I know that this has been merged, I still have to say that this just
makes the code nearly hard-to-maintained.

__btrfs_drop_extents() has already been one of the most complex function since
it was written, but now it's become more and more complex!

I'm not sure whether the gained performance number deserves that kind of
complexity, man, to be honest, try to ask yourself how much time you'll spend in
re-understanding the code and all the details.



It's just a complex operation anyway, so really it's going to suck no 
matter what.  What I would like to see is some sanity tests committed 
that test the various corner cases of btrfs_drop_extents so when we make 
these sort of changes we can be sure we're not breaking anything.


So in fact that's the new requirement, whoever wants to touch 
btrfs_drop_extents next has to make sanity tests for it first, and then 
they can do what they want, this includes cleaning it up.  Thanks,


Josef

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE

2014-05-07 Thread www.euro-millions.com



--
Good day.
Did You Get The Last Email We Sent You?
--

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: balance filter: add limit of processed chunks

2014-05-07 Thread David Sterba
Add more control to the balance behaviour.

Usage filter may not be finegrained enough and can lead to moving too
many chunks at once. Another example use is in connection with
drange+devid or vrange filters that allow to work with a specific chunk
or even with a chunk on a given device.

The limit filter applies last, the value of 0 means no limiting.

CC: Ilya Dryomov idryo...@gmail.com
CC: Hugo Mills h...@carfax.org.uk
Signed-off-by: David Sterba dste...@suse.cz
---
 cmds-balance.c | 14 ++
 ioctl.h|  4 +++-
 volumes.h  |  1 +
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/cmds-balance.c b/cmds-balance.c
index 8a743ecabd33..5de51bd463c4 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -218,6 +218,18 @@ static int parse_filters(char *filters, struct 
btrfs_balance_args *args)
args-flags |= BTRFS_BALANCE_ARGS_CONVERT;
} else if (!strcmp(this_char, soft)) {
args-flags |= BTRFS_BALANCE_ARGS_SOFT;
+   } else if (!strcmp(this_char, limit)) {
+   if (!value || !*value) {
+   fprintf(stderr,
+   the limit filter requires an 
argument\n);
+   return 1;
+   }
+   if (parse_u64(value, args-limit)) {
+   fprintf(stderr, Invalid limit argument: %s\n,
+  value);
+   return 1;
+   }
+   args-flags |= BTRFS_BALANCE_ARGS_LIMIT;
} else {
fprintf(stderr, Unrecognized balance option '%s'\n,
this_char);
@@ -252,6 +264,8 @@ static void dump_balance_args(struct btrfs_balance_args 
*args)
printf(, vrange=%llu..%llu,
   (unsigned long long)args-vstart,
   (unsigned long long)args-vend);
+   if (args-flags  BTRFS_BALANCE_ARGS_LIMIT)
+   printf(, limit=%llu, (unsigned long long)args-limit);
 
printf(\n);
 }
diff --git a/ioctl.h b/ioctl.h
index 9627e8d1bac6..f0fc06086c3e 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -194,7 +194,9 @@ struct btrfs_balance_args {
 
__u64 flags;
 
-   __u64 unused[8];
+   __u64 limit;
+
+   __u64 unused[7];
 } __attribute__ ((__packed__));
 
 struct btrfs_balance_progress {
diff --git a/volumes.h b/volumes.h
index b1ff3d04f931..8405aef2cc0a 100644
--- a/volumes.h
+++ b/volumes.h
@@ -130,6 +130,7 @@ struct map_lookup {
 #define BTRFS_BALANCE_ARGS_DEVID   (1ULL  2)
 #define BTRFS_BALANCE_ARGS_DRANGE  (1ULL  3)
 #define BTRFS_BALANCE_ARGS_VRANGE  (1ULL  4)
+#define BTRFS_BALANCE_ARGS_LIMIT   (1ULL  5)
 
 /*
  * Profile changing flags.  When SOFT is set we won't relocate chunk if
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: balance filter: add limit of processed chunks

2014-05-07 Thread David Sterba
This started as debugging helper, to watch the effects of converting
between raid levels on multiple devices, but could be useful standalone.

In my case the usage filter was not finegrained enough and led to
converting too many chunks at once. Another example use is in connection
with drange+devid or vrange filters that allow to work with a specific
chunk or even with a chunk on a given device.

The limit filter applies last, the value of 0 means no limiting.

CC: Ilya Dryomov idryo...@gmail.com
CC: Hugo Mills h...@carfax.org.uk
Signed-off-by: David Sterba dste...@suse.cz
---

The name 'limit' should resebmle the meaning from SQL SELECT.

Though it may not be that useful on it's own, we can use it as a building block
for more complex filters.

 fs/btrfs/ctree.h   |7 ++-
 fs/btrfs/volumes.c |   18 ++
 fs/btrfs/volumes.h |1 +
 include/uapi/linux/btrfs.h |3 ++-
 4 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba6b88528dc7..e6f899dc5e47 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -840,7 +840,10 @@ struct btrfs_disk_balance_args {
/* BTRFS_BALANCE_ARGS_* */
__le64 flags;
 
-   __le64 unused[8];
+   /* BTRFS_BALANCE_ARGS_LIMIT value */
+   __le64 limit;
+
+   __le64 unused[7];
 } __attribute__ ((__packed__));
 
 /*
@@ -2897,6 +2900,7 @@ btrfs_disk_balance_args_to_cpu(struct btrfs_balance_args 
*cpu,
cpu-vend = le64_to_cpu(disk-vend);
cpu-target = le64_to_cpu(disk-target);
cpu-flags = le64_to_cpu(disk-flags);
+   cpu-limit = le64_to_cpu(disk-limit);
 }
 
 static inline void
@@ -2914,6 +2918,7 @@ btrfs_cpu_balance_args_to_disk(struct 
btrfs_disk_balance_args *disk,
disk-vend = cpu_to_le64(cpu-vend);
disk-target = cpu_to_le64(cpu-target);
disk-flags = cpu_to_le64(cpu-flags);
+   disk-limit = cpu_to_le64(cpu-limit);
 }
 
 /* struct btrfs_super_block */
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 49d7fab73360..3b761a456acd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2922,6 +2922,16 @@ static int should_balance_chunk(struct btrfs_root *root,
return 0;
}
 
+   /*
+* limited by count, must be the last filter
+*/
+   if ((bargs-flags  BTRFS_BALANCE_ARGS_LIMIT)) {
+   if (bargs-limit == 0)
+   return 0;
+   else
+   bargs-limit--;
+   }
+
return 1;
 }
 
@@ -2944,6 +2954,9 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
int ret;
int enospc_errors = 0;
bool counting = true;
+   u64 limit_data = bctl-data.limit;
+   u64 limit_meta = bctl-meta.limit;
+   u64 limit_sys = bctl-sys.limit;
 
/* step one make some room on all the devices */
devices = fs_info-fs_devices-devices;
@@ -2982,6 +2995,11 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
memset(bctl-stat, 0, sizeof(bctl-stat));
spin_unlock(fs_info-balance_lock);
 again:
+   if (!counting) {
+   bctl-data.limit = limit_data;
+   bctl-meta.limit = limit_meta;
+   bctl-sys.limit = limit_sys;
+   }
key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
key.offset = (u64)-1;
key.type = BTRFS_CHUNK_ITEM_KEY;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 80754f9dd3df..1a15bbeb65e2 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -255,6 +255,7 @@ struct map_lookup {
 #define BTRFS_BALANCE_ARGS_DEVID   (1ULL  2)
 #define BTRFS_BALANCE_ARGS_DRANGE  (1ULL  3)
 #define BTRFS_BALANCE_ARGS_VRANGE  (1ULL  4)
+#define BTRFS_BALANCE_ARGS_LIMIT   (1ULL  5)
 
 /*
  * Profile changing flags.  When SOFT is set we won't relocate chunk if
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b4d69092fbdb..901a3c563f60 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -211,7 +211,8 @@ struct btrfs_balance_args {
 
__u64 flags;
 
-   __u64 unused[8];
+   __u64 limit;/* limit number of processed chunks */
+   __u64 unused[7];
 } __attribute__ ((__packed__));
 
 /* report balance progress to userspace */
-- 
1.7.9

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: retrieve more info from FS_INFO ioctl

2014-05-07 Thread David Sterba
Provide the basic information about filesystem through the ioctl:
* b-tree node size (same as leaf size)
* sector size
* expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments

Backward compatibility: if the values are 0, kernel does not provide
this information, the applications should ignore them.

Signed-off-by: David Sterba dste...@suse.cz
---
 fs/btrfs/ioctl.c   |4 
 include/uapi/linux/btrfs.h |6 +-
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2ad7de94efef..74530f226e50 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2574,6 +2574,10 @@ static long btrfs_ioctl_fs_info(struct btrfs_root *root, 
void __user *arg)
}
mutex_unlock(fs_devices-device_list_mutex);
 
+   fi_args-nodesize = root-fs_info-super_copy-nodesize;
+   fi_args-sectorsize = root-fs_info-super_copy-sectorsize;
+   fi_args-clone_alignment = root-fs_info-super_copy-sectorsize;
+
if (copy_to_user(arg, fi_args, sizeof(*fi_args)))
ret = -EFAULT;
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b4d69092fbdb..aad9391e0a6d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -181,7 +181,11 @@ struct btrfs_ioctl_fs_info_args {
__u64 max_id;   /* out */
__u64 num_devices;  /* out */
__u8 fsid[BTRFS_FSID_SIZE]; /* out */
-   __u64 reserved[124];/* pad to 1k */
+   __u32 nodesize; /* out */
+   __u32 sectorsize;   /* out */
+   __u32 clone_alignment;  /* out */
+   __u32 reserved32;
+   __u64 reserved[122];/* pad to 1k */
 };
 
 struct btrfs_ioctl_feature_flags {
-- 
1.7.9

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: export more from FS_INFO to sysfs

2014-05-07 Thread David Sterba
Similar to the FS_INFO updates, export the basic filesystem info through
sysfs: node size, sector size and clone alignment.

Signed-off-by: David Sterba dste...@suse.cz
---
 fs/btrfs/sysfs.c |   40 
 1 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c5eb2143dc66..ba2a645dee07 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -396,8 +396,48 @@ static ssize_t btrfs_label_store(struct kobject *kobj,
 }
 BTRFS_ATTR_RW(label, 0644, btrfs_label_show, btrfs_label_store);
 
+static ssize_t btrfs_no_store(struct kobject *kobj,
+struct kobj_attribute *a,
+const char *buf, size_t len)
+{
+   return -EPERM;
+}
+
+static ssize_t btrfs_nodesize_show(struct kobject *kobj,
+   struct kobj_attribute *a, char *buf)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+
+   return snprintf(buf, PAGE_SIZE, %u\n, fs_info-super_copy-nodesize);
+}
+
+BTRFS_ATTR_RW(nodesize, 0444, btrfs_nodesize_show, btrfs_no_store);
+
+static ssize_t btrfs_sectorsize_show(struct kobject *kobj,
+   struct kobj_attribute *a, char *buf)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+
+   return snprintf(buf, PAGE_SIZE, %u\n, 
fs_info-super_copy-sectorsize);
+}
+
+BTRFS_ATTR_RW(sectorsize, 0444, btrfs_sectorsize_show, btrfs_no_store);
+
+static ssize_t btrfs_clone_alignment_show(struct kobject *kobj,
+   struct kobj_attribute *a, char *buf)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+
+   return snprintf(buf, PAGE_SIZE, %u\n, 
fs_info-super_copy-sectorsize);
+}
+
+BTRFS_ATTR_RW(clone_alignment, 0444, btrfs_clone_alignment_show, 
btrfs_no_store);
+
 static struct attribute *btrfs_attrs[] = {
BTRFS_ATTR_PTR(label),
+   BTRFS_ATTR_PTR(nodesize),
+   BTRFS_ATTR_PTR(sectorsize),
+   BTRFS_ATTR_PTR(clone_alignment),
NULL,
 };
 
-- 
1.7.9

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Back from leave

2014-05-07 Thread David Sterba
Hi back,

On Mon, May 05, 2014 at 10:28:13AM -0400, Josef Bacik wrote:
 I had way too much email so I just deleted it all, if there was something
 you wanted my specific attention on then bounce it back at me and I'll look
 at it.  Thanks,

it would be really great if you resurrect btrfs-next. Most of the current
patches have been merged to 3.15 so for now it's IMHO ok to do a hard reset
to linus/master. Please push anything you've already queued so we can
let you know about the rest.

thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Back from leave

2014-05-07 Thread Chris Mason

On 05/07/2014 12:38 PM, David Sterba wrote:

Hi back,

On Mon, May 05, 2014 at 10:28:13AM -0400, Josef Bacik wrote:

I had way too much email so I just deleted it all, if there was something
you wanted my specific attention on then bounce it back at me and I'll look
at it.  Thanks,


it would be really great if you resurrect btrfs-next. Most of the current
patches have been merged to 3.15 so for now it's IMHO ok to do a hard reset
to linus/master. Please push anything you've already queued so we can
let you know about the rest.



I've got them all queued up here, but I'm having trouble getting through 
an overnight stress.sh run (hangs).  As soon as I nail down the problem 
I'll push out to my linux-next queue.


At least for the next release, trying to help Josef focus on qgroups and 
other work he's had queued up.


-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does Suse do live filesystem revert with btrfs?

2014-05-07 Thread Goffredo Baroncelli
On 05/07/2014 01:39 PM, Marc MERLIN wrote:
 On Wed, May 07, 2014 at 11:35:52AM +, Duncan wrote:
 Marc MERLIN posted on Wed, 07 May 2014 01:56:12 -0700 as excerpted:

 On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote:
 Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted:


 Aaah, right, you can use a script to see the file differences between
 two snapshots, and then restore that with reflink if you can truly
 get a list of all changed files.
 However, that is indeed not atomic at all, even if faster than rsync.

 Would send/receive help in such a script?

 Not really, you still end up with a new snapshot that you can't live
 switch to.

 It's really either 1) reboot 2) use cp --reflink to copy a list of
 changed files (as well as rm to delete the ones that were removed).

 What I meant was... use send/receive locally, in place of the
 cp --reflink.
 
 This won't work since it can only work on another read-only subvolume.
 
 But you could use btrfs send -p to get a list of changes between 2
 snapshots, decode that (without btrfs receive) just to spit out the
 names of the files that changed or got deleted.
 It would be wasteful since it would cause all the changed blocks to be
 read on the source, but still better than nothing.
 
 Really, we'd just need a btrfs --send --dry-run -v -p vol1 vol2 
 which would spit out a list of the file ops it would do.
 
 That'd be enough to simply grep out the deletes, do them locally and
 then use cp --reflink on everything else.

What happens to the already opened files ? I suppose that a process which has 
already opened a file, see the old one; instead a new open could see the new 
one.
If this is acceptable, why not doing mount --bind /snapshot /, or use 
pivot_root(2), or a overlay filesystem ?
May be that we need to move also the other already mounted_filesystem (like 
/proc, /sys)...



 
 Marc
 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: faster/more efficient insertion of file extent items

2014-05-07 Thread Filipe David Manana
On Wed, May 7, 2014 at 4:21 PM, Liu Bo bo.li@oracle.com wrote:
 On Sun, Feb 09, 2014 at 11:45:12PM +, Filipe David Borba Manana wrote:
 This is an extension to my previous commit titled:

   Btrfs: faster file extent item replace operations
   (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9)

 Instead of inserting the new file extent item if we deleted existing
 file extent items covering our target file range, also allow to insert
 the new file extent item if we didn't find any existing items to delete
 and replace_extent != 0, since in this case our caller would do another
 tree search to insert the new file extent item anyway, therefore just
 combine the two tree searches into a single one, saving cpu time, reducing
 lock contention and reducing btree node/leaf COW operations.

 This covers the case where applications keep doing tail append writes to
 files, which for example is the case of Apache CouchDB (its database and
 view index files are always open with O_APPEND).

 (I'm tracking a bug which is very hard to reproduce and the stack seems to
 locate on this area.)

 Even I know that this has been merged, I still have to say that this just
 makes the code nearly hard-to-maintained.

 __btrfs_drop_extents() has already been one of the most complex function since
 it was written, but now it's become more and more complex!

 I'm not sure whether the gained performance number deserves that kind of
 complexity, man, to be honest, try to ask yourself how much time you'll spend 
 in
 re-understanding the code and all the details.

The changes (this and the previous one mentioned in the change log)
essentially only add an if statement at the end of the function, which
has useful comments describing its purpose. It didn't change the logic
in the big while loop, which is/was basically the whole function, that
does the work of processing extent items and deleting them.

Therefore I disagree that it added such huge amount of complexity.

Thanks, and sorry for the debugging frustration you are going through.


 thanks,
 -liubo


 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---
  fs/btrfs/file.c |   52 ++--
  1 file changed, 30 insertions(+), 22 deletions(-)

 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index 0165b86..006af2f 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -720,7 +720,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle 
 *trans,
   if (drop_cache)
   btrfs_drop_extent_cache(inode, start, end - 1, 0);

 - if (start = BTRFS_I(inode)-disk_i_size)
 + if (start = BTRFS_I(inode)-disk_i_size  !replace_extent)
   modify_tree = 0;

   while (1) {
 @@ -938,34 +938,42 @@ next_slot:
* Set path-slots[0] to first slot, so that after the delete
* if items are move off from our leaf to its immediate left or
* right neighbor leafs, we end up with a correct and adjusted
 -  * path-slots[0] for our insertion.
 +  * path-slots[0] for our insertion (if replace_extent != 0).
*/
   path-slots[0] = del_slot;
   ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
   if (ret)
   btrfs_abort_transaction(trans, root, ret);
 + }

 - leaf = path-nodes[0];
 - /*
 -  * leaf eb has flag EXTENT_BUFFER_STALE if it was deleted (that
 -  * is, its contents got pushed to its neighbors), in which case
 -  * it means path-locks[0] == 0
 -  */
 - if (!ret  replace_extent  leafs_visited == 1 
 - path-locks[0] 
 - btrfs_leaf_free_space(root, leaf) =
 - sizeof(struct btrfs_item) + extent_item_size) {
 -
 - key.objectid = ino;
 - key.type = BTRFS_EXTENT_DATA_KEY;
 - key.offset = start;
 - setup_items_for_insert(root, path, key,
 -extent_item_size,
 -extent_item_size,
 -sizeof(struct btrfs_item) +
 -extent_item_size, 1);
 - *key_inserted = 1;
 + leaf = path-nodes[0];
 + /*
 +  * If btrfs_del_items() was called, it might have deleted a leaf, in
 +  * which case it unlocked our path, so check path-locks[0] matches a
 +  * write lock.
 +  */
 + if (!ret  replace_extent  leafs_visited == 1 
 + (path-locks[0] == BTRFS_WRITE_LOCK_BLOCKING ||
 +  path-locks[0] == BTRFS_WRITE_LOCK) 
 + btrfs_leaf_free_space(root, leaf) =
 + sizeof(struct btrfs_item) + extent_item_size) {
 +
 + key.objectid = ino;
 + key.type = BTRFS_EXTENT_DATA_KEY;
 + key.offset = start;
 + if 

[PATCH 1/3] btrfs-progs: print qgroup excl as unsigned

2014-05-07 Thread Mark Fasheh
It's unsigned in the structure definition.

Reviewed-by: Mark Fasheh mfas...@suse.de
---
 print-tree.c | 12 ++--
 qgroup.c |  4 ++--
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/print-tree.c b/print-tree.c
index 7263b09..adef94a 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -884,18 +884,18 @@ void btrfs_print_leaf(struct btrfs_root *root, struct 
extent_buffer *l)
qg_info = btrfs_item_ptr(l, i,
 struct btrfs_qgroup_info_item);
printf(\t\tgeneration %llu\n
-\t\treferenced %lld referenced compressed %lld\n
-\t\texclusive %lld exclusive compressed %lld\n,
+\t\treferenced %llu referenced compressed %llu\n
+\t\texclusive %llu exclusive compressed %llu\n,
   (unsigned long long)
   btrfs_qgroup_info_generation(l, qg_info),
-  (long long)
+  (unsigned long long)
   btrfs_qgroup_info_referenced(l, qg_info),
-  (long long)
+  (unsigned long long)
   btrfs_qgroup_info_referenced_compressed(l,
   qg_info),
-  (long long)
+  (unsigned long long)
   btrfs_qgroup_info_exclusive(l, qg_info),
-  (long long)
+  (unsigned long long)
   btrfs_qgroup_info_exclusive_compressed(l,
  qg_info));
break;
diff --git a/qgroup.c b/qgroup.c
index 94d1feb..368b262 100644
--- a/qgroup.c
+++ b/qgroup.c
@@ -203,11 +203,11 @@ static void print_qgroup_column(struct btrfs_qgroup 
*qgroup,
print_qgroup_column_add_blank(BTRFS_QGROUP_QGROUPID, len);
break;
case BTRFS_QGROUP_RFER:
-   len = printf(%lld, qgroup-rfer);
+   len = printf(%llu, qgroup-rfer);
print_qgroup_column_add_blank(BTRFS_QGROUP_RFER, len);
break;
case BTRFS_QGROUP_EXCL:
-   len = printf(%lld, qgroup-excl);
+   len = printf(%llu, qgroup-excl);
print_qgroup_column_add_blank(BTRFS_QGROUP_EXCL, len);
break;
case BTRFS_QGROUP_PARENT:
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] btrfs-progs: add quota group verify to btrfsck

2014-05-07 Thread Mark Fasheh
Hi,
The following 3 patches add support to btrfsck to check the counts
in subvolume quota groups. With these patches a user can run btrfsck against
a volume and if quota is enabled, qgroup data will be checked against the
actual space used on disk. I also added a --qgroup-report option that will
run the qgroup checker (only) and print out a full report of all qgroups.

The patches can be pulled from the following branch:
git://github.com/markfasheh/btrfs-progs-patches.git qgroup-verify

The patches can also be viewed:
https://github.com/markfasheh/btrfs-progs-patches/tree/qgroup-verify

The first two patches set up for qgroups:

- The change in patch #1 is optional. It corrects the print of qgroup bytes
to be %llu as they are unsigned values.  This means however that corrupted
groups will no longer show a negative value but instead an unrealistically
large one.  It's my opinion that '-1' and '18446744073709551615' both look
pretty obviously broken when put in 'qgroup show' output so I'm going for
correctness. Here's the difference in output:

qgroupid rfer   excl 
     
0/5  16384  16384
0/2574109430784 -1429504 

qgroupid rfer   excl 
     
0/5  16384  16384
0/2574109430784 18446744073708122112 


- Patch 2 imports the ulist code from kernel. Any qgroup code that deals
with resolving refs to roots needs this so that it can insert into a 'list'
that guarantees unique items.

- Patch 3 adds the actual code to do the work of adding up referenced and
exclusive bytecounts.

This involves walking the extent tree and recording refs. We then resolve
implied refs by walking down from each interior node. Finally, shared ref
roots are found and each extent is accounted to any roots that reference it.

Here's what it looks like now if you run btrfsck against a filesystem with
a couple corrupted qgroups:

Checking filesystem on /dev/vdb2
UUID: 8203ca66-9858-4e3f-b447-5bbaacf79c02
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
checking quota groups
Counts for qgroup id: 257 are different
our:referenced 4124762112 referenced compressed 4124762112
disk:   referenced 4109430784 referenced compressed 4109430784
diff:   referenced 15331328 referenced compressed 15331328
our:exclusive 901120 exclusive compressed 901120
disk:   exclusive 18446744073708122112 exclusive compressed 
18446744073708122112
diff:   exclusive 2330624 exclusive compressed 2330624
Counts for qgroup id: 280 are different
our:referenced 3750768640 referenced compressed 3750768640
disk:   referenced 3750768640 referenced compressed 3750768640
our:exclusive 14749696 exclusive compressed 14749696
disk:   exclusive 11882496 exclusive compressed 11882496
diff:   exclusive 2867200 exclusive compressed 2867200
found 1009512957 bytes used err is 0
total csum bytes: 3955388
total tree bytes: 346292224
total fs tree bytes: 331939840
total extent tree bytes: 9338880
btree space waste bytes: 48141929
file data blocks allocated: 6477553664
 referenced 6062055424
Btrfs v3.14.1-3-gc8c1814

There's a minor issue in that we'll also print out qgroups for deleted
subvolumes as they still persist on disk (not shown here). I'm pretty sure
we can fix that with a followup patch to just check them against existing
subvolumes when we initially read our qgroup info from disk.

Thanks,
--Mark
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs-progs: import ulist

2014-05-07 Thread Mark Fasheh
qgroup-verify.c wants this for walking root refs.

Signed-off-by: Mark Fasheh mfas...@suse.de
---
 Makefile |   3 +-
 kerncompat.h |   2 +-
 ulist.c  | 253 +++
 ulist.h  |  66 
 4 files changed, 322 insertions(+), 2 deletions(-)
 create mode 100644 ulist.c
 create mode 100644 ulist.h

diff --git a/Makefile b/Makefile
index da05197..202013e 100644
--- a/Makefile
+++ b/Makefile
@@ -9,7 +9,8 @@ CFLAGS = -g -O1 -fno-strict-aliasing
 objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
- qgroup.o raid6.o free-space-cache.o list_sort.o props.o
+ qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
+ ulist.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/kerncompat.h b/kerncompat.h
index f370cd8..652275e 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -235,7 +235,7 @@ static inline long IS_ERR(const void *ptr)
 
 #define BUG_ON(c) assert(!(c))
 #define WARN_ON(c) assert(!(c))
-
+#defineASSERT(c) assert(c)
 
 #define container_of(ptr, type, member) ({  \
 const typeof( ((type *)0)-member ) *__mptr = (ptr);\
diff --git a/ulist.c b/ulist.c
new file mode 100644
index 000..60fdc09
--- /dev/null
+++ b/ulist.c
@@ -0,0 +1,253 @@
+/*
+ * Copyright (C) 2011 STRATO AG
+ * written by Arne Jansen sensi...@gmx.net
+ * Distributed under the GNU GPL license version 2.
+ */
+
+//#include linux/slab.h
+#include stdlib.h
+#include kerncompat.h
+#include ulist.h
+#include ctree.h
+
+/*
+ * ulist is a generic data structure to hold a collection of unique u64
+ * values. The only operations it supports is adding to the list and
+ * enumerating it.
+ * It is possible to store an auxiliary value along with the key.
+ *
+ * A sample usage for ulists is the enumeration of directed graphs without
+ * visiting a node twice. The pseudo-code could look like this:
+ *
+ * ulist = ulist_alloc();
+ * ulist_add(ulist, root);
+ * ULIST_ITER_INIT(uiter);
+ *
+ * while ((elem = ulist_next(ulist, uiter)) {
+ * for (all child nodes n in elem)
+ * ulist_add(ulist, n);
+ * do something useful with the node;
+ * }
+ * ulist_free(ulist);
+ *
+ * This assumes the graph nodes are adressable by u64. This stems from the
+ * usage for tree enumeration in btrfs, where the logical addresses are
+ * 64 bit.
+ *
+ * It is also useful for tree enumeration which could be done elegantly
+ * recursively, but is not possible due to kernel stack limitations. The
+ * loop would be similar to the above.
+ */
+
+/**
+ * ulist_init - freshly initialize a ulist
+ * @ulist: the ulist to initialize
+ *
+ * Note: don't use this function to init an already used ulist, use
+ * ulist_reinit instead.
+ */
+void ulist_init(struct ulist *ulist)
+{
+   INIT_LIST_HEAD(ulist-nodes);
+   ulist-root = RB_ROOT;
+   ulist-nnodes = 0;
+}
+
+/**
+ * ulist_fini - free up additionally allocated memory for the ulist
+ * @ulist: the ulist from which to free the additional memory
+ *
+ * This is useful in cases where the base 'struct ulist' has been statically
+ * allocated.
+ */
+static void ulist_fini(struct ulist *ulist)
+{
+   struct ulist_node *node;
+   struct ulist_node *next;
+
+   list_for_each_entry_safe(node, next, ulist-nodes, list) {
+   kfree(node);
+   }
+   ulist-root = RB_ROOT;
+   INIT_LIST_HEAD(ulist-nodes);
+}
+
+/**
+ * ulist_reinit - prepare a ulist for reuse
+ * @ulist: ulist to be reused
+ *
+ * Free up all additional memory allocated for the list elements and reinit
+ * the ulist.
+ */
+void ulist_reinit(struct ulist *ulist)
+{
+   ulist_fini(ulist);
+   ulist_init(ulist);
+}
+
+/**
+ * ulist_alloc - dynamically allocate a ulist
+ * @gfp_mask:  allocation flags to for base allocation
+ *
+ * The allocated ulist will be returned in an initialized state.
+ */
+struct ulist *ulist_alloc(gfp_t gfp_mask)
+{
+   struct ulist *ulist = kmalloc(sizeof(*ulist), gfp_mask);
+
+   if (!ulist)
+   return NULL;
+
+   ulist_init(ulist);
+
+   return ulist;
+}
+
+/**
+ * ulist_free - free dynamically allocated ulist
+ * @ulist: ulist to free
+ *
+ * It is not necessary to call ulist_fini before.
+ */
+void ulist_free(struct ulist *ulist)
+{
+   if (!ulist)
+   return;
+   ulist_fini(ulist);
+   kfree(ulist);
+}
+
+static struct ulist_node *ulist_rbtree_search(struct ulist *ulist, u64 val)
+{
+   struct rb_node *n = ulist-root.rb_node;
+   struct ulist_node *u = NULL;
+
+   while (n) {
+   u = rb_entry(n, struct ulist_node, 

[PATCH 3/3] btrfs-progs: add quota group verify code

2014-05-07 Thread Mark Fasheh
This patch adds functionality (in qgroup-verify.c) to compute bytecounts in
subvolume quota groups. The original groups are read in and stored in memory
so that after we compute our own bytecounts, we can compare them with those
on disk. A print function is provided to do this comparison and show the
results on the console.

A 'qgroup check' pass is added to btrfsck. If any subvolume quota groups
differ from what we compute, the differences for them are printed.  We also
provide an option '--qgroup-report' which will run only the quota check code
and print a report on all quota groups.  Other than making it possible to
verify that our qgroup changes work correctly, this mode can also be used in
xfstests for automated checking after qgroup tests.

This patch does not address the following:
- compressed counts are identical to non compressed, because kernel doesn't
  make the distinction yet.  Adding the code to verify compressed counts
  shouldn't be hard at all though once kernel can do this.
- It is only concerned with subvolume quota groups (like most of
  btrfs-progs).

Signed-off-by: Mark Fasheh mfas...@suse.de
---
 Makefile|2 +-
 cmds-check.c|   24 ++
 ctree.h |   10 +
 disk-io.c   |   16 +-
 print-tree.c|2 +-
 print-tree.h|1 +
 qgroup-verify.c | 1085 +++
 qgroup-verify.h |   25 ++
 8 files changed, 1161 insertions(+), 4 deletions(-)
 create mode 100644 qgroup-verify.c
 create mode 100644 qgroup-verify.h

diff --git a/Makefile b/Makefile
index 202013e..51e5264 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
- ulist.o
+ ulist.o qgroup-verify.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/cmds-check.c b/cmds-check.c
index d195e7a..5401ad9 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -38,6 +38,7 @@
 #include commands.h
 #include free-space-cache.h
 #include btrfsck.h
+#include qgroup-verify.h
 
 static u64 bytes_used = 0;
 static u64 total_csum_bytes = 0;
@@ -6427,6 +6428,7 @@ static struct option long_options[] = {
{ init-csum-tree, 0, NULL, 0 },
{ init-extent-tree, 0, NULL, 0 },
{ backup, 0, NULL, 0 },
+   { qgroup-report, 0, NULL, 'Q' },
{ NULL, 0, NULL, 0}
 };
 
@@ -6439,6 +6441,7 @@ const char * const cmd_check_usage[] = {
--repairtry to repair the filesystem,
--init-csum-treecreate a new CRC tree,
--init-extent-tree  create a new extent tree,
+   --qgroup-report print a report on qgroup consistency,
NULL
 };
 
@@ -6453,6 +6456,7 @@ int cmd_check(int argc, char **argv)
u64 num;
int option_index = 0;
int init_csum_tree = 0;
+   int qgroup_report = 0;
enum btrfs_open_ctree_flags ctree_flags =
OPEN_CTREE_PARTIAL | OPEN_CTREE_EXCLUSIVE;
 
@@ -6479,6 +6483,9 @@ int cmd_check(int argc, char **argv)
printf(using SB copy %llu, bytenr %llu\n, num,
   (unsigned long long)bytenr);
break;
+   case 'Q':
+   qgroup_report = 1;
+   break;
case '?':
case 'h':
usage(cmd_check_usage);
@@ -6526,6 +6533,14 @@ int cmd_check(int argc, char **argv)
 
root = info-fs_root;
uuid_unparse(info-super_copy-fsid, uuidbuf);
+   if (qgroup_report) {
+   printf(Print quota groups for %s\nUUID: %s\n, argv[optind],
+  uuidbuf);
+   ret = qgroup_verify_all(info);
+   if (ret == 0)
+   print_qgroup_report(1);
+   goto close_out;
+   }
printf(Checking filesystem on %s\nUUID: %s\n, argv[optind], uuidbuf);
 
if (!extent_buffer_uptodate(info-tree_root-node) ||
@@ -6629,11 +6644,20 @@ int cmd_check(int argc, char **argv)
free(bad);
}
 
+   if (info-quota_enabled) {
+   int err;
+   fprintf(stderr, checking quota groups\n);
+   err = qgroup_verify_all(info);
+   if (err)
+   goto out;
+   }
+
if (!list_empty(root-fs_info-recow_ebs)) {
fprintf(stderr, Transid errors in file system\n);
ret = 1;
}
 out:
+   print_qgroup_report(0);
if (found_old_backref) { 

Re: [RFC PATCH 0/3] Btrfs: add xxhash algorithm

2014-05-07 Thread Darrick J. Wong
On Wed, May 07, 2014 at 01:08:06PM +0200, Tomasz Torcz wrote:
 On Wed, May 07, 2014 at 06:56:29PM +0800, Liu Bo wrote:
  xxHash is an extremely fast non-cryptographic Hash algorithm, working at 
  speeds
  close to RAM limits.[1]  And xxhash is 32-bits hash, same as crc32.
  
  Here is the hash comparsion extracted from the link[1]:
  (single thread, Windows Seven 32 bits, using Open Source's SMHasher on a 
  Core 2
  Duo @3GHz)
  
  
  NameSpeed   Q.Score   Author
  xxHash  5.4 GB/s 10
  CRC32   0.43 GB/s 9
  
 
   Core 2 Duo is awfully old CPU. Since 2008, Intel CPUs have crc32 
 instruction,
 hugely speeding up CRC operations.

Just for kicks I (sloppily) benchmarked a few of the kernel's hash
implementations on a Core i5-3320M CPU @3.3GHz: 

xxhash: 6.0GB/s
crc32c-intel: 11.5GB/s
crc32c (no hw accel): 1.8GB/s

--D
  
 
 -- 
 Tomasz Torcz God, root, what's the difference?
 xmpp: zdzich...@chrome.pl God is more forgiving.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xfstests: fix flink test

2014-05-07 Thread Josef Bacik
I don't have flink support in my xfsprogs, but it doesn't fail with command not
found or whatever, it fails because I don't have the -T option.  So fix
_require_xfs_io_command to check for an invalid option and not run.  This way I
get notrun instead of a failure.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 common/rc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/common/rc b/common/rc
index 5c13db5..4fa7e63 100644
--- a/common/rc
+++ b/common/rc
@@ -1258,6 +1258,8 @@ _require_xfs_io_command()
_notrun xfs_io $command support is missing
echo $testio | grep -q Operation not supported  \
_notrun xfs_io $command failed (old kernel/wrong fs?)
+   echo $testio | grep -q invalid option  \
+   _notrun xfs_io $command support is missing
 }
 
 # Check that a fs has enough free space (in 1024b blocks)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] Btrfs: add xxhash algorithm

2014-05-07 Thread Gregory Maxwell
On Wed, May 7, 2014 at 1:50 PM, Darrick J. Wong darrick.w...@oracle.com wrote:
 Just for kicks I (sloppily) benchmarked a few of the kernel's hash
 implementations on a Core i5-3320M CPU @3.3GHz:
 xxhash: 6.0GB/s
 crc32c-intel: 11.5GB/s
 crc32c (no hw accel): 1.8GB/s

CRC also usually has the very mild data recovery advantage that if
your error is just a bitflip you can correct it using the crc in a
computationally efficient manner, potentially enabling fancy recovery
tools... so it it were merely equal in speed you'd still probably
prefer to use a CRC.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] xfstests/btrfs: add qgroup rescan stress test

2014-05-07 Thread Josef Bacik

On 03/09/2014 11:44 PM, Wang Shilong wrote:

Test flow is to run fsstress after triggering quota rescan.
the ruler is simple, we just remove all files and directories,
sync filesystem and see if qgroup's ref and excl are nodesize.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
v1-v2:
switch into new helper _run_btrfs_util_prog()
---
  tests/btrfs/041 | 76 +
  tests/btrfs/041.out |  3 +++
  tests/btrfs/group   |  1 +
  3 files changed, 80 insertions(+)
  create mode 100644 tests/btrfs/041
  create mode 100644 tests/btrfs/041.out

diff --git a/tests/btrfs/041 b/tests/btrfs/041
new file mode 100644
index 000..92bd080
--- /dev/null
+++ b/tests/btrfs/041
@@ -0,0 +1,76 @@
+#! /bin/bash
+# FSQA Test No. btrfs/041
+#
+# Quota rescan stress test, we run fsstress and quota rescan concurrently
+#
+#---
+# Copyright (C) 2014 Fujitsu.  All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+rm -f $seqres.full
+
+run_check _scratch_mkfs -b 1g --nodesize 4096
+run_check _scratch_mount
+


Add -o nospace_cache here please, otherwise I don't get the same output.


+# -w ensures that the only ops are ones which cause write I/O
+run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
+   $FSSTRESS_AVOID /dev/null
+
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
+   $SCRATCH_MNT/snap1 $seqres.full 21


_run_btrfs_util_prog will already redirect to $seqres.full, you don't 
need this part.



+
+run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 1000 \
+   $FSSTRESS_AVOID /dev/null
+
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+#ignore removing subvolume errors
+rm -rf $SCRATCH_MNT/*  /dev/null
+
+_run_btrfs_util_prog filesystem sync $SCRATCH_MNT  $seqres.full 21


Same here.


+_run_btrfs_util_prog qgroup show $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
+   | $AWK_PROG '{print $1 $2 $3 }'
+


You can't use _run_btrfs_util_prog here, it will eat the output.  You 
need to use $BTRFS_UTIL_PROG instead.  Fix these up and resend, this is 
a really important test and I needed it to make sure my qgroups patch 
was right (which it is now.)  Thanks,


Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: add sanity tests for new qgroup accounting code

2014-05-07 Thread Josef Bacik
This exercises the various parts of the new qgroup accounting code.  We do some
basic stuff and do some things with the shared refs to make sure all that code
works.  I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/backref.c|   4 +
 fs/btrfs/ctree.c  |   4 +
 fs/btrfs/ctree.h  |   3 +
 fs/btrfs/disk-io.c|  18 +-
 fs/btrfs/disk-io.h|   1 +
 fs/btrfs/extent-tree.c|  17 ++
 fs/btrfs/extent_io.c  |  47 +
 fs/btrfs/extent_io.h  |   2 +
 fs/btrfs/qgroup.c |  23 +++
 fs/btrfs/super.c  |   3 +
 fs/btrfs/tests/btrfs-tests.c  |  96 +
 fs/btrfs/tests/btrfs-tests.h  |   9 +
 fs/btrfs/tests/inode-tests.c  |  35 +---
 fs/btrfs/tests/qgroup-tests.c | 468 ++
 fs/btrfs/transaction.h|   1 +
 16 files changed, 696 insertions(+), 37 deletions(-)
 create mode 100644 fs/btrfs/tests/qgroup-tests.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index ae837d2..b566ef3 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -17,4 +17,4 @@ btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
tests/extent-buffer-tests.o tests/btrfs-tests.o \
-   tests/extent-io-tests.o tests/inode-tests.o
+   tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 10db21f..f09aa18 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -900,7 +900,11 @@ again:
goto out;
BUG_ON(ret == 0);
 
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+   if (trans  likely(trans-type != __TRANS_DUMMY)) {
+#else
if (trans) {
+#endif
/*
 * look if there are updates for this ref queued and lock the
 * head
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 208a84d..aa849e0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1503,6 +1503,10 @@ static inline int should_cow_block(struct 
btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct extent_buffer *buf)
 {
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+   if (unlikely(root-dummy_root))
+   return 0;
+#endif
/* ensure we can see the force_cow */
smp_rmb();
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 33a1b27..96dae25 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1781,6 +1781,7 @@ struct btrfs_root {
int in_radix;
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
int dummy_root;
+   u64 alloc_bytenr;
 #endif
u64 defrag_trans_start;
struct btrfs_key defrag_progress;
@@ -4096,6 +4097,8 @@ static inline int btrfs_defrag_cancelled(struct 
btrfs_fs_info *fs_info)
 /* Sanity test specific functions */
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 void btrfs_test_destroy_inode(struct inode *inode);
+int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid,
+  u64 rfer, u64 excl);
 #endif
 
 #endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d965f51..009baaa 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1114,6 +1114,11 @@ struct extent_buffer *btrfs_find_tree_block(struct 
btrfs_root *root,
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
 u64 bytenr, u32 blocksize)
 {
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+   if (unlikely(root-dummy_root))
+   return alloc_test_extent_buffer(root-fs_info, bytenr,
+   blocksize);
+#endif
return alloc_extent_buffer(root-fs_info, bytenr, blocksize);
 }
 
@@ -1296,6 +1301,7 @@ struct btrfs_root *btrfs_alloc_dummy_root(void)
return ERR_PTR(-ENOMEM);
__setup_root(4096, 4096, 4096, 4096, root, NULL, 1);
root-dummy_root = 1;
+   root-alloc_bytenr = 0;
 
return root;
 }
@@ -2095,7 +2101,7 @@ static void free_root_pointers(struct btrfs_fs_info 
*info, int chunk_root)
free_root_extent_buffers(info-chunk_root);
 }
 
-static void del_fs_roots(struct btrfs_fs_info *fs_info)
+void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info)
 {
int ret;
struct btrfs_root *gang[8];
@@ -2984,7 +2990,7 @@ fail_qgroup:
 fail_trans_kthread:
kthread_stop(fs_info-transaction_kthread);
btrfs_cleanup_transaction(fs_info-tree_root);
-   del_fs_roots(fs_info);
+   btrfs_free_fs_roots(fs_info);
 fail_cleaner:
kthread_stop(fs_info-cleaner_kthread);
 
@@ -3519,8 +3525,10 @@ void 

Re: raid0 vs single, and should we allow -mdup by default on SSDs?

2014-05-07 Thread Mitch Harder
On Wed, May 7, 2014 at 3:52 AM, Marc MERLIN m...@merlins.org wrote:
 On Wed, May 07, 2014 at 09:29:41AM +0100, Hugo Mills wrote:
 On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote:
  On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote:
   That appears to be a very good use of either -d raid0 or -d single, yes.
   And since you're apparently not streaming such high resolution video that
   you NEED the raid0, single does indeed give you a somewhat better chance
   at recovery.
 
  zoneminder saves 'video' as a stream of independent small jpegs, so I'm
  good. Actually come to think of it they're so small that they probably
  all ended up in the raid1 metadata. That also means that I'm not getting
  twice the storage space like I planned to. Oh well...

There's a mount option to change the threshold at which files are
 inlined in metadata: maxinline=bytes. You could play with that for
 this particular use-case.

 Oh cool, thank you.


Since each non-inlined file will occupy a minimum of 4k, you may find
that inlining will still save space even if it is duplicated.

Even if they are duplicated in the metadata under RAID1, inlining a
bunch of 256 byte files will still be more space efficient than
storing them as regular files.

But if most of the files are in the 2k-3k range, you may be more
efficient to store them as files.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?

2014-05-07 Thread Marc MERLIN
In a moment of irony, my laptop's boot SSD's btrfs fileysstem crashed
last night with my btrfs talk slides still open on it. It went read only 
overnight
but did not crash.

Please tell me ASAP if you need anything off the filesystem before I recover it
since I'm travelling, and need to bring my laptop back up to a working state
ASAP (I'll save the irony of showing up at my talk with Err, I can't
give my btrfs talk, btrfs crashed on my laptop).

I'm not interested in partial recovery, I have hourly backups on my
secondary drive on my laptop (thankfully) and was able to boot from that
drive (double thankfully). Good thing I plan ahead :)

If there is something you'd like me to try to recover the filesystem
or to get more data off it to diagnose the bug, please let me know ASAP.

Otherwise, I'll just wipe it and recover from my disk backup, but
obviously this is bad.


Details:
My system didn't crash, but the filesystem went read only, and of course
couldn't syslog the error.
Thankfully I was saved by remote syslog which did work:

kernel: [545039.443412] [ cut here ]
kernel: [545039.443429] WARNING: CPU: 2 PID: 556 at fs/btrfs/inode.c:4927 
btrfs_invalidate_inode

kernel: [545039.443432] Modules linked in: e1000e iwlmvm mac80211 iwlwifi 
cfg80211 xhci_hcd usb_storage rndis_host cdc_ether btusb uvcvideo usbnet 
ehci_pci ehci_hcd usbcore usb_common tun sg nls_utf8 nls_cp437 vfat fat 
rpcsec_gss_krb5 nfsv4 ctr ccm ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack 
xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables 
ebtable_nat ebtables x_tables ppdev cpufreq_powersave cpufreq_userspace 
cpufreq_conservative cpufreq_stats rfcomm bnep autofs4 binfmt_misc uinput nfsd 
auth_rpcgss nfs_acl nfs lockd fscache sunrpc configs parport_pc lp parport 
input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs 
videobuf2_vmalloc videobuf2_memops videobuf2_core videodev bluetooth 
6lowpan_iphc media joydev arc4 snd_hda_codec_hdmi snd_hda_codec_realtek 
snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss 
snd_mixer_oss thinkpad_acpi x86_pkg_temp_thermal s
kernel: nd_pcm intel_powerclamp nvram coretemp snd_seq_midi snd_seq_midi_event 
kvm_intel snd_rawmidi kvm crct10dif_pclmul snd_seq crc32_pclmul rtsx_pci_ms 
iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_seq_device memstick 
rtsx_pci_sdmmc snd_timer lpc_ich pcspkr microcode psmouse i2c_i801 serio_raw 
snd rtsx_pci soundcore tpm_tis rfkill tpm ac battery intel_smartconnect wmi 
evdev processor sata_sil24 r8169 mii fuse fan raid456 multipath mmc_block 
mmc_core dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_crypt dm_mod 
async_raid6_recov async_pq async_xor async_memcpy async_tx blowfish_x86_64 
blowfish_common ecb xts crc32c_intel aesni_intel aes_x86_64 glue_helper lrw 
gf128mul ablk_helper cryptd ptp pps_core thermal [last unloaded: e1000e]
kernel: [545039.443693] CPU: 2 PID: 556 Comm: btrfs-transacti Tainted: G
W3.14.0-amd64-i915-preempt-20140216 #2
kernel: [545039.443697] Hardware name: LENOVO 20BECT0/20BECT0, BIOS GMET28WW 
(1.08 ) 09/18/2013
kernel: [545039.443701]   8800cd9f3d80 8160a06d 

kernel: [545039.443718]  8800cd9f3db8 81050025 81234676 
88040665c000
kernel: [545039.443727]  8800cd9f3e30 880406f708b8 880402181000 
8800cd9f3dc8
kernel: [545039.443735] Call Trace:
kernel: [545039.443746]  [8160a06d] dump_stack+0x4e/0x7a
kernel: [545039.443754]  [81050025] warn_slowpath_common+0x7f/0x98
kernel: [545039.443761]  [81234676] ? 
btrfs_invalidate_inodes+0x2f/0x12e
kernel: [545039.443768]  [810500ec] warn_slowpath_null+0x1a/0x1c
kernel: [545039.443775]  [81234676] btrfs_invalidate_inodes+0x2f/0x12e
kernel: [545039.443784]  [81227ac3] 
btrfs_cleanup_transaction+0x3b2/0x43f
kernel: [545039.443792]  [81227c92] transaction_kthread+0x142/0x1ab
kernel: [545039.443799]  [81227b50] ? 
btrfs_cleanup_transaction+0x43f/0x43f
kernel: [545039.443807]  [8106bc62] kthread+0xae/0xb6
kernel: [545039.443815]  [8106bbb4] ? __kthread_parkme+0x61/0x61
kernel: [545039.443822]  [8161637c] ret_from_fork+0x7c/0xb0
kernel: [545039.443829]  [8106bbb4] ? __kthread_parkme+0x61/0x61
kernel: [545039.443834] ---[ end trace 3c290eaa69000df6 ]---

Now, if I try to mount it, I get:
[   17.234587] BTRFS: device label btrfs_pool1 devid 1 transid 415424 
/dev/mapper/cryptroot
[   17.236873] BTRFS info (device dm-0): disk space caching is enabled
[   17.243687] BTRFS: bad tree block start 10983188636980216968 828930883584
[   17.245986] BTRFS: bad tree block start 12509109177217855588 828930883584
[   17.248174] BTRFS: failed to read tree root on dm-0
[   17.325141] BTRFS: open_ctree failed

mount -o ro,recovery gives:
[  412.572216] BTRFS: device label 

Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?

2014-05-07 Thread Chris Mason

On 05/07/2014 07:39 PM, Marc MERLIN wrote:

In a moment of irony, my laptop's boot SSD's btrfs fileysstem crashed
last night with my btrfs talk slides still open on it. It went read only 
overnight
but did not crash.

Please tell me ASAP if you need anything off the filesystem before I recover it
since I'm travelling, and need to bring my laptop back up to a working state
ASAP (I'll save the irony of showing up at my talk with Err, I can't
give my btrfs talk, btrfs crashed on my laptop).

I'm not interested in partial recovery, I have hourly backups on my
secondary drive on my laptop (thankfully) and was able to boot from that
drive (double thankfully). Good thing I plan ahead :)

If there is something you'd like me to try to recover the filesystem
or to get more data off it to diagnose the bug, please let me know ASAP.

Otherwise, I'll just wipe it and recover from my disk backup, but
obviously this is bad.


Hi Marc,

Looks like you're on 3.14, did this have the fixes from my git tree that 
went into 3.15-rc?


For now I'd say that if you can make a dd image of the FS, please do so. 
 Otherwise, I don't want to suck down your time right before the trip.


-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?

2014-05-07 Thread Marc MERLIN
On Wed, May 07, 2014 at 08:38:38PM -0400, Chris Mason wrote:
 Looks like you're on 3.14, did this have the fixes from my git tree
 that went into 3.15-rc?
 
You're correct, it's running 3.14.0. Considering that it's my main laptop
that I kind of need to work, I avoid rc kernels if possible :)
But if I had known that 3.14 had corruption problems, I'd have
re-thought that :)
(besides my report, were there other ones I missed? Is 3.14.0 something
to avoid for now?)
(yes, I know 3.14.3 is out now, I should upgrade)

 For now I'd say that if you can make a dd image of the FS, please do
 so.  Otherwise, I don't want to suck down your time right before the
 trip.

A full dd image is not practical, it's 1TB and I have nowhere to put it.
I could do an image if you'd like, and upload it when I have proper
internet (I'm thinking it's likely going to be a 1GB upload)

(by the way, I'm already in the trip, and I have 1h before my next
plane and a bit of time tonight (in 10H my time that is) to upload stuff
or more logs if that helps.

But more importantly, I have my main file server at home running 3.14.0
too. Is there a risk of known corruption, or nothing known yet?

Of if you'd like output of fsck in dry-run mode, I can do that too.

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: fsck: add an option to check data csums

2014-05-07 Thread Wang Shilong
This patch adds an option '--check-data-csum' to verify data csums.
fsck won't check data csums unless users specify this option explictly.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
 Documentation/btrfs-check.txt |   2 +
 cmds-check.c  | 122 --
 2 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-check.txt b/Documentation/btrfs-check.txt
index 485a49c..bc10755 100644
--- a/Documentation/btrfs-check.txt
+++ b/Documentation/btrfs-check.txt
@@ -30,6 +30,8 @@ try to repair the filesystem.
 create a new CRC tree.
 --init-extent-tree::
 create a new extent tree.
+--check-data-csum::
+check data csums.
 
 EXIT STATUS
 ---
diff --git a/cmds-check.c b/cmds-check.c
index 103efc5..b53d49c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -53,6 +53,7 @@ static LIST_HEAD(delete_items);
 static int repair = 0;
 static int no_holes = 0;
 static int init_extent_tree = 0;
+static int check_data_csum = 0;
 
 struct extent_backref {
struct list_head list;
@@ -3634,6 +3635,106 @@ static int check_space_cache(struct btrfs_root *root)
return error ? -EINVAL : 0;
 }
 
+static int read_extent_data(struct btrfs_root *root, char *data,
+   u64 logical, u64 len, int mirror)
+{
+   u64 offset = 0;
+   struct btrfs_multi_bio *multi = NULL;
+   struct btrfs_fs_info *info = root-fs_info;
+   struct btrfs_device *device;
+   int ret = 0;
+   u64 read_len;
+   unsigned long bytes_left = len;
+
+   while (bytes_left) {
+   read_len = bytes_left;
+   device = NULL;
+   ret = btrfs_map_block(info-mapping_tree, READ,
+   logical + offset, read_len, multi,
+   mirror, NULL);
+   if (ret) {
+   fprintf(stderr, Couldn't map the block %llu\n,
+   logical + offset);
+   goto error;
+   }
+   device = multi-stripes[0].dev;
+
+   if (device-fd == 0)
+   goto error;
+
+   if (read_len  root-sectorsize)
+   read_len = root-sectorsize;
+   if (read_len  bytes_left)
+   read_len = bytes_left;
+
+   ret = pread64(device-fd, data + offset, read_len,
+ multi-stripes[0].physical);
+   if (ret != read_len)
+   goto error;
+   offset += read_len;
+   bytes_left -= read_len;
+   kfree(multi);
+   multi = NULL;
+   }
+   return 0;
+error:
+   kfree(multi);
+   return -EIO;
+}
+
+static int check_extent_csums(struct btrfs_root *root, u64 bytenr,
+   u64 num_bytes, unsigned long leaf_offset,
+   struct extent_buffer *eb) {
+
+   u64 offset = 0;
+   u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy);
+   char *data;
+   u32 crc;
+   unsigned long tmp;
+   char result[csum_size];
+   char out[csum_size];
+   int ret = 0;
+   __s64 cmp;
+   int mirror;
+   int num_copies = btrfs_num_copies(root-fs_info-mapping_tree,
+   bytenr, num_bytes);
+
+   BUG_ON(num_bytes % root-sectorsize);
+   data = malloc(root-sectorsize);
+   if (!data)
+   return -ENOMEM;
+
+   while (offset  num_bytes) {
+   mirror = 0;
+again:
+   ret = read_extent_data(root, data, bytenr + offset,
+   root-sectorsize, mirror);
+   if (ret)
+   goto out;
+
+   crc = ~(u32)0;
+   crc = btrfs_csum_data(NULL, (char *)data, crc,
+ root-sectorsize);
+   btrfs_csum_final(crc, result);
+
+   tmp = leaf_offset + offset / root-sectorsize * csum_size;
+   read_extent_buffer(eb, out, tmp, csum_size);
+   cmp = memcmp(out, result, csum_size);
+   if (cmp) {
+   fprintf(stderr, mirror: %d range bytenr: %llu, len: %d 
checksum mismatch\n,
+   mirror, bytenr + offset, root-sectorsize);
+   if (mirror  num_copies - 1) {
+   mirror += 1;
+   goto again;
+   }
+   }
+   offset += root-sectorsize;
+   }
+out:
+   free(data);
+   return ret;
+}
+
 static int check_extent_exists(struct btrfs_root *root, u64 bytenr,
   u64 num_bytes)
 {
@@ -3771,6 +3872,8 @@ static int check_csums(struct btrfs_root *root)
u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy);
int errors = 0;
int ret;
+   u64 data_len;
+   unsigned long 

Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?

2014-05-07 Thread Marc MERLIN
On Wed, May 07, 2014 at 05:43:44PM -0700, Marc MERLIN wrote:
 A full dd image is not practical, it's 1TB and I have nowhere to put it.
 I could do an image if you'd like, and upload it when I have proper
 internet (I'm thinking it's likely going to be a 1GB upload)

In the meantime, here is fsck output:
legolas:/boot/grub# btrfsck /dev/mapper/disk1 21 | tee /tmp/fsck
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=10983188636980216968
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
Check tree block failed, want=828930883584, have=12509109177217855588
read block failed check_tree_block
Couldn't read tree root
Critical roots corrupted, unable to fsck the FS
Checking filesystem on /dev/mapper/disk1
UUID: 4850ee22-bf32-4131-a841-02abdb4a5ba6

Let me know if I should try 
--init-csum-tree and/or --init-extent-tree

legolas:/# /sbin/btrfs-find-root /dev/mapper/disk1 
Super think's the tree root is at 828930883584, chunk root 20979712
Well block 12585312256 seems great, but generation doesn't match, have=410782, 
want=415424 level 0
(...)
Well block 82629248 seems great, but generation doesn't match, have=415420, 
want=415424 level 0
Found tree root at 828930887680 gen 415424 level 0
legolas:/# 

I noted that:
828930887680 - 828930883584 = 4096

So I have a root tree that's bigger than what super is looking for?
Could that be my problem?

Can btrfs restore be used to navigate the filesystem and look for files and 
patterns
without dumping the entire filesystem, which I don't have room for?

In the meantime, I didn't get it to work anyway:
legolas:/var/local/space/nobck# btrfs restore -t 828930887680 /dev/mapper/disk1 
restore
Couldn't setup extent tree
Couldn't read fs root: -2
extent buffer leak: start 828930887680 len 4096

Now, even if that worked, 
https://btrfs.wiki.kernel.org/index.php/Restore#Advanced_usage
says I can use -r to only restore a subvolume, but I don't know its objectid.
How would I do this?

(I don't actually really need the data, I'm just trying to learn what I
would do if I did)

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: doc: link btrfsck to btrfs-check

2014-05-07 Thread Qu Wenruo


 Original Message 
Subject: Re: [PATCH] btrfs-progs: doc: link btrfsck to btrfs-check
From: David Sterba dste...@suse.cz
To: Qu Wenruo quwen...@cn.fujitsu.com
Date: 2014年04月18日 22:48

On Thu, Apr 17, 2014 at 08:47:28AM +0800, Qu Wenruo wrote:

@@ -73,6 +74,7 @@ install: install-man
  install-man: man
$(INSTALL) -d -m 755 $(DESTDIR)$(man8dir)
$(INSTALL) -m 644 $(GZ_MAN8) $(DESTDIR)$(man8dir)
+   $(LNS) btrfs-check.txt $(DESTDIR)$(man8dir)

Shouldn't the source of soft link be btrfs-check.8.gz. ?
Forgot to mention that the dest is also wrong. This will make 
$(DESTDIR)$(man8dir)/btrfs-check.8.gz to be a

infinite loop(pointing to it self).
The correct one should be like the following:
+   $(LNS) btrfs-check.8.gz $(DESTDIR)$(man8dir)/btrfsck.8.gz

Thanks,
Qu


@@ -47,4 +49,3 @@ SEE ALSO
  `mkfs.btrfs`(8),
  `btrfs-scrub`(8),
  `btrfs-rescue`(8)
-`btrfsck`(8)

Sorry to bother you but 'btrfs-scrub'/'btrfs-rescue' and 'btrfs-restore'
seems also metioning 'btrfsck' and may also needs to remove 'btrfsck'.

Thanks for catching them, I'll fix it up.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background

2014-05-07 Thread Miao Xie
On Mon, 10 Mar 2014 09:35:13 -0400, Josef Bacik wrote:
 On 03/06/2014 12:55 AM, Miao Xie wrote:
 Before applying this patch, the task had to reclaim the metadata
 space by itself if the metadata space was not enough. And When the
 task started the space reclamation, all the other tasks which
 wanted to reserve the metadata space were blocked. At some cases,
 they would be blocked for a long time, it made the performance
 fluctuate wildly.

 So we introduce the background metadata space reclamation, when the
 space is about to be exhausted, we insert a reclaim work into the
 workqueue, the worker of the workqueue helps us to reclaim the
 reserved space at the background. By this way, the tasks needn't
 reclaim the space by themselves at most cases, and even if the
 tasks have to reclaim the space or are blocked for the space
 reclamation, they will get enough space more quickly.

 We needn't worry about the early enospc problem because all the
 reclaim work is serialized by the lock.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 
 This causes generic/015 to fail with early enospc, I'm kicking this
 patch out, I'll take the rest.  Thanks,

It is not early enospc problem.

This test is to check that the space of the file is released immediately
or not after the file is deleted. In fact, the result of the test is
unstable, because the kernel may be syncing the file data when we delete
it, if so the space of file would not be released immediately.

But the case I said above is rare because the size of fs in this test is
just 50MB, and the memory size of the most machine is very large(maybe  1GB),
that is the dirty pages is not so many, the background flusher may not
be waked up immediately, so no one holds the inode of the test file after
we delete it, and then the space of it can be released immediately.

After applying this patch, we will flush the dirty pages because our background
metadata space reclaimer finds that the metadata space is going to be used up
( 5% of the total metadata size), and need flush dirty pages to reclaim some
delalloc metadata space. that is this patch makes the above case happen easily.

Anyway, we need improve this patch though it is not a bug. I will send out
a new one.

Thanks
Miao

 
 Josef
 
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
 
 iQIcBAEBAgAGBQJTHb+NAAoJEANb+wAKly3BCcUP/jGmW85hiurfTF7eom+wzDcr
 nxqvdTB/F21UJU1RRrb92CdYRYb9d4hHKhXE5OK+qamE+K55GEtgCUWCLQgDfJJL
 Wx0aUD/pTqv3J5S5zM43UBJkn2ZR99Q7hJzm9PPMSMn7hBgK87QUEme8HerCPUgY
 0VS4OcqUGhg88qO8GjdEFLnHawhjMDw9iGPUi+tMdCEmr9aQQo8ntiahdVKyTHej
 vSRQRs0igvAt73OWHXiP6vc4LOQdu1vKCFdbxhgg+duKjNOHfUoaiiaUiGhWIA9l
 BcTWd62bEJNOaXd6k06GzhpCWzMM6faTLfjI6XADUFY0VZ79akzk2KAO6YdaLz8w
 3IAKN1chTpr7q7oPuRDgDQuwwdeLPImN29CKlAF3jlSRJEblM8CKoXYD1fyqVwDy
 c1mA6mMUJnEnXrkJ/Pb5zuNIZMAlU+v3d6CCjYKHMACORvJeZVlg9gLLMATaAJIA
 xLjFlzbgSbp/OUNuBuS4YGIaa51aAyODd2h1T3E+T5JYbVkA39N3Ni9HODE8AuSE
 E6U/06FK47L0e5uGFrM3tMTL0XBF62C1iml4NsjOWgiERz8lFDdFVArgXamCVacM
 1+VdeLLS88RHFEuwlMBy/ZQBdnvWCVsNVjYukuxntmWbSWrsLUFUSzExWnp+7TAO
 xkEd2yMw75yasTVGKSXU
 =Q/fM
 -END PGP SIGNATURE-
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] Btrfs: reclaim the reserved metadata space at background

2014-05-07 Thread Miao Xie
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

Here is my test result(Tested by compilebench):
 Memory:2GB
 CPU:   2Cores * 1CPU
 Partition: 40GB(SSD)

Test command:
 # compilebench -D mnt -m

Without this patch:
 intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
 compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
 read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
 delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)

With this patch:
 intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
 compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
 read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
 delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v2 - v3:
- change the condition that the background reclaimation starts.
---
 fs/btrfs/ctree.h   |   6 +++
 fs/btrfs/disk-io.c |   3 ++
 fs/btrfs/extent-tree.c | 105 -
 fs/btrfs/super.c   |   1 +
 4 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4c48df5..f264edf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -33,6 +33,7 @@
 #include asm/kmap_types.h
 #include linux/pagemap.h
 #include linux/btrfs.h
+#include linux/workqueue.h
 #include extent_io.h
 #include extent_map.h
 #include async-thread.h
@@ -1313,6 +1314,8 @@ struct btrfs_stripe_hash_table {
 
 #define BTRFS_STRIPE_HASH_TABLE_BITS 11
 
+void btrfs_init_async_reclaim_work(struct work_struct *work);
+
 /* fs_info */
 struct reloc_control;
 struct btrfs_device;
@@ -1688,6 +1691,9 @@ struct btrfs_fs_info {
 
struct semaphore uuid_tree_rescan_sem;
unsigned int update_uuid_tree_gen:1;
+
+   /* Used to reclaim the metadata space in the background. */
+   struct work_struct async_reclaim_work;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 029d46c..475889a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2291,6 +2291,7 @@ int open_ctree(struct super_block *sb,
atomic_set(fs_info-balance_cancel_req, 0);
fs_info-balance_ctl = NULL;
init_waitqueue_head(fs_info-balance_wait_q);
+   btrfs_init_async_reclaim_work(fs_info-async_reclaim_work);
 
sb-s_blocksize = 4096;
sb-s_blocksize_bits = blksize_bits(4096);
@@ -3603,6 +3604,8 @@ int close_ctree(struct btrfs_root *root)
/* clear out the rbtree of defraggable inodes */
btrfs_cleanup_defrag_inodes(fs_info);
 
+   cancel_work_sync(fs_info-async_reclaim_work);
+
if (!(fs_info-sb-s_flags  MS_RDONLY)) {
ret = btrfs_commit_super(root);
if (ret)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1306487..5a5e156 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4201,6 +4201,104 @@ static int flush_space(struct btrfs_root *root,
 
return ret;
 }
+
+static inline u64
+btrfs_calc_reclaim_metadata_size(struct btrfs_root *root,
+struct btrfs_space_info *space_info)
+{
+   u64 used;
+   u64 expected;
+   u64 to_reclaim;
+
+   to_reclaim = min_t(u64, num_online_cpus() * 1024 * 1024,
+   16 * 1024 * 1024);
+   spin_lock(space_info-lock);
+   if (can_overcommit(root, space_info, to_reclaim,
+  BTRFS_RESERVE_FLUSH_ALL)) {
+   to_reclaim = 0;
+   goto out;
+   }
+
+   used = space_info-bytes_used + space_info-bytes_reserved +
+  space_info-bytes_pinned + space_info-bytes_readonly +
+  space_info-bytes_may_use;
+   if (can_overcommit(root, space_info, 1024 * 1024,
+  BTRFS_RESERVE_FLUSH_ALL))
+   expected = div_factor_fine(space_info-total_bytes, 95);
+   else
+   expected = div_factor_fine(space_info-total_bytes, 90);
+
+   if (used  expected)
+   to_reclaim = used - expected;
+   else
+   to_reclaim = 0;
+   to_reclaim = min(to_reclaim, 

Re: btrfs issues in 3.14

2014-05-07 Thread Liu Bo
On Wed, May 07, 2014 at 09:35:06AM -0300, Kenny MacDermid wrote:
 On Tue, May 6, 2014 at 11:22 PM, Liu Bo bo.li@oracle.com wrote:
 
  What does sysrq+w say when the hang happens?
 
 The whole system isn't hung, I may have explained that wrong. The
 system will hang if I try to shutdown, and the process will hang if I
 try to kill -9 it.
 
 It looks like the browser is in this state currently so I did an 'echo
 w /proc/sysrq-trigger' and have attached the full dmesg with the
 browser issues and the output.

Those stacks show the blocked tasks are waiting for a page's writeback, but
they don't show what blocks the endio process of that page.

I'd recommand you to try the lastest 3.15.0-rc4 or btrfs-next, as many fixes
are merged during this period.

thanks,
-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: update man page for btrfs-show-super

2014-05-07 Thread Gui Hecheng
Add '-f' option for btrfs-show-super manpage,
This option implies that sys chunk array and backup roots info
will show up.

Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 Documentation/btrfs-show-super.txt | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-show-super.txt 
b/Documentation/btrfs-show-super.txt
index e8e17ab..074700f 100644
--- a/Documentation/btrfs-show-super.txt
+++ b/Documentation/btrfs-show-super.txt
@@ -20,8 +20,13 @@ Mainly used for debug purpose.
 
 OPTIONS
 ---
+-f::
+Print full superblock information.
++
+Including the system chunk array and backup roots.
+
 -a::
-Print all the superblock information.
+Print information of all superblocks.
 +
 If this option is given, '-i' option will be ignored.
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] xfstests/btrfs: add qgroup rescan stress test

2014-05-07 Thread Wang Shilong

On 05/08/2014 04:58 AM, Josef Bacik wrote:

On 03/09/2014 11:44 PM, Wang Shilong wrote:

Test flow is to run fsstress after triggering quota rescan.
the ruler is simple, we just remove all files and directories,
sync filesystem and see if qgroup's ref and excl are nodesize.

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
v1-v2:
switch into new helper _run_btrfs_util_prog()
---
  tests/btrfs/041 | 76 
+

  tests/btrfs/041.out |  3 +++
  tests/btrfs/group   |  1 +
  3 files changed, 80 insertions(+)
  create mode 100644 tests/btrfs/041
  create mode 100644 tests/btrfs/041.out

diff --git a/tests/btrfs/041 b/tests/btrfs/041
new file mode 100644
index 000..92bd080
--- /dev/null
+++ b/tests/btrfs/041
@@ -0,0 +1,76 @@
+#! /bin/bash
+# FSQA Test No. btrfs/041
+#
+# Quota rescan stress test, we run fsstress and quota rescan 
concurrently

+#
+#--- 


+# Copyright (C) 2014 Fujitsu.  All rights reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+#
+#--- 


+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1
+
+_cleanup()
+{
+cd /
+rm -f $tmp.*
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+
+rm -f $seqres.full
+
+run_check _scratch_mkfs -b 1g --nodesize 4096
+run_check _scratch_mount
+


Add -o nospace_cache here please, otherwise I don't get the same 
output.


I am little confused why we need specify this mount option explicitly?
As far as i know, space cache is not included into qgroup accounting space.

Thanks,
Wang



+# -w ensures that the only ops are ones which cause write I/O
+run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
+$FSSTRESS_AVOID /dev/null
+
+_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \
+   $SCRATCH_MNT/snap1 $seqres.full 21


_run_btrfs_util_prog will already redirect to $seqres.full, you don't 
need this part.



+
+run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 1000 \
+   $FSSTRESS_AVOID /dev/null
+
+_run_btrfs_util_prog quota enable $SCRATCH_MNT
+_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
+
+#ignore removing subvolume errors
+rm -rf $SCRATCH_MNT/*  /dev/null
+
+_run_btrfs_util_prog filesystem sync $SCRATCH_MNT  $seqres.full 21


Same here.

+_run_btrfs_util_prog qgroup show $SCRATCH_MNT | $SED_PROG -n 
'/[0-9]/p' \

+| $AWK_PROG '{print $1 $2 $3 }'
+


You can't use _run_btrfs_util_prog here, it will eat the output. You 
need to use $BTRFS_UTIL_PROG instead.  Fix these up and resend, this 
is a really important test and I needed it to make sure my qgroups 
patch was right (which it is now.)  Thanks,


Josef
.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs/035: update clone test to expect EOPNOTSUPP

2014-05-07 Thread Liu Bo
On Wed, May 07, 2014 at 02:33:18PM +0200, David Disseldorp wrote:
 With kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, the first
 clone-range overwrite attempt now fails with EOPNOTSUPP, rather than
 tripping a Btrfs BUG_ON().
 
 This test now trips a new Btrfs bug, in which EIO is returned for
 subsequent reads following the second clone range ioctl.
 

Hi David,

Something different here, I didn't get EI on 3.15.0-rc4.

thanks,
-liubo

 Signed-off-by: David Disseldorp dd...@suse.de
 ---
  tests/btrfs/035 | 11 +++
  tests/btrfs/035.out |  5 +
  2 files changed, 16 insertions(+)
 
 diff --git a/tests/btrfs/035 b/tests/btrfs/035
 index 6808179..c9530f6 100755
 --- a/tests/btrfs/035
 +++ b/tests/btrfs/035
 @@ -57,21 +57,32 @@ src_str=aa
  echo -n $src_str  $SCRATCH_MNT/src
  
  $CLONER_PROG $SCRATCH_MNT/src  $SCRATCH_MNT/src.clone1
 +cat $SCRATCH_MNT/src.clone1
 +echo
  
  src_str=bbcc
  
  echo -n $src_str  $SCRATCH_MNT/src
  
  $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2
 +cat $SCRATCH_MNT/src.clone2
 +echo
  
 +# Prior to kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, this clone
 +# resulted in a BUG_ON in __btrfs_drop_extents(). The kernel now returns
 +# EOPNOTSUPP up to userspace.
  snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'`
  echo attempting ioctl (src.clone1 src)
  $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
   $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src
 +cat $SCRATCH_MNT/src
 +echo
  
  snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'`
  echo attempting ioctl (src.clone2 src)
  $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \
   $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src
 +# BUG: subsequent access attempts currently result in EIO...
 +cat $SCRATCH_MNT/src
  
  status=0 ; exit
 diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out
 index f86cadf..0ea2c4f 100644
 --- a/tests/btrfs/035.out
 +++ b/tests/btrfs/035.out
 @@ -1,3 +1,8 @@
  QA output created by 035
 +aa
 +bbcc
  attempting ioctl (src.clone1 src)
 +clone failed: Operation not supported
 +bbcc
  attempting ioctl (src.clone2 src)
 +bbcc
 -- 
 1.8.4.5
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] xfstests: fix flink test

2014-05-07 Thread Eric Sandeen
On 5/7/14, 3:54 PM, Josef Bacik wrote:
 I don't have flink support in my xfsprogs, but it doesn't fail with command 
 not
 found or whatever, it fails because I don't have the -T option.  So fix
 _require_xfs_io_command to check for an invalid option and not run.  This way 
 I
 get notrun instead of a failure.  Thanks,

This actually doesn't work for me on an old kernel, if that matters; it
fails with:

/mnt/test: Is a directory

and nothing catches that.  Old xfsprogs tries to open the file
in question RDWR even before it gets to the -T option (which
would fail, I guess), and you can't do that for directories.

So I suppose we could explicitly test for that when checking
flink:

[ $command = flink ]  echo $testio | grep -q Is a directory  \
_notrun xfs_io flink support is missing

or alternately, first just run xfs_io w/ the command but no file;
today, at least, that works:

[root@bp-05 xfstests]# xfs_io -c flink
command flink not found
[root@bp-05 xfstests]# xfs_io -c pread
[root@bp-05 xfstests]# 

so could do this before the case statement:

$XFS_IO_PROG -c $command 21 | grep -q not found  \
_notrun xfs_io $command support is missing

but that might be subject to future changes in xfs_io command
parsing...

-Eric


 Signed-off-by: Josef Bacik jba...@fb.com
 ---
  common/rc | 2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/common/rc b/common/rc
 index 5c13db5..4fa7e63 100644
 --- a/common/rc
 +++ b/common/rc
 @@ -1258,6 +1258,8 @@ _require_xfs_io_command()
   _notrun xfs_io $command support is missing
   echo $testio | grep -q Operation not supported  \
   _notrun xfs_io $command failed (old kernel/wrong fs?)
 + echo $testio | grep -q invalid option  \
 + _notrun xfs_io $command support is missing
  }
  
  # Check that a fs has enough free space (in 1024b blocks)
 -- 1.8.3.1 ___ xfs mailing list 
 x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove OPT_acl parse when acl disabled

2014-05-07 Thread Guangliang Zhao
Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could
been enabled using a mount option, and now fs/btrfs/acl.o is not
built, so the mount options will appear to be supported but will
be silently ignored.

Signed-off-by: Guangliang Zhao lucienc...@gmail.com
---
 fs/btrfs/super.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 363404b..68ae27c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -579,9 +579,11 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
goto out;
}
break;
+#ifdef CONFIG_BTRFS_FS_POSIX_ACL
case Opt_acl:
root-fs_info-sb-s_flags |= MS_POSIXACL;
break;
+#endif
case Opt_noacl:
root-fs_info-sb-s_flags = ~MS_POSIXACL;
break;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html