[PATCH v2] btrfs-progs: Fix a buffer overflow causing segfault in fstests/btrfs/069

2015-01-06 Thread Qu Wenruo
The newly introduced search_chunk_tree_for_fs_info() won't count devid 0
in fi_arg-num_devices, which will cause buffer overflow since later
get_device_info() will fill di_args with devid.

This can be trigger by fstests/btrfs/069 and any operations needs to
iterate over all the devices like 'fi show' or 'dev stat' while
replacing.

The fix is do an extra probe specifically for devid 0 after
search_chunk_tree_for_fs_info() and change num_devices if needed.

Reported-by: Tsutomu Itoh t-i...@jp.fujitsu.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
---
 utils.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/utils.c b/utils.c
index af0a8fe..6581568 100644
--- a/utils.c
+++ b/utils.c
@@ -1934,8 +1934,10 @@ int get_fs_info(char *path, struct 
btrfs_ioctl_fs_info_args *fi_args,
int ret = 0;
int ndevs = 0;
int i = 0;
+   int replacing = 0;
struct btrfs_fs_devices *fs_devices_mnt = NULL;
struct btrfs_ioctl_dev_info_args *di_args;
+   struct btrfs_ioctl_dev_info_args tmp;
char mp[BTRFS_PATH_NAME_MAX + 1];
DIR *dirstream = NULL;
 
@@ -2003,6 +2005,19 @@ int get_fs_info(char *path, struct 
btrfs_ioctl_fs_info_args *fi_args,
ret = search_chunk_tree_for_fs_info(fd, fi_args);
if (ret)
goto out;
+
+   /*
+* search_chunk_tree_for_fs_info() will lacks the devid 0
+* so manual probe for it here.
+*/
+   ret = get_device_info(fd, 0, tmp);
+   if (!ret) {
+   fi_args-num_devices++;
+   ndevs++;
+   replacing = 1;
+   if (i == 0)
+   i++;
+   }
}
 
if (!fi_args-num_devices)
@@ -2014,6 +2029,8 @@ int get_fs_info(char *path, struct 
btrfs_ioctl_fs_info_args *fi_args,
goto out;
}
 
+   if (replacing)
+   memcpy(di_args, tmp, sizeof(tmp));
for (; i = fi_args-max_id; ++i) {
ret = get_device_info(fd, i, di_args[ndevs]);
if (ret == -ENODEV)
-- 
2.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFE: per-subvolume timestamp that is updated on every change to a subvolume

2015-01-06 Thread Qu Wenruo


 Original Message 
Subject: Re: RFE: per-subvolume timestamp that is updated on every 
change to a subvolume

From: Qu Wenruo quwen...@cn.fujitsu.com
To: Lennart Poettering lenn...@poettering.net, 
linux-btrfs@vger.kernel.org

Date: 2015年01月06日 14:02


 Original Message 
Subject: RFE: per-subvolume timestamp that is updated on every change 
to a subvolume

From: Lennart Poettering lenn...@poettering.net
To: linux-btrfs@vger.kernel.org
Date: 2015年01月06日 01:27

Heya!

I am looking for a nice way to query the overall last modification
timestamp of a subvolume. i.e. the most recent mtime of *any* file or
directory within a subvolume. Ideally, I think, there was a
btrfs_timespec field for this in struct btrfs_root_item, alas there
isn't afaics. Any chance this can be added?
In fact, btrfs_root_item contains one btrfs_inode_item, which contains 
the a/c/m/otime.

But not sure if it contains the time you need.

I'd better add acmotime output for inode_item in btrfs-debug-tree and 
try myself.


Thanks,
Qu

The value in acmotime of the inode_item in root_item is not used,
so it seems anyone can use it to record the acmotime for your purpose.

Thanks,
Qu


Or is there another workable way to query this value? Maybe determine
it from the current generation of a subvolume or so? Is that tracked?
Ideas?

Lennart
--
To unsubscribe from this list: send the line unsubscribe 
linux-btrfs in

the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: reada: Remove unused function

2015-01-06 Thread Jiri Kosina
On Mon, 5 Jan 2015, David Sterba wrote:

  Remove the function btrfs_reada_detach() that is not used anywhere.
  
  This was partially found by using a static code analysis program called 
  cppcheck.
  
  Signed-off-by: Rickard Strandqvist rickard_strandqv...@spectrumdigital.se
 
 No please, this function is part of public readahead API and similar
 patch has been NACKed several times.

BTW how is this any kind of API for anybody, given it's not exported to 
modules?

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: reada: Remove unused function

2015-01-06 Thread David Sterba
On Tue, Jan 06, 2015 at 11:42:07AM +0100, Jiri Kosina wrote:
 On Mon, 5 Jan 2015, David Sterba wrote:
 
   Remove the function btrfs_reada_detach() that is not used anywhere.
   
   This was partially found by using a static code analysis program called 
   cppcheck.
   
   Signed-off-by: Rickard Strandqvist 
   rickard_strandqv...@spectrumdigital.se
  
  No please, this function is part of public readahead API and similar
  patch has been NACKed several times.
 
 BTW how is this any kind of API for anybody, given it's not exported to 
 modules?

Scratch 'public' from the sentence, that was misleading. The API is internal to
btrfs.  The readahead can work in synchronous and asynchronous modes, this
function is API to the async mode.

Documented at reada.c:

 34 /*
 35  * This is the implementation for the generic read ahead framework.
 36  *
 37  * To trigger a readahead, btrfs_reada_add must be called. It will start
 38  * a read ahead for the given range [start, end) on tree root. The returned
 39  * handle can either be used to wait on the readahead to finish
 40  * (btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
 41  *
...

I've experimented with it for readdir speedups, but I haven't finished that due
to other problems.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_inode_item's otime?

2015-01-06 Thread Chris Samuel
On Tue, 6 Jan 2015 10:47:00 PM Chris Samuel wrote:

 On Mon, 5 Jan 2015 06:21:52 PM Lennart Poettering wrote:
 
  It should be easy to initialize it to the mtime when the inode is
  first created...
 
 This I agree with, well worth doing anyway.
 
 I'll see if I can knock up a patch.

Sadly it appears that the btrfs code sets mtime/ctime/atimeat inode creation  
via the normal filesystem inode structure, not through it's own, and as that 
doesn't include otime I'm afraid it's out of my league.  Worth a shot though!

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_inode_item's otime?

2015-01-06 Thread Chris Samuel
On Mon, 5 Jan 2015 06:21:52 PM Lennart Poettering wrote:

 Is this on purpose, or simply an oversight?


The only hint I can see that it's deliberate is the comment in fs/btrfs/send.c 
that says:

/* TODO Add otime support when the otime patches get into upstream */

However...

 It should be easy to initialize it to the mtime when the inode is
 first created...

This I agree with, well worth doing anyway.

I'll see if I can knock up a patch.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: qgroup: move WARN_ON() to the correct location.

2015-01-06 Thread Dongsheng Yang
In function qgroup_excl_accounting(), we need to WARN when
qg-excl is less than what we want to free, same to child
and parents. But currently, for parent qgroup, the WARN_ON()
is located after freeing qg-excl. It will WARN out even we
free it normally.

This patch move this WARN_ON() before freeing qg-excl.

Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com
---
 fs/btrfs/qgroup.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 48b60db..97159a8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1431,9 +1431,8 @@ static int qgroup_excl_accounting(struct btrfs_fs_info 
*fs_info,
qgroup = u64_to_ptr(unode-aux);
qgroup-rfer += sign * oper-num_bytes;
qgroup-rfer_cmpr += sign * oper-num_bytes;
+   WARN_ON(sign  0  qgroup-excl  oper-num_bytes);
qgroup-excl += sign * oper-num_bytes;
-   if (sign  0)
-   WARN_ON(qgroup-excl  oper-num_bytes);
qgroup-excl_cmpr += sign * oper-num_bytes;
qgroup_dirty(fs_info, qgroup);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: suppress a build warning on building 32bit kernel

2015-01-06 Thread David Sterba
On Mon, Jan 05, 2015 at 05:03:29PM +0900, Satoru Takeuchi wrote:
  -  failrec = (struct io_failure_record *)state-private;
  +  failrec = (struct io_failure_record *)(unsigned 
  long)state-private;
 
  We're always using the 'private' data to store a pointer to
  'struct io_failure_record *', please change the defintion in
  'struct extent_state' instead of the typecasting.
 
 Current definition is as follow.
 
 ===
 struct extent_state {
 ...
  /* for use by the FS */
  u64 private;
 };
 ===
 
 It it OK to changing u64 private to struct io_failure_record *failrec
 and change {set,get}_state_private() to {set,get}_state_failrec()?
 Or is it better to keep the name private as is and just change its type
 to unsigned long or (void *)?

I've looked at the implied changes that set/get functions renaming would
need, also to keep the code sane. It does not seem to be small enough to
fold in this patch so please go on with adding the typecasts. The code
could use some cleanups but bugfixes first.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS: Transaction aborted (error -5)

2015-01-06 Thread Dyweni - BTRFS

Hi,



Try to mount with -o recovery with either kernel (newer is pretty much
always better). If that doesn't work then you should try upgrading
btrfs-progs to 3.18 (released dozens of hours ago) and run 'btrfs
check' on the volume and report the results. I don't recommend using
--repair option just yet.



mounting with -o recovery yields the same errors in dmesg output as 
without

the -o recovery.

This is even after upgrading the kernel to 3.18.1 and btrfs-progs to 
3.18.


BTRFS check yields this:


# btrfs check /dev/sdd1
Checking filesystem on /dev/sdd1
UUID: adad9bea-fc42-4411-bfda-345111934fda
checking extents
checksum verify failed on 588447744 found 6E0D3115 wanted F90C810B
checksum verify failed on 588447744 found 6E0D3115 wanted F90C810B
checksum verify failed on 588447744 found 24492BB3 wanted F90C810B
checksum verify failed on 588447744 found 6E0D3115 wanted F90C810B
Csum didn't match
owner ref check failed [588447744 16384]
ref mismatch on [151193784320 32768] extent item 0, found 1
Backref 151193784320 root 277 owner 36161 offset 3833856 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151193784320 root 277 owner 36161 
offset 3833856 found 1 wanted 0 back 0xad8be18

backpointer mismatch on [151193784320 32768]
ref mismatch on [151193817088 32768] extent item 0, found 1
Backref 151193817088 root 277 owner 36161 offset 3915776 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151193817088 root 277 owner 36161 
offset 3915776 found 1 wanted 0 back 0xad8bf00

backpointer mismatch on [151193817088 32768]
ref mismatch on [151193849856 180224] extent item 0, found 1
Backref 151193849856 root 277 owner 36355 offset 3112960 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151193849856 root 277 owner 36355 
offset 3112960 found 1 wanted 0 back 0xab333f0

backpointer mismatch on [151193849856 180224]
ref mismatch on [151194030080 3145728] extent item 0, found 7
Backref 151194030080 root 277 owner 36187 offset 1048576 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151194030080 root 277 owner 36187 
offset 1048576 found 7 wanted 0 back 0x9b82580

backpointer mismatch on [151194030080 3145728]
ref mismatch on [151197175808 32768] extent item 0, found 1
Backref 151197175808 root 277 owner 36361 offset 2523136 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151197175808 root 277 owner 36361 
offset 2523136 found 1 wanted 0 back 0xa0f5568

backpointer mismatch on [151197175808 32768]
ref mismatch on [151197208576 32768] extent item 0, found 1
Backref 151197208576 root 277 owner 36361 offset 2572288 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151197208576 root 277 owner 36361 
offset 2572288 found 1 wanted 0 back 0xa783490

backpointer mismatch on [151197208576 32768]
ref mismatch on [151197241344 32768] extent item 0, found 1
Backref 151197241344 root 277 owner 36361 offset 2621440 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151197241344 root 277 owner 36361 
offset 2621440 found 1 wanted 0 back 0xa4d67e8

backpointer mismatch on [151197241344 32768]
ref mismatch on [151197274112 32768] extent item 0, found 1
Backref 151197274112 root 277 owner 36361 offset 2703360 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151197274112 root 277 owner 36361 
offset 2703360 found 1 wanted 0 back 0x925de30

backpointer mismatch on [151197274112 32768]
ref mismatch on [151197306880 16384] extent item 0, found 1
Backref 151197306880 root 277 owner 36361 offset 3637248 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151197306880 root 277 owner 36361 
offset 3637248 found 1 wanted 0 back 0x916a658

backpointer mismatch on [151197306880 16384]
ref mismatch on [151197323264 983040] extent item 0, found 3
Backref 151197323264 root 277 owner 36208 offset 0 num_refs 0 not found 
in extent tree
Incorrect local backref count on 151197323264 root 277 owner 36208 
offset 0 found 3 wanted 0 back 0xb18a1e0

backpointer mismatch on [151197323264 983040]
ref mismatch on [151198306304 32768] extent item 0, found 1
Backref 151198306304 root 277 owner 36086 offset 3780608 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151198306304 root 277 owner 36086 
offset 3780608 found 1 wanted 0 back 0xb30f878

backpointer mismatch on [151198306304 32768]
ref mismatch on [151198339072 98304] extent item 0, found 1
Backref 151198339072 root 277 owner 36396 offset 901120 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151198339072 root 277 owner 36396 
offset 901120 found 1 wanted 0 back 0x99bfc58

backpointer mismatch on [151198339072 98304]
ref mismatch on [151198437376 16384] extent item 0, found 1
Backref 151198437376 root 277 owner 36396 offset 1015808 num_refs 0 not 
found in extent tree
Incorrect local backref count on 151198437376 root 277 owner 36396 
offset 1015808 found 1 wanted 0 back 0x99bfd40


Re: BTRFS: Transaction aborted (error -5)

2015-01-06 Thread Dyweni - BTRFS

Hi,


[32079.815291] BTRFS info (device sdd1): disk space caching is enabled
[32082.419524] BTRFS: sdd1 checksum verify failed on 588447744 wanted
F90C810B found 6E0D3115 level 0
[32114.418433] BTRFS: sdd1 checksum verify failed on 588447744 wanted
F90C810B found 6E0D3115 level 0
[32125.951446] BTRFS: sdd1 checksum verify failed on 588447744 wanted
F90C810B found 6E0D3115 level 0
[32125.959497] BTRFS: sdd1 checksum verify failed on 588447744 wanted
F90C810B found 24492BB3 level 0


Well I'm no expert, but it seems suspicious to me it doesn't find what
it wants on a particular block twice, but then on the 3rd attempt it
finds something different on the same block which also isn't what it
wants. So that sounds like a device problem to me. Is this an SSD?
What are your mount options (are you using discard)? And what's the
metadata profile, is it single or DUP? I'm gonna guess it's an SSD
with single copy of metadata which is why this isn't self-correcting.




So I finished testing the drive using 'badblocks -n -s -v' (the
non-destructive read-write mode).  It came back clean, no bad blocks
found.  This I did with the entire drive unmounted.

Yet, still, the file system reports the errors shortly after mounting.
(See below)

This drive is an older spinning type drive.  This is the drive as
reported by 'lsscsi':

[3:0:0:0]diskATA  WDC WD1001FALS-0 1D05  /dev/sdd

Newegg lists it as a 'Western Digital WD Black WD1001FALS 1TB 7200 RPM
32MB Cache SATA 3.0Gb/s 3.5 Internal Hard Drive Bare Drive'

The disk is attached to the system via this, as reported by 'lspci':

01:09.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial 
ATA Controller (rev 02)


(Not sure why it lists it as a raid controller or a pci-x controller, as
I used it a simple sata controller and it plugs into a regular 32bit
pci slot).

The motherboard is a Micro-Star MS-6570, with an AMD Athlon XP 3000+
(2171 MHz) processor and 2GB of RAM.

Mount options are only: noatime

BTRFS Profile is:

# btrfs fi df /var/lib/ceph/osd/ceph-1/
Data, single: total=185.01GiB, used=183.39GiB
System, DUP: total=8.00MiB, used=48.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=367.19MiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=128.00MiB, used=0.00B






[162288.768747] BTRFS info (device sdc1): disk space caching is enabled
[162290.463003] BTRFS info (device sdd1): disk space caching is enabled
[162335.594094] BTRFS: sdd1 checksum verify failed on 588447744 wanted 
F90C810B found 6E0D3115 level 0
[162335.595476] BTRFS: sdd1 checksum verify failed on 588447744 wanted 
F90C810B found 6E0D3115 level 0
[162335.602066] BTRFS: sdd1 checksum verify failed on 588447744 wanted 
F90C810B found 24492BB3 level 0

[162335.602075] [ cut here ]
[162335.602085] WARNING: CPU: 0 PID: 31841 at fs/btrfs/super.c:260 
__btrfs_abort_transaction+0x43/0x110()

[162335.602086] BTRFS: Transaction aborted (error -5)
[162335.602087] Modules linked in: iscsi_trgt(O)
[162335.602094] CPU: 0 PID: 31841 Comm: btrfs-cleaner Tainted: G 
  O   3.18.1-gentoo-20150104-0921 #1

[162335.602096] Hardware name:/MS-6570, BIOS 6.00 PG 11/07/2003
[162335.602097]  e68a5e68 e68a5e68 e68a5e28 c14e48a4 e68a5e58 c10345a0 
c15cbefc e68a5e84
[162335.602101]  7c61 c15d895b 0104 c11cff13 c11cff13 fffb 
f4d23800 c150d330
[162335.602104]  e68a5e70 c10345ee 0009 e68a5e68 c15cbefc e68a5e84 
e68a5e9c c11cff13

[162335.602108] Call Trace:
[162335.602117]  [c14e48a4] dump_stack+0x16/0x18
[162335.602122]  [c10345a0] warn_slowpath_common+0x70/0x90
[162335.602125]  [c11cff13] ? __btrfs_abort_transaction+0x43/0x110
[162335.602127]  [c11cff13] ? __btrfs_abort_transaction+0x43/0x110
[162335.602130]  [c10345ee] warn_slowpath_fmt+0x2e/0x30
[162335.602133]  [c11cff13] __btrfs_abort_transaction+0x43/0x110
[162335.602138]  [c11ea884] btrfs_run_delayed_refs.part.73+0xd4/0x1d0
[162335.602140]  [c11ea98f] btrfs_run_delayed_refs+0xf/0x20
[162335.602143]  [c11f96f4] btrfs_should_end_transaction+0x34/0x50
[162335.602146]  [c11e8ef9] btrfs_drop_snapshot+0x1c9/0x740
[162335.602149]  [c11fb152] btrfs_clean_one_deleted_snapshot+0x62/0x90
[162335.602152]  [c11f2a49] cleaner_kthread+0xd9/0x110
[162335.602155]  [c11f2970] ? btrfs_destroy_pinned_extent+0x120/0x120
[162335.602160]  [c1047415] kthread+0x95/0xb0
[162335.602164]  [c14e9100] ret_from_kernel_thread+0x20/0x30
[162335.602166]  [c1047380] ? kthread_worker_fn+0xb0/0xb0
[162335.602168] ---[ end trace ba640116f371d2ff ]---
[162335.602171] BTRFS: error (device sdd1) in 
btrfs_run_delayed_refs:2792: errno=-5 IO failure

[162335.602173] BTRFS info (device sdd1): forced readonly







--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS: Transaction aborted (error -5)

2015-01-06 Thread Dyweni - BTRFS

Hi,

BTRFS check on /dev/sdc1 reveals everything looks ok:

# btrfs check /dev/sdc1
Checking filesystem on /dev/sdc1
UUID: 26ed1033-429a-444f-97cc-ce8103db4c39
checking extents
checking free space cache
checking fs roots
checking csums
checking root refs
found 195515710524 bytes used err is 0
total csum bytes: 205915200
total tree bytes: 407355392
total fs tree bytes: 94830592
total extent tree bytes: 31588352
btree space waste bytes: 100867438
file data blocks allocated: 537492316160
 referenced 195656101888
Btrfs v3.18




(/dev/sdd1 and /dev/sdc1 are the only two btrfs file systems in this 
machine).




Oddly, when the problem with /dev/sdd1 started, problems with /dev/sdc1 
were

also reported, but /dev/sdc1 managed to fix itself.

Below is the complete dmesg output from when problems first started 
until /dev/sdd1 went readonly with errors.


The strangest part of all of this, is that the dmesg output shows no 
errors about the drive being physically bad.
(I ran badblocks -nsv on both /dev/sdd and /dev/sdc, and it confirmed 0 
bad blocks for both drives).





[25581.099684] BTRFS: sdd1 checksum verify failed on 521797632 wanted 
8F2F5FEC found 3E879EFE level 0
[25581.105441] BTRFS: read error corrected: ino 1 off 521797632 (dev 
/dev/sdd1 sector 1035520)
[25581.105612] BTRFS: read error corrected: ino 1 off 521801728 (dev 
/dev/sdd1 sector 1035528)
[25581.105784] BTRFS: read error corrected: ino 1 off 521805824 (dev 
/dev/sdd1 sector 1035536)
[25581.105956] BTRFS: read error corrected: ino 1 off 521809920 (dev 
/dev/sdd1 sector 1035544)
[2.799514] BTRFS: sdd1 checksum verify failed on 680296448 wanted 
AB0E191F found 192D4134 level 0
[2.856199] BTRFS: read error corrected: ino 1 off 680296448 (dev 
/dev/sdd1 sector 1345088)
[2.860571] BTRFS: read error corrected: ino 1 off 680300544 (dev 
/dev/sdd1 sector 1345096)
[2.909634] BTRFS: read error corrected: ino 1 off 680304640 (dev 
/dev/sdd1 sector 1345104)
[2.909876] BTRFS: read error corrected: ino 1 off 680308736 (dev 
/dev/sdd1 sector 1345112)
[29292.777237] BTRFS: sdc1 checksum verify failed on 937738240 wanted 
F4196CDA found AF30B394 level 0
[29292.778022] BTRFS: sdc1 checksum verify failed on 937738240 wanted 
F4196CDA found AF30B394 level 0
[29292.781889] BTRFS: read error corrected: ino 1 off 937738240 (dev 
/dev/sdc1 sector 1847904)
[29292.782054] BTRFS: read error corrected: ino 1 off 937742336 (dev 
/dev/sdc1 sector 1847912)
[29292.782224] BTRFS: read error corrected: ino 1 off 937746432 (dev 
/dev/sdc1 sector 1847920)
[29292.782399] BTRFS: read error corrected: ino 1 off 937750528 (dev 
/dev/sdc1 sector 1847928)
[29691.731107] BTRFS: sdd1 checksum verify failed on 610877440 wanted 
5A8006E7 found 1CFE4A20 level 0
[29691.791550] BTRFS: read error corrected: ino 1 off 610877440 (dev 
/dev/sdd1 sector 1209504)
[29691.793252] BTRFS: read error corrected: ino 1 off 610881536 (dev 
/dev/sdd1 sector 1209512)
[29691.793608] BTRFS: read error corrected: ino 1 off 610885632 (dev 
/dev/sdd1 sector 1209520)
[29691.793797] BTRFS: read error corrected: ino 1 off 610889728 (dev 
/dev/sdd1 sector 1209528)
[34626.017914] BTRFS: sdd1 checksum verify failed on 737181696 wanted 
15D7099D found B6A2A7A9 level 0
[34626.022656] BTRFS: read error corrected: ino 1 off 737181696 (dev 
/dev/sdd1 sector 1456192)
[34626.022867] BTRFS: read error corrected: ino 1 off 737185792 (dev 
/dev/sdd1 sector 1456200)
[34626.023107] BTRFS: read error corrected: ino 1 off 737189888 (dev 
/dev/sdd1 sector 1456208)
[34626.023314] BTRFS: read error corrected: ino 1 off 737193984 (dev 
/dev/sdd1 sector 1456216)
[37057.349996] BTRFS: sdc1 checksum verify failed on 701792256 wanted 
A7BD5067 found 87EF0602 level 0
[37057.424920] BTRFS: read error corrected: ino 1 off 701792256 (dev 
/dev/sdc1 sector 1387072)
[37057.425178] BTRFS: read error corrected: ino 1 off 701796352 (dev 
/dev/sdc1 sector 1387080)
[37057.450174] BTRFS: read error corrected: ino 1 off 701800448 (dev 
/dev/sdc1 sector 1387088)
[37057.453476] BTRFS: read error corrected: ino 1 off 701804544 (dev 
/dev/sdc1 sector 1387096)
[38283.714855] BTRFS: sdd1 checksum verify failed on 190169088 wanted 
27D1E032 found 585B1651 level 0
[38283.715349] BTRFS: sdd1 checksum verify failed on 190169088 wanted 
27D1E032 found 585B1651 level 0
[38283.724140] BTRFS: read error corrected: ino 1 off 190169088 (dev 
/dev/sdd1 sector 387808)
[38283.724313] BTRFS: read error corrected: ino 1 off 190173184 (dev 
/dev/sdd1 sector 387816)
[38283.724485] BTRFS: read error corrected: ino 1 off 190177280 (dev 
/dev/sdd1 sector 387824)
[38283.724648] BTRFS: read error corrected: ino 1 off 190181376 (dev 
/dev/sdd1 sector 387832)
[38385.874438] BTRFS: sdd1 checksum verify failed on 472825856 wanted 
937078F5 found 7FCB4F87 level 0
[38385.897113] BTRFS: read error corrected: ino 1 off 472825856 (dev 
/dev/sdd1 sector 939872)
[38385.897336] BTRFS: read error corrected: ino 1 off 472829952 (dev 
/dev/sdd1 sector 939880)

[PATCH] fstests: add generic test for fsync after unlink

2015-01-06 Thread Filipe Manana
This test is motivated by an fsync issue discovered in btrfs.
The issue was that after fsyncing an inode that got its link count
decremented, and the new link count is greater than zero, after the
fsync log replay the inode's parent directory metadata became
inconsistent - it had a wrong i_size which prevented the directory
from ever being removed (rmdir always failed with -ENOTEMPTY, even
if the directory had no more child inodes).

The btrfs issue was fixed by the following linux kernel patch:

Btrfs: fix directory inconsistency after fsync log replay

Signed-off-by: Filipe Manana fdman...@suse.com
---
 tests/generic/039 | 102 ++
 tests/generic/039.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 105 insertions(+)
 create mode 100755 tests/generic/039
 create mode 100644 tests/generic/039.out

diff --git a/tests/generic/039 b/tests/generic/039
new file mode 100755
index 000..85646f9
--- /dev/null
+++ b/tests/generic/039
@@ -0,0 +1,102 @@
+#! /bin/bash
+# FS QA Test No. 039
+#
+# This test is motivated by an fsync issue discovered in btrfs.
+# The issue was that after fsyncing an inode that got its link count
+# decremented, and the new link count is greater than zero, after the
+# fsync log replay the inode's parent directory metadata became
+# inconsistent - it had a wrong i_size which prevented the directory
+# from ever being removed (rmdir always failed with -ENOTEMPTY, even
+# if the directory had no more child inodes).
+#
+# The btrfs issue was fixed by the following linux kernel patch:
+#
+#Btrfs: fix directory inconsistency after fsync log replay
+#
+#---
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana fdman...@suse.com
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+status=1   # failure is the default!
+
+_cleanup()
+{
+   _cleanup_flakey
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_need_to_be_root
+_require_scratch
+_require_dm_flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs  $seqres.full 21
+
+_init_flakey
+_mount_flakey
+
+# Create a test file with 2 hard links.
+mkdir -p $SCRATCH_MNT/a/b
+echo hello world  $SCRATCH_MNT/a/b/foo
+ln $SCRATCH_MNT/a/b/foo $SCRATCH_MNT/a/b/bar
+
+# Make sure all metadata and data are durably persisted.
+sync
+
+# Now remove one of the hard links and fsync the inode.
+rm -f $SCRATCH_MNT/a/b/bar
+$XFS_IO_PROG -c fsync $SCRATCH_MNT/a/b/foo
+
+# Simulate a crash/power loss. This makes sure the next mount
+# will see an fsync log and will replay that log.
+
+_load_flakey_table $FLAKEY_DROP_WRITES
+_unmount_flakey
+
+_load_flakey_table $FLAKEY_ALLOW_WRITES
+_mount_flakey
+
+# Remove the last hard link of the file and attempt to remove its parent
+# directory - this failed in btrfs because the fsync log and replay code
+# didn't decrement the parent directory's i_size - this made the btrfs
+# rmdir implementation always fail with -ENOTEMPTY.
+#
+# The parent directory's metadata inconsistency was also detected by btrfs'
+# fsck tool, which is run automatically by the fstests framework when the
+# test finishes.
+rm -f $SCRATCH_MNT/a/b/foo
+rmdir $SCRATCH_MNT/a/b
+rmdir $SCRATCH_MNT/a
+
+echo Silence is golden
+status=0
+exit
diff --git a/tests/generic/039.out b/tests/generic/039.out
new file mode 100644
index 000..d4e7ef6
--- /dev/null
+++ b/tests/generic/039.out
@@ -0,0 +1,2 @@
+QA output created by 039
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 1e89848..6af5a1a 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -41,6 +41,7 @@
 036 auto aio rw stress
 037 metadata auto quick
 038 auto stress
+039 metadata auto quick
 053 acl repair auto quick
 062 attr udf auto quick
 068 other auto freeze dangerous stress
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH] Btrfs: fix directory inconsistency after fsync log replay

2015-01-06 Thread Filipe Manana
If we have an inode (file) with a link count greater than 1, remove
one of its hard links and, fsync the inode, power fail/crash and
then replay the fsync log on the next mount, we end up getting the
parent directory's metadata inconsistent - its i_size still reflects
the deleted hard link. This prevents the directory from ever being
deletable, as its i_size can never decrease to BTRFS_EMPTY_DIR_SIZE
even if all of its children inodes are deleted.

This is easy to reproduce with the following excerpt from a test case
for xfstests that I just made (and it passes with xfs and ext4):

mkdir $SCRATCH_MNT/testdir
echo hello world  $SCRATCH_MNT/testdir/foo
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar

# Make sure all metadata and data are durably persisted.
sync

# Now remove one of the hard links and fsync the inode.
rm -f $SCRATCH_MNT/testdir/bar
$XFS_IO_PROG -c fsync $SCRATCH_MNT/testdir/foo

# Simulate a crash/power loss. This makes sure the next mount
# will see an fsync log and will replay that log.

_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey

_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey

# Remove the last hard link of the file and attempt to remove its parent
# directory - this failed in btrfs because the fsync log and replay code
# didn't decrement the parent directory's i_size - this made the btrfs
# rmdir implementation always fail with -ENOTEMPTY.
#
# The parent directory's metadata inconsistency was also detected by btrfs'
# fsck tool, which is run automatically by the fstests framework when the
# test finishes.
rm -f $SCRATCH_MNT/testdir/foo
rmdir $SCRATCH_MNT/testdir

To fix this just make sure that on unlink, if the inode's link count is
greater than 1 and its parent inode is not yet in the fsync log, we end
up logging the parent inode.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/tree-log.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9a02da1..1d65a46 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4272,6 +4272,9 @@ static int btrfs_log_inode_parent(struct 
btrfs_trans_handle *trans,
struct dentry *old_parent = NULL;
int ret = 0;
u64 last_committed = root-fs_info-last_trans_committed;
+   const struct dentry * const first_parent = parent;
+   const bool did_unlink = (BTRFS_I(inode)-last_unlink_trans 
+last_committed);
 
sb = inode-i_sb;
 
@@ -4327,7 +4330,6 @@ static int btrfs_log_inode_parent(struct 
btrfs_trans_handle *trans,
goto end_trans;
}
 
-   inode_only = LOG_INODE_EXISTS;
while (1) {
if (!parent || !parent-d_inode || sb != parent-d_inode-i_sb)
break;
@@ -4336,8 +4338,22 @@ static int btrfs_log_inode_parent(struct 
btrfs_trans_handle *trans,
if (root != BTRFS_I(inode)-root)
break;
 
+   /*
+* On unlink we must make sure our immediate parent directory
+* inode is fully logged. This is to prevent leaving dangling
+* directory index entries and a wrong directory inode's i_size.
+* Not doing so can result in a directory being impossible to
+* delete after log replay (rmdir will always fail with error
+* -ENOTEMPTY).
+*/
+   if (did_unlink  parent == first_parent)
+   inode_only = LOG_INODE_ALL;
+   else
+   inode_only = LOG_INODE_EXISTS;
+
if (BTRFS_I(inode)-generation 
-   root-fs_info-last_trans_committed) {
+   root-fs_info-last_trans_committed ||
+   inode_only == LOG_INODE_ALL) {
ret = btrfs_log_inode(trans, root, inode, inode_only,
  0, LLONG_MAX, ctx);
if (ret)
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] Btrfs: Enhancment for qgroup.

2015-01-06 Thread Satoru Takeuchi
Hi Yang,

On 2015/01/05 15:16, Dongsheng Yang wrote:
 Hi Josef and others,
 
 This patch set is about enhancing qgroup.
 
 [1/3]: fix a bug about qgroup leak when we exceed quota limit,
   It is reviewd by Josef.
 [2/3]: introduce a new accounter in qgroup to close a window where
   user will exceed the limit by qgroup. It looks good to Josef.
 [3/3]: a new patch to fix a bug reported by Satoru.

I tested your the patchset v3. Although it's far better
than the patchset v2, there is still one problem in this patchset.
When I wrote 1.5GiB to a subvolume with 1.0 GiB limit,
1.0GiB - 139 block (in this case, 1KiB/block) was written.

I consider user should be able to write just 1.0GiB in this case.

* Test result

===
+ mkfs.btrfs -f /dev/vdb
Btrfs v3.17
See http://btrfs.wiki.kernel.org for more information.

Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vdb
nodesize 16384 leafsize 16384 sectorsize 4096 size 30.00GiB
+ mount /dev/vdb /root/btrfs-auto-test/
+ ret=0
+ btrfs quota enable /root/btrfs-auto-test/
+ btrfs subvolume create /root/btrfs-auto-test//sub
Create subvolume '/root/btrfs-auto-test/sub'
+ btrfs qgroup limit 1G /root/btrfs-auto-test//sub
+ dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=150
dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded
1048438+0 records in# Tried to write 1GiB - 138 KiB
1048437+0 records out   # Succeeded to write 1GiB - 139 KiB
1073599488 bytes (1.1 GB) copied, 19.0247 s, 56.4 MB/s
===

* note

I tried to run the reproducer five times and the result is
a bit different for each time.

=
#   Written
-
1   1GiB - 139 KiB
2   1GiB - 139 KiB
3   1GiB - 145 KiB
4   1GiB - 135 KiB
5   1GiB - 135 KiB
==

So I consider it's a problem comes from timing.

If I changed the block size from 1KiB to 1 MiB,
the difference in bytes got larger.


#   Written

1   1GiB - 1 MiB
2   1GiB - 1 MiB
3   1GiB - 1 MiB
4   1GiB - 1 MiB
5   1GiB - 1 MiB


Thanks,
Satoru

 
 BTW, I have some other plan about qgroup in my TODO list:
 
 Kernel:
 a). adjust the accounters in parent qgroup when we move
 the child qgroup.
   Currently, when we move a qgroup, the parent qgroup
 will not updated at the same time. This will cause some wrong
 numbers in qgroup.
 
 b). add a ioctl to show the qgroup info.
   Command btrfs qgroup show is showing the qgroup info
 read from qgroup tree. But there is some information in memory
 which is not synced into device. Then it will show some outdate
 number.
 
 c). limit and account size in 3 modes, data, metadata and both.
   qgroup is accounting the size both of data and metadata
 togather, but to a user, the data size is the most useful to them.
 
 d). remove a subvolume related qgroup when subvolume is deleted and
 there is no other reference to it.
 
 user-tool:
 a). Add the unit of B/K/M/G to btrfs qgroup show.
 b). get the information via ioctl rather than reading it from
 btree. Will keep the old way as a fallback for compatiblity.
 
 Any comment and sugguestion is welcome. :)
 
 Yang
 
 Dongsheng Yang (3):
Btrfs: qgroup: free reserved in exceeding quota.
Btrfs: qgroup: Introduce a may_use to account
  space_info-bytes_may_use.
Btrfs: qgroup, Account data space in more proper timings.
 
   fs/btrfs/extent-tree.c | 41 +++---
   fs/btrfs/file.c|  9 ---
   fs/btrfs/inode.c   | 18 -
   fs/btrfs/qgroup.c  | 68 
 +++---
   fs/btrfs/qgroup.h  |  4 +++
   5 files changed, 117 insertions(+), 23 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: clear bio reference after submit_one_bio()

2015-01-06 Thread Satoru Takeuchi
Hi Naota,

On 2015/01/06 1:01, Naohiro Aota wrote:
 After submit_one_bio(), `bio' can go away. However submit_extent_page()
 leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
 It will cause invalid paging request when submit_extent_page() is called
 next time.
 
 I reproduced ENOMEM case with the following script (need
 CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).

I confirmed that this problem reproduce with 3.19-rc3 and
not reproduce with 3.19-rc3 with your patch.

Tested-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com

Thank you for reporting this problem with the reproducer
and fixing it too.

  NOTE:
  I used v3.19-rc3's tools/testing/fault-injection/failcmd.sh
  for the following ./failcmd.sh.

  ./failcmd.sh -p $percent -t $times -i $interval \
  --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 
\
  -- \
  cat $directory/file  /dev/null

* 3.19-rc1 + your patch

===
# ./run
512+0 records in
512+0 records out
# 
===

* 3.19-rc3

===
# ./run
512+0 records in
512+0 records out
[  188.433726] run (776): drop_caches: 1
[  188.682372] FAULT_INJECTION: forcing a failure.
name fail_page_alloc, interval 100, probability 111000, space 0, times 3
[  188.689986] CPU: 0 PID: 954 Comm: cat Not tainted 3.19.0-rc3-ktest #1
[  188.693834] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Bochs 01/01/2011
[  188.698466]  0064 88007b343618 816e5563 
88007fc0fc78
[  188.702730]  81c655c0 88007b343638 813851b5 
0010
[  188.707043]  0002 88007b343768 81188126 
88007b3435a8
[  188.711283] Call Trace:
[  188.712620]  [816e5563] dump_stack+0x45/0x57
[  188.715330]  [813851b5] should_fail+0x135/0x140
[  188.718218]  [81188126] __alloc_pages_nodemask+0xd6/0xb30
[  188.721567]  [81339075] ? blk_rq_map_sg+0x35/0x170
[  188.724558]  [a0010705] ? virtio_queue_rq+0x145/0x2b0 [virtio_blk]
[  188.728191]  [a01bd00f] ? btrfs_submit_compressed_read+0xcf/0x4d0 
[btrfs]
[  188.732079]  [811d99fb] ? kmem_cache_alloc+0x1cb/0x230
[  188.735153]  [81181265] ? mempool_alloc_slab+0x15/0x20
[  188.738188]  [811cee1a] alloc_pages_current+0x9a/0x120
[  188.741153]  [a01bd0e9] btrfs_submit_compressed_read+0x1a9/0x4d0 
[btrfs]
[  188.744835]  [a0178621] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
[  188.748225]  [a018b7b3] ? lookup_extent_mapping+0x13/0x20 [btrfs]
[  188.751547]  [a0179c08] ? btrfs_get_extent+0x98/0xad0 [btrfs]
[  188.754656]  [a01901d7] submit_one_bio+0x67/0xa0 [btrfs]
[  188.757554]  [a0193f27] submit_extent_page.isra.35+0xd7/0x1c0 
[btrfs]
[  188.760981]  [a019509d] __do_readpage+0x31d/0x7b0 [btrfs]
[  188.763920]  [a0195f10] ? btrfs_create_repair_bio+0x110/0x110 
[btrfs]
[  188.767382]  [a0179b70] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
[  188.770671]  [a018f88d] ? btrfs_lookup_ordered_range+0x13d/0x180 
[btrfs]
[  188.774366]  [a01958ca] 
__extent_readpages.constprop.42+0x2ba/0x2d0 [btrfs]
[  188.778031]  [a0179b70] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
[  188.781241]  [a01969b9] extent_readpages+0x169/0x1b0 [btrfs]
[  188.784322]  [a0179b70] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
[  188.789014]  [a0176b0f] btrfs_readpages+0x1f/0x30 [btrfs]
[  188.792028]  [8118bf5c] __do_page_cache_readahead+0x18c/0x1f0
[  188.795078]  [8118c09f] ondemand_readahead+0xdf/0x260
[  188.797702]  [a016c5df] ? btrfs_congested_fn+0x5f/0xa0 [btrfs]
[  188.800718]  [8118c291] page_cache_async_readahead+0x71/0xa0
[  188.803650]  [8118017f] generic_file_read_iter+0x40f/0x5e0
[  188.806480]  [811f43be] new_sync_read+0x7e/0xb0
[  188.808832]  [811f55d8] __vfs_read+0x18/0x50
[  188.811068]  [811f569a] vfs_read+0x8a/0x140
[  188.813298]  [811f5796] SyS_read+0x46/0xb0
[  188.815486]  [81125806] ? __audit_syscall_exit+0x1f6/0x2a0
[  188.818293]  [816eb8e9] system_call_fastpath+0x12/0x17
[  188.821005] BUG: unable to handle kernel paging request at 0001000c
[  188.821984] IP: [a01901b3] submit_one_bio+0x43/0xa0 [btrfs]
[  188.821984] PGD 7bad3067 PUD 0 
[  188.821984] Oops:  [#1] SMP 
[  188.821984] Modules linked in: ip6table_filter ip6_tables ebtable_nat 
ebtables bnep bluetooth rfkill btrfs xor raid6_pq microcode 8139too serio_raw 
virtio_balloon 8139cp mii nfsd auth_rpcgss nfs_acl lockd grace sunrpc 
virtio_blk ata_generic pata_acpi
[  188.821984] CPU: 1 PID: 954 Comm: cat Not tainted 3.19.0-rc3-ktest #1
[  

[PATCH] Btrfs: lookup for block group only if needed when freeing a tree block

2015-01-06 Thread Filipe Manana
Very often our extent buffer's header generation doesn't match the current
transaction's id or it is also referenced by other trees (snapshots), so
we don't need the corresponding block group cache object. Therefore only
search for it if we are going to use it, so we avoid an unnecessary search
in the block groups rbtree (and acquiring and releasing its spinlock).

Freeing a tree block is performed when COWing or deleting a node/leaf,
which implies we are holding the node/leaf's parent node lock, therefore
reducing the amount of time spent when freeing a tree block helps reducing
the amount of time we are holding the parent node's lock.

For example, for a run of xfstests/generic/083, the block group cache
object was needed only 682 times for a total of 226691 calls to free
a tree block.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a80b971..5a45253 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6205,7 +6205,6 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
   struct extent_buffer *buf,
   u64 parent, int last_ref)
 {
-   struct btrfs_block_group_cache *cache = NULL;
int pin = 1;
int ret;
 
@@ -6221,17 +6220,20 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
if (!last_ref)
return;
 
-   cache = btrfs_lookup_block_group(root-fs_info, buf-start);
-
if (btrfs_header_generation(buf) == trans-transid) {
+   struct btrfs_block_group_cache *cache;
+
if (root-root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
ret = check_ref_cleanup(trans, root, buf-start);
if (!ret)
goto out;
}
 
+   cache = btrfs_lookup_block_group(root-fs_info, buf-start);
+
if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
pin_down_extent(root, cache, buf-start, buf-len, 1);
+   btrfs_put_block_group(cache);
goto out;
}
 
@@ -6239,6 +6241,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
 
btrfs_add_free_space(cache, buf-start, buf-len);
btrfs_update_reserved_bytes(cache, buf-len, RESERVE_FREE, 0);
+   btrfs_put_block_group(cache);
trace_btrfs_reserved_extent_free(root, buf-start, buf-len);
pin = 0;
}
@@ -6253,7 +6256,6 @@ out:
 * anymore.
 */
clear_bit(EXTENT_BUFFER_CORRUPT, buf-bflags);
-   btrfs_put_block_group(cache);
 }
 
 /* Can return -ENOMEM */
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: qgroup: move WARN_ON() to the correct location.

2015-01-06 Thread Satoru Takeuchi
On 2015/01/06 21:54, Dongsheng Yang wrote:
 In function qgroup_excl_accounting(), we need to WARN when
 qg-excl is less than what we want to free, same to child
 and parents. But currently, for parent qgroup, the WARN_ON()
 is located after freeing qg-excl. It will WARN out even we
 free it normally.
 
 This patch move this WARN_ON() before freeing qg-excl.
 
 Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com

Reviewed-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com

 ---
   fs/btrfs/qgroup.c | 3 +--
   1 file changed, 1 insertion(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
 index 48b60db..97159a8 100644
 --- a/fs/btrfs/qgroup.c
 +++ b/fs/btrfs/qgroup.c
 @@ -1431,9 +1431,8 @@ static int qgroup_excl_accounting(struct btrfs_fs_info 
 *fs_info,
   qgroup = u64_to_ptr(unode-aux);
   qgroup-rfer += sign * oper-num_bytes;
   qgroup-rfer_cmpr += sign * oper-num_bytes;
 + WARN_ON(sign  0  qgroup-excl  oper-num_bytes);
   qgroup-excl += sign * oper-num_bytes;
 - if (sign  0)
 - WARN_ON(qgroup-excl  oper-num_bytes);
   qgroup-excl_cmpr += sign * oper-num_bytes;
   qgroup_dirty(fs_info, qgroup);
   
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Data recovery after RBD I/O error

2015-01-06 Thread Jérôme Poulin
On Mon, Jan 5, 2015 at 6:59 AM, Austin S Hemmelgarn
ahferro...@gmail.com wrote:
 Secondly, I would highly recommend not using ANY non-cluster-aware FS on top
 of a clustered block device like RBD


For my use-case, this is just a single server using the RBD device. No
clustering involved on the BTRFS side of thing. However, it was really
useful to take snapshots (just like LVM) before modifying the
filesystem in any way.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123!

2015-01-06 Thread Satoru Takeuchi

Hi Tomasz,

On 2014/12/20 8:28, Tomasz Chmielewski wrote:

Get this BUG with 3.18.1 (pasted at the bottom of the email).
Below all actions from creating the fs to BUG. I did not attempt to reproduce.


I tried to reproduce this problem and have some questions.



# mkfs.btrfs /dev/vdb
Btrfs v3.17.3
See http://btrfs.wiki.kernel.org for more information.

Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vdb
 nodesize 16384 leafsize 16384 sectorsize 4096 size 256.00GiB

# mount -o noatime /dev/vdb /mnt/test/
# cd /mnt/test
# btrfs sub cre subvolume
Create subvolume './subvolume'
# dd if=/dev/urandom of=bigfile.img bs=64k


Does it really this command? I consider it will fill up
whole /dev/vdb. And is it not subvolume/bigfile.img
but bigfile.img?


^C91758+0 records in
91757+0 records out
6013386752 bytes (6.0 GB) copied, 374.777 s, 16.0 MB/s
# btrfs sub list /mnt/test/
ID 257 gen 16 top level 5 path subvolume

# btrfs quota enable /mnt/test

# btrfs qgroup show /mnt/test
qgroupid rfer   excl
    
0/5  16384  16384
0/2576013403136 6013403136

# dd if=/dev/urandom of=bigfile2.img bs=64k
^C47721+0 records in
47720+0 records out
3127377920 bytes (3.1 GB) copied, 194.641 s, 16.1 MB/s


If bigfile.img is just under /mnt/test, I can't understand
why this command succeeded to write more 3 GiB.



# btrfs qgroup show /mnt/test
qgroupid rfer   excl
    
0/5  16384  16384
0/2578704049152 8704049152
root@srv2:/mnt/test/subvolume# sync
root@srv2:/mnt/test/subvolume# btrfs qgroup show /mnt/test
qgroupid rfer   excl
    
0/5  16384  16384
0/2579140781056 9140781056

# dd if=/dev/urandom of=bigfile3.img bs=64k
^C3617580+0 records in
3617579+0 records out
237081657344 bytes (237 GB) copied, 14796 s, 16.0 MB/s


It's too.

Thanks,
Satoru



# df -h
Filesystem  Size  Used Avail Use% Mounted on
(...)
/dev/vdb256G  230G   25G  91% /mnt/test


# btrfs qgroup show /mnt/test
qgroupid rfer excl
  
0/5  1638416384
0/257245960245248 245960245248

# ls -l
total 240451584
-rw-r--r-- 1 root root   3127377920 Dec 19 20:06 bigfile2.img
-rw-r--r-- 1 root root 237081657344 Dec 20 00:15 bigfile3.img
-rw-r--r-- 1 root root   6013386752 Dec 19 20:02 bigfile.img

# rm bigfile3.img

# sync

# dmesg
(...)
[   95.055420] BTRFS: device fsid 97f98279-21e7-4822-89be-3aed9dc05f2c devid 1 
transid 3 /dev/vdb
[  118.446509] BTRFS info (device vdb): disk space caching is enabled
[  118.446518] BTRFS: flagging fs with big metadata feature
[  118.452176] BTRFS: creating UUID tree
[  575.189412] BTRFS info (device vdb): qgroup scan completed
[15948.234826] [ cut here ]
[15948.234883] kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123!
[15948.234906] invalid opcode:  [#1] SMP
[15948.234925] Modules linked in: nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 
nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv4 
nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables 
dm_crypt btrfs xor crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
aesni_intel ppdev aes_x86_64 lrw raid6_pq gf128mul glue_helper ablk_helper 
cryptd serio_raw mac_hid pvpanic 8250_fintek parport_pc i2c_piix4 lp parport 
psmouse qxl ttm floppy drm_kms_helper drm
[15948.235172] CPU: 0 PID: 3274 Comm: btrfs-cleaner Not tainted 
3.18.1-031801-generic #201412170637
[15948.235193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[15948.235222] task: 880036708a00 ti: 88007b97c000 task.ti: 
88007b97c000
[15948.235240] RIP: 0010:[c0458ec9]  [c0458ec9] 
btrfs_orphan_add+0x1a9/0x1c0 [btrfs]
[15948.235305] RSP: 0018:88007b97fc98  EFLAGS: 00010286
[15948.235318] RAX: ffe4 RBX: 88007b80a800 RCX: 
[15948.235333] RDX: 219e RSI: 0004 RDI: 880079418138
[15948.235349] RBP: 88007b97fcd8 R08: 88007fc1cae0 R09: 88007ad272d0
[15948.235366] R10:  R11: 0010 R12: 88007a2d9500
[15948.235381] R13: 8800027d60e0 R14: 88007b80ac58 R15: 0001
[15948.235401] FS:  () GS:88007fc0() 
knlGS:
[15948.235418] CS:  0010 DS:  ES:  CR0: 80050033
[15948.235432] CR2: 7f0489ff CR3: 7a5e CR4: 001407f0
[15948.235464] Stack:
[15948.235473]  88007b97fcd8 c0497acf 88007b809800 
88003c207400
[15948.235498]  88007b809800 88007ad272d0 88007a2d9500 
0001
[15948.235521]  88007b97fd58 c04412e0 880079418000 
0004c0427fea
[15948.235551] Call Trace:
[15948.235601]  [c0497acf] ? 

Re: btrfs_inode_item's otime?

2015-01-06 Thread David Sterba
On Mon, Jan 05, 2015 at 06:21:52PM +0100, Lennart Poettering wrote:
 btrfs' btrfs_inode_item structure contains a field for the birth time
 of a file, .otime. This field could be quite useful, and I'd like to
 make use of it. I can query it with the BTRFS_IOC_TREE_SEARCH ioctl
 from userspace, alas it appears that the entry is never actually
 initialized to anything other than 0?
 
 Is this on purpose, or simply an oversight? It should be easy to
 initialize it to the mtime when the inode is first created...

I'ts probably just lack of implementation due to lack of interface to
userspace, but we should set it.

 I am aware of the discussions about introducing the birth time as
 something queriable with a future xstat() call. Even if that
 high-level API doesn't exist yet, and even if it might be messy to use
 BTRFS_IOC_TREE_SEARCH to query the otime currently, I think it would
 be good to properly initialize the field, so that pre-existing file
 systems would report useful data when xstat() is added one day...

Agreed.

 (Of course, even without xstat(), I think it would be good to have an
 unprivileged ioctl to query the otime in btrfs... the TREE_SEARCH
 ioctl after all requires privileges...)

Adding this interface is a different question. I do not like to add
ioctls that do too specialized things that normally fit into a generic
interface like the xstat example. We could use the object properties
instead (ie. export the otime as an extended attribute), but the work on
that has stalled and it's not ready to just simply add the otime in
advance.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_inode_item's otime?

2015-01-06 Thread David Sterba
On Tue, Jan 06, 2015 at 11:43:22PM +1100, Chris Samuel wrote:
 On Tue, 6 Jan 2015 10:47:00 PM Chris Samuel wrote:
 
  On Mon, 5 Jan 2015 06:21:52 PM Lennart Poettering wrote:
  
   It should be easy to initialize it to the mtime when the inode is
   first created...
  
  This I agree with, well worth doing anyway.
  
  I'll see if I can knock up a patch.
 
 Sadly it appears that the btrfs code sets mtime/ctime/atimeat inode creation  
 via the normal filesystem inode structure, not through it's own, and as that 
 doesn't include otime I'm afraid it's out of my league.  Worth a shot though!

Set the otime in btrfs_new_inode after the call to fill_inode_item.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html