Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system during the snapshot receive

2016-12-22 Thread Giuseppe Della Bianca
(synthetic resend)

Hi.

Is possible that there are transfers, cancellations and other, at the same 
time, but not in the same subvolume.

My script checks that there are no transfers in progress on the same 
subvolume.

Is possible that the same subvolume is mounted several times (temporary mount 
at the beginning, and unmount at the end, in my script).


Thanks for all.


P.S. Sorry for my bad English.


Gdb


In data mercoledì 21 dicembre 2016 23:14:44, Xin Zhou ha scritto:
> Hi,
> Racing condition can happen, if running multiple transfers to the same
> destination. Would you like to tell how many transfers are the scripts
> running at a time to a specific hdd?
> 
> Thanks,
> Xin
>  
> 
> Sent: Wednesday, December 21, 2016 at 1:11 PM
> From: "Chris Murphy" 
> To: No recipient address
> Cc: "Giuseppe Della Bianca" , "Xin Zhou" ,
> "Btrfs BTRFS"  Subject: Re: [CORRUPTION
> FILESYSTEM] Corrupted and unrecoverable file system during the snapshot
> receive
> On Wed, Dec 21, 2016 at 2:09 PM, Chris Murphy  
wrote:
> > What about CONFIG_BTRFS_FS_CHECK_INTEGRITY? And then using check_int
> > mount option?
> 
> This slows things down, and in that case it might avoid the problem if
> it's the result of a race condition.
> 
> --
> Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [CORRUPTION FILESYSTEM] Corrupted and unrecoverable file system during the snapshot receive

2016-12-22 Thread Giuseppe Della Bianca
(synthetic resend)

I'll try to compile the kernel and mount with the config and option enabled.

I keep in mind the side effects of that option.


Thanks for all.


P.S. Sorry for my bad English.



Chris Murphy ha scritto:
> On Wed, Dec 21, 2016 at 2:09 PM, Chris Murphy  
wrote:
> > What about CONFIG_BTRFS_FS_CHECK_INTEGRITY? And then using check_int
> > mount option?
> 
> This slows things down, and in that case it might avoid the problem if
> it's the result of a race condition.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


strange btrfs deadlock

2016-12-22 Thread Christoph Anton Mitterer
Hey.

Had the following on a Debian sid:
Linux heisenberg 4.8.0-2-amd64 #1 SMP Debian 4.8.11-1 (2016-12-02)
x86_64 GNU/Linux


I was basically copying data between several filesystems all on SATA
disks attached via USB.

Unfortunately I have only little data...



The first part may be totally unrelated... here I was doing some
recursive diff between data on sdb and sdc (both mounted ro), when I
connected a 3rd disk to the same USB3.0 hub on which the other two
disks were already connected.

That somehow made sdc failing... (interestingly sdb seemed to continue
working).

Dec 23 04:36:04 heisenberg kernel: [38080.618202] BTRFS info (device dm-1): 
disk space caching is enabled
Dec 23 04:36:18 heisenberg kernel: [38093.903212] bash (7006): drop_caches: 3
Dec 23 04:58:44 heisenberg kernel: [39440.832610] scsi host7: uas_pre_reset: 
timed out
Dec 23 04:58:44 heisenberg kernel: [39440.832760] sd 7:0:0:0: [sdc] tag#4 
uas_zap_pending 0 uas-tag 5 inflight: CMD 
Dec 23 04:58:44 heisenberg kernel: [39440.832767] sd 7:0:0:0: [sdc] tag#4 CDB: 
Read(10) 28 00 3f 03 45 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832777] sd 7:0:0:0: [sdc] tag#5 
uas_zap_pending 0 uas-tag 6 inflight: CMD 
Dec 23 04:58:44 heisenberg kernel: [39440.832780] sd 7:0:0:0: [sdc] tag#5 CDB: 
Read(10) 28 00 3f 03 49 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832785] sd 7:0:0:0: [sdc] tag#6 
uas_zap_pending 0 uas-tag 7 inflight: CMD 
Dec 23 04:58:44 heisenberg kernel: [39440.832788] sd 7:0:0:0: [sdc] tag#6 CDB: 
Read(10) 28 00 3f 03 4d 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832792] sd 7:0:0:0: [sdc] tag#8 
uas_zap_pending 0 uas-tag 9 inflight: CMD 
Dec 23 04:58:44 heisenberg kernel: [39440.832796] sd 7:0:0:0: [sdc] tag#8 CDB: 
Read(10) 28 00 3f 03 51 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832858] sd 7:0:0:0: [sdc] tag#4 
FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec 23 04:58:44 heisenberg kernel: [39440.832864] sd 7:0:0:0: [sdc] tag#4 CDB: 
Read(10) 28 00 3f 03 45 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832870] blk_update_request: I/O 
error, dev sdc, sector 1057178952
Dec 23 04:58:44 heisenberg kernel: [39440.832917] sd 7:0:0:0: [sdc] tag#5 
FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec 23 04:58:44 heisenberg kernel: [39440.832921] sd 7:0:0:0: [sdc] tag#5 CDB: 
Read(10) 28 00 3f 03 49 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832924] blk_update_request: I/O 
error, dev sdc, sector 1057179976
Dec 23 04:58:44 heisenberg kernel: [39440.832937] BTRFS error (device dm-2): 
bdev /dev/mapper/data-c errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Dec 23 04:58:44 heisenberg kernel: [39440.832959] sd 7:0:0:0: [sdc] tag#6 
FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec 23 04:58:44 heisenberg kernel: [39440.832963] sd 7:0:0:0: [sdc] tag#6 CDB: 
Read(10) 28 00 3f 03 4d 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832966] blk_update_request: I/O 
error, dev sdc, sector 1057181000
Dec 23 04:58:44 heisenberg kernel: [39440.832980] sd 7:0:0:0: [sdc] tag#8 
FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Dec 23 04:58:44 heisenberg kernel: [39440.832985] sd 7:0:0:0: [sdc] tag#8 CDB: 
Read(10) 28 00 3f 03 51 48 00 04 00 00
Dec 23 04:58:44 heisenberg kernel: [39440.832988] blk_update_request: I/O 
error, dev sdc, sector 1057182024
Dec 23 04:58:44 heisenberg kernel: [39440.832995] BTRFS error (device dm-2): 
bdev /dev/mapper/data-c errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Dec 23 04:58:44 heisenberg kernel: [39440.833807] sd 7:0:0:0: [sdc] 
Synchronizing SCSI cache
Dec 23 04:58:45 heisenberg kernel: [39441.072663] sd 7:0:0:0: [sdc] Synchronize 
Cache(10) failed: Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
Dec 23 04:58:45 heisenberg kernel: [39441.096973] usb 4-2.4: Disable of 
device-initiated U1 failed.
Dec 23 04:58:45 heisenberg kernel: [39441.100670] usb 4-2.4: Disable of 
device-initiated U2 failed.
Dec 23 04:58:45 heisenberg kernel: [39441.107663] usb 4-2.4: Set SEL for 
device-initiated U1 failed.
Dec 23 04:58:45 heisenberg kernel: [39441.55] usb 4-2.4: Set SEL for 
device-initiated U2 failed.
Dec 23 04:58:45 heisenberg kernel: [39441.188752] usb 4-2.4: reset SuperSpeed 
USB device number 4 using xhci_hcd
Dec 23 04:58:45 heisenberg kernel: [39441.225703] scsi host8: uas
Dec 23 04:58:45 heisenberg kernel: [39441.227043] scsi 8:0:0:0: Direct-Access   
  Seagate  Expansion0636 PQ: 0 ANSI: 6
Dec 23 04:58:45 heisenberg kernel: [39441.429443] sd 8:0:0:0: Attached scsi 
generic sg2 type 0
Dec 23 04:58:45 heisenberg kernel: [39441.429572] sd 8:0:0:0: [sdd] 3907029167 
512-byte logical blocks: (2.00 TB/1.82 TiB)
Dec 23 04:58:45 heisenberg kernel: [39441.430756] sd 8:0:0:0: [sdd] Write 
Protect is off
Dec 23 04:58:45 heisenberg kernel: [39441.430764] sd 8:0:0:0: [sdd] Mode Sense: 
2b 00 10 08
Dec 23 04:58:45 heisenberg kernel: [39441.431593] sd 8:0:0:0: [sdd] Write 
cache: enabled, read 

Re: [PATCH] Btrfs: adjust outstanding_extents counter properly when dio write is split

2016-12-22 Thread Anand Jain


On 12/23/16 09:13, Liu Bo wrote:

Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.

This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.


 Fix works. Thanks Liu bo for working on this.

 Tested-by: Anand Jain 


Reported-by: Anand Jain 
Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a4c8796..4175987 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7641,11 +7641,18 @@ static void adjust_dio_outstanding_extents(struct inode 
*inode,
 * within our reservation, otherwise we need to adjust our inode
 * counter appropriately.
 */
-   if (dio_data->outstanding_extents) {
+   if (dio_data->outstanding_extents >= num_extents) {
dio_data->outstanding_extents -= num_extents;
} else {
+   /*
+* If dio write length has been split due to no large enough
+* contiguous space, we need to compensate our inode counter
+* appropriately.
+*/
+   u64 num_needed = num_extents - dio_data->outstanding_extents;
+
spin_lock(_I(inode)->lock);
-   BTRFS_I(inode)->outstanding_extents += num_extents;
+   BTRFS_I(inode)->outstanding_extents += num_needed;
spin_unlock(_I(inode)->lock);
}
 }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Get the highest inode for lost+found

2016-12-22 Thread Qu Wenruo



At 12/23/2016 08:47 AM, Goldwyn Rodrigues wrote:



On 12/20/2016 06:57 PM, Qu Wenruo wrote:



At 12/20/2016 08:08 PM, Goldwyn Rodrigues wrote:

From: Goldwyn Rodrigues 

root->highest_inode is not accurate at the time of creating a lost+found
and it fails because the highest_inode+1 is already present. This
could be
because of fixes after highest_inode is set. Instead, search
for the highest inode in the tree and use it for lost+found.

This makes root->highest_inode unnecessary and hence deleted.


This is much better than recording it in root.



Signed-off-by: Goldwyn Rodrigues 
---
 cmds-check.c | 46 +++---
 ctree.h  |  1 -
 disk-io.c|  1 -
 3 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 1dba298..a55d00d 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2853,6 +2853,31 @@ out:
 return ret;
 }

+static int get_highest_inode(struct btrfs_trans_handle *trans,
+struct btrfs_root *root,
+struct btrfs_path *path,
+u64 *highest_ino)
+{
+struct btrfs_key key, found_key;
+int ret;
+
+btrfs_init_path(path);
+key.objectid = BTRFS_LAST_FREE_OBJECTID;
+key.offset = -1;
+key.type = BTRFS_INODE_ITEM_KEY;
+ret = btrfs_search_slot(trans, root, , path, -1, 1);
+if (ret == 1) {
+btrfs_item_key_to_cpu(path->nodes[0], _key,
+path->slots[0] - 1);
+*highest_ino = found_key.objectid;
+ret = 0;
+}


I think such search may cause problem.

If the fs uses inode_map mount option, each fs tree will have a tailing
FREE_INO and FREE_SPACE items.

And FREE_INO/FREE_SPACE are all over LAST_FREE_OBJECTID.

item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
inode generation 3 transid 7 size 0 nbytes 16384
block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
sequence 0 flags 0x1(none)
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
inode ref index 0 namelen 2 name: ..
item 2 key (FREE_INO INODE_ITEM 0) itemoff 15951 itemsize 160
inode generation 0 transid 7 size 0 nbytes 0
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 24 flags 0x0(NOCOMPRESS|PREALLOC)
item 3 key (FREE_SPACE UNTYPED 0) itemoff 15910 itemsize 41
location key (FREE_INO INODE_ITEM 0)
cache generation 0 entries 0 bitmaps 0


In that case, such search will point to the FREE_INO slot, and always
return -EOVERFLOW.

What about check the objectid and if it's larger than
LAST_FREE_OBJECTID, try to search previous slot?


If we are starting from LAST_FREE_OBJECTID which is -256ULL and smaller
than FREE_INO (-12ULL). -256ULL < -12ULL.
Won't a search for a (slot - 1) result in something smaller than


Oh my fault, I didn't the "path->slots[0] - 1" used in 
btrfs_item_key_to_cpu().


I always assume we should use btrfs_previous_item() to get previous 
item, not use slot - 1 directly.


BTW, btrfs_previous_item() seems safer, since it can handle case like 
slot[0] == 0.
Even though it won't happen since LAST_FREE_OBJECTID will not be used by 
any inode.


Feel free to add my reviewed tag:

Reviewed-by: Qu Wenruo 

Thanks,
Qu


-256ULL? IOW, if it results in -12ULL then it is not a valid inode
anyways and hence should return -EOVERFLOW anyways.




Other part looks good for me.

Thanks,
Qu

+if (*highest_ino >= BTRFS_LAST_FREE_OBJECTID)
+ret = -EOVERFLOW;
+btrfs_release_path(path);
+return ret;
+}
+
 static int repair_inode_nlinks(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct btrfs_path *path,
@@ -2898,11 +2923,9 @@ static int repair_inode_nlinks(struct
btrfs_trans_handle *trans,
 }

 if (rec->found_link == 0) {
-lost_found_ino = root->highest_inode;
-if (lost_found_ino >= BTRFS_LAST_FREE_OBJECTID) {
-ret = -EOVERFLOW;
+ret = get_highest_inode(trans, root, path, _found_ino);
+if (ret < 0)
 goto out;
-}
 lost_found_ino++;
 ret = btrfs_mkdir(trans, root, dir_name, strlen(dir_name),
   BTRFS_FIRST_FREE_OBJECTID, _found_ino,
@@ -3266,21 +3289,6 @@ static int check_inode_recs(struct btrfs_root
*root,
 }

 /*
- * We need to record the highest inode number for later 'lost+found'
- * dir creation.
- * We must select an ino not used/referred by any existing inode, or
- * 'lost+found' ino may be a missing ino in a corrupted leaf,
- * this may cause 'lost+found' dir has wrong nlinks.
- */
-cache = last_cache_extent(inode_cache);
-if (cache) {
-node = container_of(cache, struct ptr_node, cache);
-rec = node->data;
-if (rec->ino > root->highest_inode)
-

[PATCH] Btrfs: adjust outstanding_extents counter properly when dio write is split

2016-12-22 Thread Liu Bo
Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.

This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.

Reported-by: Anand Jain 
Signed-off-by: Liu Bo 
---
 fs/btrfs/inode.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a4c8796..4175987 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7641,11 +7641,18 @@ static void adjust_dio_outstanding_extents(struct inode 
*inode,
 * within our reservation, otherwise we need to adjust our inode
 * counter appropriately.
 */
-   if (dio_data->outstanding_extents) {
+   if (dio_data->outstanding_extents >= num_extents) {
dio_data->outstanding_extents -= num_extents;
} else {
+   /*
+* If dio write length has been split due to no large enough
+* contiguous space, we need to compensate our inode counter
+* appropriately.
+*/
+   u64 num_needed = num_extents - dio_data->outstanding_extents;
+
spin_lock(_I(inode)->lock);
-   BTRFS_I(inode)->outstanding_extents += num_extents;
+   BTRFS_I(inode)->outstanding_extents += num_needed;
spin_unlock(_I(inode)->lock);
}
 }
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Get the highest inode for lost+found

2016-12-22 Thread Goldwyn Rodrigues


On 12/20/2016 06:57 PM, Qu Wenruo wrote:
> 
> 
> At 12/20/2016 08:08 PM, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues 
>>
>> root->highest_inode is not accurate at the time of creating a lost+found
>> and it fails because the highest_inode+1 is already present. This
>> could be
>> because of fixes after highest_inode is set. Instead, search
>> for the highest inode in the tree and use it for lost+found.
>>
>> This makes root->highest_inode unnecessary and hence deleted.
> 
> This is much better than recording it in root.
> 
>>
>> Signed-off-by: Goldwyn Rodrigues 
>> ---
>>  cmds-check.c | 46 +++---
>>  ctree.h  |  1 -
>>  disk-io.c|  1 -
>>  3 files changed, 27 insertions(+), 21 deletions(-)
>>
>> diff --git a/cmds-check.c b/cmds-check.c
>> index 1dba298..a55d00d 100644
>> --- a/cmds-check.c
>> +++ b/cmds-check.c
>> @@ -2853,6 +2853,31 @@ out:
>>  return ret;
>>  }
>>
>> +static int get_highest_inode(struct btrfs_trans_handle *trans,
>> +struct btrfs_root *root,
>> +struct btrfs_path *path,
>> +u64 *highest_ino)
>> +{
>> +struct btrfs_key key, found_key;
>> +int ret;
>> +
>> +btrfs_init_path(path);
>> +key.objectid = BTRFS_LAST_FREE_OBJECTID;
>> +key.offset = -1;
>> +key.type = BTRFS_INODE_ITEM_KEY;
>> +ret = btrfs_search_slot(trans, root, , path, -1, 1);
>> +if (ret == 1) {
>> +btrfs_item_key_to_cpu(path->nodes[0], _key,
>> +path->slots[0] - 1);
>> +*highest_ino = found_key.objectid;
>> +ret = 0;
>> +}
> 
> I think such search may cause problem.
> 
> If the fs uses inode_map mount option, each fs tree will have a tailing
> FREE_INO and FREE_SPACE items.
> 
> And FREE_INO/FREE_SPACE are all over LAST_FREE_OBJECTID.
> 
> item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
> inode generation 3 transid 7 size 0 nbytes 16384
> block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0
> sequence 0 flags 0x1(none)
> item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
> inode ref index 0 namelen 2 name: ..
> item 2 key (FREE_INO INODE_ITEM 0) itemoff 15951 itemsize 160
> inode generation 0 transid 7 size 0 nbytes 0
> block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> sequence 24 flags 0x0(NOCOMPRESS|PREALLOC)
> item 3 key (FREE_SPACE UNTYPED 0) itemoff 15910 itemsize 41
> location key (FREE_INO INODE_ITEM 0)
> cache generation 0 entries 0 bitmaps 0
> 
> 
> In that case, such search will point to the FREE_INO slot, and always
> return -EOVERFLOW.
> 
> What about check the objectid and if it's larger than
> LAST_FREE_OBJECTID, try to search previous slot?

If we are starting from LAST_FREE_OBJECTID which is -256ULL and smaller
than FREE_INO (-12ULL). -256ULL < -12ULL.
Won't a search for a (slot - 1) result in something smaller than
-256ULL? IOW, if it results in -12ULL then it is not a valid inode
anyways and hence should return -EOVERFLOW anyways.


> 
> Other part looks good for me.
> 
> Thanks,
> Qu
>> +if (*highest_ino >= BTRFS_LAST_FREE_OBJECTID)
>> +ret = -EOVERFLOW;
>> +btrfs_release_path(path);
>> +return ret;
>> +}
>> +
>>  static int repair_inode_nlinks(struct btrfs_trans_handle *trans,
>> struct btrfs_root *root,
>> struct btrfs_path *path,
>> @@ -2898,11 +2923,9 @@ static int repair_inode_nlinks(struct
>> btrfs_trans_handle *trans,
>>  }
>>
>>  if (rec->found_link == 0) {
>> -lost_found_ino = root->highest_inode;
>> -if (lost_found_ino >= BTRFS_LAST_FREE_OBJECTID) {
>> -ret = -EOVERFLOW;
>> +ret = get_highest_inode(trans, root, path, _found_ino);
>> +if (ret < 0)
>>  goto out;
>> -}
>>  lost_found_ino++;
>>  ret = btrfs_mkdir(trans, root, dir_name, strlen(dir_name),
>>BTRFS_FIRST_FREE_OBJECTID, _found_ino,
>> @@ -3266,21 +3289,6 @@ static int check_inode_recs(struct btrfs_root
>> *root,
>>  }
>>
>>  /*
>> - * We need to record the highest inode number for later 'lost+found'
>> - * dir creation.
>> - * We must select an ino not used/referred by any existing inode, or
>> - * 'lost+found' ino may be a missing ino in a corrupted leaf,
>> - * this may cause 'lost+found' dir has wrong nlinks.
>> - */
>> -cache = last_cache_extent(inode_cache);
>> -if (cache) {
>> -node = container_of(cache, struct ptr_node, cache);
>> -rec = node->data;
>> -if (rec->ino > root->highest_inode)
>> -root->highest_inode = rec->ino;
>> -}
>> -
>> -/*
>>   * We need to repair backrefs first because we could change some
>> of the
>>   * errors in the inode recs.
>>   *
>> 

Re: btrfs_log2phys: cannot lookup extent mapping

2016-12-22 Thread Xin Zhou
Hi,
If the change of disk format between versions is precisely documented,
it is plausible to create a utility to convert the old volume to new ones,
trigger the workflow, upgrade the kernel and boots up for mounting the new 
volume.
Currently, the btrfs wiki shows partial content of the on-disk format.
Thanks,
Xin
 
 

Sent: Wednesday, December 21, 2016 at 6:50 AM
From: "David Hanke" 
To: linux-btrfs@vger.kernel.org
Subject: Re: btrfs_log2phys: cannot lookup extent mapping
Hi Duncan,

Thank you for your reply. If I've emailed the wrong list, please let me
know. What I hear you saying, in short, is that btrfs is not yet fully
stable but current 4.x versions may work better. I'm willing to upgrade,
but I'm told that the upgrade process may result in total failure, and
I'm not sure I can trust the contents of the volume either way. Given
that, it seems I must backup the backup, erase and start over. What
would you do?

Thank you,

David


On 12/20/16 17:24, Duncan wrote:
> David Hanke posted on Tue, 20 Dec 2016 09:52:25 -0600 as excerpted:
>
>> I've been using a btrfs-based volume for backups, but lately the
>> system's been filling the syslog with errors like "btrfs_log2phys:
>> cannot lookup extent mapping for 7129125486592" at the rate of hundreds
>> per second. (Please see output below for more details.) Despite the
>> errors, the files I've looked at appear to be written and read
>> successfully.
>>
>> I'm wondering if the contents of the volume are trustworthy and whether
>> this problem is resolvable without backing up, erasing and starting
>> over?
>>
>> Thank you!
>>
>> David
>>
>>
>> # uname -a
>> Linux backup2 3.0.101.RNx86_64.3 #1 SMP Wed Apr 1 16:02:14 PDT 2015
>> x86_64 GNU/Linux
>>
>> # btrfs --version
>> Btrfs v3.17.3
> FWIW...
>
> [TL;DR: see the four bottom line choices, at the bottom.]
>
> This is the upstream btrfs development and discussion list for a
> filesystem that's still stabilizing (that is, not fully stable and
> mature) and that remains under heavy development and bug fixing. As
> such, list focus is heavily forward looking, with an extremely strong
> recommendation to use current kernels (and to a lessor extent btrfs
> userspace) if you're going to be running btrfs, as these have all the
> latest bugfixes.
>
> Put a different way, the general view and strong recommendation of the
> list is that because btrfs is still under heavy development, with bug
> fixes, some more major than others, every kernel cycle, while we
> recognize that choosing to run old and stale^H^Hble kernels and userspace
> is a legitimate choice on its own, that choice of stability over support
> for the latest and greatest, is viewed as incompatible with choosing to
> run a still under heavy development filesystem. Choosing one OR the
> other is strongly recommended.
>
> For list purposes, we recommend and best support the last two kernel
> release series in two tracks, LTS/long-term-stable, or current release
> track. On the LTS track, that's the LTS 4.4 and 4.1 series. On the
> current track, 4.9 is the latest release, so 4.9 and 4.8 are best
> supported.
>
> Meanwhile, it's worth keeping in mind that the experimental label and
> accompanying extremely strong "eat your babies" level warnings weren't
> peeled off until IIRC 3.12 or so, meaning anything before that is not
> only ancient history in list terms, but also still labeled as "eat your
> babies" level experimental. Why anyone choosing to run an ancient eat-
> your-babies level experimental version of a filesystem that's now rather
> more stable and mature, tho not yet fully stabilized, is beyond me. If
> they're interested in newer filesystems, running newer and less buggy
> versions is reasonable; if they're interested in years-stale level of
> stability, then running such filesystems, especially when still labeled
> eat-your-babies level experimental back then, seems an extremely odd
> choice indeed.
>
> Of course, on-list we do recognize that various distros did and do offer
> support at some level for older than list-recommended version btrfs, in
> part because they backport fixes from newer versions. However, because
> we're forward development focused we don't track what patches these
> distros may or may not have backported and thus aren't in a good position
> to provide good support for them. Instead, users choosing to use such
> kernels are generally asked to choose between upgrading to something
> reasonably supportable on-list if they wish to go that route, or referred
> back to their distros for the support they're in a far better position to
> offer, since they know what they've backported and what they haven't,
> while we don't.
>
> As for btrfs userspace, the way btrfs works, during normal runtime,
> userspace primarily calls the kernel to do the real work, so userspace
> version isn't as big a deal unless you're trying to use a feature only
> supported by newer versions, except that if it's /too/ old, the 

Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Thu, Dec 22, 2016 at 08:17:19PM +0100, Michal Hocko wrote:
> TL;DR I still do not see what is going on here and it still smells like
> multiple issues. Please apply the patch below on _top_ of what you had.

I've run the usual procedure again with the new patch on top and the
log is now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22_2.log.xz

As a little side note: It is likely, but I cannot completely say for
sure yet, that this issue is rather easy to reproduce. When I had some
time today at work, I set up a fresh Debian Sid installation in a VM
(32 bit PAE kernel, 4 GB RAM, btrfs as root fs). I used some late 4.9rc(8?)
kernel supplied by Debian - they don't seem to have 4.9 final yet and I
didn't come around to build and use a custom 4.9 final kernel, probably
even with your patches. But the 4.9rc kernel there seemed to behave very much
the same as the 4.9 kernel on my real 32 bit machines does: All I had
to do was unpack a few big tarballs - firefox, libreoffice and the
kernel are my favorites - and the machine would start OOMing.

This might suggest - although I have to admit, again, that this is
inconclusive, as I've not used a final 4.9 kernel - that you could
very easily reproduce the issue yourself by just setting up a 32 bit
system with a btrfs filesystem and then unpacking a few huge tarballs.
Of course, I'm more than happy to continue giving any patches sent to
me a spin, but I thought I'd still mention this in case it makes
things easier for you. :-)

Greetings
Nils
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[josef-btrfs:inet-rework 1/6] net/ipv4/inet_connection_sock.c:45:38: note: in expansion of macro 'sk_v6_rcv_saddr'

2016-12-22 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
inet-rework
head:   749825bc60f7224225ced1dbed77d3cc2b0bd72f
commit: a36b30653769d1e20ff0df41533a2766453ced1a [1/6] inet: collapse ipv4/v6 
rcv_saddr_equal functions into one
config: cris-etrax-100lx_v2_defconfig (attached as .config)
compiler: cris-linux-gcc (GCC) 6.2.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout a36b30653769d1e20ff0df41533a2766453ced1a
# save the attached .config to linux build tree
make.cross ARCH=cris 

All error/warnings (new ones prefixed by >>):

   In file included from include/net/inet_sock.h:27:0,
from include/net/inet_connection_sock.h:23,
from net/ipv4/inet_connection_sock.c:19:
   net/ipv4/inet_connection_sock.c: In function 'ipv6_rcv_saddr_equal':
>> include/net/sock.h:339:36: error: 'const struct sock_common' has no member 
>> named 'skc_v6_rcv_saddr'; did you mean 'skc_rcv_saddr'?
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
   ^
>> net/ipv4/inet_connection_sock.c:45:38: note: in expansion of macro 
>> 'sk_v6_rcv_saddr'
 int addr_type = ipv6_addr_type(>sk_v6_rcv_saddr);
 ^~~
>> include/net/sock.h:339:36: error: 'const struct sock_common' has no member 
>> named 'skc_v6_rcv_saddr'; did you mean 'skc_rcv_saddr'?
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
   ^
   net/ipv4/inet_connection_sock.c:71:27: note: in expansion of macro 
'sk_v6_rcv_saddr'
 ipv6_addr_equal(>sk_v6_rcv_saddr, sk2_rcv_saddr6))
  ^~~

vim +/sk_v6_rcv_saddr +45 net/ipv4/inet_connection_sock.c

13   *  2 of the License, or(at your option) any later version.
14   */
15  
16  #include 
17  #include 
18  
  > 19  #include 
20  #include 
21  #include 
22  #include 
23  #include 
24  #include 
25  #include 
26  #include 
27  #include 
28  
29  #ifdef INET_CSK_DEBUG
30  const char inet_csk_timer_bug_msg[] = "inet_csk BUG: unknown timer 
value\n";
31  EXPORT_SYMBOL(inet_csk_timer_bug_msg);
32  #endif
33  
34  /* match_wildcard == true:  IPV6_ADDR_ANY equals to any IPv6 addresses 
if IPv6
35   *  only, and any IPv4 addresses if not IPv6 
only
36   * match_wildcard == false: addresses must be exactly the same, i.e.
37   *  IPV6_ADDR_ANY only equals to IPV6_ADDR_ANY,
38   *  and 0.0.0.0 equals to 0.0.0.0 only
39   */
40  static int ipv6_rcv_saddr_equal(const struct sock *sk, const struct 
sock *sk2,
41  bool match_wildcard)
42  {
43  const struct in6_addr *sk2_rcv_saddr6 = inet6_rcv_saddr(sk2);
44  int sk2_ipv6only = inet_v6_ipv6only(sk2);
  > 45  int addr_type = ipv6_addr_type(>sk_v6_rcv_saddr);
46  int addr_type2 = sk2_rcv_saddr6 ? 
ipv6_addr_type(sk2_rcv_saddr6) : IPV6_ADDR_MAPPED;
47  
48  /* if both are mapped, treat as IPv4 */

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
TL;DR I still do not see what is going on here and it still smells like
multiple issues. Please apply the patch below on _top_ of what you had.

On Thu 22-12-16 11:10:29, Nils Holland wrote:
[...]
> http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

It took me a while to realize that tracepoint and printk messages are
not sorted by the timestamp. Some massaging has fixed that
$ xzcat boerne_2016-12-22.log.xz | sed -e 's@.*192.168.17.32:6665 
\[[[:space:]]*\([0-9\.]\+\)\] @\1 @' -e 
's@.*192.168.17.32:53062[[:space:]]*\([^[:space:]]\+\)[[:space:]].*[[:space:]]\([0-9\.]\+\):@\2
 \1@' | sort -k1 -n -s

461.757468 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757501 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757504 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11852 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757508 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757535 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757537 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11820 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757543 kswapd0-32 mm_vmscan_lru_isolate: isolate_mode=0 classzone=1 order=0 
nr_requested=32 nr_scanned=32 nr_skipped=0 nr_taken=32 lru=1
461.757584 kswapd0-32 mm_vmscan_lru_shrink_inactive: nid=0 nr_scanned=32 
nr_reclaimed=32 nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 
nr_activate=0 nr_ref_keep=0 nr_unmap_fail=0 p
riority=2 flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
461.757588 kswapd0-32 mm_vmscan_inactive_list_is_low: nid=0 
total_inactive=11788 inactive=0 total_active=118195 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
[...]
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 
inactive=0 total_active=120208 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=9939 
inactive=0 total_active=120208 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722379 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=89 
inactive=0 total_active=1301 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722385 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722386 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722391 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=0 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722396 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=131 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=1 
inactive=0 total_active=21 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
482.722397 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=0 
inactive=0 total_active=131 active=0 ratio=1 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
482.722401 cat-2974 mm_vmscan_inactive_list_is_low: nid=0 total_inactive=450730 
inactive=0 total_active=206026 active=0 ratio=1 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
484.144971 collect2 invoked oom-killer: 
gfp_mask=0x27080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=0, 
order=0, oom_score_adj=0
[...]
484.146871 Node 0 active_anon:100688kB inactive_anon:380kB 
active_file:1296560kB inactive_file:1848044kB unevictable:0kB 
isolated(anon):0kB isolated(file):0kB mapped:32180kB dirty:20896kB 
writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 40960kB anon_thp: 776kB 
writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
484.147097 DMA free:4004kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:8016kB inactive_file:12kB unevictable:0kB 
writepending:68kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:2652kB slab_unreclaimable:1224kB kernel_stack:8kB 

[josef-btrfs:inet-rework 6/6] net/ipv4/inet_connection_sock.c:256:32: error: 'const struct inet_connection_sock_af_ops' has no member named 'rcv_saddr_equal'

2016-12-22 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
inet-rework
head:   749825bc60f7224225ced1dbed77d3cc2b0bd72f
commit: 749825bc60f7224225ced1dbed77d3cc2b0bd72f [6/6] inet: reset 
tb->fastreuseport when adding a reuseport sk
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
git checkout 749825bc60f7224225ced1dbed77d3cc2b0bd72f
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   net/ipv4/inet_connection_sock.c: In function 'sk_reuseport_match':
>> net/ipv4/inet_connection_sock.c:256:32: error: 'const struct 
>> inet_connection_sock_af_ops' has no member named 'rcv_saddr_equal'
 if (!inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
   ^~

vim +256 net/ipv4/inet_connection_sock.c

   250   * without fastreuseport and then was reset, as we can only 
know that
   251   * the fastsock has no potential bind conflicts with the rest 
of the
   252   * possible socks on the owners list.
   253   */
   254  if (tb->fastreuseport == FASTREUSEPORT_ANY)
   255  return 1;
 > 256  if (!inet_csk(sk)->icsk_af_ops->rcv_saddr_equal(sk, sk2, true))
   257  return 0;
   258  return 1;
   259  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[josef-btrfs:inet-rework 1/6] net/ipv4/inet_hashtables.c:461:5: error: conflicting types for '__inet_hash'

2016-12-22 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
inet-rework
head:   749825bc60f7224225ced1dbed77d3cc2b0bd72f
commit: a36b30653769d1e20ff0df41533a2766453ced1a [1/6] inet: collapse ipv4/v6 
rcv_saddr_equal functions into one
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
git checkout a36b30653769d1e20ff0df41533a2766453ced1a
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

>> net/ipv4/inet_hashtables.c:461:5: error: conflicting types for '__inet_hash'
int __inet_hash(struct sock *sk, struct sock *osk)
^~~
   In file included from net/ipv4/inet_hashtables.c:25:0:
   include/net/inet_hashtables.h:206:5: note: previous declaration of 
'__inet_hash' was here
int __inet_hash(struct sock *sk, struct sock *osk,
^~~
   In file included from include/linux/linkage.h:6:0,
from include/linux/kernel.h:6,
from include/linux/list.h:8,
from include/linux/module.h:9,
from net/ipv4/inet_hashtables.c:16:
   net/ipv4/inet_hashtables.c:492:15: error: conflicting types for '__inet_hash'
EXPORT_SYMBOL(__inet_hash);
  ^
   include/linux/export.h:58:21: note: in definition of macro '___EXPORT_SYMBOL'
 extern typeof(sym) sym;  \
^~~
>> net/ipv4/inet_hashtables.c:492:1: note: in expansion of macro 'EXPORT_SYMBOL'
EXPORT_SYMBOL(__inet_hash);
^
   In file included from net/ipv4/inet_hashtables.c:25:0:
   include/net/inet_hashtables.h:206:5: note: previous declaration of 
'__inet_hash' was here
int __inet_hash(struct sock *sk, struct sock *osk,
^~~
--
>> net/ipv4/udp.c:229:5: error: conflicting types for 'udp_lib_get_port'
int udp_lib_get_port(struct sock *sk, unsigned short snum,
^~~~
   In file included from net/ipv4/udp_impl.h:3:0,
from net/ipv4/udp.c:115:
   include/net/udp.h:206:5: note: previous declaration of 'udp_lib_get_port' 
was here
int udp_lib_get_port(struct sock *sk, unsigned short snum,
^~~~
   In file included from include/linux/linkage.h:6:0,
from include/linux/kernel.h:6,
from include/asm-generic/bug.h:13,
from arch/x86/include/asm/bug.h:35,
from include/linux/bug.h:4,
from include/linux/thread_info.h:11,
from arch/x86/include/asm/uaccess.h:9,
from net/ipv4/udp.c:82:
   net/ipv4/udp.c:341:15: error: conflicting types for 'udp_lib_get_port'
EXPORT_SYMBOL(udp_lib_get_port);
  ^
   include/linux/export.h:58:21: note: in definition of macro '___EXPORT_SYMBOL'
 extern typeof(sym) sym;  \
^~~
>> net/ipv4/udp.c:341:1: note: in expansion of macro 'EXPORT_SYMBOL'
EXPORT_SYMBOL(udp_lib_get_port);
^
   In file included from net/ipv4/udp_impl.h:3:0,
from net/ipv4/udp.c:115:
   include/net/udp.h:206:5: note: previous declaration of 'udp_lib_get_port' 
was here
int udp_lib_get_port(struct sock *sk, unsigned short snum,
^~~~
--
   net/ipv6/inet6_hashtables.c: In function 'inet6_hash':
>> net/ipv6/inet6_hashtables.c:271:9: error: too few arguments to function 
>> '__inet_hash'
  err = __inet_hash(sk, NULL);
^~~
   In file included from net/ipv6/inet6_hashtables.c:22:0:
   include/net/inet_hashtables.h:206:5: note: declared here
int __inet_hash(struct sock *sk, struct sock *osk,
^~~
--
   net/ipv6/udp.c: In function 'udp_v6_get_port':
>> net/ipv6/udp.c:106:36: warning: passing argument 3 of 'udp_lib_get_port' 
>> makes pointer from integer without a cast [-Wint-conversion]
 return udp_lib_get_port(sk, snum, hash2_nulladdr);
   ^~
   In file included from net/ipv6/udp_impl.h:3:0,
from net/ipv6/udp.c:56:
   include/net/udp.h:206:5: note: expected 'int (*)(const struct sock *, const 
struct sock *, bool) {aka int (*)(const struct sock *, const struct sock *, 
_Bool)}' but argument is of type 'unsigned int'
int udp_lib_get_port(struct sock *sk, unsigned short snum,
^~~~
>> net/ipv6/udp.c:106:9: error: too few arguments to function 'udp_lib_get_port'
 return udp_lib_get_port(sk, snum, hash2_nulladdr);
^~~~
   In file included from net/ipv6/udp_impl.h:3:0,
from net/ipv6/udp.c:56:
   include/net/udp.h:206:5: note: declared here
int udp_lib_get_port(struct sock *sk, unsigned short snum,
^~~~
>> net/ipv6/udp.c:107:1: warning: control reaches end of non-void function 
>> [-Wreturn-type]
}
^

vim +/__inet_hash +461 

Re: btrfs_log2phys: cannot lookup extent mapping

2016-12-22 Thread Austin S. Hemmelgarn

On 2016-12-22 10:14, Adam Borowski wrote:

On Thu, Dec 22, 2016 at 10:11:35AM +, Duncan wrote:

Given the maturing-but-not-yet-fully-stable-and-mature state of btrfs
today, being no further from a usable current backup than the data you're
willing to lose, at least worst-case, remains an even stronger
recommendation than it is on fully mature and stable filesystem, kernel
and hardware.


The usual rant about backups which I snipped is 110%[1] right, however I
disagree that btrfs is worse than other filesystems for data safety.

On one hand, btrfs:
* is buggy
* fails the KISS principle to a ridiculous degree
* lacks logic people take for granted (especially on RAID)
On the other, other filesystems:
* suffer from silent data loss every time the disk doesn't notice an error!
  Allowing silent data loss fails the most basic requirement for a
  filesystem.  Btrfs at least makes that loss noisy (single) so you can
  recover from backups, or handles it (redundant RAID).
No, allowing silent data loss fails the most basic requirement for a 
_storage system_.  A filesystem is generally a key component in a data 
storage system, but people regularly conflate the two as having the same 
meaning, which is absolutely wrong.  Most traditional filesystems are 
designed under the assumption that if someone cares about at-rest data 
integrity, they will purchase hardware to ensure at-rest data integrity. 
 This is a perfectly reasonable stance, especially considering that 
ensuring at-rest data integrity is _hard_ (BTRFS is better at it than 
most filesystems, but it still can't do it to the degree that most of 
the people who actually require it need).  A filesystem's job is 
traditionally to organize things, not verify them or provide redundancy.

* don't have frequent snapshots to save you from human error (including
  other software)
* make backups time-costly.  rsync needs to at least stat everything, on a
  populated disk that's often half an hour or more, on btrfs a no-op backup
  takes O(1).

These two points I agree on, despite me not using snapshots or send/receive.


So sorry, but I had enough woe with those "fully mature and stable"
filesystems.  Thus I use btrfs pretty much everywhere, backing up my crap
every 24 hours, important bits every 3 hours.
I use BTRFS pretty much everywhere too.  I've also had more catastrophic 
failures from BTRFS than any other filesystem I've used except FAT (NTFS 
is a close third).  I've also recovered sanely without needing a new 
filesystem and a full data restoration on ext4, FAT, and even XFS more 
than I have on BTRFS (ext4 and FAT are well enough documented that I can 
put a broken filesystem back together by hand if needed (and have done 
so on multiple occasions)).


That said, the two of us and most of the other list regulars have a much 
better understanding of the involved risks than a significant majority 
of 'normal' users, partly because we have done our research regarding 
this, and partly because we're watching the list regularly.  For us, the 
risk is a calculated one, for anyone who's just trying it out for 
laughs, or happened to get it because the distro they picked happened to 
use it by default though, it's a very much unknown risk.


Ignoring the checksumming, COW, and multi-device support in BTRFS, 
pretty much everything else wins in terms of reliability by a pretty 
significant margin (and in terms of performance too, even mounted with 
no checksumming and no COW for everything but metadata, ext4 and XFS 
still beat the tar out of BTRFS in terms of performance).  BTRFS crashes 
more, and fails harder than any other first-class (listed on the main 
'Filesystems' menu, not in 'Misc Filesystems') filesystem in the 
mainline Linux kernel right now.  For it to be reliable, the devices 
need to be monitored, the filesystems need to be curated, and you 
absolutely have to understand the risks.  Given this, for a vast 
majority of users, BTRFS _is_ worse on average for data safety than 
almost any other filesystem in the kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_log2phys: cannot lookup extent mapping

2016-12-22 Thread Adam Borowski
On Thu, Dec 22, 2016 at 10:11:35AM +, Duncan wrote:
> Given the maturing-but-not-yet-fully-stable-and-mature state of btrfs 
> today, being no further from a usable current backup than the data you're 
> willing to lose, at least worst-case, remains an even stronger 
> recommendation than it is on fully mature and stable filesystem, kernel 
> and hardware.

The usual rant about backups which I snipped is 110%[1] right, however I
disagree that btrfs is worse than other filesystems for data safety.

On one hand, btrfs:
* is buggy
* fails the KISS principle to a ridiculous degree
* lacks logic people take for granted (especially on RAID)
On the other, other filesystems:
* suffer from silent data loss every time the disk doesn't notice an error!
  Allowing silent data loss fails the most basic requirement for a
  filesystem.  Btrfs at least makes that loss noisy (single) so you can
  recover from backups, or handles it (redundant RAID).
* don't have frequent snapshots to save you from human error (including
  other software)
* make backups time-costly.  rsync needs to at least stat everything, on a
  populated disk that's often half an hour or more, on btrfs a no-op backup
  takes O(1).

So sorry, but I had enough woe with those "fully mature and stable"
filesystems.  Thus I use btrfs pretty much everywhere, backing up my crap
every 24 hours, important bits every 3 hours.


Meow!

[1]. Above 100% as it's more true than people read it.
-- 
Autotools hint: to do a zx-spectrum build on a pdp11 host, type:
  ./configure --host=zx-spectrum --build=pdp11
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-22 Thread Jeff Layton
On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > 
> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes. That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> 
> Do you have numbers to justify these changes?

I have numbers. As to whether they justify the changes, I'm not sure.
This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
XFS, with this fio jobfile:

8<--
[global]
direct=0
size=2g
filesize=512m
bsrange=1-1
timeout=60
numjobs=1
directory=/mnt/scratch

[f1]
filename=randwrite
rw=randwrite
8<--

Unpatched kernel:
  WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=6msec, 
maxt=6msec

Patched kernel:
  WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=6msec, 
maxt=6msec

So quite a difference there and it's pretty consistent across runs. If I
change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
variation between the two doesn't seem to be significant (numbers vary
as much between runs on the same kernels and are roughly the same).

Playing with buffered I/O sizes between 1 byte and 4k shows that as the
I/O sizes get larger, this makes less difference (which is what I'd
expect).

Previous testing with ext4 shows roughly the same results. btrfs shows
some benefit here but significantly less than with ext4 or xfs. Not sure
why that is yet -- maybe CoW effects?

That said, I don't have a great test rig for this. I'm using VMs with a
dedicated LVM volume that's on a random SSD I had laying around. It
could use testing on a wider set of configurations and workloads.

I was also hoping that others may have workloads that they think might
be (postively or negatively) affected by these changes. If you can think
of any in particular, then I'm interested to hear about them.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bug report] btrfs: root->fs_info cleanup, add fs_info convenience variables

2016-12-22 Thread Jeff Mahoney
On 12/22/16 7:53 AM, Dan Carpenter wrote:
> Hello Jeff Mahoney,
> 
> This is a semi-automatic email about new static checker warnings.

Hi Dan -

Thanks for the report.  We've already seen this one and the right fix is
to remove the checks in btrfs_get_name since exportfs won't pass
negative dentries.  The original reporter has submitted a patch for that.

-Jeff

> The patch 0b246afa62b0: "btrfs: root->fs_info cleanup, add fs_info 
> convenience variables" from Jun 22, 2016, leads to the following 
> Smatch complaint:
> 
> fs/btrfs/export.c:238 btrfs_get_name()
>warn: variable dereferenced before check 'inode' (see line 226)
> 
> fs/btrfs/export.c
>225struct inode *dir = d_inode(parent);
>226struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  ^^^
> Patch adds dereference.
> 
>227struct btrfs_path *path;
>228struct btrfs_root *root = BTRFS_I(dir)->root;
>229struct btrfs_inode_ref *iref;
>230struct btrfs_root_ref *rref;
>231struct extent_buffer *leaf;
>232unsigned long name_ptr;
>233struct btrfs_key key;
>234int name_len;
>235int ret;
>236u64 ino;
>237
>238if (!dir || !inode)
>  ^
> Old code checked for NULL.
> 
>239return -EINVAL;
>240
> 
> regards,
> dan carpenter
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t

2016-12-22 Thread Jeff Layton
On Thu, 2016-12-22 at 10:38 +0200, Amir Goldstein wrote:
> On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton  wrote:
> > 
> > The spinlock is only used to serialize callers that want to increment
> > the counter. We can achieve the same thing with an atomic64_t and
> > get the i_lock out of this codepath.
> > 
> 
> Cool work! See some nits and suggestions below.
> 
> > 
> > +/*
> > + * We borrow the top bit in the i_version to use as a flag to tell us 
> > whether
> > + * it has been queried since we last bumped it. If it has, then we must 
> > bump
> > + * it and set the flag. Note that this means that we have to handle 
> > wrapping
> > + * manually.
> > + */
> > +#define INODE_I_VERSION_QUERIED(1ULL<<63)
> > +
> >  /**
> >   * inode_set_iversion - set i_version to a particular value
> >   * @inode: inode to set
> > @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode 
> > *inode)
> >  static inline void
> >  inode_set_iversion(struct inode *inode, const u64 new)
> >  {
> > -   inode->i_version = new;
> > +   atomic64_set(>i_version, new);
> >  }
> > 
> 
> Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
> See API change suggestion below.
> 
> 

Possibly. Note that in some cases (when the i_version can be stored on
disk across a remount), we need to ensure that we set this flag when the
inode is read in from disk. It's always possible that we'll get a query
for it, and then crash so we always set the flag just in case.

> > 
> >  /**
> > @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const 
> > u64 new)
> >  static inline bool
> >  inode_inc_iversion(struct inode *inode, bool force)
> >  {
> > -   bool ret = false;
> > +   u64 cur, old, new;
> > +
> > +   cur = (u64)atomic64_read(>i_version);
> > +   for (;;) {
> > +   /* If flag is clear then we needn't do anything */
> > +   if (!force && !(cur & INODE_I_VERSION_QUERIED))
> > +   return false;
> > +
> > +   new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> > +
> > +   /* Did we overflow into flag bit? Reset to 0 if so. */
> > +   if (unlikely(new == INODE_I_VERSION_QUERIED))
> > +   new = 0;
> > 
> 
> Did you consider changing f_version type and the signature of the new
> i_version API to set/get s64 instead of u64?
> 
> It makes a bit more sense from API users perspective to know that
> the valid range for version is >=0.
> 
> file->f_version is not the only struct member used to store
> i_version. nfs and xfs have other struct members for that, but even
> if all those members are not changed to type s64, the explicit cast
> to (s64) and back to (u64) will serve as a good documentation in
> the code about the valid range of version in the new API.
> 

This API is definitely not set in stone. That said, we have to consider
that there are really three classes of filesystems here:

1) ones that treat i_version as an opaque value: Mostly AFS and NFS,
as they get this value from the server. These both can also use the
entire u64 field, so we need to ensure that we don't monkey with the
flag bit on them.

2) filesystems that just use it internally: These don't set MS_I_VERSION
and mostly use it to detect directory changes that occur during readdir.
i_version is initialized to some value (0 or 1) when the struct inode is
allocated and bump it on directory changes.

3) filesystems where the kernel manages it completely: these set
MS_I_VERSION and the kernel handles bumping it on writes. Currently,
this is btrfs, ext4 and xfs. These are persistent across remounts as
well.

So, we have to ensure that this API encompasses all 3 of these use
cases.

> >  /**
> > @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
> >  static inline s64
> >  inode_cmp_iversion(const struct inode *inode, const u64 old)
> >  {
> > -   return (s64)inode->i_version - (s64)old;
> > +   return (s64)(atomic64_read(>i_version) << 1) - (s64)(old << 
> > 1);
> >  }
> > 
> 
> IMO, it is better for the API to determine that 'old' is valid a value
> returned from
> inode_get_iversion* and therefore should not have the MSB set.
> Unless the reason you chose to shift those 2 values is because it is cheaper
> then masking INODE_I_VERSION_QUERIED??
> 
> 

No, we need to do that in order to handle wraparound correctly. We want
this check to work something like the time_before/after macros in the
kernel that handle jiffies wraparound.

So, the sign returned here matters, as positive values indicate that the
current one is "newer" than the old one. That's the main reason for the
shift here.

Note that that that should be documented here too, I'll plan to add that
for the next revision.

Thanks for the comments so far!
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

[bug report] btrfs: root->fs_info cleanup, add fs_info convenience variables

2016-12-22 Thread Dan Carpenter
Hello Jeff Mahoney,

This is a semi-automatic email about new static checker warnings.

The patch 0b246afa62b0: "btrfs: root->fs_info cleanup, add fs_info 
convenience variables" from Jun 22, 2016, leads to the following 
Smatch complaint:

fs/btrfs/export.c:238 btrfs_get_name()
 warn: variable dereferenced before check 'inode' (see line 226)

fs/btrfs/export.c
   225  struct inode *dir = d_inode(parent);
   226  struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 ^^^
Patch adds dereference.

   227  struct btrfs_path *path;
   228  struct btrfs_root *root = BTRFS_I(dir)->root;
   229  struct btrfs_inode_ref *iref;
   230  struct btrfs_root_ref *rref;
   231  struct extent_buffer *leaf;
   232  unsigned long name_ptr;
   233  struct btrfs_key key;
   234  int name_len;
   235  int ret;
   236  u64 ino;
   237  
   238  if (!dir || !inode)
 ^
Old code checked for NULL.

   239  return -EINVAL;
   240  

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bug or by design ?] btrfs defrag compression does not persist

2016-12-22 Thread Austin S. Hemmelgarn

On 2016-12-21 21:28, Anand Jain wrote:


A quick design specific question.

The following command converts file-data-extents to the specified
encoder (lzo).

  $ btrfs filesystem defrag -v -r -f -clzo dir/

However the lzo property does not persist through the file modify.
As the above operation does not update the btrfs.compression property.

Question:
 I wonder if this should be a bug or if its by design ?
 What could be the main use case to _associate compression
 at the time of defrag_ ?
While I didn't write the tool, I'm pretty certain that it was by design. 
 The primary use cases as I see them for compressing using defrag are:
1. Repacking files after mounting without compression (I do this myself 
whenever I have to boot into a different distro than what I already have 
on the system since I always use compress=none when mounting from 
recovery media).

2. Actually compressing files after marking them for compression.
3. Using multi-tier in-line compression (for example, 'live' data is LZO 
compressed, data which has sat idle for a while is ZLIB compressed).



 The no-conflict fix will be to add another option to make
 it persistent.
This ideally should be better documented (nothing says it will persist, 
but nothing I can find says it won't either), but I personally feel that 
defrag shouldn't be mucking about with file attributes beyond the stuff 
that's implicitly updated by shuffling extents around.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-22 Thread Tetsuo Handa
Nils Holland wrote:
> Well, the issue is that I could only do everything via ssh today and
> don't have any physical access to the machines. In fact, both seem to
> have suffered a genuine kernel panic, which is also visible in the
> last few lines of the log I provided today. So, basically, both
> machines are now sitting at my home in panic state and I'll only be
> able to resurrect them wheh I'm physically there again tonight.

# echo 10 > /proc/sys/kernel/panic
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Thu, Dec 22, 2016 at 11:27:25AM +0100, Michal Hocko wrote:
> On Thu 22-12-16 11:10:29, Nils Holland wrote:
> 
> > However, the log comes from machine #2 again today, as I'm
> > unfortunately forced to try this via VPN from work to home today, so I
> > have exactly one attempt per machine before it goes down and locks up
> > (and I can only restart it later tonight).
> 
> This is really surprising to me. Are you sure that you have sysrq
> configured properly. At least sysrq+b shouldn't depend on any memory
> allocations and should allow you to reboot immediately. A sysrq+m right
> before the reboot might turn out being helpful as well.

Well, the issue is that I could only do everything via ssh today and
don't have any physical access to the machines. In fact, both seem to
have suffered a genuine kernel panic, which is also visible in the
last few lines of the log I provided today. So, basically, both
machines are now sitting at my home in panic state and I'll only be
able to resurrect them wheh I'm physically there again tonight. But
that was expected; I could have waited with the test until I'm at
home, which makes things easier, but I thought the sooner I can
provide a log for you to look at, the better. ;-)

Greetings
Nils
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [bug or by design ?] btrfs defrag compression does not persist

2016-12-22 Thread Duncan
Anand Jain posted on Thu, 22 Dec 2016 10:28:28 +0800 as excerpted:

> A quick design specific question.
> 
> The following command converts file-data-extents to the specified
> encoder (lzo).
> 
>$ btrfs filesystem defrag -v -r -f -clzo dir/
> 
> However the lzo property does not persist through the file modify.
> As the above operation does not update the btrfs.compression property.
> 
> Question:
>   I wonder if this should be a bug or if its by design ? What could be
>   the main use case to _associate compression at the time of defrag_ ?
>   The no-conflict fix will be to add another option to make it
>   persistent.

I'd say it's a feature (whether deliberately designed that way or not).  
Presumably, if you want real-time compression, you're using the 
appropriate mount option or already setting the property on the file(s) 
in question.  Which leaves defrag-with-compression runs for one-time 
compression of cold data that's unlikely to change further, or for 
periodic compression runs during slow periods or the like, when you know 
it's not going to be an issue, even if it would be an issue if done in 
real-time production, thus the reason you're not running with the option 
already.

A practical example would be someone using the compress=lzo mount option 
for real-time work, then using defrag-and-compress to compress to zlib 
level either during slow periods, or when files have been rotated out of 
active use and are no longer being actively rewritten/appended.  They 
don't /want/ further modifications compressed to the same level in real-
time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-22 Thread Michal Hocko
On Thu 22-12-16 11:10:29, Nils Holland wrote:
> On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> > TL;DR
> > there is another version of the debugging patch. Just revert the
> > previous one and apply this one instead. It's still not clear what
> > is going on but I suspect either some misaccounting or unexpeted
> > pages on the LRU lists. I have added one more tracepoint, so please
> > enable also mm_vmscan_inactive_list_is_low.
> 
> Right, I did just that and can provide a new log. I was also able, in
> this case, to reproduce the OOM issues again and not just the "page
> allocation stalls" that were the only thing visible in the previous
> log.

Thanks a lot for testing! I will have a look later today.

> However, the log comes from machine #2 again today, as I'm
> unfortunately forced to try this via VPN from work to home today, so I
> have exactly one attempt per machine before it goes down and locks up
> (and I can only restart it later tonight).

This is really surprising to me. Are you sure that you have sysrq
configured properly. At least sysrq+b shouldn't depend on any memory
allocations and should allow you to reboot immediately. A sysrq+m right
before the reboot might turn out being helpful as well.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_log2phys: cannot lookup extent mapping

2016-12-22 Thread Duncan
David Hanke posted on Wed, 21 Dec 2016 08:50:02 -0600 as excerpted:

> Thank you for your reply. If I've emailed the wrong list, please let me
> know.

Well, it's the right list... for /current/ btrfs.  For 3.0, as I said 
your distro lists may be more appropriate.  But from the below you do 
seem willing to upgrade, so...

> What I hear you saying, in short, is that btrfs is not yet fully
> stable but current 4.x versions may work better.

Yes.

> I'm willing to upgrade,
> but I'm told that the upgrade process may result in total failure, and
> I'm not sure I can trust the contents of the volume either way. Given
> that, it seems I must backup the backup, erase and start over. What
> would you do?

That's exactly what I'd do, but...

Given the maturing-but-not-yet-fully-stable-and-mature state of btrfs 
today, being no further from a usable current backup than the data you're 
willing to lose, at least worst-case, remains an even stronger 
recommendation than it is on fully mature and stable filesystem, kernel 
and hardware.  (And even on such a stable system, any sysadmin worth the 
name defines the real value of data by the extent to which it is backed 
up, no backup means it's simply not worth the trouble and the loss of the 
data is a smaller loss than the loss of resources and hassle required to 
back it up as insurance against loss of the data, regardless of any 
claims to the contrary.)

Knowing that, I do have reasonable backups, and while they aren't always 
current, I take seriously the backup or lack thereof as data value 
definition discussed above, so if I lose it due to not having a backup, I 
swallow hard and know I must have considered the time saved worth it...

Which is a long way of saying I have my backups closer at hand and am 
more willing to use them and lose what wasn't backed up, than some.  So 
it's easier for me to say that's what I'd do, than it would be for some.  
I actually make it a point to keep my data in reasonably sized for 
management partitions, with equivalently sized partitions elsewhere for 
the backups, to multiple levels in many cases, tho some are rather old.  
So freshening or restoring a backup is simply a matter of copying from 
one partition (or pair of partitions given that many of them are btrfs 
raid1 pair-mirrors) to another, deliberately pre-provisioned to the same 
size, for use /as/ the working and backup copies.  Similarly, falling 
back to a backup is simply a matter of ensuring the appropriate physical 
media is connected, and either mounting it as a backup, or switching a 
couple entries in fstab, and mounting it in place of the original.

So it's relatively easy here, but only because I've taken pains to set it 
up to make it so.

Meanwhile, btrfs does have some tools that can /sometimes/ help recover 
data off of unmountable fs' that would otherwise be "in the backup gap".  
Btrfs restore has helped me save that "backup gap" data a few times -- it 
may not have been worth the trouble of a backup when the risk was still 
theoretical, and I'd have accepted the loss if it came to it, but that 
didn't mean it wasn't worth spending a bit more time trying to save it, 
successfully in my case, once I knew I was actually in the recovery or 
loss situation.

Tho in your case it looks like you are looking at the warnings before it 
gets to that point, and it's both a backup already, so you presumably 
have the live data in most cases, and you can still mount and read most 
or all of it, so it's just a question of the time and hassle to do it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OOM: Better, but still there on

2016-12-22 Thread Nils Holland
On Wed, Dec 21, 2016 at 08:36:59AM +0100, Michal Hocko wrote:
> TL;DR
> there is another version of the debugging patch. Just revert the
> previous one and apply this one instead. It's still not clear what
> is going on but I suspect either some misaccounting or unexpeted
> pages on the LRU lists. I have added one more tracepoint, so please
> enable also mm_vmscan_inactive_list_is_low.

Right, I did just that and can provide a new log. I was also able, in
this case, to reproduce the OOM issues again and not just the "page
allocation stalls" that were the only thing visible in the previous
log. However, the log comes from machine #2 again today, as I'm
unfortunately forced to try this via VPN from work to home today, so I
have exactly one attempt per machine before it goes down and locks up
(and I can only restart it later tonight). Machine #1 failed to
produce good looking results during its one attempt, but what machine #2
produced seems to be exactly what we've been trying to track down, and so
its log us now up at:

http://ftp.tisys.org/pub/misc/boerne_2016-12-22.log.xz

Greetings
Nils
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9 v2] scope GFP_NOFS api

2016-12-22 Thread Michal Hocko
Are there any objections to the approach and can we have this merged to
the mm tree?

Dave has expressed the patch2 should be dropped for now. I will do that
in a next submission but I do not want to resubmit until there is a
consensus on this.

What do other than xfs/ext4 developers think about this API. Can we find
a way to use it?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-22 Thread Christoph Hellwig
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes. That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]

Do you have numbers to justify these changes?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t

2016-12-22 Thread Amir Goldstein
On Wed, Dec 21, 2016 at 7:03 PM, Jeff Layton  wrote:
> The spinlock is only used to serialize callers that want to increment
> the counter. We can achieve the same thing with an atomic64_t and
> get the i_lock out of this codepath.
>

Cool work! See some nits and suggestions below.

> +/*
> + * We borrow the top bit in the i_version to use as a flag to tell us whether
> + * it has been queried since we last bumped it. If it has, then we must bump
> + * it and set the flag. Note that this means that we have to handle wrapping
> + * manually.
> + */
> +#define INODE_I_VERSION_QUERIED(1ULL<<63)
> +
>  /**
>   * inode_set_iversion - set i_version to a particular value
>   * @inode: inode to set
> @@ -1976,7 +1980,7 @@ static inline void inode_dec_link_count(struct inode 
> *inode)
>  static inline void
>  inode_set_iversion(struct inode *inode, const u64 new)
>  {
> -   inode->i_version = new;
> +   atomic64_set(>i_version, new);
>  }
>

Maybe needs an overflow sanity check !(new & INODE_I_VERSION_QUERIED)??
See API change suggestion below.


>  /**
> @@ -2010,16 +2011,26 @@ inode_set_iversion_read(struct inode *inode, const 
> u64 new)
>  static inline bool
>  inode_inc_iversion(struct inode *inode, bool force)
>  {
> -   bool ret = false;
> +   u64 cur, old, new;
> +
> +   cur = (u64)atomic64_read(>i_version);
> +   for (;;) {
> +   /* If flag is clear then we needn't do anything */
> +   if (!force && !(cur & INODE_I_VERSION_QUERIED))
> +   return false;
> +
> +   new = (cur & ~INODE_I_VERSION_QUERIED) + 1;
> +
> +   /* Did we overflow into flag bit? Reset to 0 if so. */
> +   if (unlikely(new == INODE_I_VERSION_QUERIED))
> +   new = 0;
>

Did you consider changing f_version type and the signature of the new
i_version API to set/get s64 instead of u64?

It makes a bit more sense from API users perspective to know that
the valid range for version is >=0.

file->f_version is not the only struct member used to store
i_version. nfs and xfs have other struct members for that, but even
if all those members are not changed to type s64, the explicit cast
to (s64) and back to (u64) will serve as a good documentation in
the code about the valid range of version in the new API.

>  /**
> @@ -2080,7 +2099,7 @@ inode_get_iversion(struct inode *inode)
>  static inline s64
>  inode_cmp_iversion(const struct inode *inode, const u64 old)
>  {
> -   return (s64)inode->i_version - (s64)old;
> +   return (s64)(atomic64_read(>i_version) << 1) - (s64)(old << 1);
>  }
>

IMO, it is better for the API to determine that 'old' is valid a value
returned from
inode_get_iversion* and therefore should not have the MSB set.
Unless the reason you chose to shift those 2 values is because it is cheaper
then masking INODE_I_VERSION_QUERIED??


Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html