Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state

2015-12-13 Thread Alex Lyakas
[Resending in plain text, apologies.]

Hi Chandan, Josef, Chris,

I am not sure I understand the fix to the problem.

It may happen that when updating the device tree, we need to allocate a new
chunk via do_chunk_alloc (while we are holding the device tree root node
locked). This is a legitimate thing for find_free_extent() to do. And
do_chunk_alloc() call may lead to call to
btrfs_create_pending_block_groups(), which will try to update the device
tree. This may happen due to direct call to
btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or
perhaps by __btrfs_end_transaction() that find_free_extent() does after it
completed chunk allocation (although in this case it will use the
transaction that already exists in current->journal_info).
So the deadlock still may happen?

Thanks,
 Alex.

>
>
> On Mon, Nov 2, 2015 at 6:52 PM, Chris Mason  wrote:
>>
>> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote:
>> > When executing generic/001 in a loop on a ppc64 machine (with both
>> > sectorsize
>> > and nodesize set to 64k), the following call trace is observed,
>>
>> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes.
>>
>> -chris
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Alex Lyakas
Hi Filipe Manana,

My understanding of selecting delayed refs to run or merging them is
far from complete. Can you please explain what will happen in the
following scenario:

1) Ref1 is created, as you explain
2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
up with an EXTENT_ITEM and an inline extent back ref
3) Ref2 and Ref3 are added
4) Somebody calls __btrfs_run_delayed_refs()

At this point, we cannot merge Ref2 and Ref3, because they might be
referencing tree blocks of completely different trees, thus
comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
run, because we prefer BTRFS_ADD_DELAYED_REF over
BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
now, because we already have Ref1 in the extent tree.

So something should prevent us from running Ref3 before running Ref2.
We should run Ref2 first, which should get rid of the EXTENT_ITEM and
the inline backref, and then run Ref3 to create a new backref with a
proper owner. What is that something?

Can you please point me at what am I missing?

Also, can such scenario happen in 3.18 kernel, which still has an
rbtree per ref-head? Looking at the code, I don't see anything
preventing that from happening.

Thanks,
Alex.


On Sun, Oct 25, 2015 at 8:51 PM,   wrote:
> From: Filipe Manana 
>
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
>
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
> loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW 
>   4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
> 88010c4c8000
> [35908.065201] RIP: 0010:[]  [] 
> insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
> 
> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
> 
> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: 
> 88010c4cb9f8
> [35908.065201] R10:  R11: 002c R12: 
> 
> [35908.065201] R13: 88020a74c578 R14:  R15: 
> 
> [35908.065201] FS:  () GS:88023edc() 
> knlGS:
> [35908.065201] CS:  0010 DS:  ES:  CR0: 8005003b
> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: 
> 06e0
> [35908.065201] Stack:
> [35908.065201]  88010c4cbb18 0f37 88020a74c578 
> 88015a408000
> [35908.065201]  880154a44000  0005 
> 88010c4cbbd8
> [35908.065201]  a0492b9a 0005  
> 
> [35908.065201] Call Trace:
> [35908.065201]  [] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201]  [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 
> [btrfs]
> [35908.065201]  [] __btrfs_run_delayed_refs+0xafa/0xd33 
> [btrfs]
> [35908.065201]  [] ? join_transaction.isra.10+0x25/0x41f 
> [btrfs]
> [35908.065201]  [] ? join_transaction.isra.10+0xa8/0x41f 
> [btrfs]
> [35908.065201]  [] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201]  [] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201]  [] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201]  [] btrfs_extent_refs_helper+0x12/0x14 
> [btrfs]
> [35908.065201]  [] process_one_work+0x24a/0x4ac
> [35908.065201]  [] worker_thread+0x206/0x2c2
> [35908.065201]  [] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [] kthread+0xef/0xf7
> [35908.065201]  [] ? kthread_parkme+0x24/0x24
> [35908.065201]  [] ret_from_fork+0x3f/0x70
> [35908.065201]  [] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 
> 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 
> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP  [] 
> insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201]  RSP 
> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> This happens because the new delayed references code no longer merges
> delayed references that have different sequence values. 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas
Hi Filipe Manana,

Can't the call to btrfs_create_pending_block_groups() cause a
deadlock, like in
http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
call updates the device tree, and we may be calling do_chunk_alloc()
from find_free_extent() when holding a lock on the device tree root
(because we want to COW a block of the device tree).

My understanding from Josef's chunk allocator rework
(http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
when allocating a new chunk we do not immediately update the
device/chunk tree. We keep the new chunk in "pending_chunks" and in
"new_bgs" on a transaction handle, and we actually update the
chunk/device tree only when we are done with a particular transaction
handle. This way we avoid that sort of deadlocks.

But this patch breaks this rule, as it may make us update the
device/chunk tree in the context of chunk allocation, which is the
scenario that the rework was meant to avoid.

Can you please point me at what I am missing?

Thanks,
Alex.


On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>> finishing block group creation"), introduced in 4.2-rc1, the following
>> test was failing due to exhaustion of the system array in the superblock:
>>
>>   #!/bin/bash
>>
>>   truncate -s 100T big.img
>>   mkfs.btrfs big.img
>>   mount -o loop big.img /mnt/loop
>>
>>   num=5
>>   sz=10T
>>   for ((i = 0; i < $num; i++)); do
>>   echo fallocate $i $sz
>>   fallocate -l $sz /mnt/loop/testfile$i
>>   done
>>   btrfs filesystem sync /mnt/loop
>>
>>   for ((i = 0; i < $num; i++)); do
>> echo rm $i
>> rm /mnt/loop/testfile$i
>> btrfs filesystem sync /mnt/loop
>>   done
>>   umount /mnt/loop
>>
>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>> allocation of system block groups. This happened because the test creates
>> a large number of data block groups per transaction and when committing
>> the transaction we start the writeout of the block group caches for all
>> the new new (dirty) block groups, which results in pre-allocating space
>> for each block group's free space cache using the same transaction handle.
>> That in turn often leads to creation of more block groups, and all get
>> attached to the new_bgs list of the same transaction handle to the point
>> of getting a list with over 1500 elements, and creation of new block groups
>> leads to the need of reserving space in the chunk block reserve and often
>> creating a new system block group too.
>>
>> So that made us quickly exhaust the chunk block reserve/system space info,
>> because as of the commit mentioned before, we do reserve space for each
>> new block group in the chunk block reserve, unlike before where we would
>> not and would at most allocate one new system block group and therefore
>> would only ensure that there was enough space in the system space info to
>> allocate 1 new block group even if we ended up allocating thousands of
>> new block groups using the same transaction handle. That worked most of
>> the time because the computed required space at check_system_chunk() is
>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>> that all nodes/leafs in a path will be COWed and split) and since the
>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>> it is unlikely that a path needs to be COWed more than once (unless
>> writepages() for the btree inode is called by mm in between) and that
>> compensated for the need of creating any new nodes/leads in the chunk
>> tree.
>>
>> So fix this by ensuring we don't accumulate a too large list of new block
>> groups in a transaction's handles new_bgs list, inserting/updating the
>> chunk tree for all accumulated new block groups and releasing the unused
>> space from the chunk block reserve whenever the list becomes sufficiently
>> large. This is a generic solution even though the problem currently can
>> only happen when starting the writeout of the free space caches for all
>> dirty block groups (btrfs_start_dirty_block_groups()).
>>
>> Reported-by: Omar Sandoval 
>> Signed-off-by: Filipe Manana 
>
> Thanks a lot for taking a look.
>
> Tested-by: Omar Sandoval 
>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 171312d..07204bf 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4227,6 +4227,24 @@ out:
>>   space_info->chunk_alloc = 0;
>>   spin_unlock(_info->lock);
>>   mutex_unlock(_info->chunk_mutex);
>> + /*
>> +  * When we allocate a new chunk we reserve space in the chunk block
>> +  * reserve 

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-13 Thread Christoph Anton Mitterer
On Wed, 2015-12-09 at 13:36 +, Duncan wrote:
> Answering the BTW first, not to my knowledge, and I'd be
> skeptical.  In 
> general, btrfs is cowed, and that's the focus.  To the extent that
> nocow 
> is necessary for fragmentation/performance reasons, etc, the idea is
> to 
> try to make cow work better in those cases, for example by working on
> autodefrag to make it better at handling large files without the
> scaling 
> issues it currently has above half a gig or so, and thus to confine
> nocow 
> to a smaller and smaller niche use-case, rather than focusing on
> making 
> nocow better.
> Of course it remains to be seen how much better they can do with 
> autodefrag, etc, but at this point, there's way more project 
> possibilities than people to develop them, so even if they do find
> they 
> can't make cow work much better for these cases, actually working on
> nocow 
> would still be rather far down the list, because there's so many
> other 
> improvement and feature opportunities that will get the focus
> first.  
> Which in practice probably puts it in "it'd be nice, but it's low
> enough 
> priority that we're talking five years out or more, unless of course 
> someone else qualified steps up and that's their personal itch they
> want 
> to scratch", territory.
I guess I'll split out my answer on that, in a fresh thread about
checksums for nodatacow later, hoping to attract some more devs there
:-)

I think however, again with my naive understanding on how CoW works and
what it inherently implies, that there cannot be a real good solution
to the fragmentation problem for DB/etc. files.

And as such, I'd think that having a checksumming feature for
notdatacow as well, even if it's not perfect, is definitely worth it.


> As for the updated checksum after modification, the problem with that
> is 
> that in the mean time, the checksum wouldn't verify,
Well one could either implement some locking,.. but I don't see the
general problem here... if the block is still being written (and I
count updating the meta-data, including checksum, to that) it cannot be
read anyway, can it? It may be only half written and the data returned
would be garbage.


>  and while btrfs 
> could of course keep status in memory during normal operations,
> that's 
> not the problem, the problem is what happens if there's a crash and
> in-
> memory state vaporizes.  In that case, when btrfs remounted, it'd
> have no 
> way of knowing why the checksum didn't match, just that it didn't,
> and 
> would then refuse access to that block in the file, because for all
> it 
> knows, it /is/ a block error.
And this would only happen in the rare cases that anything crashes,
where it's anyway quite likely that this no-CoWed block will be
garbage.
I'll talk about that more in the separate thread... so let's move
things there.


> Same here.  In fact, my most anticipated feature is N-way-mirroring, 
Hmm ... not totally sure about that...
AFAIU, N-way-mirroring is what currently the currently wrongly called
RAID1 is in btrfs, i.e. having N replicas of everything on M devices,
right?
In other words, not being a N-parity-RAID and not guaranteeing that
*any* N disks could fail, right?

Hmm I guess that would be definitely nice to have, especially since
then we could have true RAID1, i.e. N=M.

But it's probably rather important for those scenarios, where either
resilience matters a lot... and/or   those where write speed doesn't
but read speed does, right?

Taking the example of our use case at the university, i.e. the LHC
Tier-2 we run,... that would rather be uninteresting.
We typically have storage nodes (and many of them) of say 16-24
devices, and based on funding constraints, resilience concerns and IO
performance, we place them in RAID6 (yeah i know, RAID5 is faster, but
even with hotspares in place, practise lead too often to lost RAIDs).

Especially for the bigger nodes, with more disks, we'd rather have a N-
parity RAID, where any N disks can fail)... of course performance
considerations may kill that desire again ;)


> It is a big and basic feature, but turning it off isn't the end of
> the 
> world, because then it's still the same level of reliability other 
> solutions such as raid generally provide.
Sure... I never meant it as "loss to what we already have in other
systems"... but as "loss compared to how awesome[0] btrfs could be ;-)"


> But as it happens, both VM image management and databases tend to
> come 
> with their own integrity management, in part precisely because the 
> filesystem could never provide that sort of service.
Well that's only partially true, to my knowledge.
a) I wouldn't know that hypervisors do that at all.
b) DBs have of course their journal, but that protects only against
crashes,... not against bad blocks nor does it help you to decide which
block is good when you have multiple.


> After all, you can always decide not to run it if you're worried
> about the space effects it's going to have

Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Filipe Manana
On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas  wrote:
> Hi Filipe Manana,
>
> My understanding of selecting delayed refs to run or merging them is
> far from complete. Can you please explain what will happen in the
> following scenario:
>
> 1) Ref1 is created, as you explain
> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
> up with an EXTENT_ITEM and an inline extent back ref
> 3) Ref2 and Ref3 are added
> 4) Somebody calls __btrfs_run_delayed_refs()
>
> At this point, we cannot merge Ref2 and Ref3, because they might be
> referencing tree blocks of completely different trees, thus
> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
> run, because we prefer BTRFS_ADD_DELAYED_REF over
> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
> now, because we already have Ref1 in the extent tree.

No, that won't happen. If the ref (Ref3) is for a different tree, than
it has a different inline extent from Ref1
(lookup_inline_extent_backref returns -ENOENT and not 0).

If they are all for the same tree it means Ref3 is not merged with
Ref2 because they have different seq numbers and a seq value exist in
fs_info->tree_mod_seq_list, and we skip Ref3 through
btrfs_check_delayed_seq() until such seq number goes away from
tree_mod_seq_list. If no seq number exists in tree_mod_seq_list then
we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
running delayed refs, with Ref2 (which removes both refs since one is
"-1" and the other "+1").

Iow, after this regression fix, no behaviour changed from releases before 4.2.

>
> So something should prevent us from running Ref3 before running Ref2.
> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
> the inline backref, and then run Ref3 to create a new backref with a
> proper owner. What is that something?
>
> Can you please point me at what am I missing?
>
> Also, can such scenario happen in 3.18 kernel, which still has an
> rbtree per ref-head? Looking at the code, I don't see anything
> preventing that from happening.
>
> Thanks,
> Alex.
>
>
> On Sun, Oct 25, 2015 at 8:51 PM,   wrote:
>> From: Filipe Manana 
>>
>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>> references implementation in order to fix certain problems with qgroups.
>> However that rework introduced one more regression that leads to the
>> following trace when running delayed references for metadata:
>>
>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
>> sunrpc loop fuse parport_pc psmouse i2
>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW
>>4.3.0-rc5-btrfs-next-17+ #1
>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
>> 88010c4c8000
>> [35908.065201] RIP: 0010:[]  [] 
>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
>> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
>> 
>> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
>> 
>> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: 
>> 88010c4cb9f8
>> [35908.065201] R10:  R11: 002c R12: 
>> 
>> [35908.065201] R13: 88020a74c578 R14:  R15: 
>> 
>> [35908.065201] FS:  () GS:88023edc() 
>> knlGS:
>> [35908.065201] CS:  0010 DS:  ES:  CR0: 8005003b
>> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: 
>> 06e0
>> [35908.065201] Stack:
>> [35908.065201]  88010c4cbb18 0f37 88020a74c578 
>> 88015a408000
>> [35908.065201]  880154a44000  0005 
>> 88010c4cbbd8
>> [35908.065201]  a0492b9a 0005  
>> 
>> [35908.065201] Call Trace:
>> [35908.065201]  [] __btrfs_inc_extent_ref+0x8b/0x208 
>> [btrfs]
>> [35908.065201]  [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 
>> [btrfs]
>> [35908.065201]  [] __btrfs_run_delayed_refs+0xafa/0xd33 
>> [btrfs]
>> [35908.065201]  [] ? join_transaction.isra.10+0x25/0x41f 
>> [btrfs]
>> [35908.065201]  [] ? join_transaction.isra.10+0xa8/0x41f 
>> [btrfs]
>> [35908.065201]  [] btrfs_run_delayed_refs+0x75/0x1dd 
>> [btrfs]
>> [35908.065201]  [] delayed_ref_async_start+0x3c/0x7b 
>> [btrfs]
>> [35908.065201]  [] 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Filipe Manana
On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas  wrote:
> Hi Filipe Manana,
>
> Can't the call to btrfs_create_pending_block_groups() cause a
> deadlock, like in
> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
> call updates the device tree, and we may be calling do_chunk_alloc()
> from find_free_extent() when holding a lock on the device tree root
> (because we want to COW a block of the device tree).
>
> My understanding from Josef's chunk allocator rework
> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
> when allocating a new chunk we do not immediately update the
> device/chunk tree. We keep the new chunk in "pending_chunks" and in
> "new_bgs" on a transaction handle, and we actually update the
> chunk/device tree only when we are done with a particular transaction
> handle. This way we avoid that sort of deadlocks.
>
> But this patch breaks this rule, as it may make us update the
> device/chunk tree in the context of chunk allocation, which is the
> scenario that the rework was meant to avoid.
>
> Can you please point me at what I am missing?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff

>
> Thanks,
> Alex.
>
>
> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>>> From: Filipe Manana 
>>>
>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>> test was failing due to exhaustion of the system array in the superblock:
>>>
>>>   #!/bin/bash
>>>
>>>   truncate -s 100T big.img
>>>   mkfs.btrfs big.img
>>>   mount -o loop big.img /mnt/loop
>>>
>>>   num=5
>>>   sz=10T
>>>   for ((i = 0; i < $num; i++)); do
>>>   echo fallocate $i $sz
>>>   fallocate -l $sz /mnt/loop/testfile$i
>>>   done
>>>   btrfs filesystem sync /mnt/loop
>>>
>>>   for ((i = 0; i < $num; i++)); do
>>> echo rm $i
>>> rm /mnt/loop/testfile$i
>>> btrfs filesystem sync /mnt/loop
>>>   done
>>>   umount /mnt/loop
>>>
>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>> allocation of system block groups. This happened because the test creates
>>> a large number of data block groups per transaction and when committing
>>> the transaction we start the writeout of the block group caches for all
>>> the new new (dirty) block groups, which results in pre-allocating space
>>> for each block group's free space cache using the same transaction handle.
>>> That in turn often leads to creation of more block groups, and all get
>>> attached to the new_bgs list of the same transaction handle to the point
>>> of getting a list with over 1500 elements, and creation of new block groups
>>> leads to the need of reserving space in the chunk block reserve and often
>>> creating a new system block group too.
>>>
>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>> because as of the commit mentioned before, we do reserve space for each
>>> new block group in the chunk block reserve, unlike before where we would
>>> not and would at most allocate one new system block group and therefore
>>> would only ensure that there was enough space in the system space info to
>>> allocate 1 new block group even if we ended up allocating thousands of
>>> new block groups using the same transaction handle. That worked most of
>>> the time because the computed required space at check_system_chunk() is
>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>> that all nodes/leafs in a path will be COWed and split) and since the
>>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>>> it is unlikely that a path needs to be COWed more than once (unless
>>> writepages() for the btree inode is called by mm in between) and that
>>> compensated for the need of creating any new nodes/leads in the chunk
>>> tree.
>>>
>>> So fix this by ensuring we don't accumulate a too large list of new block
>>> groups in a transaction's handles new_bgs list, inserting/updating the
>>> chunk tree for all accumulated new block groups and releasing the unused
>>> space from the chunk block reserve whenever the list becomes sufficiently
>>> large. This is a generic solution even though the problem currently can
>>> only happen when starting the writeout of the free space caches for all
>>> dirty block groups (btrfs_start_dirty_block_groups()).
>>>
>>> Reported-by: Omar Sandoval 
>>> Signed-off-by: Filipe Manana 
>>
>> Thanks a lot for taking a look.
>>
>> Tested-by: Omar Sandoval 
>>
>>> ---
>>>  fs/btrfs/extent-tree.c | 18 ++
>>>  1 file changed, 18 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 171312d..07204bf 

Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Alex Lyakas
Hi Filipe,

Thank you for the explanation.

On Sun, Dec 13, 2015 at 5:43 PM, Filipe Manana  wrote:
> On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas  wrote:
>> Hi Filipe Manana,
>>
>> My understanding of selecting delayed refs to run or merging them is
>> far from complete. Can you please explain what will happen in the
>> following scenario:
>>
>> 1) Ref1 is created, as you explain
>> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
>> up with an EXTENT_ITEM and an inline extent back ref
>> 3) Ref2 and Ref3 are added
>> 4) Somebody calls __btrfs_run_delayed_refs()
>>
>> At this point, we cannot merge Ref2 and Ref3, because they might be
>> referencing tree blocks of completely different trees, thus
>> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
>> run, because we prefer BTRFS_ADD_DELAYED_REF over
>> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
>> now, because we already have Ref1 in the extent tree.
>
> No, that won't happen. If the ref (Ref3) is for a different tree, than
> it has a different inline extent from Ref1
> (lookup_inline_extent_backref returns -ENOENT and not 0).
Understood. So in this case, we will first add inline ref for Ref3,
and later drop the Ref1 inline ref via update_inline_extent_backref()
by truncating the EXTENT_ITEM. All in the same transaction.


>
> If they are all for the same tree it means Ref3 is not merged with
> Ref2 because they have different seq numbers and a seq value exist in
> fs_info->tree_mod_seq_list, and we skip Ref3 through
> btrfs_check_delayed_seq() until such seq number goes away from
> tree_mod_seq_list.
Ok, so we won't process this ref-head at all, until the "seq problem"
disappears.

> If no seq number exists in tree_mod_seq_list then
> we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
> running delayed refs, with Ref2 (which removes both refs since one is
> "-1" and the other "+1").
So in this case we don't care that the inline ref we have in the
EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for
the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully
equivalent. Interesting.

Thanks!
Alex.

>
> Iow, after this regression fix, no behaviour changed from releases before 4.2.
>
>>
>> So something should prevent us from running Ref3 before running Ref2.
>> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
>> the inline backref, and then run Ref3 to create a new backref with a
>> proper owner. What is that something?
>>
>> Can you please point me at what am I missing?
>>
>> Also, can such scenario happen in 3.18 kernel, which still has an
>> rbtree per ref-head? Looking at the code, I don't see anything
>> preventing that from happening.
>>
>> Thanks,
>> Alex.
>>
>>
>> On Sun, Oct 25, 2015 at 8:51 PM,   wrote:
>>> From: Filipe Manana 
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
>>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
>>> sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW   
>>> 4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>>> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
>>> 88010c4c8000
>>> [35908.065201] RIP: 0010:[]  [] 
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
>>> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
>>> 
>>> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
>>> 
>>> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: 
>>> 88010c4cb9f8
>>> [35908.065201] R10:  R11: 002c R12: 
>>> 
>>> [35908.065201] R13: 88020a74c578 R14:  R15: 
>>> 
>>> [35908.065201] FS:  () GS:88023edc() 
>>> knlGS:
>>> [35908.065201] CS:  0010 DS:  ES:  CR0: 8005003b
>>> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: 
>>> 06e0
>>> [35908.065201] Stack:
>>> [35908.065201]  88010c4cbb18 0f37 88020a74c578 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas
Thank you, Filipe. Now it is more clear.
Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling
btrfs_create_pending_block_groups(), so I cannot hit this deadlock.
But can hit the issue that this call is meant to fix.

Thanks,
Alex.


On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana  wrote:
> On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas  wrote:
>> Hi Filipe Manana,
>>
>> Can't the call to btrfs_create_pending_block_groups() cause a
>> deadlock, like in
>> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
>> call updates the device tree, and we may be calling do_chunk_alloc()
>> from find_free_extent() when holding a lock on the device tree root
>> (because we want to COW a block of the device tree).
>>
>> My understanding from Josef's chunk allocator rework
>> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
>> when allocating a new chunk we do not immediately update the
>> device/chunk tree. We keep the new chunk in "pending_chunks" and in
>> "new_bgs" on a transaction handle, and we actually update the
>> chunk/device tree only when we are done with a particular transaction
>> handle. This way we avoid that sort of deadlocks.
>>
>> But this patch breaks this rule, as it may make us update the
>> device/chunk tree in the context of chunk allocation, which is the
>> scenario that the rework was meant to avoid.
>>
>> Can you please point me at what I am missing?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
>>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
 From: Filipe Manana 

 Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
 finishing block group creation"), introduced in 4.2-rc1, the following
 test was failing due to exhaustion of the system array in the superblock:

   #!/bin/bash

   truncate -s 100T big.img
   mkfs.btrfs big.img
   mount -o loop big.img /mnt/loop

   num=5
   sz=10T
   for ((i = 0; i < $num; i++)); do
   echo fallocate $i $sz
   fallocate -l $sz /mnt/loop/testfile$i
   done
   btrfs filesystem sync /mnt/loop

   for ((i = 0; i < $num; i++)); do
 echo rm $i
 rm /mnt/loop/testfile$i
 btrfs filesystem sync /mnt/loop
   done
   umount /mnt/loop

 This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
 allocation of system block groups. This happened because the test creates
 a large number of data block groups per transaction and when committing
 the transaction we start the writeout of the block group caches for all
 the new new (dirty) block groups, which results in pre-allocating space
 for each block group's free space cache using the same transaction handle.
 That in turn often leads to creation of more block groups, and all get
 attached to the new_bgs list of the same transaction handle to the point
 of getting a list with over 1500 elements, and creation of new block groups
 leads to the need of reserving space in the chunk block reserve and often
 creating a new system block group too.

 So that made us quickly exhaust the chunk block reserve/system space info,
 because as of the commit mentioned before, we do reserve space for each
 new block group in the chunk block reserve, unlike before where we would
 not and would at most allocate one new system block group and therefore
 would only ensure that there was enough space in the system space info to
 allocate 1 new block group even if we ended up allocating thousands of
 new block groups using the same transaction handle. That worked most of
 the time because the computed required space at check_system_chunk() is
 very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
 that all nodes/leafs in a path will be COWed and split) and since the
 updates to the chunk tree all happen at btrfs_create_pending_block_groups
 it is unlikely that a path needs to be COWed more than once (unless
 writepages() for the btree inode is called by mm in between) and that
 compensated for the need of creating any new nodes/leads in the chunk
 tree.

 So fix this by ensuring we don't accumulate a too large list of new block
 groups in a transaction's handles new_bgs list, inserting/updating the
 chunk tree for all accumulated new block groups and releasing the unused
 space from the chunk block reserve whenever the list becomes sufficiently
 large. This is a generic solution even though the problem currently can
 only happen when starting the writeout of the free space caches for all
 dirty block groups 

Re: Very various speed of grep operation on btrfs partition

2015-12-13 Thread Михаил Гаврилов
Ok, I am make another experiment. I am buy new HDD and format it with
btrfs file system. Also I increased size of grep data and make bash script
wich automate testing:

#!/bin/bash

#For testing on windows machine
#grep_path='/cygdrive/e/Sources/inside'
#For testing on new HDD
#grep_path='/run/media/mikhail/eaa531cd-25f4-4e00-b31f-22665faa9768/sources/inside'
#For testing in real life
grep_path='/home/mikhail/sources/inside'
command="grep -rn 'float:left;display: block;height: 24px;line-height:
1.2em;position: relative;text-align: center;white-space: nowrap;width:
80px;' '$grep_path'"
log_file='res.log'

exec 3>&1 1>>${log_file} 2>&1
while [ 1 = 1 ]
do

   (( count++ ))
   echo "PASS: $count" at `date +"%T"` | tee /dev/fd/3
   echo $command | tee /dev/fd/3
   eval "{ time $command > /dev/null; } |& tee /dev/fd/3"
done


And get very interesting results:

Linux btrfs with NEW HDD: 6.441s (result as in syntetic tests)
Linux btrfs with real data HDD (used 94%): 16m52.036s Very bad why???
Data are same with first variant.
Windows ntfs NEW HDD: 1m27.643s

I am really disappointed why in real life (home folder) have so bad results
It's possible HDD which is used 94% optimise speed as on empty hard drive?
Both hard disk are same. This is ST4000NM0033-9ZM170.


--
Best Regards,
Mike Gavrilov.


Determine is file a reflink or not

2015-12-13 Thread Ivan Sizov
Is there a way to view the CoW structure, e.g. to know is file just a
reflink or it was modified? I copied many files from snapshot with
--reflink=always I and want to know which files was modified since the
copying. Calculating hashsums seems to be too long thing.

-- 
Ivan Sizov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-13 Thread Marc MERLIN
On Sun, Dec 13, 2015 at 11:35:08PM +0100, Martin Steigerwald wrote:
> Hi!
> 
> For me it is still not production ready. Again I ran into:
> 
> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random 
> write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
 
Sorry you're having issues. I haven't seen this before myself.
I couldn't find the kernel version you're using in your Email or the bug
you filed (quick scan).

That's kind of important :)

Marc
 
> No matter whether SLES 12 uses it as default for root, no matter whether 
> Fujitsu and Facebook use it: I will not let this onto any customer machine 
> without lots and lots of underprovisioning and rigorous free space 
> monitoring. 
> Actually I will renew my recommendations in my trainings to be careful with 
> BTRFS.
> 
> From my experience the monitoring would check for:
> 
> merkaba:~> btrfs fi show /home
> Label: 'home'  uuid: […]
> Total devices 2 FS bytes used 156.31GiB
> devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
> devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> 
> If "used" is same as "size" then make big fat alarm. It is not sufficient for 
> it to happen. It can run for quite some time just fine without any issues, 
> but 
> I never have seen a kworker thread using 100% of one core for extended period 
> of time blocking everything else on the fs without this condition being met.
> 
> 
> In addition to that last time I tried it aborts scrub any of my BTRFS 
> filesstems. Reported in another thread here that got completely ignored so 
> far. I think I could go back to 4.2 kernel to make this work.
> 
> 
> I am not going to bother to go into more detail on any on this, as I get the 
> impression that my bug reports and feedback get ignored. So I spare myself 
> the 
> time to do this work for now.
> 
> 
> Only thing I wonder now whether this all could be cause my /home is already 
> more than one and a half year old. Maybe newly created filesystems are 
> created 
> in a way that prevents these issues? But it already has a nice global reserve:
> 
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.98GiB, used=24.07GiB
> System, RAID1: total=19.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=536.80MiB
> GlobalReserve, single: total=192.00MiB, used=0.00B
> 
> 
> Actually when I see that this free space thing is still not fixed for good I 
> wonder whether it is fixable at all. Is this an inherent issue of BTRFS or 
> more generally COW filesystem design?
> 
> I think it got somewhat better. It took much longer to come into that state 
> again than last time, but still, blocking like this is *no* option for a 
> *production ready* filesystem.
> 
> 
> 
> I am seriously consider to switch to XFS for my production laptop again. 
> Cause 
> I never saw any of these free space issues with any of the XFS or Ext4 
> filesystems I used in the last 10 years.
> 
> Thanks,
> -- 
> Martin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-13 Thread Christoph Anton Mitterer
On Fri, 2015-12-11 at 16:06 -0700, Chris Murphy wrote:
> For anything but a new and empty Btrfs volume
What's the influence of the fs being new/empty?

> this hypothetical
> attack would be a ton easier to do on LVM and mdadm raid because they
> have a tiny amount of metadata to spoof compared to a Btrfs volume
> with even a little bit of data on it.
Uhm I haven't said that other systems properly handle this kind of
attack. ;-)
Guess that would need to be evaluated...


>  I think this concern is overblown.
I don't think so. Let me give you an example: There is an attack[0]
against crypto, where the attacker listens via a smartphone's
microphone, and based on the acoustics of a computer where gnupg runs.
This is surely not an attack many people would have considered even
remotely possible, but in fact it works, at least under lab conditions.

I guess the same applies for possible attack vectors like this here.
The stronger actual crypto and the strong software gets in terms of
classical security holes (buffer overruns and so), the more attackers
will try to go alternative ways.


> I'm suggesting bitwise identical copies being created is not what is
> wanted most of the time, except in edge cases.
mhh,.. well there's the VM case, e.g. duplicating a template VM,
booting it deploying software. Guess that's already common enough.
There are people who want to use btrfs on top of LVM and using the
snapshot functionality of that... another use case.
Some people may want to use it on top of MD (for whatever reason)... at
least in the mirroring RAID case, the kernel would see the same btrfs
twice.

Apart from that, btrfs should be a general purpose fs, and not just a
desktop or server fs.
So edge cases like forensics (where it's common that you create bitwise
identical images) shouln't be forgotten either.


> > >If your workflow requires making an exact copy (for the shelf or
> > > for
> > > an emergency) then dd might be OK. But most often it's used
> > > because
> > > it's been easy, not because it's a good practice.
> > Ufff.. I wouldn't got that far to call something here bad or good
> > practice.
> 
> It's not just bad practice, it's sufficiently sloppy that it's very
> nearly user sabotage. That this is due to innocent ignorance, and a
> long standing practice that's bad advice being handed down from
> previous generations doesn't absolve the practice and mean we should
> invent esoteric work arounds for what is not a good practice. We have
> all sorts of exhibits why it's not a good idea.
Well if you don't give any real arguments or technical reasons (apart
from "working around software that doesn't handle this well") I
consider this just repetition of the baseless claim that long standing
practise would be bad.


> I disagree. It was due to the rudimentary nature of earlier
> filesystems' metadata paradigm that it worked. That's no longer the
> case.
Well in the end it's of course up to the developers to decide whether
this is acceptable or not, but being on the admin/end-user side, I can
at least say that not everyone on there would accept "this is no longer
the case" as valid explanation when their fs was corrupted or attacked.


> Sure, the kernel code should get smarter about refusing to mount in
> ambiguous cases, so that a file system isn't nerfed. That shouldn't
> happen. But we also need to get away from this idea that dd is
> actually an appropriate tool for making a file system copy.
Uhm... your view is a bit narrow-sighted... again take the forensics
example.

But apart from that,... I never said that dd should be the regular tool
for people to copy a btrfs image. Typically it would be simply slower
than other means.

But for some solutions, it may still be the better choice, or at least
the only choice implemented right now (e.g. I wouldn't now of a
hypervisor system, that looks at an existing disk image, finds any
btrfs in that (possibly "hidden" below further block layers), and
cleanly copies the data into freshly created btrfs image, with the same
structure.
AFAIK, there's not even a solution right now, that copies a complete
btrfs, with snapshots, etc. preserving all ref-links. At least nothing
official that works in one command.

Long story, short, I think we can agree, that - dd or not - corruptions
or attack vectors shouldn't be possible.
And be it just, to protect against the btrfs on hardware RAID1 case,
which is accidentally switched to JBOD mode...


Cheers,
Chris.


[0] http://www.tau.ac.il/~tromer/papers/acoustic-20131218.pdf


smime.p7s
Description: S/MIME cryptographic signature


Still not production ready

2015-12-13 Thread Martin Steigerwald
Hi!

For me it is still not production ready. Again I ran into:

btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random 
write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401


No matter whether SLES 12 uses it as default for root, no matter whether 
Fujitsu and Facebook use it: I will not let this onto any customer machine 
without lots and lots of underprovisioning and rigorous free space monitoring. 
Actually I will renew my recommendations in my trainings to be careful with 
BTRFS.

>From my experience the monitoring would check for:

merkaba:~> btrfs fi show /home
Label: 'home'  uuid: […]
Total devices 2 FS bytes used 156.31GiB
devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home

If "used" is same as "size" then make big fat alarm. It is not sufficient for 
it to happen. It can run for quite some time just fine without any issues, but 
I never have seen a kworker thread using 100% of one core for extended period 
of time blocking everything else on the fs without this condition being met.


In addition to that last time I tried it aborts scrub any of my BTRFS 
filesstems. Reported in another thread here that got completely ignored so 
far. I think I could go back to 4.2 kernel to make this work.


I am not going to bother to go into more detail on any on this, as I get the 
impression that my bug reports and feedback get ignored. So I spare myself the 
time to do this work for now.


Only thing I wonder now whether this all could be cause my /home is already 
more than one and a half year old. Maybe newly created filesystems are created 
in a way that prevents these issues? But it already has a nice global reserve:

merkaba:~> btrfs fi df /
Data, RAID1: total=27.98GiB, used=24.07GiB
System, RAID1: total=19.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=536.80MiB
GlobalReserve, single: total=192.00MiB, used=0.00B


Actually when I see that this free space thing is still not fixed for good I 
wonder whether it is fixable at all. Is this an inherent issue of BTRFS or 
more generally COW filesystem design?

I think it got somewhat better. It took much longer to come into that state 
again than last time, but still, blocking like this is *no* option for a 
*production ready* filesystem.



I am seriously consider to switch to XFS for my production laptop again. Cause 
I never saw any of these free space issues with any of the XFS or Ext4 
filesystems I used in the last 10 years.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel lockup, might be helpful log.

2015-12-13 Thread Birdsarenice
I've finally finished deleting all those nasty unreliable Seagate drives 
from my array. During the process I crashed my server - over, and over, 
and over. Completely gone - screen blank, controls unresponsive, no 
network activity (no, I don't have root on btrfs - data only). Most 
annoying, but I think btrfs survived it all somehow - it's scrubbing now.


Meanwhile, I did get lucky: At one crash I happened to be logged in and 
was able to hit dmesg seconds before it went completely. So what I have 
here is information that looks like it'll help you track down a 
rarely-encountered and hard-to-reproduce bug which can cause the system 
to lock up completely in event of certain types of hard drive failure. 
It might be nothing, but perhaps someone will find it of use - because 
it'd be a tricky one to both reproduce and get a good error report if it 
did occur.


I see an 'invalid opcode' error in here, that's pretty unusual - and 
again it even gives a file name and line number to look at. The root 
cause of all my issues is the NCQ issue with Seagate 8TB archive drives, 
which is Someone Else's Problem - but I think some good can come of 
this, as these exotic forms of corruption and weird drive semi-failures 
have revealed ways in which btrfs's error handling could be made more 
graceful.


Meanwhile I remain impressed that btrfs appears to have kept all my data 
intact even though all these issues.
[11668.697976] BTRFS info (device sde1): relocating block group 5932520046592 
flags 17
[11676.977183] BTRFS info (device sde1): found 20 extents
[11686.138376] BTRFS info (device sde1): found 20 extents
[11686.567242] BTRFS info (device sde1): relocating block group 5935741272064 
flags 17
[11695.452025] BTRFS info (device sde1): found 17 extents
[11704.627191] BTRFS info (device sde1): found 17 extents
[11705.966792] BTRFS info (device sde1): relocating block group 5938962497536 
flags 17
[11715.343790] BTRFS info (device sde1): found 15 extents
[11724.219660] BTRFS info (device sde1): found 15 extents
[11724.910970] BTRFS info (device sde1): relocating block group 5940036239360 
flags 17
[11733.289804] BTRFS info (device sde1): found 22 extents
[11741.538676] BTRFS info (device sde1): found 22 extents
[11742.019752] BTRFS info (device sde1): relocating block group 5941109981184 
flags 17
[11751.676514] BTRFS info (device sde1): found 14 extents
[11759.404371] [ cut here ]
[11759.404439] kernel BUG at ../fs/btrfs/extent-tree.c:1832!
[11759.404514] invalid opcode:  [#1] PREEMPT SMP 
[11759.404600] Modules linked in: xt_nat nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_filter ip6_tables xt_conntrack xt_tcpudp ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_filter iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables af_packet 
bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul 
crc32c_intel raid6_pq aesni_intel aes_x86_64 lrw gf128mul iTCO_wdt glue_helper 
ablk_helper iTCO_vendor_support cryptd pcspkr i2c_i801 ib_mthca lpc_ich tpm_tis 
8250_fintek ie31200_edac mfd_core shpchp battery edac_core thermal tpm video 
fan button processor hid_generic usbhid uas usb_storage amdkfd amd_iommu_v2 
radeon igb dca i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
[11759.405914]  fb_sys_fops ttm drm xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore 
usb_common e1000e ptp pps_core fjes vhost_net tun vhost macvtap macvlan sg 
rpcrdma sunrpc rdma_cm iw_cm ib_ipoib ib_cm ib_sa ib_umad ib_mad ib_core ib_addr
[11759.406328] CPU: 2 PID: 2060 Comm: btrfs Not tainted 4.3.0-2-default #1
[11759.406414] Hardware name: FUJITSU PRIMERGY TX100 S3P/D3009-B1, BIOS 
V4.6.5.3 R1.10.0 for D3009-B1x 12/18/2012
[11759.406555] task: 88042f832040 ti: 88041cae4000 task.ti: 
88041cae4000
[11759.406659] RIP: 0010:[]  [] 
insert_inline_extent_backref+0xc6/0xd0 [btrfs]
[11759.406815] RSP: 0018:88041cae7830  EFLAGS: 00010293
[11759.406889] RAX:  RBX:  RCX: 0001
[11759.406986] RDX: 8800 RSI: 0001 RDI: 
[11759.407085] RBP: 88041cae7890 R08: 4000 R09: 88041cae7748
[11759.407184] R10:  R11: 0003 R12: 880412615800
[11759.407283] R13:  R14:  R15: 8800c92aef50
[11759.407383] FS:  7f2e3b1678c0() GS:88042fd0() 
knlGS:
[11759.407497] CS:  0010 DS:  ES:  CR0: 80050033
[11759.407576] CR2: 55f473f59f28 CR3: 0004180be000 CR4: 001406e0
[11759.407675] Stack:
[11759.407706]   0102  

[11759.407831]  0001 88041170d800 32b6 
88041170d800
[11759.407949]  88030f0203b0 8800c92aef50 0102 
88040b22e000
[11759.408069] Call Trace:
[11759.408127]  

Re: attacking btrfs filesystems via UUID collisions?

2015-12-13 Thread Christoph Anton Mitterer
On Sat, 2015-12-12 at 02:34 +0100, S.J. wrote:
> A bit more about the dd-is-bad-topic:
> 
> IMHO it doesn't matter at all.
Yes, fully agree.


> a) For this specific problem here, fixing a security problem
> automatically
> fixes the risk of data corruption because careless cloning+mounting
> (without UUID adjustments) too.
> So, if the user likes to use dd with its disadvantages, like waiting 
> hours to
> copy lots of free space, and bad practice, etc.etc., why should it
> concern
> the Btrfs developers and/or us here?
> 
> b) At wider scope; while Btrfs is more complex than Xfs etc.,
> currently
> there is no other reason why things could go bad when dd'ing
> something.
> As long as this holds, is there really a place in the official Btrfs 
> documentation
> for telling the users "dd is bad [practice]"?
> ...
fully agree as well. :-)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-12-13 Thread Qu Wenruo



Laurent Bonnaud wrote on 2015/12/11 15:21 +0100:

On 04/12/2015 01:47, Qu Wenruo wrote:


[run btrfsck]


I did that, too with an old btrfsck version (4.0) and it found the following 
errors.
Then I did a btrfsck --repair, and I have been able to complete my "du -s" test.
The next step will we to run a "btrfs scrub" to check if data loss did happen...


Glad to hear that btrfsck --repair can fix it.
It seems to be space cache problem, and normally mount with -o 
clearcache should handle it.

But btrfsck --repair should also handle it well.

Thanks,
Qu



# btrfsck /dev/sdb1
Checking filesystem on /dev/sdb1
UUID: f6d4db2e-962b-42db-87b1-35064a4d38e0
checking extents
checking free space cache
block group 314635714560 has wrong amount of free spacefailed to load free 
space cache for block group 314635714560
There is no free space entry for 353290420224-353290764288
There is no free space entry for 353290420224-353827291136
cache appears valid but isnt 353290420224
There is no free space entry for 541732175872-541732208640
There is no free space entry for 541732175872-542268981248
cache appears valid but isnt 541732110336
Wanted bytes 32768, found 262144 for off 1008273178624
Wanted bytes 536625152, found 262144 for off 1008273178624
cache appears valid but isnt 1008272932864
block group 1475887497216 has wrong amount of free spacefailed to load free 
space cache for block group 1475887497216
block group 1823242977280 has wrong amount of free spacefailed to load free 
space cache for block group 1823242977280
There is no free space entry for 1827001073664-1827002810368
There is no free space entry for 1827001073664-1827537944576
cache appears valid but isnt 1827001073664
There is no free space entry for 1969305501696-1969305518080
There is no free space entry for 1969305501696-1969842290688
cache appears valid but isnt 1969305419776
There is no free space entry for 2021381947392-2021381963776
There is no free space entry for 2021381947392-2021918769152
cache appears valid but isnt 2021381898240
There is no free space entry for 2027287478272-2027287724032
There is no free space entry for 2027287478272-2027824349184
cache appears valid but isnt 2027287478272
There is no free space entry for 2143889227776-2143889244160
There is no free space entry for 2143889227776-2144426000384
cache appears valid but isnt 2143889129472
found 1977224107644 bytes used err is -22
total csum bytes: 1925245108
total tree bytes: 5773115392
total fs tree bytes: 3504685056
total extent tree bytes: 156975104
btree space waste bytes: 780048699
file data blocks allocated: 1971884707840
  referenced 1971875930112
btrfs-progs v4.0




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-13 Thread Christoph Anton Mitterer
Two more on these:

On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote:
> 3) When I would actually disable datacow for e.g. a subvolume that
> > holds VMs or DBs... what are all the implications?
> > Obviously no checksumming, but what happens if I snapshot such a
> > subvolume or if I send/receive it?
>    After snapshotting, modifications are CoWed precisely once, and
> then it reverts to nodatacow again. This means that making a snapshot
> of a nodatacow object will cause it to fragment as writes are made to
> it.
AFAIU, the one the get's fragmented then is the snapshot, right, and
the "original" will stay in place where it was? (Which is of course
good, because one probably marked it nodatacow, to avoid that
fragmentation problem on internal writes).

I'd assume the same happens when I do a reflink cp.

Can one make a copy, where one still has atomicity (which I guess
implies CoW) but where the destination file isn't heavily fragmented
afterwards,... i.e. there's some pre-allocation, and then cp really
does copy each block (just everything's at the state of time where I
stared cp, not including any other internal changes made on the source
in between).


And one more:
You both said, auto-defrag is generally recommended.
Does that also apply for SSDs (where we want to avoid unnecessary
writes)?
It does seem to get enabled, when SSD mode is detected.
What would it actually do on an SSD?


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


dear developers, can we have notdatacow + checksumming, plz?

2015-12-13 Thread Christoph Anton Mitterer
(consider that question being asked with that face on: http://goo.gl/LQaOuA)

Hey.

I've had some discussions on the list these days about not having
checksumming with nodatacow (mostly with Hugo and Duncan).

They both basically told me it wouldn't be straight possible with CoW,
and Duncan thinks it may not be so much necessary, but none of them
could give me really hard arguments, why it cannot work (or perhaps I
was just too stupid to understand them ^^)... while at the same time I
think that it would be generally utmost important to have checksumming
(real world examples below).

Also, I remember that in 2014, Ted Ts'o told me that there are some
plans ongoing to get data checksumming into ext4, with possibly even
some guy at RH actually doing it sooner or later.

Since these threads were rather admin-work-centric, developers may have
skipped it, therefore, I decided to write down some thoughts
label them with a more attracting subject and give it some bigger
attention.
O:-)




1) Motivation why, it makes sense to have checksumming (especially also
in the nodatacow case)


I think of all major btrfs features I know of (apart from the CoW
itself and having things like reflinks), checksumming is perhaps the
one that distinguishes it the most from traditional filesystems.

Sure we have snapshots, multi-device support and compression - but we
could have had that as well with LVM and software/hardware RAID... (and
ntfs supported compression IIRC ;) ).
Of course, btrfs does all that in a much smarter way, I know, but it's
nothing generally new.
The *data* checksumming at filesystem level, to my knowledge, is
however. Especially that it's always verified. Awesome. :-)


When one starts to get a bit deeper into btrfs (from the admin/end-user 
side) one sooner or later stumbles across the recommendation/need to
use nodatacow for certain types of data (DBs, VM images, etc.) and the
reason, AFAIU, being the inherent fragmentation that comes along with
the CoW, which is especially noticeable for those types of files with
lots of random internal writes.

Now duncan implied, that this could improve in the future, with the
auto-defragmentation getting (even) better, defrag becoming usable
again for those that do snapshots or reflinked copies and btrfs itself
generally maturing more and more.
But I kinda wonder to what extent one will be really able to solve
that, what seems to me a CoW-inherent "problem",...
Even *if* one can make the auto-defrag much smarter, it would still
mean that such files, like big DBs, VMs, or scientific datasets that
are internally rewritten, may get more or less constantly defragmented.
That may be quite undesired...
a) for performance reasons (when I consider our research software which
often has IO as the limiting factor and where we want as much IO being
used by actual programs as possible)...
b) SSDs...
Not really sure about that; btrfs seems to enable the autodefrag even
when an SSD is detected,... what is it doing? Placing the block in a
smart way on different chips so that accesses can be better
parallelised by the controller?
Anyway, (a) is could be already argument enough, not to run solve the
problem by a smart-[auto-]defrag, should that actually be implemented.

So I think having notdatacow is great and not just a workaround till
everything else gets better to handle these cases.
Thus, checksumming, which is such a vital feature, should also be
possible for that.


Duncan also mention that in some of those cases, the integrity is
already protected by the application layer, making it less important to
have it at the fs-layer.
Well, this may be true for file-sharing protocols, but I wouldn't know
that relational DBs really do cheksuming of the data.
They have journals, of course, but these protect against crashes, not
against silent block errors and that like.
And I wouldn't know that VM hypervisors do checksuming (but perhaps
I've just missed that).

Here I can give a real-world example, from the Tier-2 that I run for
LHC at work/university.
We have large amounts of storage (perhaps not as large as what Google
and Facebook have, or what the NSA stores about us)... but it's still
some ~ 2PiB, or a bit more.
That's managed with some special storage management software called
dCache. dCache even stores checksums, but per file, so that means for
normal reads, these cannot be verified (well technically it's
supported, but with our usual file sizes, this is not working) so what
remains are scrubs.
For The two PiB, we have some... roughly 50-60 nodes, each with
something between 12 and 24 disks, usually in either one or two RAID6
volumes, all different kinds of hard disks.
And we do run these scrubs quite rarely, since it costs IO that could
be used for actual computing jobs (a problem that wouldn't be there
with how btrfs calculates the sums on read, the data is then read
anyway)... so likely there are even more errors that are just never
noticed, because the datasets are removed 

[PATCH] btrfs-progs: Format change for btrfs fi df

2015-12-13 Thread Qu Wenruo
The GlobalReserve space in 'btrfs fi df' is always confusing for a lot
of users.
As it is not a special chunk type like DATA or METADATA, it's in fact a
sub-type of METADATA.

So change the output to skip GlobalReserve by default, and adding its
total to metadata used.
And add a new option '-r|--reserve' to show the GlobalReserve, but skip
the profile of GlobalReserve.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-filesystem.asciidoc |  8 ++
 cmds-filesystem.c   | 51 ++---
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/Documentation/btrfs-filesystem.asciidoc 
b/Documentation/btrfs-filesystem.asciidoc
index 31cd51b..510c23f 100644
--- a/Documentation/btrfs-filesystem.asciidoc
+++ b/Documentation/btrfs-filesystem.asciidoc
@@ -22,6 +22,14 @@ Show space usage information for a mount point.
 +
 `Options`
 +
+-r|--reserve
+also show Global Reserve space info.
++
+Global Reserve space is reserved space from metadata. It's reserved for Btrfs
+metadata COW.
++
+It will be counted as 'used' space in metadata space info.
++
 -b|--raw
 raw numbers in bytes, without the 'B' suffix
 -h|--human-readable
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 25317fa..26e62e0 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -123,6 +123,8 @@ static const char * const filesystem_cmd_group_usage[] = {
 static const char * const cmd_filesystem_df_usage[] = {
"btrfs filesystem df [options] ",
"Show space usage information for a mount point",
+   "",
+   "-r|--reserve   show global reserve space info"
HELPINFO_UNITS_SHORT_LONG,
NULL
 };
@@ -175,12 +177,32 @@ static int get_df(int fd, struct btrfs_ioctl_space_args 
**sargs_ret)
return 0;
 }
 
-static void print_df(struct btrfs_ioctl_space_args *sargs, unsigned unit_mode)
+static void print_df(struct btrfs_ioctl_space_args *sargs, unsigned unit_mode,
+int show_reserve)
 {
u64 i;
+   u64 global_reserve = 0;
struct btrfs_ioctl_space_info *sp = sargs->spaces;
 
+   /* First iterate to get global reserve space size */
for (i = 0; i < sargs->total_spaces; i++, sp++) {
+   if (sp->flags & BTRFS_SPACE_INFO_GLOBAL_RSV)
+   global_reserve = sp->total_bytes;
+   }
+
+   for (i = 0, sp = sargs->spaces; i < sargs->total_spaces; i++, sp++) {
+   if (sp->flags & BTRFS_SPACE_INFO_GLOBAL_RSV) {
+   if (!show_reserve)
+   continue;
+   printf(" \\- %s: reserved=%s, used=%s\n",
+   btrfs_group_type_str(sp->flags),
+   pretty_size_mode(sp->total_bytes, unit_mode),
+   pretty_size_mode(sp->used_bytes, unit_mode));
+   continue;
+   }
+
+   if (sp->flags & BTRFS_BLOCK_GROUP_METADATA)
+   sp->used_bytes += global_reserve;
printf("%s, %s: total=%s, used=%s\n",
btrfs_group_type_str(sp->flags),
btrfs_group_profile_str(sp->flags),
@@ -196,14 +218,35 @@ static int cmd_filesystem_df(int argc, char **argv)
int fd;
char *path;
DIR *dirstream = NULL;
+   int show_reserve = 0;
unsigned unit_mode;
 
unit_mode = get_unit_mode_from_arg(, argv, 1);
 
-   if (argc != 2 || argv[1][0] == '-')
+   while (1) {
+   int c;
+   static const struct option long_options[] = {
+   { "reserve", no_argument, NULL, 'r'},
+   { NULL, 0, NULL, 0}
+   };
+
+   c = getopt_long(argc, argv, "r", long_options, NULL);
+   if (c < 0)
+   break;
+   switch (c) {
+   case 'r':
+   show_reserve = 1;
+   break;
+   default:
+   usage(cmd_filesystem_df_usage);
+   }
+   }
+
+   argc = argc - optind;
+   if (check_argc_exact(argc, 1))
usage(cmd_filesystem_df_usage);
 
-   path = argv[1];
+   path = argv[optind];
 
fd = btrfs_open_dir(path, , 1);
if (fd < 0)
@@ -212,7 +255,7 @@ static int cmd_filesystem_df(int argc, char **argv)
ret = get_df(fd, );
 
if (ret == 0) {
-   print_df(sargs, unit_mode);
+   print_df(sargs, unit_mode, show_reserve);
free(sargs);
} else {
fprintf(stderr, "ERROR: get_df failed %s\n", strerror(-ret));
-- 
2.6.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check inconsistency with raid1, part 1

2015-12-13 Thread Qu Wenruo



Chris Murphy wrote on 2015/12/13 21:16 -0700:

Part 1= What to do about it? This post.
Part 2 = How I got here? I'm still working on the write up, so it's
not yet posted.

Summary:

2 dev (spinning rust) raid1 for data and metadata.
kernel 4.2.6, btrfs-progs 4.2.2

btrfs check with devid 1 and 2 present produces thousands of scary
messages, e.g.
checksum verify failed on 714189357056 found E4E3BDB6 wanted 


Checked the full output.
The interesting part is, the calculated result is always E4E3BDB6, and 
wanted is always all 0.


I assume E4E3BDB6 is crc32 of all 0 data.


If there is a full disk dump, it will be much easier to find where the 
problem is.

But I'm a afraid it won't be possible.

At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong 
with the bytenr in the warning.



The good news is, the fs seems to be OK without major problem.
As except the csum error, btrfsck doesn't give other error/warning.


btrfs check with devid 1 or devid2 separate (the other is missing)
produces no such scary messages at all, but instead messages e.g.
failed to load free space cache for block group 357585387520

a. This inconsistency is unexpected.
b. the 'btrfs check' with combined devices gives no insight to the
seriousness of "checksum verify failed" messages, or what the solution
is.


I guess btrfsck did the wrong device assemble, but that's just my 
personal guess.
And since I can't reproduce in my test environment, it won't be easy to 
find the root cause.



c. combined or separate+degraded, read-only mounts succeed with no
errors in user space or dmesg; only normal mount messages happen. With
both devs ro mounted, I was able to completely btrfs send/receive the
most recent two ro snapshots comprising 100% (minus stale historical)
data on the drive, with zero errors reported.
d. no read-write mount attempt has happened since "the incident" which
will be detailed in part 2.


Details:


The full devid1&2 btrfs check is long and not very interesting, so
I've put that here:
https://drive.google.com/open?id=0B_2Asp8DGjJ9Vjd0VlNYb09LVFU

btrfs-show-super shows some differences, values denoted as
devid1/devid2. If there's no split, those values are the same for both
devids.


generation4924/4923
root714189258752/714188554240
sys_array_size129
chunk_root_generation4918
root_level1
chunk_root715141414912
chunk_root_level1
log_root0
log_root_transid0
log_root_level0
total_bytes1500312748032
bytes_used537228206080
sectorsize4096
nodesize16384
[snip]
cache_generation4924/4923
uuid_tree_generation4924/4923
[snip]
dev_item.total_bytes750156374016
dev_item.bytes_used541199433728

Perhaps useful, is at the time of "the incident" this volume was rw
mounted, but was being used by a single process only: btrfs send. So
it was used as a source. No writes, other than btrfs's own generation
increment, were happening.

So in theory, this should perhaps be the simplest case of "what do I
do now?" and even makes me wonder if a normal rw mount should just fix
this up: either btrfs uses generation 4924 and updates all changes
from 4923 and 4924 automatically to devid2 so they are now in sync, or
it automatically discards generation 4924 from devid1, so both devices
are in sync.

The workload, circumstances of "the incident", the general purpose of
btrfs, and the likelihood a typical user would never have even become
aware of "the incident" until much later than I did, makes me strongly
feel like Btrfs should be able to completely recover from this, with
just a rw mount and eventually the missync'd generations will
autocorrect. But I don't know that. And I get essentially no advice
from btrfs check results.

So. What's the theory in this case? And then does it differ from reality?


Personally speaking, it may be a false alert from btrfsck.
So in this case, I can't provide much help.

If you're brave enough, mount it rw to see what will happen(although it 
may mount just OK).


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Will "btrfs check --repair" fix the mounting problem?

2015-12-13 Thread Qu Wenruo



Chris Murphy wrote on 2015/12/11 11:24 -0700:

On Fri, Dec 11, 2015 at 10:50 AM, Ivan Sizov  wrote:

Btrfs crashes in few seconds after mounting RW.
If it's important: the volume was converted from ext4. "ext2_saved"
subvolume still presents.

dmesg:
[  625.998387] BTRFS info (device sda1): disk space caching is enabled
[  625.998392] BTRFS: has skinny extents
[  627.727708] BTRFS: checking UUID tree
[  708.514128] [ cut here ]
[  708.514161] WARNING: CPU: 1 PID: 2263 at fs/btrfs/extent-tree.c:6255 
__btrfs_free_extent.isra.68+0x8c8/0xd70 [btrfs]()
[  708.514164] Modules linked in: bnep bluetooth rfkill ip6t_rpfilter 
ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute bridge ebtable_filter 
ebtable_nat ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_raw ip6table_security ip6table_mangle ip6table_filter ip6_tables 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_raw iptable_security iptable_mangle gpio_ich coretemp kvm_intel kvm 
iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_hda_codec_generic 
snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device 
lpc_ich snd_pcm snd_timer ppdev snd i2c_i801 mei_me mei soundcore parport_pc 
parport shpchp tpm_infineon tpm_tis tpm acpi_cpufreq nfsd auth_rpcgss nfs_acl 
lockd grace isofs squashfs btrfs xor raid6_pq i915 hid_logitech_hidpp
[  708.514277]  8021q garp stp video llc mrp i2c_algo_bit drm_kms_helper r8169 
uas crc32c_intel drm serio_raw mii hid_logitech_dj usb_storage scsi_dh_rdac 
scsi_dh_emc scsi_dh_alua sunrpc loop
[  708.514311] CPU: 1 PID: 2263 Comm: btrfs-transacti Not tainted 
4.2.3-300.fc23.x86_64 #1
[  708.514315] Hardware name: MSI MS-7636/H55M-P31(MS-7636)   , BIOS V1.9 
09/14/2010
[  708.514319]   f50458a6 880066b03ad8 
81771fca
[  708.514326]    880066b03b18 
8109e4a6
[  708.514332]  0002 00252f595000 fffe 

[  708.514338] Call Trace:
[  708.514349]  [] dump_stack+0x45/0x57
[  708.514359]  [] warn_slowpath_common+0x86/0xc0
[  708.514365]  [] warn_slowpath_null+0x1a/0x20
[  708.514391]  [] __btrfs_free_extent.isra.68+0x8c8/0xd70 
[btrfs]
[  708.514429]  [] ? find_ref_head+0x5a/0x80 [btrfs]
[  708.514456]  [] __btrfs_run_delayed_refs+0x998/0x1080 
[btrfs]


Not completely sure, but it may be related to a regression in 4.2.
The regression it self is already fixed, but is not backported to 4.2 as 
far as I know.


So, I'd recommend to revert to 4.1 and see if things get better.
Fortunately, btrfs already aborted the transaction before things get worse.


[  708.514477]  [] btrfs_run_delayed_refs.part.73+0x74/0x270 
[btrfs]
[  708.514496]  [] btrfs_run_delayed_refs+0x15/0x20 [btrfs]
[  708.514518]  [] btrfs_commit_transaction+0x56/0xad0 [btrfs]
[  708.514541]  [] transaction_kthread+0x214/0x230 [btrfs]
[  708.514564]  [] ? btrfs_cleanup_transaction+0x500/0x500 
[btrfs]
[  708.514569]  [] kthread+0xd8/0xf0
[  708.514574]  [] ? kthread_worker_fn+0x160/0x160
[  708.514581]  [] ret_from_fork+0x3f/0x70
[  708.514585]  [] ? kthread_worker_fn+0x160/0x160
[  708.514588] ---[ end trace 673f3bf2295a ]---
[  708.514594] BTRFS info (device sda1): leaf 535035904 total ptrs 204 free 
space 4451
[  708.514598]  item 0 key (159696797696 169 0) itemoff 16250 itemsize 33
[  708.514601]  extent refs 1 gen 21134 flags 2
[  708.514604]  tree block backref root 2
[  708.514609]  item 1 key (159696830464 169 1) itemoff 16217 itemsize 33
[  708.514612]  extent refs 1 gen 21134 flags 2
[  708.514615]  tree block backref root 2
[  708.514619]  item 2 key (159696846848 169 0) itemoff 16184 itemsize 33

*** a lot of similar messages ***

[  708.516923]  item 203 key (159711268864 169 0) itemoff 9551 itemsize 33
[  708.516927]  extent refs 1 gen 21082 flags 2
[  708.516930]  tree block backref root 384
[  708.516937] BTRFS error (device sda1): unable to find ref byte nr 
159708172288 parent 0 root 385  owner 2 offset 0
[  708.516944] [ cut here ]
[  708.516975] WARNING: CPU: 1 PID: 2263 at fs/btrfs/extent-tree.c:6261 
__btrfs_free_extent.isra.68+0x92f/0xd70 [btrfs]()
[  708.516979] BTRFS: Transaction aborted (error -2)
[  708.516982] Modules linked in: bnep bluetooth rfkill ip6t_rpfilter 
ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute bridge ebtable_filter 
ebtable_nat ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 
ip6table_raw ip6table_security ip6table_mangle ip6table_filter ip6_tables 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
iptable_raw iptable_security iptable_mangle gpio_ich coretemp kvm_intel kvm 
iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_hda_codec_generic 
snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device 
lpc_ich 

btrfs check inconsistency with raid1, part 1

2015-12-13 Thread Chris Murphy
Part 1= What to do about it? This post.
Part 2 = How I got here? I'm still working on the write up, so it's
not yet posted.

Summary:

2 dev (spinning rust) raid1 for data and metadata.
kernel 4.2.6, btrfs-progs 4.2.2

btrfs check with devid 1 and 2 present produces thousands of scary
messages, e.g.
checksum verify failed on 714189357056 found E4E3BDB6 wanted 

btrfs check with devid 1 or devid2 separate (the other is missing)
produces no such scary messages at all, but instead messages e.g.
failed to load free space cache for block group 357585387520

a. This inconsistency is unexpected.
b. the 'btrfs check' with combined devices gives no insight to the
seriousness of "checksum verify failed" messages, or what the solution
is.
c. combined or separate+degraded, read-only mounts succeed with no
errors in user space or dmesg; only normal mount messages happen. With
both devs ro mounted, I was able to completely btrfs send/receive the
most recent two ro snapshots comprising 100% (minus stale historical)
data on the drive, with zero errors reported.
d. no read-write mount attempt has happened since "the incident" which
will be detailed in part 2.


Details:


The full devid1&2 btrfs check is long and not very interesting, so
I've put that here:
https://drive.google.com/open?id=0B_2Asp8DGjJ9Vjd0VlNYb09LVFU

btrfs-show-super shows some differences, values denoted as
devid1/devid2. If there's no split, those values are the same for both
devids.


generation4924/4923
root714189258752/714188554240
sys_array_size129
chunk_root_generation4918
root_level1
chunk_root715141414912
chunk_root_level1
log_root0
log_root_transid0
log_root_level0
total_bytes1500312748032
bytes_used537228206080
sectorsize4096
nodesize16384
[snip]
cache_generation4924/4923
uuid_tree_generation4924/4923
[snip]
dev_item.total_bytes750156374016
dev_item.bytes_used541199433728

Perhaps useful, is at the time of "the incident" this volume was rw
mounted, but was being used by a single process only: btrfs send. So
it was used as a source. No writes, other than btrfs's own generation
increment, were happening.

So in theory, this should perhaps be the simplest case of "what do I
do now?" and even makes me wonder if a normal rw mount should just fix
this up: either btrfs uses generation 4924 and updates all changes
from 4923 and 4924 automatically to devid2 so they are now in sync, or
it automatically discards generation 4924 from devid1, so both devices
are in sync.

The workload, circumstances of "the incident", the general purpose of
btrfs, and the likelihood a typical user would never have even become
aware of "the incident" until much later than I did, makes me strongly
feel like Btrfs should be able to completely recover from this, with
just a rw mount and eventually the missync'd generations will
autocorrect. But I don't know that. And I get essentially no advice
from btrfs check results.

So. What's the theory in this case? And then does it differ from reality?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state

2015-12-13 Thread Chandan Rajendra
On Sunday 13 Dec 2015 12:18:55 Alex Lyakas wrote:
> [Resending in plain text, apologies.]
> 
> Hi Chandan, Josef, Chris,
> 
> I am not sure I understand the fix to the problem.
> 
> It may happen that when updating the device tree, we need to allocate a new
> chunk via do_chunk_alloc (while we are holding the device tree root node
> locked). This is a legitimate thing for find_free_extent() to do. And
> do_chunk_alloc() call may lead to call to
> btrfs_create_pending_block_groups(), which will try to update the device
> tree. This may happen due to direct call to
> btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or
> perhaps by __btrfs_end_transaction() that find_free_extent() does after it
> completed chunk allocation (although in this case it will use the
> transaction that already exists in current->journal_info).
> So the deadlock still may happen?

Hello Alex,

The "global block reservation" (see btrfs_fs_info->global_block_rsv) aims to
solve this problem. I don't claim to have understood the behaviour of
global_block_rsv completely. However, Global block reservation makes sure that
we have enough free space reserved (see update_global_block_rsv()) for future
operations on,
- Extent tree
- Checksum tree
- Device tree 
- Tree root tree and
- Quota tree.

Tasks changing the device tree should get their space requirements satisfied
from the global block reservation. Hence such changes to the device tree
should not end up forcing find_free_extent() to allocate a new chunk.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-13 Thread Duncan
Qu Wenruo posted on Mon, 14 Dec 2015 10:08:16 +0800 as excerpted:

> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
>> Hi!
>>
>> For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.

In the above sentence, I /think/ you (Qu) agree with Martin (and I) that 
btrfs shouldn't be considered production ready... yet, and the first part 
of the sentence makes it very clear that you feel strongly about the 
*FACT*, but the second half of the sentence (after *FACT*) doesn't parse 
well in English, thus leaving the entire sentence open to interpretation, 
tho it's obvious either way that you feel strongly about it. =:^\

At the risk of getting it completely wrong, what I /think/ you meant to 
say is (as expanded in typically Duncan fashion =:^)...

Yes, this is the *FACT*, though some people have reasons to deny it.

Presumably, said reasons would include the fact that various distros are 
trying to sell enterprise support contracts to customers very eager to 
have the features that btrfs provides, and said customers are willing to 
pay for assurances that the solutions they're buying are "production 
ready", whether that's actually the case or not, presumably because said 
payment is (in practice) simply ensuring there's someone else to pin the 
blame on if things go bad.

And the demonstration of that would be the continued fact that people 
otherwise unnecessarily continue to pay rather large sums of money for 
that very assurance, when in practice, they'd get equal or better support 
not worrying about that payment, but instead actually making use of free-
of-cost resources such as this list.


[Linguistic analysis, see frequent discussion of this topic at Language 
Log, which I happen to subscribe to as I find this sort of thing 
interesting, for more commentary and examples of the same general issue: 
http://languagelog.net ]

The problem with the sentence as originally written, is that English 
doesn't deal well with multi-negation, sometimes considering each 
negation an inversion of the previous (as do most programming languages 
and thus programmers), while other times or as read/heard/interpreted by 
others repeated negation may be considered a strengthening of the 
original negation.

Regardless, mis-negation due to speaker/writer confusion is quite common 
even among native English speakers/writers.

The negating words in question here are "not" and "deny".  If you will 
note, my rewrite kept "deny", but rewrote the "not" out of the sentence, 
so there's only one negative to worry about, making the meaning much 
clearer as the reader's mind isn't left trying to figure out what the 
speaker meant with the double-negative (mistake? deliberate canceling out 
of the first negative with the second? deliberate intensifier?)  and thus 
unable to be sure one way or the other what was meant.

And just in case there would have been doubt, the explanation then makes 
doubly obvious what I think your intent was by expanding on it.  Of 
course that's easy to do as I entirely agree.

OTOH if I'm mistaken as to your intent and you meant it the other way... 
well then you'll need to do the explaining as then the implication is 
that some people have good reasons to deny it and you agree with them, 
but without further expansion, I wouldn't know where you're trying to go 
with that claim.


Just in case there's any doubt left of my own opinion on the original 
claim of not production ready in the above discussion, let me be 
explicit:  I (too) agree with Martin (and I think with Qu) that btrfs 
isn't yet production ready.  But I don't believe you'll find many on the 
list taking issue with that, as I think everybody on-list agrees, btrfs 
/isn't/ production ready.  Certainly pretty much just that has been 
repeatedly stated in individualized style by many posters including 
myself, and I've yet to see anyone take serious issue with it.

>> No matter whether SLES 12 uses it as default for root, no matter
>> whether Fujitsu and Facebook use it: I will not let this onto any
>> customer machine without lots and lots of underprovisioning and
>> rigorous free space monitoring.
>> Actually I will renew my recommendations in my trainings to be careful
>> with BTRFS.

... And were I to put money on it, my money would be on every regular on-
list poster 100% agreeing with that. =:^)

>>
>>  From my experience the monitoring would check for:
>>
>> merkaba:~> btrfs fi show /home
>>  Label: 'home'  uuid: […]
>>  Total devices 2 FS bytes used 156.31GiB
>>  devid1 size 170.00GiB used 164.13GiB path /dev/[path1]
>>  devid2 size 170.00GiB used 164.13GiB path /dev/[path2]
>>
>> If "used" is same as "size" then make big fat alarm. It is not
>> sufficient for it to happen. It can run for quite some time just fine
>> without any issues, but I never have seen a kworker thread using 100%
>> of one core for extended period of time 

Re: Kernel lockup, might be helpful log.

2015-12-13 Thread Duncan
Birdsarenice posted on Sun, 13 Dec 2015 22:55:19 + as excerpted:

> Meanwhile, I did get lucky: At one crash I happened to be logged in and
> was able to hit dmesg seconds before it went completely. So what I have
> here is information that looks like it'll help you track down a
> rarely-encountered and hard-to-reproduce bug which can cause the system
> to lock up completely in event of certain types of hard drive failure.
> It might be nothing, but perhaps someone will find it of use - because
> it'd be a tricky one to both reproduce and get a good error report if it
> did occur.
> 
> I see an 'invalid opcode' error in here, that's pretty unusual

Disclaimer:  I'm a list regular and (small-scale) sysadmin, not a dev, 
and most certainly not a btrfs dev.  Take what I saw with that in mind, 
tho I've been active on-list for over a year and thus now have a 
reasonable level of practical sysadmin configuration and crisis recovery 
level btrfs experience.

You could well be quite correct with the unusual crash log and its value, 
I'll leave that up to the devs to decide, but that "invalid opcode: " 
bit is in fact not at all unusual on btrfs.  Tho I can say it fooled me 
originally as well, because it certainly /looks/ both suspicious and in 
general unusual.

Based on how a dev explained it to me, I believe btrfs actually 
deliberately uses opcode  to trigger a semi-controlled crash in 
instances where code that "should never happen" actually gets executed 
for some reason, leaving the kernel is an unknown and thus not 
trustworthy enough to reliably write to storage devices and do a 
controlled shutdown.  That's of course why the tracebacks are there, to 
help the devs figure out where it was and what triggered it, but the  
opcode itself is actually quite frequently found in these tracebacks, 
because it's the method chosen to deliberately trigger them.

I'd guess the same technique is actually used in various other (non-
btrfs) kernel code as well, but in fully stable code it actually is very 
rarely seen, precisely because it /does/ mean the kernel reached code 
that it is never expected to reach, meaning something specific went wrong 
to get to that point, and in fully stable code, it's rare that any code 
paths actually leading to that sort of execution point remain, as they've 
all been found over the years.

But of course btrfs, while no longer experimental, remains "still 
stabilizing and maturing, not yet fully stable or mature", so there's 
still code paths left that do still occasionally reach these intended to 
be unreachable code points, and when that happens, triggering a crash and 
hopefully getting a traceback that helps the devs figure out which code 
path has the bug and why, is a good thing to do, and this is apparently 
the way it's done.

(BTW, compliments on the nick and email address. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs check inconsistency with raid1, backstory, part 2

2015-12-13 Thread Chris Murphy
(sdb are a raid0, btrfs receive destination, they don't come up in
this dmesg at all)

sdd are a raid1, btrfs send source

All four devices are in USB enclosures on a new Intel NUC that's had
~32 hours burnin with memtest. Both file systems had received many
scrubs before with zero errors all time, include most recently within
the past few days. What's new about today's setup is the NUC, and the
fact all four drives are directly connected, not to a USB hub. So
right off the bat I'm going to suspect hardware problems are due to
insufficient USB bus power.

kernel messages:
https://drive.google.com/open?id=0B_2Asp8DGjJ9Z0hhbVUwakF5Y2c

Lines 6-28
I'm pretty sure what happens first is sdd is producing spurious data
(bad tree block start). I can't tell if the 'read error correct'
messages are fixing sde or sdd? In any case, since the last thing that
happened before this was a passing scrub, none of these corrections
written to disk are warranted and are suspect. Maybe what happens is
the reads are bad but the writes back to the device (corrections) are
OK.

Next, line 29 to 48, looks like there is a USB bus reset, sde vanishes
off the bus, and reappears as sdf, at which point thousands of write
errors ensue. And then I became aware of all of this, did an abrupt
shutdown (poweroff -f) at which point the journal ends.

And then we go to part 1 for what I did next to try to recover.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-13 Thread Russell Coker
On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote:
> I've had some discussions on the list these days about not having
> checksumming with nodatacow (mostly with Hugo and Duncan).
> 
> They both basically told me it wouldn't be straight possible with CoW,
> and Duncan thinks it may not be so much necessary, but none of them
> could give me really hard arguments, why it cannot work (or perhaps I
> was just too stupid to understand them ^^)... while at the same time I
> think that it would be generally utmost important to have checksumming
> (real world examples below).

My understanding of BTRFS is that the metadata referencing data blocks has the 
checksums for those blocks, then the blocks which link to that metadata (EG 
directory entries referencing file metadata) has checksums of those.  For each 
metadata block there is a new version that is eventually linked from a new 
version of the tree root.

This means that the regular checksum mechanisms can't work with nocow data.  A 
filesystem can have checksums just pointing to data blocks but you need to 
cater for the case where a corrupt metadata block points to an old version of 
a data block and matching checksum.  The way that BTRFS works with an entire 
checksumed tree means that there's no possibility of pointing to an old 
version of a data block.

The NetApp published research into hard drive errors indicates that they are 
usually in small numbers and located in small areas of the disk.  So if BTRFS 
had a nocow file with any storage method other than dup you would have metadata 
and file data far enough apart that they are not likely to be hit by the same 
corruption (and the same thing would apply with most Ext4 Inode tables and 
data blocks).  I think that a file mode where there were checksums on data 
blocks with no checksums on the metadata tree would be useful.  But it would 
require a moderate amount of coding and there's lots of other things that the 
developers are working on.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check inconsistency with raid1, part 1

2015-12-13 Thread Chris Murphy
Thanks for the reply.


On Sun, Dec 13, 2015 at 10:48 PM, Qu Wenruo  wrote:
>
>
> Chris Murphy wrote on 2015/12/13 21:16 -0700:
>> btrfs check with devid 1 and 2 present produces thousands of scary
>> messages, e.g.
>> checksum verify failed on 714189357056 found E4E3BDB6 wanted 
>
>
> Checked the full output.
> The interesting part is, the calculated result is always E4E3BDB6, and
> wanted is always all 0.
>
> I assume E4E3BDB6 is crc32 of all 0 data.
>
>
> If there is a full disk dump, it will be much easier to find where the
> problem is.
> But I'm a afraid it won't be possible.

What is a full disk dump? I can try to see if it's possible. Main
thing though is only if it can make Btrfs overall better, because I
don't need this volume repaired, there's no data loss (backups!) so
this volume's purpose now is for study.


> At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong with
> the bytenr in the warning.

Both devs attached (not mounted).

[root@f23a ~]# btrfs-debug-tree -t 2 /dev/sdb > btrfsdebugtreet2_verb.txt
checksum verify failed on 714189570048 found E4E3BDB6 wanted 
checksum verify failed on 714189570048 found E4E3BDB6 wanted 
checksum verify failed on 714189471744 found E4E3BDB6 wanted 
checksum verify failed on 714189471744 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189750272 found E4E3BDB6 wanted 
checksum verify failed on 714189750272 found E4E3BDB6 wanted 

https://drive.google.com/open?id=0B_2Asp8DGjJ9NUdmdXZFQ1Myek0


>
>
> The good news is, the fs seems to be OK without major problem.
> As except the csum error, btrfsck doesn't give other error/warning.

Yes, I think so. Main issue here seems to be the scary warnings and
uncertainty what the user should do next, if anything at all.

> I guess btrfsck did the wrong device assemble, but that's just my personal
> guess.
> And since I can't reproduce in my test environment, it won't be easy to find
> the root cause.

It might be reproducible. More on that in the next email. Easy to get
you remote access if useful.


>> So. What's the theory in this case? And then does it differ from reality?
>
>
> Personally speaking, it may be a false alert from btrfsck.
> So in this case, I can't provide much help.
>
> If you're brave enough, mount it rw to see what will happen(although it may
> mount just OK).

I'm brave enough. I'll give it a try tomorrow unless there's another
request for more info before then.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-13 Thread Qu Wenruo



Duncan wrote on 2015/12/14 06:21 +:

Qu Wenruo posted on Mon, 14 Dec 2015 10:08:16 +0800 as excerpted:


Martin Steigerwald wrote on 2015/12/13 23:35 +0100:

Hi!

For me it is still not production ready.


Yes, this is the *FACT* and not everyone has a good reason to deny it.


In the above sentence, I /think/ you (Qu) agree with Martin (and I) that
btrfs shouldn't be considered production ready... yet, and the first part
of the sentence makes it very clear that you feel strongly about the
*FACT*, but the second half of the sentence (after *FACT*) doesn't parse
well in English, thus leaving the entire sentence open to interpretation,
tho it's obvious either way that you feel strongly about it. =:^\


Oh, my poor English... :(

The latter half is just in case someone consider btrfs is stable in some 
respects.




At the risk of getting it completely wrong, what I /think/ you meant to
say is (as expanded in typically Duncan fashion =:^)...

Yes, this is the *FACT*, though some people have reasons to deny it.


Right! That's what I want to say!!



Presumably, said reasons would include the fact that various distros are
trying to sell enterprise support contracts to customers very eager to
have the features that btrfs provides, and said customers are willing to
pay for assurances that the solutions they're buying are "production
ready", whether that's actually the case or not, presumably because said
payment is (in practice) simply ensuring there's someone else to pin the
blame on if things go bad.

And the demonstration of that would be the continued fact that people
otherwise unnecessarily continue to pay rather large sums of money for
that very assurance, when in practice, they'd get equal or better support
not worrying about that payment, but instead actually making use of free-
of-cost resources such as this list.


[Linguistic analysis, see frequent discussion of this topic at Language
Log, which I happen to subscribe to as I find this sort of thing
interesting, for more commentary and examples of the same general issue:
http://languagelog.net ]

The problem with the sentence as originally written, is that English
doesn't deal well with multi-negation, sometimes considering each
negation an inversion of the previous (as do most programming languages
and thus programmers), while other times or as read/heard/interpreted by
others repeated negation may be considered a strengthening of the
original negation.

Regardless, mis-negation due to speaker/writer confusion is quite common
even among native English speakers/writers.

The negating words in question here are "not" and "deny".  If you will
note, my rewrite kept "deny", but rewrote the "not" out of the sentence,
so there's only one negative to worry about, making the meaning much
clearer as the reader's mind isn't left trying to figure out what the
speaker meant with the double-negative (mistake? deliberate canceling out
of the first negative with the second? deliberate intensifier?)  and thus
unable to be sure one way or the other what was meant.

And just in case there would have been doubt, the explanation then makes
doubly obvious what I think your intent was by expanding on it.  Of
course that's easy to do as I entirely agree.

OTOH if I'm mistaken as to your intent and you meant it the other way...
well then you'll need to do the explaining as then the implication is
that some people have good reasons to deny it and you agree with them,
but without further expansion, I wouldn't know where you're trying to go
with that claim.


Just in case there's any doubt left of my own opinion on the original
claim of not production ready in the above discussion, let me be
explicit:  I (too) agree with Martin (and I think with Qu) that btrfs
isn't yet production ready.  But I don't believe you'll find many on the
list taking issue with that, as I think everybody on-list agrees, btrfs
/isn't/ production ready.  Certainly pretty much just that has been
repeatedly stated in individualized style by many posters including
myself, and I've yet to see anyone take serious issue with it.


No matter whether SLES 12 uses it as default for root, no matter
whether Fujitsu and Facebook use it: I will not let this onto any
customer machine without lots and lots of underprovisioning and
rigorous free space monitoring.
Actually I will renew my recommendations in my trainings to be careful
with BTRFS.


... And were I to put money on it, my money would be on every regular on-
list poster 100% agreeing with that. =:^)



  From my experience the monitoring would check for:

merkaba:~> btrfs fi show /home
  Label: 'home'  uuid: […]
  Total devices 2 FS bytes used 156.31GiB
  devid1 size 170.00GiB used 164.13GiB path /dev/[path1]
  devid2 size 170.00GiB used 164.13GiB path /dev/[path2]

If "used" is same as "size" then make big fat alarm. It is not
sufficient for it to happen. It can run for quite some time just fine
without any 

Re: Kernel lockup, might be helpful log.

2015-12-13 Thread Chris Murphy
I can't help with the call traces. But several (not all) of the hard
resetting link messages are hallmark cases where the SCSI command
timer default of 30 seconds looks like it's being hit while the drive
itself is hung up doing a sector read recovery (multiple attempts).
It's worth seeing if 'smartctl -l scterc ' will report back that
SCT is supported and that it's just disabled, meaning you can change
this to something sane like with 'smartctl -l 70,70 ' which will
make the drive time out before the linux kernel command timer. That'll
let Btrfs do the right thing, rather than constantly getting poked in
both eyes by link resets.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html